The Milwaukee Journal Sentinel and Pothole fill times#

Two pieces from Milwaukee Journal-Sentinel (first, second) outline racial disparities in pothole fill times. Using a combination of geospatial analysis, census data, and linear regression, we'll see what we can find.

Read online Download notebook Interactive version

Imports#

We're importing pandas, as well as setting it to display up to 200 columns at a time.

import pandas as pd

pd.set_option("display.max_columns", 200)

%matplotlib inline

Importing our data#

The city delivered multiple files to us, so we'll need to filter them down to just 2013. What seems straightforward might actually be an editorial decision: when we say "a pothole from 2013," what exactly do we mean? Filed in 2013? Filled in 2013? Filed and filled?

We're going to consider a pothole as being from 2013 if it was entered in 2013.

potholes_2010 = pd.read_excel("data/2010-2013 POTHOLES.xls")
potholes_2013 = pd.read_excel("data/2013-2017 POTHOLES.xls")
potholes = pd.concat([potholes_2010, potholes_2013])
potholes.head()

	A	Street	EnterDt	PrintDt	ResolvDt
0	3839	N 10TH ST	2013-07-15 23:35	2013-07-16 05:46	2013-07-17 05:50
1	4900	W MELVINA ST	2013-07-15 20:05	2013-07-16 05:46	2013-07-24 16:58
2	2400	W WISCONSIN AV	2013-07-15 20:00	2013-07-16 05:56	2013-07-25 14:42
3	1800	W HAMPTON AV	2013-07-15 19:55	2013-07-16 05:46	2013-07-18 06:06
4	4718	N 19TH ST	2013-07-15 19:50	2013-07-16 05:46	2013-08-02 06:08

How many do we have?

potholes = potholes[potholes.EnterDt.str.startswith('2013')]
potholes.shape

(12836, 5)

Double-checking our dates#

Let's confirm that our date range is what we expected. We'll also need to mind missing data - if a pothole was never filled, it will be missing a ResolvDt.

potholes.EnterDt.dropna().agg(['min', 'max'])

min    2013-01-02 06:31
max    2013-12-31 17:06
Name: EnterDt, dtype: object

potholes.ResolvDt.dropna().agg(['min', 'max'])

min    2013-01-02 06:33
max    2014-01-23 17:05
Name: ResolvDt, dtype: object

Feature engineering: fill time#

Let's calculate how long it took to fill potholes in 2013. We'll save this into a new column.

# We'll need to convert our columns into datetimes to find the difference

potholes['EnterDt'] = pd.to_datetime(potholes.EnterDt)
potholes['ResolvDt'] = pd.to_datetime(potholes.ResolvDt)
potholes['difference'] = potholes.ResolvDt - potholes.EnterDt
potholes.head()

	A	Street	EnterDt	PrintDt	ResolvDt	difference
0	3839	N 10TH ST	2013-07-15 23:35:00	2013-07-16 05:46	2013-07-17 05:50:00	1 days 06:15:00
1	4900	W MELVINA ST	2013-07-15 20:05:00	2013-07-16 05:46	2013-07-24 16:58:00	8 days 20:53:00
2	2400	W WISCONSIN AV	2013-07-15 20:00:00	2013-07-16 05:56	2013-07-25 14:42:00	9 days 18:42:00
3	1800	W HAMPTON AV	2013-07-15 19:55:00	2013-07-16 05:46	2013-07-18 06:06:00	2 days 10:11:00
4	4718	N 19TH ST	2013-07-15 19:50:00	2013-07-16 05:46	2013-08-02 06:08:00	17 days 10:18:00

What kind of datatype is 17 days 10:18:00?

potholes.difference.dtype

dtype('<m8[ns]')

Hm, we probably need to turn that into an integer. Otherwise we won't be able to do regression on it!

potholes['wait_days'] = potholes['difference'].dt.components['days'] + \
                        potholes['difference'].dt.components['hours'] / 24
potholes.head()

	A	Street	EnterDt	PrintDt	ResolvDt	difference	wait_days
0	3839	N 10TH ST	2013-07-15 23:35:00	2013-07-16 05:46	2013-07-17 05:50:00	1 days 06:15:00	1.250000
1	4900	W MELVINA ST	2013-07-15 20:05:00	2013-07-16 05:46	2013-07-24 16:58:00	8 days 20:53:00	8.833333
2	2400	W WISCONSIN AV	2013-07-15 20:00:00	2013-07-16 05:56	2013-07-25 14:42:00	9 days 18:42:00	9.750000
3	1800	W HAMPTON AV	2013-07-15 19:55:00	2013-07-16 05:46	2013-07-18 06:06:00	2 days 10:11:00	2.416667
4	4718	N 19TH ST	2013-07-15 19:50:00	2013-07-16 05:46	2013-08-02 06:08:00	17 days 10:18:00	17.416667

Census data#

Data from the US Government is often Latin-1 encoded instead of UTF-8, so we'll need to specify the encoding. We also want to demand that Geo_FIPS be read in as a string so as to not lose leading zeroes.

census = pd.read_csv("data/R12216099_SL140.csv", encoding='latin-1', dtype={'Geo_FIPS': 'str'})
census.head(2)

	Geo_FIPS	Geo_GEOID	Geo_NAME	Geo_QName	Geo_STUSAB	Geo_SUMLEV	Geo_GEOCOMP	Geo_FILEID	Geo_LOGRECNO	Geo_US	Geo_REGION	Geo_DIVISION	Geo_STATECE	Geo_STATE	Geo_COUNTY	Geo_COUSUB	Geo_PLACE	Geo_PLACESE	Geo_TRACT	Geo_BLKGRP	Geo_CONCIT	Geo_AIANHH	Geo_AIANHHFP	Geo_AIHHTLI	Geo_AITSCE	Geo_AITS	Geo_ANRC	Geo_CBSA	Geo_CSA	Geo_METDIV	Geo_MACC	Geo_MEMI	Geo_NECTA	Geo_CNECTA	Geo_NECTADIV	Geo_UA	Geo_UACP	Geo_CDCURR	Geo_SLDU	Geo_SLDL	Geo_VTD	Geo_ZCTA3	Geo_ZCTA5	Geo_SUBMCD	Geo_SDELM	Geo_SDSEC	Geo_SDUNI	Geo_UR	Geo_PCI	Geo_TAZ	Geo_UGA	Geo_BTTR	Geo_BTBG	Geo_PUMA5	Geo_PUMA1	SE_A04001_001	SE_A04001_002	SE_A04001_003	SE_A04001_004	SE_A04001_005	SE_A04001_006	SE_A04001_007	SE_A04001_008	SE_A04001_009	SE_A04001_010	SE_A04001_011	SE_A04001_012	SE_A04001_013	SE_A04001_014	SE_A04001_015	SE_A04001_016	SE_A04001_017
0	55079000101	14000US55079000101	Census Tract 1.01, Milwaukee County, Wisconsin	Census Tract 1.01, Milwaukee County, Wisconsin	wi	140	0	ACSSF	4717	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	101	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5306	5015	1343	3567	13	70	0	16	6	291	37	0	0	0	0	254	0
1	55079000102	14000US55079000102	Census Tract 1.02, Milwaukee County, Wisconsin	Census Tract 1.02, Milwaukee County, Wisconsin	wi	140	0	ACSSF	4718	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	102	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3648	3549	1034	2195	36	118	0	0	166	99	67	32	0	0	0	0	0

Feature engineering: Race#

Let's use this dataset to create a second dataframe of all of our features. Each row should contain:

The census tract number
The percent of the population that is White
The percent of the population that is Black
The percent of the population that is Hispanic
The percent of the population that is a minority (non-White)

We'll need to read the data dictionary to know exactly what the columns mean.

census['pct_white'] = census.SE_A04001_003 / census.SE_A04001_001 * 100
census['pct_black'] = census.SE_A04001_004 / census.SE_A04001_001 * 100
census['pct_hispanic'] = census.SE_A04001_010 / census.SE_A04001_001 * 100
census['pct_minority'] = 100 - census.pct_white
census = census[['Geo_FIPS', 'pct_white', 'pct_black', 'pct_hispanic', 'pct_minority']]
census.head()

	Geo_FIPS	pct_white	pct_black	pct_hispanic	pct_minority
0	55079000101	25.310969	67.225782	5.484357	74.689031
1	55079000102	28.344298	60.169956	2.713816	71.655702
2	55079000201	24.733876	55.458151	16.113546	75.266124
3	55079000202	42.488931	44.433903	2.783049	57.511069
4	55079000301	75.056012	16.430172	0.000000	24.943988

Combine datasets#

We need to create a new dataframe by merging our street addresses with our census data. It would be nice to merge on census tract code...

potholes.head(2)

	A	Street	EnterDt	PrintDt	ResolvDt	difference	wait_days
0	3839	N 10TH ST	2013-07-15 23:35:00	2013-07-16 05:46	2013-07-17 05:50:00	1 days 06:15:00	1.250000
1	4900	W MELVINA ST	2013-07-15 20:05:00	2013-07-16 05:46	2013-07-24 16:58:00	8 days 20:53:00	8.833333

...but our addresses data does not have a census tract on them!

Connecting addresses to census tracts#

To make this work, we need to go through a multi-step process. First we need to find the lat/lon coordinates for each address, then we need to find out which census tract it's in.

To start with, I made a list of unique addresses and save them to a file. I then geocoded this file, converting the addresses to latitude and longitude. You can use something like Little Geocoder, geocodio, or the Google Maps API to perform the conversion.

Finally, I opened these latitude/longitude pairs in QGIS. The QGIS command Join attributes by location will merge datasets that geographically overlap each other. If you have one layer that's a list of lat/lon points and one layer that's a shapefile with census tract information, join attributes by location can create a new layer of lat/lon points that also has census tract information. By downloading a shapefile of census tracts from the Census Bureau, I was able to combine these two datasets.

Finally, I exported the result by right clicking the layer, selecting Export As, and making sure the output is another CSV.

Merging#

We now have three datasets:

Addresses and pothole fill times
Census data according with census tract codes
Addresses with census tract codes

We'll need to merge them each together to create one whole dataset.

addresses_tracts = pd.read_csv("addresses_with_tract.csv", dtype={'GEOID': 'str'})
addresses_tracts.head()

	address	GEOID
0	8900 N 124TH ST	55133200202
1	2200 N TERRACE AV	55079186900
2	1000 N PROSPECT AV	55079186900
3	1700 N PROSPECT AV	55079186900
4	2298 N TERRACE AV	55079186900

merged = addresses_tracts.merge(census, left_on='GEOID', right_on='Geo_FIPS')
merged.head()

	address	GEOID	Geo_FIPS	pct_white	pct_black	pct_hispanic	pct_minority
0	2200 N TERRACE AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943
1	1000 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943
2	1700 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943
3	2298 N TERRACE AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943
4	1400 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943

potholes.head()

	A	Street	EnterDt	PrintDt	ResolvDt	difference	wait_days
0	3839	N 10TH ST	2013-07-15 23:35:00	2013-07-16 05:46	2013-07-17 05:50:00	1 days 06:15:00	1.250000
1	4900	W MELVINA ST	2013-07-15 20:05:00	2013-07-16 05:46	2013-07-24 16:58:00	8 days 20:53:00	8.833333
2	2400	W WISCONSIN AV	2013-07-15 20:00:00	2013-07-16 05:56	2013-07-25 14:42:00	9 days 18:42:00	9.750000
3	1800	W HAMPTON AV	2013-07-15 19:55:00	2013-07-16 05:46	2013-07-18 06:06:00	2 days 10:11:00	2.416667
4	4718	N 19TH ST	2013-07-15 19:50:00	2013-07-16 05:46	2013-08-02 06:08:00	17 days 10:18:00	17.416667

potholes['address'] = potholes.A.astype(str) + ' ' + potholes.Street
potholes.head()

	A	Street	EnterDt	PrintDt	ResolvDt	difference	wait_days	address
0	3839	N 10TH ST	2013-07-15 23:35:00	2013-07-16 05:46	2013-07-17 05:50:00	1 days 06:15:00	1.250000	3839 N 10TH ST
1	4900	W MELVINA ST	2013-07-15 20:05:00	2013-07-16 05:46	2013-07-24 16:58:00	8 days 20:53:00	8.833333	4900 W MELVINA ST
2	2400	W WISCONSIN AV	2013-07-15 20:00:00	2013-07-16 05:56	2013-07-25 14:42:00	9 days 18:42:00	9.750000	2400 W WISCONSIN AV
3	1800	W HAMPTON AV	2013-07-15 19:55:00	2013-07-16 05:46	2013-07-18 06:06:00	2 days 10:11:00	2.416667	1800 W HAMPTON AV
4	4718	N 19TH ST	2013-07-15 19:50:00	2013-07-16 05:46	2013-08-02 06:08:00	17 days 10:18:00	17.416667	4718 N 19TH ST

merged = merged.merge(potholes, on='address')
merged.head()

	address	GEOID	Geo_FIPS	pct_white	pct_black	pct_hispanic	pct_minority	A	Street	EnterDt	PrintDt	ResolvDt	difference	wait_days
0	1400 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1400	N PROSPECT AV	2013-11-11 12:55:00	2013-11-12 06:52	2013-11-12 15:22:00	1 days 02:27:00	1.083333
1	1228 E JUNEAU AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1228	E JUNEAU AV	2013-10-04 11:31:00	2013-10-07 14:38	2013-10-09 06:09:00	4 days 18:38:00	4.750000
2	1346 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1346	N PROSPECT AV	2013-03-12 21:00:00	2013-03-13 06:13	2013-03-15 15:38:00	2 days 18:38:00	2.750000
3	1346 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1346	N PROSPECT AV	2013-12-27 09:10:00	2013-12-30 06:00	2013-12-30 13:38:00	3 days 04:28:00	3.166667
4	1230 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1230	N PROSPECT AV	2013-11-25 22:45:00	2013-11-26 06:18	2013-11-26 14:40:00	0 days 15:55:00	0.625000

Missing data#

Linear regression doesn't (always) deal well with missing data, so we'll remove any datasets that are missing any of our fields.

merged.shape

(12820, 14)

merged = merged.dropna()
merged.shape

(11856, 14)

As a result we removed 964 points from our dataset, about 7.5%.

Linear regression#

Using the statsmodels package, we'll run a linear regression to find the coefficient relating percent minority and pothole fill times.

import statsmodels.formula.api as smf

model = smf.ols("wait_days ~ pct_minority", data=merged)
result = model.fit()
result.summary()

OLS Regression Results
Dep. Variable:	wait_days	R-squared:	0.008
Model:	OLS	Adj. R-squared:	0.008
Method:	Least Squares	F-statistic:	98.92
Date:	Tue, 14 Jan 2020	Prob (F-statistic):	3.24e-23
Time:	14:07:40	Log-Likelihood:	-46657.
No. Observations:	11856	AIC:	9.332e+04
Df Residuals:	11854	BIC:	9.333e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	6.7985	0.264	25.715	0.000	6.280	7.317
pct_minority	0.0373	0.004	9.946	0.000	0.030	0.045

Omnibus:	5674.093	Durbin-Watson:	1.493
Prob(Omnibus):	0.000	Jarque-Bera (JB):	30528.208
Skew:	2.318	Prob(JB):	0.00
Kurtosis:	9.348	Cond. No.	164.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Translate that into the form "every X percentage point change in the minority population translates to a Y change in pot hole fill times"

# Every +1 percent point change in the minority population translates to a 0.04 day change in pot hole fill times

Do you feel comfortable that someone can understand that? Can you reword it to make it more easily understandable?

# times 25
# Every 25 percent point change in the minority population translates to a 1 day change in pot hole fill times

Other methods of explanation#

While the regression is technically correct, it just doesn't sound very nice, which makes it hard to communicate.

What other options do we have?

Averages#

What's the average wait to fill a pothole between majority-white and majority-minority census tracts? We can create a new column to specify whether the census tract is majority white or not.

merged['majority_white'] = (merged.pct_white > 50).astype(int)
merged.groupby('majority_white').wait_days.describe()

	count	mean	std	min	25%	50%	75%	max
majority_white
0	7879.0	9.978281	13.050119	0.0	1.500000	4.875	13.000000	95.208333
1	3977.0	7.575067	10.945687	0.0	1.166667	3.125	8.708333	91.750000

This is how the following sentence in the first story came to be:

potholes in mostly minority census tracts took an average of 11 days to repair, while potholes in mostly white census tracts took seven days.

That's much easier to understand than any regression output!

Binning our data#

Our initial regression answered the question how does the average wait time to fill a pothole change as more minorities live in an area?, but outliers and odd distributions can always cause problems with regression. Instead of trying to rely on each small increase in minority population to have the same effect, we could also separate our data into different bins.

# Bins between 0-100, with steps of 20
# of 0-10, 10-20, 20-30, ... 90-100
bins = range(0, 101, 20)

# Cut pct_minority into separate bins
merged['bin'] = pd.cut(merged.pct_minority, bins=bins)
merged.head()

	address	GEOID	Geo_FIPS	pct_white	pct_black	pct_hispanic	pct_minority	A	Street	EnterDt	PrintDt	ResolvDt	difference	wait_days	majority_white	bin
0	1400 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1400	N PROSPECT AV	2013-11-11 12:55:00	2013-11-12 06:52	2013-11-12 15:22:00	1 days 02:27:00	1.083333	1	(20, 40]
1	1228 E JUNEAU AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1228	E JUNEAU AV	2013-10-04 11:31:00	2013-10-07 14:38	2013-10-09 06:09:00	4 days 18:38:00	4.750000	1	(20, 40]
2	1346 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1346	N PROSPECT AV	2013-03-12 21:00:00	2013-03-13 06:13	2013-03-15 15:38:00	2 days 18:38:00	2.750000	1	(20, 40]
3	1346 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1346	N PROSPECT AV	2013-12-27 09:10:00	2013-12-30 06:00	2013-12-30 13:38:00	3 days 04:28:00	3.166667	1	(20, 40]
4	1230 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1230	N PROSPECT AV	2013-11-25 22:45:00	2013-11-26 06:18	2013-11-26 14:40:00	0 days 15:55:00	0.625000	1	(20, 40]

What's the distribution look like?

merged.bin.value_counts().sort_index().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x11d9d1b00>

With the majority/minority white version, we split our data into two bins. Now instead we've split it into five different bins. We can do the same analysis we did before with grouping and describing.

merged.groupby('bin').wait_days.describe()

	count	mean	std	min	25%	50%	75%	max
bin
(0, 20]	1456.0	6.742588	9.743595	0.0	1.125000	2.958333	7.250000	64.750000
(20, 40]	2113.0	7.743157	11.231985	0.0	1.166667	3.166667	9.083333	91.750000
(40, 60]	1201.0	9.350853	12.444021	0.0	1.250000	4.000000	11.958333	84.750000
(60, 80]	2072.0	10.094494	12.883166	0.0	1.541667	4.979167	13.177083	95.208333
(80, 100]	5014.0	10.055893	13.253221	0.0	1.552083	4.895833	13.041667	93.833333

How does the median change as the percentage of minority population increases?

merged.groupby('bin').wait_days.median().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x11da2ecc0>

Adding income to our regression#

R12216226_SL140.csv contains income data for each census tract in Wisconsin. Let's add it into our analysis to take into account differences in income.

income = pd.read_csv("data/R12216226_SL140.csv", encoding='latin-1', dtype={'Geo_FIPS': 'str'})
income.head(2)

	Geo_FIPS	Geo_GEOID	Geo_NAME	Geo_QName	Geo_STUSAB	Geo_SUMLEV	Geo_GEOCOMP	Geo_FILEID	Geo_LOGRECNO	Geo_US	Geo_REGION	Geo_DIVISION	Geo_STATECE	Geo_STATE	Geo_COUNTY	Geo_COUSUB	Geo_PLACE	Geo_PLACESE	Geo_TRACT	Geo_BLKGRP	Geo_CONCIT	Geo_AIANHH	Geo_AIANHHFP	Geo_AIHHTLI	Geo_AITSCE	Geo_AITS	Geo_ANRC	Geo_CBSA	Geo_CSA	Geo_METDIV	Geo_MACC	Geo_MEMI	Geo_NECTA	Geo_CNECTA	Geo_NECTADIV	Geo_UA	Geo_UACP	Geo_CDCURR	Geo_SLDU	Geo_SLDL	Geo_VTD	Geo_ZCTA3	Geo_ZCTA5	Geo_SUBMCD	Geo_SDELM	Geo_SDSEC	Geo_SDUNI	Geo_UR	Geo_PCI	Geo_TAZ	Geo_UGA	Geo_BTTR	Geo_BTBG	Geo_PUMA5	Geo_PUMA1	SE_A14006_001
0	55079000101	14000US55079000101	Census Tract 1.01, Milwaukee County, Wisconsin	Census Tract 1.01, Milwaukee County, Wisconsin	wi	140	0	ACSSF	4717	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	101	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	27421.0
1	55079000102	14000US55079000102	Census Tract 1.02, Milwaukee County, Wisconsin	Census Tract 1.02, Milwaukee County, Wisconsin	wi	140	0	ACSSF	4718	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	102	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40692.0

income = income[['Geo_FIPS', 'SE_A14006_001']]
income.head(2)

	Geo_FIPS	SE_A14006_001
0	55079000101	27421.0
1	55079000102	40692.0

Individual dollars will make our results difficult to read, so we'll convert it to tens of thousands of dollars.

income['income_10k'] = income.SE_A14006_001 / 10000
income.head(2)

	Geo_FIPS	SE_A14006_001	income_10k
0	55079000101	27421.0	2.7421
1	55079000102	40692.0	4.0692

merged = merged.merge(income, on='Geo_FIPS')
merged.head(2)

	address	GEOID	Geo_FIPS	pct_white	pct_black	pct_hispanic	pct_minority	A	Street	EnterDt	PrintDt	ResolvDt	difference	wait_days	majority_white	bin	SE_A14006_001	income_10k
0	1400 N PROSPECT AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1400	N PROSPECT AV	2013-11-11 12:55:00	2013-11-12 06:52	2013-11-12 15:22:00	1 days 02:27:00	1.083333	1	(20, 40]	66875.0	6.6875
1	1228 E JUNEAU AV	55079186900	55079186900	74.735057	9.114031	5.341246	25.264943	1228	E JUNEAU AV	2013-10-04 11:31:00	2013-10-07 14:38	2013-10-09 06:09:00	4 days 18:38:00	4.750000	1	(20, 40]	66875.0	6.6875

Perform the regression#

import statsmodels.formula.api as smf

model = smf.ols("wait_days ~ pct_minority + income_10k", data=merged)
result = model.fit()
result.summary()

OLS Regression Results
Dep. Variable:	wait_days	R-squared:	0.012
Model:	OLS	Adj. R-squared:	0.012
Method:	Least Squares	F-statistic:	70.05
Date:	Tue, 14 Jan 2020	Prob (F-statistic):	5.69e-31
Time:	14:16:45	Log-Likelihood:	-46637.
No. Observations:	11856	AIC:	9.328e+04
Df Residuals:	11853	BIC:	9.330e+04
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.6729	0.697	3.833	0.000	1.306	4.040
pct_minority	0.0609	0.005	11.583	0.000	0.051	0.071
income_10k	0.6788	0.106	6.392	0.000	0.471	0.887

Omnibus:	5676.105	Durbin-Watson:	1.498
Prob(Omnibus):	0.000	Jarque-Bera (JB):	30709.656
Skew:	2.317	Prob(JB):	0.00
Kurtosis:	9.380	Cond. No.	438.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

merged['income_bin'] = pd.cut(merged.pct_minority, bins=10)
merged.groupby('income_bin').wait_days.median().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x129f90c18>

Writing our results#

After controlling for income, our coefficient for pct_minority was raised from 0.03 to 0.06. How do we put that into words?

Every +1 percent point change in the minority population translates to a 0.06-day change in pot hole fill times

Which, if we multiply by 16, we can turn it into an actual day measurement:

Every +16.6 percent point change in the minority population translates to a 1-day change in pot hole fill times

Review#

In this section we partially reproduced a linear regression from the Milwaukee Journal-Sentinel regarding pothole fill times. We discussed the different ways of communicating your findings to the reader, along with other options for analysis besides just regression on a field, such as majority/minority splits or binning.

Discussion topics#

What technique of communicating our findings was the most helpful to the reader? What was most useful to us as researchers?

In the published piece, it includes this phase: "the higher the minority population percentage in a census tract, the longer it took crews to fix potholes." How does that compare to the other approaches we've taken in this analysis?

The piece also states "The analysis could not account for the remaining factors: the age or size of roadways, citywide traffic counts or pothole severity, which the city considers as priorities." How do we feel about a regression that leaves out so many other variables? Is it still meaningful?

What kind of expert would you look for to double-check our analysis? Who did the Milwaukee Journal-Sentinel use?