Imports#
We're importing pandas, as well as setting it to display up to 200 columns at a time.
import pandas as pd
pd.set_option("display.max_columns", 200)
%matplotlib inline
Importing our data#
The city delivered multiple files to us, so we'll need to filter them down to just 2013. What seems straightforward might actually be an editorial decision: when we say "a pothole from 2013," what exactly do we mean? Filed in 2013? Filled in 2013? Filed and filled?
We're going to consider a pothole as being from 2013 if it was entered in 2013.
potholes_2010 = pd.read_excel("data/2010-2013 POTHOLES.xls")
potholes_2013 = pd.read_excel("data/2013-2017 POTHOLES.xls")
potholes = pd.concat([potholes_2010, potholes_2013])
potholes.head()
How many do we have?
potholes = potholes[potholes.EnterDt.str.startswith('2013')]
potholes.shape
Double-checking our dates#
Let's confirm that our date range is what we expected. We'll also need to mind missing data - if a pothole was never filled, it will be missing a ResolvDt
.
potholes.EnterDt.dropna().agg(['min', 'max'])
potholes.ResolvDt.dropna().agg(['min', 'max'])
Feature engineering: fill time#
Let's calculate how long it took to fill potholes in 2013. We'll save this into a new column.
# We'll need to convert our columns into datetimes to find the difference
potholes['EnterDt'] = pd.to_datetime(potholes.EnterDt)
potholes['ResolvDt'] = pd.to_datetime(potholes.ResolvDt)
potholes['difference'] = potholes.ResolvDt - potholes.EnterDt
potholes.head()
What kind of datatype is 17 days 10:18:00
?
potholes.difference.dtype
Hm, we probably need to turn that into an integer. Otherwise we won't be able to do regression on it!
potholes['wait_days'] = potholes['difference'].dt.components['days'] + \
potholes['difference'].dt.components['hours'] / 24
potholes.head()
Census data#
Data from the US Government is often Latin-1 encoded instead of UTF-8, so we'll need to specify the encoding. We also want to demand that Geo_FIPS
be read in as a string so as to not lose leading zeroes.
census = pd.read_csv("data/R12216099_SL140.csv", encoding='latin-1', dtype={'Geo_FIPS': 'str'})
census.head(2)
Feature engineering: Race#
Let's use this dataset to create a second dataframe of all of our features. Each row should contain:
- The census tract number
- The percent of the population that is White
- The percent of the population that is Black
- The percent of the population that is Hispanic
- The percent of the population that is a minority (non-White)
We'll need to read the data dictionary to know exactly what the columns mean.
census['pct_white'] = census.SE_A04001_003 / census.SE_A04001_001 * 100
census['pct_black'] = census.SE_A04001_004 / census.SE_A04001_001 * 100
census['pct_hispanic'] = census.SE_A04001_010 / census.SE_A04001_001 * 100
census['pct_minority'] = 100 - census.pct_white
census = census[['Geo_FIPS', 'pct_white', 'pct_black', 'pct_hispanic', 'pct_minority']]
census.head()
Combine datasets#
We need to create a new dataframe by merging our street addresses with our census data. It would be nice to merge on census tract code...
potholes.head(2)
...but our addresses data does not have a census tract on them!
Connecting addresses to census tracts#
To make this work, we need to go through a multi-step process. First we need to find the lat/lon coordinates for each address, then we need to find out which census tract it's in.
To start with, I made a list of unique addresses and save them to a file. I then geocoded this file, converting the addresses to latitude and longitude. You can use something like Little Geocoder, geocodio, or the Google Maps API to perform the conversion.
Finally, I opened these latitude/longitude pairs in QGIS. The QGIS command Join attributes by location will merge datasets that geographically overlap each other. If you have one layer that's a list of lat/lon points and one layer that's a shapefile with census tract information, join attributes by location can create a new layer of lat/lon points that also has census tract information. By downloading a shapefile of census tracts from the Census Bureau, I was able to combine these two datasets.
Finally, I exported the result by right clicking the layer, selecting Export As, and making sure the output is another CSV.
Merging#
We now have three datasets:
- Addresses and pothole fill times
- Census data according with census tract codes
- Addresses with census tract codes
We'll need to merge them each together to create one whole dataset.
addresses_tracts = pd.read_csv("addresses_with_tract.csv", dtype={'GEOID': 'str'})
addresses_tracts.head()
merged = addresses_tracts.merge(census, left_on='GEOID', right_on='Geo_FIPS')
merged.head()
potholes.head()
potholes['address'] = potholes.A.astype(str) + ' ' + potholes.Street
potholes.head()
merged = merged.merge(potholes, on='address')
merged.head()
Missing data#
Linear regression doesn't (always) deal well with missing data, so we'll remove any datasets that are missing any of our fields.
merged.shape
merged = merged.dropna()
merged.shape
As a result we removed 964 points from our dataset, about 7.5%.
Linear regression#
Using the statsmodels
package, we'll run a linear regression to find the coefficient relating percent minority and pothole fill times.
import statsmodels.formula.api as smf
model = smf.ols("wait_days ~ pct_minority", data=merged)
result = model.fit()
result.summary()
Translate that into the form "every X percentage point change in the minority population translates to a Y change in pot hole fill times"
# Every +1 percent point change in the minority population translates to a 0.04 day change in pot hole fill times
Do you feel comfortable that someone can understand that? Can you reword it to make it more easily understandable?
# times 25
# Every 25 percent point change in the minority population translates to a 1 day change in pot hole fill times
Other methods of explanation#
While the regression is technically correct, it just doesn't sound very nice, which makes it hard to communicate.
What other options do we have?
Averages#
What's the average wait to fill a pothole between majority-white and majority-minority census tracts? We can create a new column to specify whether the census tract is majority white or not.
merged['majority_white'] = (merged.pct_white > 50).astype(int)
merged.groupby('majority_white').wait_days.describe()
This is how the following sentence in the first story came to be:
potholes in mostly minority census tracts took an average of 11 days to repair, while potholes in mostly white census tracts took seven days.
That's much easier to understand than any regression output!
Binning our data#
Our initial regression answered the question how does the average wait time to fill a pothole change as more minorities live in an area?, but outliers and odd distributions can always cause problems with regression. Instead of trying to rely on each small increase in minority population to have the same effect, we could also separate our data into different bins.
# Bins between 0-100, with steps of 20
# of 0-10, 10-20, 20-30, ... 90-100
bins = range(0, 101, 20)
# Cut pct_minority into separate bins
merged['bin'] = pd.cut(merged.pct_minority, bins=bins)
merged.head()
What's the distribution look like?
merged.bin.value_counts().sort_index().plot(kind='bar')
With the majority/minority white version, we split our data into two bins. Now instead we've split it into five different bins. We can do the same analysis we did before with grouping and describing.
merged.groupby('bin').wait_days.describe()
How does the median change as the percentage of minority population increases?
merged.groupby('bin').wait_days.median().plot(kind='bar')
Adding income to our regression#
R12216226_SL140.csv
contains income data for each census tract in Wisconsin. Let's add it into our analysis to take into account differences in income.
income = pd.read_csv("data/R12216226_SL140.csv", encoding='latin-1', dtype={'Geo_FIPS': 'str'})
income.head(2)
income = income[['Geo_FIPS', 'SE_A14006_001']]
income.head(2)
Individual dollars will make our results difficult to read, so we'll convert it to tens of thousands of dollars.
income['income_10k'] = income.SE_A14006_001 / 10000
income.head(2)
merged = merged.merge(income, on='Geo_FIPS')
merged.head(2)
Perform the regression#
import statsmodels.formula.api as smf
model = smf.ols("wait_days ~ pct_minority + income_10k", data=merged)
result = model.fit()
result.summary()
merged['income_bin'] = pd.cut(merged.pct_minority, bins=10)
merged.groupby('income_bin').wait_days.median().plot(kind='bar')
Writing our results#
After controlling for income, our coefficient for pct_minority
was raised from 0.03 to 0.06. How do we put that into words?
Every +1 percent point change in the minority population translates to a 0.06-day change in pot hole fill times
Which, if we multiply by 16, we can turn it into an actual day measurement:
Every +16.6 percent point change in the minority population translates to a 1-day change in pot hole fill times
Review#
In this section we partially reproduced a linear regression from the Milwaukee Journal-Sentinel regarding pothole fill times. We discussed the different ways of communicating your findings to the reader, along with other options for analysis besides just regression on a field, such as majority/minority splits or binning.
Discussion topics#
What technique of communicating our findings was the most helpful to the reader? What was most useful to us as researchers?
In the published piece, it includes this phase: "the higher the minority population percentage in a census tract, the longer it took crews to fix potholes." How does that compare to the other approaches we've taken in this analysis?
The piece also states "The analysis could not account for the remaining factors: the age or size of roadways, citywide traffic counts or pothole severity, which the city considers as priorities." How do we feel about a regression that leaves out so many other variables? Is it still meaningful?
What kind of expert would you look for to double-check our analysis? Who did the Milwaukee Journal-Sentinel use?