The Milwaukee Journal Sentinel and pothole fill times (No merging)#
Story: Race gap found in pothole patching
Author: Keegan Kyle, Grant Smith and Ben Poston, Milwaukee Journal Sentinel
Topics: Census Data, Geocoding, QGIS Spatial Joins, Linear Regression
Datasets
- potholes-cleaned-merged.csv: a series of merged datasets (minus the income dataset). The datasets include:
- 2007-2010 POTHOLES.xls: Pothole data, July 2007-July 2010 from the Milwaukee DPW
- 2010-2013 POTHOLES.xls: Pothole data, July 2010-July 2013 from the Milwaukee DPW
- 2013-2017 POTHOLES.xls: Pothole data, July 2013-July 2017 from the Milwaukee DPW
- tl_2013_55_tract.zip: 2013 census tract boundaries from the US Census Bureau
- addresses_geocoded.csv: a large selection of addresses in Milwaukee, geocoded by Geocod.io
- R12216099_SL140.csv: ACS 2013 5-year, tract level, from Social Explorer
- (Table A04001, Hispanic or Latino by Race,
R12216099.txt
is the data dictionary
- (Table A04001, Hispanic or Latino by Race,
- R12216226_SL140.csv ACS 2013 5-year, tract level, from Social Explorer
- Table A14006, 2013 Median Household income
- Data dictionary is here
What's the story?#
We're trying to figure out if the time it took Milwaukee to fill pot holes is related to the racial makeup of a census tract.
Do your imports#
You'll also want to set pandas to display up to 200 columns at a time.
Read in your data#
We're just reading in potholes-cleaned-merged.csv
for now. It's a lot lot lot of other files, somewhat cleaned and all merged together.
- Tip: Both
GEOID
andGeo_FIPS
are census tract identifiers. You'll want to read them in as strings so they don't lose leading zeroes
What is the maximum and minimum EnterDt
and ResolvDt
?#
Use this to confirm that your date range is what you expected. If it isn't, take a look at what might have happened with your dataset.
- Tip: Missing data might be a headache
Calculate how long it took to fill potholes in 2013#
Save it into a new column.
- Tip: It's possible to subtract two dates
Hrm, well, I think we need that difference to be an integer#
If your new column isn't an integer, create another column that is.
- Tip: Just like you might use
.str.strip()
on a string column, if your column is a datetime you can use.dt.components
to get the days, hours, minutes, seconds, etc of the column.
Cleaning up your census data#
The SE_
columns are all from the census data, you can find out what they mean by reading the data dictionary R12216099.txt
.
Add new columns to create:
pct_white
The percent of the population that is Whitepct_black
The percent of the population that is Blackpct_hispanic
The percent of the population that is Hispanicpct_minority
The percent of the population that is a minority (non-White)
The column names don't match exactly, but you can figure it out.
Feel free to drop the census if you're not interested in them any more.
Linear regression#
Using the statsmodels
package, run a linear regression to find the coefficient relating percent minority and pothole fill times.
- Tip: Be sure to remove missing data with
.dropna()
first. How many rows get removed? - Tip: Don't forget to use
sm.add_constant
. Why do we use it?
Translate that into the form "every X percentage point change in the minority population translates to a Y change in pot hole fill times"
Do you feel comfortable that someone can understand that? Can you reword it to make it more easily understandable?
Other methods of explanation#
While the regression is technically correct, it just doesn't sound very nice. What other options do we have?
What's the average wait to fill a pothole between majority-white and majority-minority census tracts?#
You'll need to create a new column to specify whether the census tract is majority White or not.
How does the average wait time to fill a pothole change as more minorities live in an area?#
- Tip: Use
.cut
to split the percent minority (or white) into a few different bins.
Analyzing Income#
R12216226_SL140.csv
contains income data for each census tract in Wisconsin. Add it into your analysis.
If you run a multivariate regression also including income, how does this change things?
- Tip: Be sure to read in
Geo_FIPS
as a string so leading zeroes don't get removed - Tip: You can use this data dictionary to understand what column you're interested in.
Filter out every column except the one you'll be joining on and the median income#
Merge with your existing dataset on census tract#
Run another regression, this time including both percent minority and income#
The income coefficient is very unfriendly!#
Try to explain what it means in normal words. Or... don't, and just skip to the next question.
Create a new column that stands for income in $10,000 increments, and try the regression again#
Explain that in normal human-being words#
Controlling for minority population, for an X change in income, there is a Y change in how long it takes to get potholes filled.
...does that make sense?
Bin income levels and graph it#
This seems unexpected, maybe?#
Not like we were hoping for the race this, but this seems like not what we were going to get. That means there's either a story or we're forgetting something obvious. What might be causing this trend? How could we investigate it?