The Milwaukee Journal Sentinel and pothole fill times (No merging)#

Story: Race gap found in pothole patching

Author: Keegan Kyle, Grant Smith and Ben Poston, Milwaukee Journal Sentinel

Topics: Census Data, Geocoding, QGIS Spatial Joins, Linear Regression


  • potholes-cleaned-merged.csv: a series of merged datasets (minus the income dataset). The datasets include:
    • 2007-2010 POTHOLES.xls: Pothole data, July 2007-July 2010 from the Milwaukee DPW
    • 2010-2013 POTHOLES.xls: Pothole data, July 2010-July 2013 from the Milwaukee DPW
    • 2013-2017 POTHOLES.xls: Pothole data, July 2013-July 2017 from the Milwaukee DPW
    • 2013 census tract boundaries from the US Census Bureau
    • addresses_geocoded.csv: a large selection of addresses in Milwaukee, geocoded by
    • R12216099_SL140.csv: ACS 2013 5-year, tract level, from Social Explorer
      • (Table A04001, Hispanic or Latino by Race, R12216099.txt is the data dictionary
  • R12216226_SL140.csv ACS 2013 5-year, tract level, from Social Explorer
    • Table A14006, 2013 Median Household income
    • Data dictionary is here

What's the story?#

We're trying to figure out if the time it took Milwaukee to fill pot holes is related to the racial makeup of a census tract.

Do your imports#

You'll also want to set pandas to display up to 200 columns at a time.


Read in your data#

We're just reading in potholes-cleaned-merged.csv for now. It's a lot lot lot of other files, somewhat cleaned and all merged together.

  • Tip: Both GEOID and Geo_FIPS are census tract identifiers. You'll want to read them in as strings so they don't lose leading zeroes

What is the maximum and minimum EnterDt and ResolvDt?#

Use this to confirm that your date range is what you expected. If it isn't, take a look at what might have happened with your dataset.

  • Tip: Missing data might be a headache

Calculate how long it took to fill potholes in 2013#

Save it into a new column.

  • Tip: It's possible to subtract two dates

Hrm, well, I think we need that difference to be an integer#

If your new column isn't an integer, create another column that is.

  • Tip: Just like you might use .str.strip() on a string column, if your column is a datetime you can use .dt.components to get the days, hours, minutes, seconds, etc of the column.

Cleaning up your census data#

The SE_ columns are all from the census data, you can find out what they mean by reading the data dictionary R12216099.txt.

Add new columns to create:

  • pct_white The percent of the population that is White
  • pct_black The percent of the population that is Black
  • pct_hispanic The percent of the population that is Hispanic
  • pct_minority The percent of the population that is a minority (non-White)

The column names don't match exactly, but you can figure it out.


Feel free to drop the census if you're not interested in them any more.

Linear regression#

Using the statsmodels package, run a linear regression to find the coefficient relating percent minority and pothole fill times.

  • Tip: Be sure to remove missing data with .dropna() first. How many rows get removed?
  • Tip: Don't forget to use sm.add_constant. Why do we use it?

Translate that into the form "every X percentage point change in the minority population translates to a Y change in pot hole fill times"


Do you feel comfortable that someone can understand that? Can you reword it to make it more easily understandable?


Other methods of explanation#

While the regression is technically correct, it just doesn't sound very nice. What other options do we have?

What's the average wait to fill a pothole between majority-white and majority-minority census tracts?#

You'll need to create a new column to specify whether the census tract is majority White or not.


How does the average wait time to fill a pothole change as more minorities live in an area?#

  • Tip: Use .cut to split the percent minority (or white) into a few different bins.

Analyzing Income#

R12216226_SL140.csv contains income data for each census tract in Wisconsin. Add it into your analysis.

If you run a multivariate regression also including income, how does this change things?

  • Tip: Be sure to read in Geo_FIPS as a string so leading zeroes don't get removed
  • Tip: You can use this data dictionary to understand what column you're interested in.

Filter out every column except the one you'll be joining on and the median income#


Merge with your existing dataset on census tract#


Run another regression, this time including both percent minority and income#


The income coefficient is very unfriendly!#

Try to explain what it means in normal words. Or... don't, and just skip to the next question.


Create a new column that stands for income in $10,000 increments, and try the regression again#


Explain that in normal human-being words#

Controlling for minority population, for an X change in income, there is a Y change in how long it takes to get potholes filled.


...does that make sense?

Bin income levels and graph it#


This seems unexpected, maybe?#

Not like we were hoping for the race this, but this seems like not what we were going to get. That means there's either a story or we're forgetting something obvious. What might be causing this trend? How could we investigate it?