Predicting delays in patching potholes based on demographics

An analysis of the relationship between race and city sanitation services in Milwaukee.

linear regression multivariable regression geocoding census data joining datasets feature engineering


This chapter reviews a piece from the Milwaukee Journal Sentinel that requires a unique combination of geographic data cleaning along with a linear regression using census data. While the end result is similar to other logistic regressions, there are a lot of data cleaning steps between us and the finish line.

After requesting pothole fill data from the city, you're granted a file containing the addresses of reported potholes, along with the time the potholes were reported and fixed. To turn this into geographic data, we need to convert these addresses to latitude/longitude pairs (geocoding), and then use a spatial join to find out which census tract each address is in.

For each census tract, the Census Bureau publishes data on race, income, and population. We'll then downloaded this data and join it to the potholes dataset, allowing us to see the demographic details at each pothole's location.

Once all the data is in one place - how long it took the potholes to be fixed along with demographic data for each pothole's location - we are finally able to use a linear regression to determine how an area's demographics relate to the number of days it takes to fix a pothole.

The original piece uses more features than just race, but in its current incarnation our analysis only uses race. Combining with the same dataset as the AP regression on life expectancy would be a quick upgrade.

Notebooks, Assignments, and Walkthroughs

Complete walkthrough

A start-to-finish walkthrough of the Milwaukee Journal-Sentinel's analysis.

Multi-page walkthrough

Pothole geographic analysis and linear regression, complete walkthrough

A start-to-finish analysis of pothole fill times in Milwaukee, including the geographic parts and the linear regression.

Pothole demographics linear regression, no spatial analysis

A quicker version of the full analysis, skipping the whole QGIS/spatial analysis parts.

Discussion topics