← Previous: Pothole geographic analysis and linear regression, complete walkthrough

The Milwaukee Journal Sentinel and pothole fill times (No merging)#

Story: Race gap found in pothole patching

Author: Keegan Kyle, Grant Smith and Ben Poston, Milwaukee Journal Sentinel

Topics: Census Data, Geocoding, QGIS Spatial Joins, Linear Regression

Datasets

potholes-cleaned-merged.csv: a series of merged datasets (minus the income dataset). The datasets include:
- 2007-2010 POTHOLES.xls: Pothole data, July 2007-July 2010 from the Milwaukee DPW
- 2010-2013 POTHOLES.xls: Pothole data, July 2010-July 2013 from the Milwaukee DPW
- 2013-2017 POTHOLES.xls: Pothole data, July 2013-July 2017 from the Milwaukee DPW
- tl_2013_55_tract.zip: 2013 census tract boundaries from the US Census Bureau
- addresses_geocoded.csv: a large selection of addresses in Milwaukee, geocoded by Geocod.io
- R12216099_SL140.csv: ACS 2013 5-year, tract level, from Social Explorer
  - (Table A04001, Hispanic or Latino by Race, R12216099.txt is the data dictionary
R12216226_SL140.csv ACS 2013 5-year, tract level, from Social Explorer
- Table A14006, 2013 Median Household income
- Data dictionary is here

What's the story?#

We're trying to figure out if the time it took Milwaukee to fill pot holes is related to the racial makeup of a census tract.

Read online Download notebook Interactive version

Do your imports#

You'll also want to set pandas to display up to 200 columns at a time.

Read in your data#

We're just reading in potholes-cleaned-merged.csv for now. It's a lot lot lot of other files, somewhat cleaned and all merged together.

Tip: Both GEOID and Geo_FIPS are census tract identifiers. You'll want to read them in as strings so they don't lose leading zeroes

What is the maximum and minimum `EnterDt` and `ResolvDt`?#

Use this to confirm that your date range is what you expected. If it isn't, take a look at what might have happened with your dataset.

Tip: Missing data might be a headache

Calculate how long it took to fill potholes in 2013#

Save it into a new column.

Tip: It's possible to subtract two dates

Hrm, well, I think we need that difference to be an integer#

If your new column isn't an integer, create another column that is.

Tip: Just like you might use .str.strip() on a string column, if your column is a datetime you can use .dt.components to get the days, hours, minutes, seconds, etc of the column.

Cleaning up your census data#

The SE_ columns are all from the census data, you can find out what they mean by reading the data dictionary R12216099.txt.

Add new columns to create:

pct_white The percent of the population that is White
pct_black The percent of the population that is Black
pct_hispanic The percent of the population that is Hispanic
pct_minority The percent of the population that is a minority (non-White)

The column names don't match exactly, but you can figure it out.

Feel free to drop the census if you're not interested in them any more.

Linear regression#

Using the statsmodels package, run a linear regression to find the coefficient relating percent minority and pothole fill times.

Tip: Be sure to remove missing data with .dropna() first. How many rows get removed?
Tip: Don't forget to use sm.add_constant. Why do we use it?

Translate that into the form "every X percentage point change in the minority population translates to a Y change in pot hole fill times"

Do you feel comfortable that someone can understand that? Can you reword it to make it more easily understandable?

Other methods of explanation#

While the regression is technically correct, it just doesn't sound very nice. What other options do we have?

What's the average wait to fill a pothole between majority-white and majority-minority census tracts?#

You'll need to create a new column to specify whether the census tract is majority White or not.

How does the average wait time to fill a pothole change as more minorities live in an area?#

Tip: Use .cut to split the percent minority (or white) into a few different bins.

Analyzing Income#

R12216226_SL140.csv contains income data for each census tract in Wisconsin. Add it into your analysis.

If you run a multivariate regression also including income, how does this change things?

Tip: Be sure to read in Geo_FIPS as a string so leading zeroes don't get removed
Tip: You can use this data dictionary to understand what column you're interested in.

Filter out every column except the one you'll be joining on and the median income#

Merge with your existing dataset on census tract#

Run another regression, this time including both percent minority and income#

The income coefficient is very unfriendly!#

Try to explain what it means in normal words. Or... don't, and just skip to the next question.

Create a new column that stands for income in $10,000 increments, and try the regression again#

Explain that in normal human-being words#

Controlling for minority population, for an X change in income, there is a Y change in how long it takes to get potholes filled.

...does that make sense?

Bin income levels and graph it#

This seems unexpected, maybe?#

Not like we were hoping for the race this, but this seems like not what we were going to get. That means there's either a story or we're forgetting something obvious. What might be causing this trend? How could we investigate it?

← Previous: Pothole geographic analysis and linear regression, complete walkthrough

Next: Summary →

The Milwaukee Journal Sentinel and pothole fill times (No merging)#

What's the story?#

Do your imports#

Read in your data#

What is the maximum and minimum `EnterDt` and `ResolvDt`?#

Calculate how long it took to fill potholes in 2013#

Hrm, well, I think we need that difference to be an integer#

Cleaning up your census data#

Linear regression#

Other methods of explanation#

What's the average wait to fill a pothole between majority-white and majority-minority census tracts?#

How does the average wait time to fill a pothole change as more minorities live in an area?#

Analyzing Income#

Filter out every column except the one you'll be joining on and the median income#

Merge with your existing dataset on census tract#

Run another regression, this time including both percent minority and income#

The income coefficient is very unfriendly!#

Create a new column that stands for income in $10,000 increments, and try the regression again#

Explain that in normal human-being words#

Bin income levels and graph it#

This seems unexpected, maybe?#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects

The Milwaukee Journal Sentinel and pothole fill times (No merging)#

What's the story?#

Do your imports#

Read in your data#

What is the maximum and minimum EnterDt and ResolvDt?#

Calculate how long it took to fill potholes in 2013#

Hrm, well, I think we need that difference to be an integer#

Cleaning up your census data#

Linear regression#

Other methods of explanation#

What's the average wait to fill a pothole between majority-white and majority-minority census tracts?#

How does the average wait time to fill a pothole change as more minorities live in an area?#

Analyzing Income#

Filter out every column except the one you'll be joining on and the median income#

Merge with your existing dataset on census tract#

Run another regression, this time including both percent minority and income#

The income coefficient is very unfriendly!#

Create a new column that stands for income in $10,000 increments, and try the regression again#

Explain that in normal human-being words#

Bin income levels and graph it#

This seems unexpected, maybe?#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects

What is the maximum and minimum `EnterDt` and `ResolvDt`?#