2.1 Turning our question into measurables
First, we’ll need to break this question down into something measurable, then figure out where we can get the necessary information.
2.1.1 Data: Potholes
We’ll start with the time to fix pot holes.
People report potholes, Milwaukee puts the reports into their system, and then the potholes get fixed. Let’s assume we can get this dataset, and that it will be in a reasonable form. To do our job, we’d probably want these three columns:
- A street address (to know where the pothole is)
- The time the pothole was reported
- The time the pothole was filled in
A “time it took to fill in the pothole” column might be nice, but it’s kind of unrealistic! We can just compute that ourselves by subtracting the report vs. filled-in times.
If you start searching around about potholes in Milwaukee, you quickly end up on this page about reporting a pot hole. These potholes are reported to - and fixed by - the Department of Public Works.
According to the County of Milwaukee’s Open Records Request page:
Each county department and elected official is the custodian of their respective records. As a result, each department and each elected official fulfills their own records requests. Therefore, to obtain the records you seek, you need to direct your open records requests to the appropriate record custodian.
For example, if you are looking for pension information, Human Resources would be the custodian of those records. If you are looking for an accident report from an incident on the freeway, the Sheriff’s Department would be the custodian of those records.
And so you might file a request with them (as I did!) and come away with a few CSV files with exactly the columns we asked for.
import pandas as pd
pd.set_option("display.max_columns", 20)
pd.set_option("display.max_colwidth", 200)
df = pd.read_excel("data/2007-2010 POTHOLES.xls")
df.head(3)
A | Street | EnterDt | PrintDt | ResolvDt |
---|---|---|---|---|
1846 | W HALSEY AV | 2010-07-15 16:33 | 2010-07-16 15:44 | 2010-07-19 15:14 |
9324 | W PARK HILL AV | 2010-07-15 16:06 | 2010-07-16 10:05 | 2010-07-21 06:02 |
1020 | E MANITOBA ST | 2010-07-15 15:13 | 2010-07-15 15:33 | 2010-07-16 14:35 |
Our dataset is all pothole requests filed between July 15 2007-July 15 2017.
2.1.2 Data: Number of minorities in a neighborhood
This is a very, very, very vague statement, but it’s a fine starting point.
The Census Bureau collects information on who lives where, and releases data each and very year through the American Community Survey, so it sounds like it’ll be a good resource.
But neighborhoods? Unfortunately, they aren’t included on the Census. They aren’t included on the Census because in most places they’re vaguely defined.
Instead, the census has a crazy set of different levels, including:
- Census blocks
- Block groups
- Census tracts
- ZIP code tabulation areas
- School districts
- Counties
- States
- …many, many more.
I’ll give you a secret: most of the time, your answer is going to be census tracts. They’re pretty small - but not too small - and most types of data are available for them.
If we use Social Explorer one of the first questions we have is what year do we want our data for?
2.1.2.1 Picking a year (editorial choice)
Our pothole data is from 2007-2017, but I’m sure the city of Milwaukee has changed a lot during that time. Neighborhoods shift, people move in and out, and what an area was like in 2007 isn’t necessarily the same a decade later.
We have three options:
- Pick a year in the middle and assume it will, on average, be like 2007 and 2017.
- Download census data for each year, matching up the years between the potholes and the census data, and combining them all together
- Pick one year and do it just for that one year.
Picking an average year seems like the easiest way to get the most pothole data into our analysis, but it’s really not the most responsible method. If we have pothole data from each year and census data from each year, we’d only really be doing this because we’re too lazy to match up the years!
It also isn’t very good from the perspective of finding interesting stories: imagine if a neighborhood were rapidly changing, with a lot of wealthy white people moving in, and suddenly potholes were filling much more quickly - wouldn’t you want to be able to notice that?
Downloading every year of data is definitely the most thorough approach, but there could be a downside depending on how we do our analysis. If we study all ten years at once, we might miss changes over time, or things that happened in smaller windows. For example, what if in the past 3 years the Department of Public Works became much, much worse at filling potholes? We don’t want to miss that one!
It’s also a lot of work!
Picking just one year is a simple way to do the analysis, and allows you to expand to more years later on if you find anything interesting. Since we’re probably on a deadline, that’ll be our approach.
We’re going to pick the year 2013 for this walkthrough. That’ll allow you to reproduce the original analysis for 2007 if you want, or something more recently - 2017 - if you want.
TALK MORE ABOUT SELECTING A YEAR FLAG EDITORIAL DECISIONS
TALK MORE ABOUT SELECTING A TABLE FLAG EDITORIAL DECISIONS
TALK ABOUT READING THE DATA DICTIONARY
2.1.2.2 Picking a data table (editorial choice)
Since we’re looking at “number of minorities,” our first instinct is to use Table A03001: Race. If we look at the data in the table, though, it breaks the population down into these categories:
- White Alone
- Black or African-American Alone
- American Indian or Alaska Native Alone
- Asian Alone
- Native Hawaiian and Other Pacific Islander Alone
- Some Other Race Alone
- Two or More Races
Does this seem okay? It honestly depends on what you mean by “number of minorities” - is that just a way of saying “the Black population,” or “non-White people,” or something else altogether?
One thing you might notice is that this table doesn’t include Hispanic/Latinx as a breakdown. According to the Census Bureau, people of Hispanic origin can be any race. There’s a lot of interesting history regarding how “Hispanic” wound up on the Census form and race on the census in general, but for now we’ll say we’re interested in that population, and we need to track it down.
There’s another table, right after our “Race” table, called A04001. Hispanic or Latino by Race. It breaks people down in a similar way as the race table, but also includes whether they identify as Hispanic/Latino:
- Not Hispanic or Latino
- White Alone
- Black or African-American Alone
- American Indian or Alaska Native Alone
- Asian Alone
- Native Hawaiian and Other Pacific Islander Alone
- Some Other Race Alone
- Two or More Races
- Hispanic or Latino
- White Alone
- Black or African-American Alone
- American Indian or Alaska Native Alone
- Asian Alone
- Native Hawaiian and Other Pacific Islander Alone
- Some Other Race Alone
- Two or More Races
This table seems a lot more useful if we’re looking for a definition of “minority” that includes Hispanic/Latinx populations. With this dataset, we’ll be examining non-Hispanic White Alone as white, and everyone else as a minority.
2.1.2.3 Downloading our data
When you download your data, make sure you scroll to the bottom and get the Data Dictionary. The dataset itself is full of weird codes, and the data dictionary will enable us to understand them.