2.5 Final prep for analysis

Let’s go back to our question: “in Milwaukee, is there a relationship between how many minorities are in a neighborhood, and how long it takes to fix potholes there?”

We have two things we’re measuring:

The amount of minorities
How long it takes to fix potholes

Before we do our analysis, we’ll need to see how (or if) our dataset is able to represent these concepts. Before we dive in, let’s read in our final (?) merged dataset and take a look at what we have.

df = pd.read_csv("data/2013-complete.csv")
df.head()

EnterDt	PrintDt	ResolvDt	address	GEOID	Geo_FIPS	SE_A04001_001	SE_A04001_002	SE_A04001_003	SE_A04001_004	SE_A04001_005	SE_A04001_006	SE_A04001_008	SE_A04001_009	SE_A04001_010	SE_A04001_011	SE_A04001_012	SE_A04001_016	SE_A04001_017
2013-07-15 23:35	2013-07-16 05:46	2013-07-17 05:50	3839 N 10TH ST	55079004500	55079004500	3160	3053	76	2854	1	30	8	84	107	16	17	74	0
2013-07-15 20:05	2013-07-16 05:46	2013-07-24 16:58	4900 W MELVINA ST	55079003800	55079003800	2323	2276	205	2007	0	0	0	64	47	0	0	47	0
2013-07-15 20:00	2013-07-16 05:56	2013-07-25 14:42	2400 W WISCONSIN AV	55079014900	55079014900	1275	1113	514	538	0	50	0	11	162	148	3	11	0
2013-07-15 19:55	2013-07-16 05:46	2013-07-18 06:06	1800 W HAMPTON AV	55079002300	55079002300	4078	4063	179	3765	0	38	0	81	15	8	0	0	7
2013-07-15 19:50	2013-07-16 05:46	2013-08-02 06:08	4718 N 19TH ST	55079002300	55079002300	4078	4063	179	3765	0	38	0	81	15	8	0	0	7

A lot of those columns are weird census codes, but we’ll let them stay for now - we can always look them up in the census table’s data dictionary later.

2.5.1 “The amount of minorities”

First, what counts as “minorities?”

We decided that everyone whose race is not White - Asian, African-American, etc - gets counted as a minority, along with everyone who is marked as Hispanic/Latino, even if they’re White. So we’re looking specifically for White non-Hispanic.

If we open up the data dictionary - R12217898.txt - we can see that A04001_003 is the column of our White Non-Hispanic. In fact it’s SE_A04001_003 thanks to Social Explorer, but you get the idea.

Now, in our sentence we said “the amount of minorities” - but is this raw number really what we want? No way! We want a percentage of minorities. We could add up each and every one of those other columns, or we could just… calculate the percent of non-Hispanic Whites instead. It’s easier, right?

According to our data dictionary A04001_001 is our total population and A04001_003 is our non-Hispanic white population, so simple division will get us what we need. After the computation, we can drop all of the other columns that came along with the census.

# Calculate
df['pct_white'] = df.SE_A04001_003 / df.SE_A04001_001

# Drop the extra columns
se_columns = df.columns[df.columns.str.contains("SE_")]
df = df.drop(columns=se_columns)

df.head()

EnterDt	PrintDt	ResolvDt	address	GEOID	Geo_FIPS	pct_white
2013-07-15 23:35	2013-07-16 05:46	2013-07-17 05:50	3839 N 10TH ST	55079004500	55079004500	0.0240506
2013-07-15 20:05	2013-07-16 05:46	2013-07-24 16:58	4900 W MELVINA ST	55079003800	55079003800	0.0882480
2013-07-15 20:00	2013-07-16 05:56	2013-07-25 14:42	2400 W WISCONSIN AV	55079014900	55079014900	0.4031373
2013-07-15 19:55	2013-07-16 05:46	2013-07-18 06:06	1800 W HAMPTON AV	55079002300	55079002300	0.0438941
2013-07-15 19:50	2013-07-16 05:46	2013-08-02 06:08	4718 N 19TH ST	55079002300	55079002300	0.0438941

This calculation gives us a number 0-1 for percent of non-Hispanic Whites, so 0.25 would be 25% and 0.75 would be 75%. Nothing too crazy, and we’ll leave it alone for now.

If we really want the percent of minorities, we can just subtract the percent white from 1. We might use it later, so let’s do that calculation now.

df['pct_minority'] = 1 - df.pct_white
df.head()

EnterDt	PrintDt	ResolvDt	address	GEOID	Geo_FIPS	pct_white	pct_minority
2013-07-15 23:35	2013-07-16 05:46	2013-07-17 05:50	3839 N 10TH ST	55079004500	55079004500	0.0240506	0.9759494
2013-07-15 20:05	2013-07-16 05:46	2013-07-24 16:58	4900 W MELVINA ST	55079003800	55079003800	0.0882480	0.9117520
2013-07-15 20:00	2013-07-16 05:56	2013-07-25 14:42	2400 W WISCONSIN AV	55079014900	55079014900	0.4031373	0.5968627
2013-07-15 19:55	2013-07-16 05:46	2013-07-18 06:06	1800 W HAMPTON AV	55079002300	55079002300	0.0438941	0.9561059
2013-07-15 19:50	2013-07-16 05:46	2013-08-02 06:08	4718 N 19TH ST	55079002300	55079002300	0.0438941	0.9561059

2.5.2 “How long it takes to fix potholes”

The time it takes to fix potholes seems pretty easy - it’s just the difference between when it got filed and when it got fixed, right? First we’ll convert the columns to datetimes, then we’ll subtract them.

df['EnterDt'] = pd.to_datetime(df.EnterDt)
df['ResolvDt'] = pd.to_datetime(df.ResolvDt)
df['wait_time'] = df.ResolvDt - df.EnterDt
df[['EnterDt', 'ResolvDt', 'wait_time']].head()

##               EnterDt            ResolvDt        wait_time
## 0 2013-07-15 23:35:00 2013-07-17 05:50:00  1 days 06:15:00
## 1 2013-07-15 20:05:00 2013-07-24 16:58:00  8 days 20:53:00
## 2 2013-07-15 20:00:00 2013-07-25 14:42:00  9 days 18:42:00
## 3 2013-07-15 19:55:00 2013-07-18 06:06:00  2 days 10:11:00
## 4 2013-07-15 19:50:00 2013-08-02 06:08:00 17 days 10:18:00

The problem in this case isn’t the result - it looks fine - it’s that we actually need a number. So we need to convert “1 days, 6 hours, 4 minutes and 34 seconds” into some sort of decimal.

Besides the technical question, there’s an emotional one: do we talk about this in terms of hours, or in terms of days? It seems to me that saying “It took 2 days longer to fix a pothole” sounds more natural than “it took 48 hours longer,” so let’s go with days.

df['wait_days'] = df.wait_time.dt.days
df.head(5)

##               EnterDt           PrintDt            ResolvDt  \
## 0 2013-07-15 23:35:00  2013-07-16 05:46 2013-07-17 05:50:00   
## 1 2013-07-15 20:05:00  2013-07-16 05:46 2013-07-24 16:58:00   
## 2 2013-07-15 20:00:00  2013-07-16 05:56 2013-07-25 14:42:00   
## 3 2013-07-15 19:55:00  2013-07-16 05:46 2013-07-18 06:06:00   
## 4 2013-07-15 19:50:00  2013-07-16 05:46 2013-08-02 06:08:00   
## 
##                address         GEOID      Geo_FIPS  pct_white  pct_minority  \
## 0       3839 N 10TH ST  5.507900e+10  5.507900e+10   0.024051      0.975949   
## 1    4900 W MELVINA ST  5.507900e+10  5.507900e+10   0.088248      0.911752   
## 2  2400 W WISCONSIN AV  5.507901e+10  5.507901e+10   0.403137      0.596863   
## 3    1800 W HAMPTON AV  5.507900e+10  5.507900e+10   0.043894      0.956106   
## 4       4718 N 19TH ST  5.507900e+10  5.507900e+10   0.043894      0.956106   
## 
##          wait_time  wait_days  
## 0  1 days 06:15:00        1.0  
## 1  8 days 20:53:00        8.0  
## 2  9 days 18:42:00        9.0  
## 3  2 days 10:11:00        2.0  
## 4 17 days 10:18:00       17.0

The problem is that we’re really missing all of the time between days here, and it’s all whole numbers - 3 days, 5 days, etc. Fractions of days might be a little better, but it means we can’t use the nice and friendly .dt.days. Instead we’ll need to use .dt.components['days'] and the like, which is a little slower.

# Hours get divided by 24, so 12 hours is 0.5 days
df['wait_days'] = df.wait_time.dt.components['days'] + (df.wait_time.dt.components['hours'] / 24)
df.head(5)

##               EnterDt           PrintDt            ResolvDt  \
## 0 2013-07-15 23:35:00  2013-07-16 05:46 2013-07-17 05:50:00   
## 1 2013-07-15 20:05:00  2013-07-16 05:46 2013-07-24 16:58:00   
## 2 2013-07-15 20:00:00  2013-07-16 05:56 2013-07-25 14:42:00   
## 3 2013-07-15 19:55:00  2013-07-16 05:46 2013-07-18 06:06:00   
## 4 2013-07-15 19:50:00  2013-07-16 05:46 2013-08-02 06:08:00   
## 
##                address         GEOID      Geo_FIPS  pct_white  pct_minority  \
## 0       3839 N 10TH ST  5.507900e+10  5.507900e+10   0.024051      0.975949   
## 1    4900 W MELVINA ST  5.507900e+10  5.507900e+10   0.088248      0.911752   
## 2  2400 W WISCONSIN AV  5.507901e+10  5.507901e+10   0.403137      0.596863   
## 3    1800 W HAMPTON AV  5.507900e+10  5.507900e+10   0.043894      0.956106   
## 4       4718 N 19TH ST  5.507900e+10  5.507900e+10   0.043894      0.956106   
## 
##          wait_time  wait_days  
## 0  1 days 06:15:00   1.250000  
## 1  8 days 20:53:00   8.833333  
## 2  9 days 18:42:00   9.750000  
## 3  2 days 10:11:00   2.416667  
## 4 17 days 10:18:00  17.416667

Now we have our dataset ready for analysis, with each row having a time that it took to fix the potholes and a number describing the demographics of that census tract.

2.5.3 Removing missing data

When we’re doing our final analysis, we want to make sure we’re only working with data that’s actually exists. While it makes sense to not use missing data, missing data also causes a lot of analysis packages to throw errors!

We’ll get rid of anything that’s missing the fields we’re interested in - race percentages, along with number of days wait. We’ll want to compare the number of rows before and after to make sure we don’t lose too many.

df.shape

## (12836, 10)

df = df.dropna(subset=['pct_white', 'wait_days'])
df.shape

## (12783, 10)

If you end up with a lot of rows that disappear between these two stages, you should take a look to see if something might have gone wrong with your data. In this case we didn’t lose too many, though, so we should be okay.

2.5.3.1 One more cleaning

While we’re fixing up our data, I’m going to remove the datetime fields we created. I normally wouldn’t do this, but they’re actually messing with my publishing system!

df = df.drop(columns=['EnterDt','PrintDt','ResolvDt','wait_time'])