2.5 Final prep for analysis

Let’s go back to our question: “in Milwaukee, is there a relationship between how many minorities are in a neighborhood, and how long it takes to fix potholes there?”

We have two things we’re measuring:

  1. The amount of minorities
  2. How long it takes to fix potholes

Before we do our analysis, we’ll need to see how (or if) our dataset is able to represent these concepts. Before we dive in, let’s read in our final (?) merged dataset and take a look at what we have.

df = pd.read_csv("data/2013-complete.csv")
df.head()
EnterDt PrintDt ResolvDt address GEOID Geo_FIPS SE_A04001_001 SE_A04001_002 SE_A04001_003 SE_A04001_004 SE_A04001_005 SE_A04001_006 SE_A04001_007 SE_A04001_008 SE_A04001_009 SE_A04001_010 SE_A04001_011 SE_A04001_012 SE_A04001_013 SE_A04001_014 SE_A04001_015 SE_A04001_016 SE_A04001_017
2013-07-15 23:35 2013-07-16 05:46 2013-07-17 05:50 3839 N 10TH ST 55079004500 55079004500 3160 3053 76 2854 1 30 0 8 84 107 16 17 0 0 0 74 0
2013-07-15 20:05 2013-07-16 05:46 2013-07-24 16:58 4900 W MELVINA ST 55079003800 55079003800 2323 2276 205 2007 0 0 0 0 64 47 0 0 0 0 0 47 0
2013-07-15 20:00 2013-07-16 05:56 2013-07-25 14:42 2400 W WISCONSIN AV 55079014900 55079014900 1275 1113 514 538 0 50 0 0 11 162 148 3 0 0 0 11 0
2013-07-15 19:55 2013-07-16 05:46 2013-07-18 06:06 1800 W HAMPTON AV 55079002300 55079002300 4078 4063 179 3765 0 38 0 0 81 15 8 0 0 0 0 0 7
2013-07-15 19:50 2013-07-16 05:46 2013-08-02 06:08 4718 N 19TH ST 55079002300 55079002300 4078 4063 179 3765 0 38 0 0 81 15 8 0 0 0 0 0 7

A lot of those columns are weird census codes, but we’ll let them stay for now - we can always look them up in the census table’s data dictionary later.

2.5.1 “The amount of minorities”

First, what counts as “minorities?”

We decided that everyone whose race is not White - Asian, African-American, etc - gets counted as a minority, along with everyone who is marked as Hispanic/Latino, even if they’re White. So we’re looking specifically for White non-Hispanic.

If we open up the data dictionary - R12217898.txt - we can see that A04001_003 is the column of our White Non-Hispanic. In fact it’s SE_A04001_003 thanks to Social Explorer, but you get the idea.

Now, in our sentence we said “the amount of minorities” - but is this raw number really what we want? No way! We want a percentage of minorities. We could add up each and every one of those other columns, or we could just… calculate the percent of non-Hispanic Whites instead. It’s easier, right?

According to our data dictionary A04001_001 is our total population and A04001_003 is our non-Hispanic white population, so simple division will get us what we need. After the computation, we can drop all of the other columns that came along with the census.

# Calculate
df['pct_white'] = df.SE_A04001_003 / df.SE_A04001_001

# Drop the extra columns
se_columns = df.columns[df.columns.str.contains("SE_")]
df = df.drop(columns=se_columns)

df.head()
EnterDt PrintDt ResolvDt address GEOID Geo_FIPS pct_white
2013-07-15 23:35 2013-07-16 05:46 2013-07-17 05:50 3839 N 10TH ST 55079004500 55079004500 0.0240506
2013-07-15 20:05 2013-07-16 05:46 2013-07-24 16:58 4900 W MELVINA ST 55079003800 55079003800 0.0882480
2013-07-15 20:00 2013-07-16 05:56 2013-07-25 14:42 2400 W WISCONSIN AV 55079014900 55079014900 0.4031373
2013-07-15 19:55 2013-07-16 05:46 2013-07-18 06:06 1800 W HAMPTON AV 55079002300 55079002300 0.0438941
2013-07-15 19:50 2013-07-16 05:46 2013-08-02 06:08 4718 N 19TH ST 55079002300 55079002300 0.0438941

This calculation gives us a number 0-1 for percent of non-Hispanic Whites, so 0.25 would be 25% and 0.75 would be 75%. Nothing too crazy, and we’ll leave it alone for now.

If we really want the percent of minorities, we can just subtract the percent white from 1. We might use it later, so let’s do that calculation now.

df['pct_minority'] = 1 - df.pct_white
df.head()
EnterDt PrintDt ResolvDt address GEOID Geo_FIPS pct_white pct_minority
2013-07-15 23:35 2013-07-16 05:46 2013-07-17 05:50 3839 N 10TH ST 55079004500 55079004500 0.0240506 0.9759494
2013-07-15 20:05 2013-07-16 05:46 2013-07-24 16:58 4900 W MELVINA ST 55079003800 55079003800 0.0882480 0.9117520
2013-07-15 20:00 2013-07-16 05:56 2013-07-25 14:42 2400 W WISCONSIN AV 55079014900 55079014900 0.4031373 0.5968627
2013-07-15 19:55 2013-07-16 05:46 2013-07-18 06:06 1800 W HAMPTON AV 55079002300 55079002300 0.0438941 0.9561059
2013-07-15 19:50 2013-07-16 05:46 2013-08-02 06:08 4718 N 19TH ST 55079002300 55079002300 0.0438941 0.9561059

2.5.2 “How long it takes to fix potholes”

The time it takes to fix potholes seems pretty easy - it’s just the difference between when it got filed and when it got fixed, right? First we’ll convert the columns to datetimes, then we’ll subtract them.

df['EnterDt'] = pd.to_datetime(df.EnterDt)
df['ResolvDt'] = pd.to_datetime(df.ResolvDt)
df['wait_time'] = df.ResolvDt - df.EnterDt
df[['EnterDt', 'ResolvDt', 'wait_time']].head()
##               EnterDt            ResolvDt        wait_time
## 0 2013-07-15 23:35:00 2013-07-17 05:50:00  1 days 06:15:00
## 1 2013-07-15 20:05:00 2013-07-24 16:58:00  8 days 20:53:00
## 2 2013-07-15 20:00:00 2013-07-25 14:42:00  9 days 18:42:00
## 3 2013-07-15 19:55:00 2013-07-18 06:06:00  2 days 10:11:00
## 4 2013-07-15 19:50:00 2013-08-02 06:08:00 17 days 10:18:00

The problem in this case isn’t the result - it looks fine - it’s that we actually need a number. So we need to convert “1 days, 6 hours, 4 minutes and 34 seconds” into some sort of decimal.

Besides the technical question, there’s an emotional one: do we talk about this in terms of hours, or in terms of days? It seems to me that saying “It took 2 days longer to fix a pothole” sounds more natural than “it took 48 hours longer,” so let’s go with days.

df['wait_days'] = df.wait_time.dt.days
df.head(5)
##               EnterDt           PrintDt            ResolvDt  \
## 0 2013-07-15 23:35:00  2013-07-16 05:46 2013-07-17 05:50:00   
## 1 2013-07-15 20:05:00  2013-07-16 05:46 2013-07-24 16:58:00   
## 2 2013-07-15 20:00:00  2013-07-16 05:56 2013-07-25 14:42:00   
## 3 2013-07-15 19:55:00  2013-07-16 05:46 2013-07-18 06:06:00   
## 4 2013-07-15 19:50:00  2013-07-16 05:46 2013-08-02 06:08:00   
## 
##                address         GEOID      Geo_FIPS  pct_white  pct_minority  \
## 0       3839 N 10TH ST  5.507900e+10  5.507900e+10   0.024051      0.975949   
## 1    4900 W MELVINA ST  5.507900e+10  5.507900e+10   0.088248      0.911752   
## 2  2400 W WISCONSIN AV  5.507901e+10  5.507901e+10   0.403137      0.596863   
## 3    1800 W HAMPTON AV  5.507900e+10  5.507900e+10   0.043894      0.956106   
## 4       4718 N 19TH ST  5.507900e+10  5.507900e+10   0.043894      0.956106   
## 
##          wait_time  wait_days  
## 0  1 days 06:15:00        1.0  
## 1  8 days 20:53:00        8.0  
## 2  9 days 18:42:00        9.0  
## 3  2 days 10:11:00        2.0  
## 4 17 days 10:18:00       17.0

The problem is that we’re really missing all of the time between days here, and it’s all whole numbers - 3 days, 5 days, etc. Fractions of days might be a little better, but it means we can’t use the nice and friendly .dt.days. Instead we’ll need to use .dt.components['days'] and the like, which is a little slower.

# Hours get divided by 24, so 12 hours is 0.5 days
df['wait_days'] = df.wait_time.dt.components['days'] + (df.wait_time.dt.components['hours'] / 24)
df.head(5)
##               EnterDt           PrintDt            ResolvDt  \
## 0 2013-07-15 23:35:00  2013-07-16 05:46 2013-07-17 05:50:00   
## 1 2013-07-15 20:05:00  2013-07-16 05:46 2013-07-24 16:58:00   
## 2 2013-07-15 20:00:00  2013-07-16 05:56 2013-07-25 14:42:00   
## 3 2013-07-15 19:55:00  2013-07-16 05:46 2013-07-18 06:06:00   
## 4 2013-07-15 19:50:00  2013-07-16 05:46 2013-08-02 06:08:00   
## 
##                address         GEOID      Geo_FIPS  pct_white  pct_minority  \
## 0       3839 N 10TH ST  5.507900e+10  5.507900e+10   0.024051      0.975949   
## 1    4900 W MELVINA ST  5.507900e+10  5.507900e+10   0.088248      0.911752   
## 2  2400 W WISCONSIN AV  5.507901e+10  5.507901e+10   0.403137      0.596863   
## 3    1800 W HAMPTON AV  5.507900e+10  5.507900e+10   0.043894      0.956106   
## 4       4718 N 19TH ST  5.507900e+10  5.507900e+10   0.043894      0.956106   
## 
##          wait_time  wait_days  
## 0  1 days 06:15:00   1.250000  
## 1  8 days 20:53:00   8.833333  
## 2  9 days 18:42:00   9.750000  
## 3  2 days 10:11:00   2.416667  
## 4 17 days 10:18:00  17.416667

Now we have our dataset ready for analysis, with each row having a time that it took to fix the potholes and a number describing the demographics of that census tract.

2.5.3 Removing missing data

When we’re doing our final analysis, we want to make sure we’re only working with data that’s actually exists. While it makes sense to not use missing data, missing data also causes a lot of analysis packages to throw errors!

We’ll get rid of anything that’s missing the fields we’re interested in - race percentages, along with number of days wait. We’ll want to compare the number of rows before and after to make sure we don’t lose too many.

df.shape
## (12836, 10)
df = df.dropna(subset=['pct_white', 'wait_days'])
df.shape
## (12783, 10)

If you end up with a lot of rows that disappear between these two stages, you should take a look to see if something might have gone wrong with your data. In this case we didn’t lose too many, though, so we should be okay.

2.5.3.1 One more cleaning

While we’re fixing up our data, I’m going to remove the datetime fields we created. I normally wouldn’t do this, but they’re actually messing with my publishing system!

df = df.drop(columns=['EnterDt','PrintDt','ResolvDt','wait_time'])