2.5 Final prep for analysis
Let’s go back to our question: “in Milwaukee, is there a relationship between how many minorities are in a neighborhood, and how long it takes to fix potholes there?”
We have two things we’re measuring:
- The amount of minorities
- How long it takes to fix potholes
Before we do our analysis, we’ll need to see how (or if) our dataset is able to represent these concepts. Before we dive in, let’s read in our final (?) merged dataset and take a look at what we have.
EnterDt | PrintDt | ResolvDt | address | GEOID | Geo_FIPS | SE_A04001_001 | SE_A04001_002 | SE_A04001_003 | SE_A04001_004 | SE_A04001_005 | SE_A04001_006 | SE_A04001_007 | SE_A04001_008 | SE_A04001_009 | SE_A04001_010 | SE_A04001_011 | SE_A04001_012 | SE_A04001_013 | SE_A04001_014 | SE_A04001_015 | SE_A04001_016 | SE_A04001_017 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013-07-15 23:35 | 2013-07-16 05:46 | 2013-07-17 05:50 | 3839 N 10TH ST | 55079004500 | 55079004500 | 3160 | 3053 | 76 | 2854 | 1 | 30 | 0 | 8 | 84 | 107 | 16 | 17 | 0 | 0 | 0 | 74 | 0 |
2013-07-15 20:05 | 2013-07-16 05:46 | 2013-07-24 16:58 | 4900 W MELVINA ST | 55079003800 | 55079003800 | 2323 | 2276 | 205 | 2007 | 0 | 0 | 0 | 0 | 64 | 47 | 0 | 0 | 0 | 0 | 0 | 47 | 0 |
2013-07-15 20:00 | 2013-07-16 05:56 | 2013-07-25 14:42 | 2400 W WISCONSIN AV | 55079014900 | 55079014900 | 1275 | 1113 | 514 | 538 | 0 | 50 | 0 | 0 | 11 | 162 | 148 | 3 | 0 | 0 | 0 | 11 | 0 |
2013-07-15 19:55 | 2013-07-16 05:46 | 2013-07-18 06:06 | 1800 W HAMPTON AV | 55079002300 | 55079002300 | 4078 | 4063 | 179 | 3765 | 0 | 38 | 0 | 0 | 81 | 15 | 8 | 0 | 0 | 0 | 0 | 0 | 7 |
2013-07-15 19:50 | 2013-07-16 05:46 | 2013-08-02 06:08 | 4718 N 19TH ST | 55079002300 | 55079002300 | 4078 | 4063 | 179 | 3765 | 0 | 38 | 0 | 0 | 81 | 15 | 8 | 0 | 0 | 0 | 0 | 0 | 7 |
A lot of those columns are weird census codes, but we’ll let them stay for now - we can always look them up in the census table’s data dictionary later.
2.5.1 “The amount of minorities”
First, what counts as “minorities?”
We decided that everyone whose race is not White - Asian, African-American, etc - gets counted as a minority, along with everyone who is marked as Hispanic/Latino, even if they’re White. So we’re looking specifically for White non-Hispanic.
If we open up the data dictionary - R12217898.txt
- we can see that A04001_003
is the column of our White Non-Hispanic. In fact it’s SE_A04001_003
thanks to Social Explorer, but you get the idea.
Now, in our sentence we said “the amount of minorities” - but is this raw number really what we want? No way! We want a percentage of minorities. We could add up each and every one of those other columns, or we could just… calculate the percent of non-Hispanic Whites instead. It’s easier, right?
According to our data dictionary A04001_001
is our total population and A04001_003
is our non-Hispanic white population, so simple division will get us what we need. After the computation, we can drop all of the other columns that came along with the census.
# Calculate
df['pct_white'] = df.SE_A04001_003 / df.SE_A04001_001
# Drop the extra columns
se_columns = df.columns[df.columns.str.contains("SE_")]
df = df.drop(columns=se_columns)
df.head()
EnterDt | PrintDt | ResolvDt | address | GEOID | Geo_FIPS | pct_white |
---|---|---|---|---|---|---|
2013-07-15 23:35 | 2013-07-16 05:46 | 2013-07-17 05:50 | 3839 N 10TH ST | 55079004500 | 55079004500 | 0.0240506 |
2013-07-15 20:05 | 2013-07-16 05:46 | 2013-07-24 16:58 | 4900 W MELVINA ST | 55079003800 | 55079003800 | 0.0882480 |
2013-07-15 20:00 | 2013-07-16 05:56 | 2013-07-25 14:42 | 2400 W WISCONSIN AV | 55079014900 | 55079014900 | 0.4031373 |
2013-07-15 19:55 | 2013-07-16 05:46 | 2013-07-18 06:06 | 1800 W HAMPTON AV | 55079002300 | 55079002300 | 0.0438941 |
2013-07-15 19:50 | 2013-07-16 05:46 | 2013-08-02 06:08 | 4718 N 19TH ST | 55079002300 | 55079002300 | 0.0438941 |
This calculation gives us a number 0-1 for percent of non-Hispanic Whites, so 0.25 would be 25% and 0.75 would be 75%. Nothing too crazy, and we’ll leave it alone for now.
If we really want the percent of minorities, we can just subtract the percent white from 1. We might use it later, so let’s do that calculation now.
EnterDt | PrintDt | ResolvDt | address | GEOID | Geo_FIPS | pct_white | pct_minority |
---|---|---|---|---|---|---|---|
2013-07-15 23:35 | 2013-07-16 05:46 | 2013-07-17 05:50 | 3839 N 10TH ST | 55079004500 | 55079004500 | 0.0240506 | 0.9759494 |
2013-07-15 20:05 | 2013-07-16 05:46 | 2013-07-24 16:58 | 4900 W MELVINA ST | 55079003800 | 55079003800 | 0.0882480 | 0.9117520 |
2013-07-15 20:00 | 2013-07-16 05:56 | 2013-07-25 14:42 | 2400 W WISCONSIN AV | 55079014900 | 55079014900 | 0.4031373 | 0.5968627 |
2013-07-15 19:55 | 2013-07-16 05:46 | 2013-07-18 06:06 | 1800 W HAMPTON AV | 55079002300 | 55079002300 | 0.0438941 | 0.9561059 |
2013-07-15 19:50 | 2013-07-16 05:46 | 2013-08-02 06:08 | 4718 N 19TH ST | 55079002300 | 55079002300 | 0.0438941 | 0.9561059 |
2.5.2 “How long it takes to fix potholes”
The time it takes to fix potholes seems pretty easy - it’s just the difference between when it got filed and when it got fixed, right? First we’ll convert the columns to datetimes, then we’ll subtract them.
df['EnterDt'] = pd.to_datetime(df.EnterDt)
df['ResolvDt'] = pd.to_datetime(df.ResolvDt)
df['wait_time'] = df.ResolvDt - df.EnterDt
df[['EnterDt', 'ResolvDt', 'wait_time']].head()
## EnterDt ResolvDt wait_time
## 0 2013-07-15 23:35:00 2013-07-17 05:50:00 1 days 06:15:00
## 1 2013-07-15 20:05:00 2013-07-24 16:58:00 8 days 20:53:00
## 2 2013-07-15 20:00:00 2013-07-25 14:42:00 9 days 18:42:00
## 3 2013-07-15 19:55:00 2013-07-18 06:06:00 2 days 10:11:00
## 4 2013-07-15 19:50:00 2013-08-02 06:08:00 17 days 10:18:00
The problem in this case isn’t the result - it looks fine - it’s that we actually need a number. So we need to convert “1 days, 6 hours, 4 minutes and 34 seconds” into some sort of decimal.
Besides the technical question, there’s an emotional one: do we talk about this in terms of hours, or in terms of days? It seems to me that saying “It took 2 days longer to fix a pothole” sounds more natural than “it took 48 hours longer,” so let’s go with days.
## EnterDt PrintDt ResolvDt \
## 0 2013-07-15 23:35:00 2013-07-16 05:46 2013-07-17 05:50:00
## 1 2013-07-15 20:05:00 2013-07-16 05:46 2013-07-24 16:58:00
## 2 2013-07-15 20:00:00 2013-07-16 05:56 2013-07-25 14:42:00
## 3 2013-07-15 19:55:00 2013-07-16 05:46 2013-07-18 06:06:00
## 4 2013-07-15 19:50:00 2013-07-16 05:46 2013-08-02 06:08:00
##
## address GEOID Geo_FIPS pct_white pct_minority \
## 0 3839 N 10TH ST 5.507900e+10 5.507900e+10 0.024051 0.975949
## 1 4900 W MELVINA ST 5.507900e+10 5.507900e+10 0.088248 0.911752
## 2 2400 W WISCONSIN AV 5.507901e+10 5.507901e+10 0.403137 0.596863
## 3 1800 W HAMPTON AV 5.507900e+10 5.507900e+10 0.043894 0.956106
## 4 4718 N 19TH ST 5.507900e+10 5.507900e+10 0.043894 0.956106
##
## wait_time wait_days
## 0 1 days 06:15:00 1.0
## 1 8 days 20:53:00 8.0
## 2 9 days 18:42:00 9.0
## 3 2 days 10:11:00 2.0
## 4 17 days 10:18:00 17.0
The problem is that we’re really missing all of the time between days here, and it’s all whole numbers - 3 days, 5 days, etc. Fractions of days might be a little better, but it means we can’t use the nice and friendly .dt.days
. Instead we’ll need to use .dt.components['days']
and the like, which is a little slower.
# Hours get divided by 24, so 12 hours is 0.5 days
df['wait_days'] = df.wait_time.dt.components['days'] + (df.wait_time.dt.components['hours'] / 24)
df.head(5)
## EnterDt PrintDt ResolvDt \
## 0 2013-07-15 23:35:00 2013-07-16 05:46 2013-07-17 05:50:00
## 1 2013-07-15 20:05:00 2013-07-16 05:46 2013-07-24 16:58:00
## 2 2013-07-15 20:00:00 2013-07-16 05:56 2013-07-25 14:42:00
## 3 2013-07-15 19:55:00 2013-07-16 05:46 2013-07-18 06:06:00
## 4 2013-07-15 19:50:00 2013-07-16 05:46 2013-08-02 06:08:00
##
## address GEOID Geo_FIPS pct_white pct_minority \
## 0 3839 N 10TH ST 5.507900e+10 5.507900e+10 0.024051 0.975949
## 1 4900 W MELVINA ST 5.507900e+10 5.507900e+10 0.088248 0.911752
## 2 2400 W WISCONSIN AV 5.507901e+10 5.507901e+10 0.403137 0.596863
## 3 1800 W HAMPTON AV 5.507900e+10 5.507900e+10 0.043894 0.956106
## 4 4718 N 19TH ST 5.507900e+10 5.507900e+10 0.043894 0.956106
##
## wait_time wait_days
## 0 1 days 06:15:00 1.250000
## 1 8 days 20:53:00 8.833333
## 2 9 days 18:42:00 9.750000
## 3 2 days 10:11:00 2.416667
## 4 17 days 10:18:00 17.416667
Now we have our dataset ready for analysis, with each row having a time that it took to fix the potholes and a number describing the demographics of that census tract.
2.5.3 Removing missing data
When we’re doing our final analysis, we want to make sure we’re only working with data that’s actually exists. While it makes sense to not use missing data, missing data also causes a lot of analysis packages to throw errors!
We’ll get rid of anything that’s missing the fields we’re interested in - race percentages, along with number of days wait. We’ll want to compare the number of rows before and after to make sure we don’t lose too many.
## (12836, 10)
## (12783, 10)
If you end up with a lot of rows that disappear between these two stages, you should take a look to see if something might have gone wrong with your data. In this case we didn’t lose too many, though, so we should be okay.