2.2 Combining our datasets
The Milwaukee Department of Public Works can’t create one Excel file big enough to hold ten years of records, so they sent the data to us in three separate files:
2007-2010 POTHOLES.xls
2010-2013 POTHOLES.xls
2013-2017 POTHOLES.xls
First, we’re going to open them each up and confirm their columns are all the same.
A | Street | EnterDt | PrintDt | ResolvDt |
---|---|---|---|---|
1846 | W HALSEY AV | 2010-07-15 16:33 | 2010-07-16 15:44 | 2010-07-19 15:14 |
9324 | W PARK HILL AV | 2010-07-15 16:06 | 2010-07-16 10:05 | 2010-07-21 06:02 |
1020 | E MANITOBA ST | 2010-07-15 15:13 | 2010-07-15 15:33 | 2010-07-16 14:35 |
A | Street | EnterDt | PrintDt | ResolvDt |
---|---|---|---|---|
3839 | N 10TH ST | 2013-07-15 23:35 | 2013-07-16 05:46 | 2013-07-17 05:50 |
4900 | W MELVINA ST | 2013-07-15 20:05 | 2013-07-16 05:46 | 2013-07-24 16:58 |
2400 | W WISCONSIN AV | 2013-07-15 20:00 | 2013-07-16 05:56 | 2013-07-25 14:42 |
A | Street | EnterDt | PrintDt | ResolvDt |
---|---|---|---|---|
7741 | N 59TH ST | 2017-07-15 08:55 | 2017-07-17 05:33 | 2017-07-21 04:51 |
5517 | N 39TH ST | 2017-07-14 22:36 | 2017-07-17 05:33 | 2017-07-25 15:29 |
8242 | N GRANVILLE RD | 2017-07-14 18:30 | 2017-07-17 05:33 | 2017-07-17 06:55 |
Looks good! Matching columns, and they seem to start at the right date.
Let’s do an extra check to make sure the files don’t have any overlap - it’d be no good if 2007-2010 and 2010-2013 both included 2010 and we ended up double-counting the year.
## EnterDt
## min 2007-07-16 07:53
## max 2010-07-15 16:33
## EnterDt
## min 2010-07-16 06:34
## max 2013-07-15 23:35
## EnterDt
## min 2013-07-16 05:40
## max 2017-07-15 08:55
Okay, we’re looking pretty good!
Let’s combine the three dataframes into a single dataframe.
## (120186, 5)
Now we’re looking at 120,186 rows of potholes across those ten years.
Also, a little more cleaning. See anything that feels weird about the columns in our dataset?
A | Street | EnterDt | PrintDt | ResolvDt | |
---|---|---|---|---|---|
0 | 1846 | W HALSEY AV | 2010-07-15 16:33 | 2010-07-16 15:44 | 2010-07-19 15:14 |
1 | 9324 | W PARK HILL AV | 2010-07-15 16:06 | 2010-07-16 10:05 | 2010-07-21 06:02 |
2 | 1020 | E MANITOBA ST | 2010-07-15 15:13 | 2010-07-15 15:33 | 2010-07-16 14:35 |
3 | 3200 | S 72ND ST | 2010-07-15 15:12 | 2010-07-15 15:33 | 2010-07-16 14:36 |
4 | 6001 | W WARNIMONT AV | 2010-07-15 15:11 | 2010-07-15 15:33 | 2010-07-16 14:35 |
Instead of a single address column, we have a street - Street
- and a house number - A
. It makes sense to combine those to get a new column that’s the whole address, yeah?
# Address number read in as a number, so we need to convert it to a string
potholes['address'] = potholes.A.astype(str) + " " + potholes.Street
potholes = potholes.drop(columns=['A', 'Street'])
potholes.head()
EnterDt | PrintDt | ResolvDt | address | |
---|---|---|---|---|
0 | 2010-07-15 16:33 | 2010-07-16 15:44 | 2010-07-19 15:14 | 1846 W HALSEY AV |
1 | 2010-07-15 16:06 | 2010-07-16 10:05 | 2010-07-21 06:02 | 9324 W PARK HILL AV |
2 | 2010-07-15 15:13 | 2010-07-15 15:33 | 2010-07-16 14:35 | 1020 E MANITOBA ST |
3 | 2010-07-15 15:12 | 2010-07-15 15:33 | 2010-07-16 14:36 | 3200 S 72ND ST |
4 | 2010-07-15 15:11 | 2010-07-15 15:33 | 2010-07-16 14:35 | 6001 W WARNIMONT AV |
Now that our data is nice and clean we can save it. We might have to send it to someone else at some point, and we wouldn’t want to do all this cleaning again, right?
2.2.1 Merging Census data and pothole data
If we look at our Census dataset, we see a couple columns that identify the census tract: Geo_FIPS
and Geo_NAME
.
Geo_FIPS | Geo_GEOID | Geo_NAME | Geo_QName | Geo_STUSAB | Geo_SUMLEV | Geo_GEOCOMP | Geo_FILEID | Geo_LOGRECNO | Geo_US | Geo_REGION | Geo_DIVISION | Geo_STATECE | Geo_STATE | Geo_COUNTY | Geo_COUSUB | Geo_PLACE | Geo_PLACESE | Geo_TRACT | Geo_BLKGRP | Geo_CONCIT | Geo_AIANHH | Geo_AIANHHFP | Geo_AIHHTLI | Geo_AITSCE | Geo_AITS | Geo_ANRC | Geo_CBSA | Geo_CSA | Geo_METDIV | Geo_MACC | Geo_MEMI | Geo_NECTA | Geo_CNECTA | Geo_NECTADIV | Geo_UA | Geo_UACP | Geo_CDCURR | Geo_SLDU | Geo_SLDL | Geo_VTD | Geo_ZCTA3 | Geo_ZCTA5 | Geo_SUBMCD | Geo_SDELM | Geo_SDSEC | Geo_SDUNI | Geo_UR | Geo_PCI | Geo_TAZ | Geo_UGA | Geo_BTTR | Geo_BTBG | Geo_PUMA5 | Geo_PUMA1 | SE_A04001_001 | SE_A04001_002 | SE_A04001_003 | SE_A04001_004 | SE_A04001_005 | SE_A04001_006 | SE_A04001_007 | SE_A04001_008 | SE_A04001_009 | SE_A04001_010 | SE_A04001_011 | SE_A04001_012 | SE_A04001_013 | SE_A04001_014 | SE_A04001_015 | SE_A04001_016 | SE_A04001_017 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
55079000101 | 14000US55079000101 | Census Tract 1.01, Milwaukee County, Wisconsin | Census Tract 1.01, Milwaukee County, Wisconsin | wi | 140 | 0 | ACSSF | 4717 | NaN | NaN | NaN | NaN | 55 | 79 | NaN | NaN | NaN | 101 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5306 | 5015 | 1343 | 3567 | 13 | 70 | 0 | 16 | 6 | 291 | 37 | 0 | 0 | 0 | 0 | 254 | 0 |
55079000102 | 14000US55079000102 | Census Tract 1.02, Milwaukee County, Wisconsin | Census Tract 1.02, Milwaukee County, Wisconsin | wi | 140 | 0 | ACSSF | 4718 | NaN | NaN | NaN | NaN | 55 | 79 | NaN | NaN | NaN | 102 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3648 | 3549 | 1034 | 2195 | 36 | 118 | 0 | 0 | 166 | 99 | 67 | 32 | 0 | 0 | 0 | 0 | 0 |
We had to read in
Geo_FIPS
as a string, otherwise pandas would convert it into a very long number.
If we look at our pothole dataset, can we match any of the columns with either of those columns?
EnterDt | PrintDt | ResolvDt | address |
---|---|---|---|
2010-07-15 16:33 | 2010-07-16 15:44 | 2010-07-19 15:14 | 1846 W HALSEY AV |
2010-07-15 16:06 | 2010-07-16 10:05 | 2010-07-21 06:02 | 9324 W PARK HILL AV |
Unfortunately not! Even though we have the street address, which we think of as the “location,” we can’t join it onto the Census dataset because we need a matching column.
What we need is to add a new column that includes the census tract name or number, which means we need to find out which addresses are inside which census tracts.
While there might be a service we can feed the address to and get the census tract back, we’re going to do something a little more generally useful!
From the Census Bureau, we can get a geographic file (a shapefile) of where each census tract is. If we can put the street addresses on the same map, we’ll be able to say “find out which shape you’re in, and take the census tract ID from that shape.”
VISUAL THING OF THAT
But how do we put the street addresses on a map? For that, we need to convert their street address to latitude and longitude, a process called geocoding. Geocoding (or georeferencing) does just that: converts a “normal” way of describing a place, like a street address or a ZIP code or a city name, and converts it into latitude and longitude.
For most geocoding services we’re going to be paying per address. I’m cheap, so let’s look at how many rows we have:
## (120186, 4)
Oof, look at that! Geocoding 100,000 addresses is going to cost more than we want to pay - are there any shortcuts we can think of to geocoding all these pothole addresses? Luckily, there is!
Each row isn’t actually an address - it’s a pothole at an address. This means the same address might show up in our dataset multiple times. Let’s take a look:
x | |
---|---|
4700 S HOWELL AV | 84 |
6000 W HAMPTON AV | 77 |
6000 W OKLAHOMA AV | 74 |
2700 S 60TH ST | 73 |
6000 W FOND DU LAC AV | 72 |
7400 W APPLETON AV | 71 |
2700 W MORGAN AV | 70 |
7600 W HAMPTON AV | 65 |
10700 W GOOD HOPE RD | 63 |
3100 S 60TH ST | 60 |
Yowza, that’s an impressive number of potholes at one address! It also means we’re lucky! Instead of geocoding each row individually - which would mean geocoding 4700 S HOWELL AV
84 times - we can pull out the addresses, remove the duplicates, geocode them, and then join them back into our dataset.
How many unique addresses do we have?
## 53172
Wow, that’s about a 50% savings from the original list! Let’s put it into a dataframe to make the unique addresses easier to work with.
address_df = pd.DataFrame({
'address': unique_addresses,
'city': 'Milwaukee',
'state': 'Wisconsin'
})
address_df.head()
address | city | state |
---|---|---|
1846 W HALSEY AV | Milwaukee | Wisconsin |
9324 W PARK HILL AV | Milwaukee | Wisconsin |
1020 E MANITOBA ST | Milwaukee | Wisconsin |
3200 S 72ND ST | Milwaukee | Wisconsin |
6001 W WARNIMONT AV | Milwaukee | Wisconsin |
I added the city and state because depending on what you use to geocode, it might come in handy. Let’s save this list of addresses in case we need the separate file.
2.2.2 Performing the geocoding
While I wrote some code to do the geocoding in Python with Google’s API, I ended up using Good.io instead because it’s a really, really, really easy service, and Google would cost over twice as much! You just drag and drop the CSV file and it emails you when it’s finished.
Geocoding this dataset of 50k addresses cost me about $15., and Good.io even offers to add on census information to our geocoded addresses for an extra fee. I didn’t do that because we’re going to do it for free in a minute.
When the geocoding was completed, I saved the new dataset as addresses_geocoded.csv
.