2.2 Combining our datasets

The Milwaukee Department of Public Works can’t create one Excel file big enough to hold ten years of records, so they sent the data to us in three separate files:

  • 2007-2010 POTHOLES.xls
  • 2010-2013 POTHOLES.xls
  • 2013-2017 POTHOLES.xls

First, we’re going to open them each up and confirm their columns are all the same.

# Open up 2007-2010
df_2007 = pd.read_excel("data/2007-2010 POTHOLES.xls")
df_2007.head(3)
A Street EnterDt PrintDt ResolvDt
1846 W HALSEY AV 2010-07-15 16:33 2010-07-16 15:44 2010-07-19 15:14
9324 W PARK HILL AV 2010-07-15 16:06 2010-07-16 10:05 2010-07-21 06:02
1020 E MANITOBA ST 2010-07-15 15:13 2010-07-15 15:33 2010-07-16 14:35
# Open up 2010-2013
df_2010 = pd.read_excel("data/2010-2013 POTHOLES.xls")
df_2010.head(3)
A Street EnterDt PrintDt ResolvDt
3839 N 10TH ST 2013-07-15 23:35 2013-07-16 05:46 2013-07-17 05:50
4900 W MELVINA ST 2013-07-15 20:05 2013-07-16 05:46 2013-07-24 16:58
2400 W WISCONSIN AV 2013-07-15 20:00 2013-07-16 05:56 2013-07-25 14:42
# Open up 2013-2017
df_2013 = pd.read_excel("data/2013-2017 POTHOLES.xls")
df_2013.head(3)
A Street EnterDt PrintDt ResolvDt
7741 N 59TH ST 2017-07-15 08:55 2017-07-17 05:33 2017-07-21 04:51
5517 N 39TH ST 2017-07-14 22:36 2017-07-17 05:33 2017-07-25 15:29
8242 N GRANVILLE RD 2017-07-14 18:30 2017-07-17 05:33 2017-07-17 06:55

Looks good! Matching columns, and they seem to start at the right date.

Let’s do an extra check to make sure the files don’t have any overlap - it’d be no good if 2007-2010 and 2010-2013 both included 2010 and we ended up double-counting the year.

# min/max of the date in 2007-2010
df_2007.agg({'EnterDt': ['min', 'max']})
##               EnterDt
## min  2007-07-16 07:53
## max  2010-07-15 16:33
# min/max of the date in 2010-2013
df_2010.agg({'EnterDt': ['min', 'max']})
##               EnterDt
## min  2010-07-16 06:34
## max  2013-07-15 23:35
# min/max of the date in 2013-2017
df_2013.agg({'EnterDt': ['min', 'max']})
##               EnterDt
## min  2013-07-16 05:40
## max  2017-07-15 08:55

Okay, we’re looking pretty good!

Let’s combine the three dataframes into a single dataframe.

potholes = pd.concat([
  df_2007,
  df_2010,
  df_2013
])
potholes.shape
## (120186, 5)

Now we’re looking at 120,186 rows of potholes across those ten years.

Also, a little more cleaning. See anything that feels weird about the columns in our dataset?

potholes.head()
A Street EnterDt PrintDt ResolvDt
0 1846 W HALSEY AV 2010-07-15 16:33 2010-07-16 15:44 2010-07-19 15:14
1 9324 W PARK HILL AV 2010-07-15 16:06 2010-07-16 10:05 2010-07-21 06:02
2 1020 E MANITOBA ST 2010-07-15 15:13 2010-07-15 15:33 2010-07-16 14:35
3 3200 S 72ND ST 2010-07-15 15:12 2010-07-15 15:33 2010-07-16 14:36
4 6001 W WARNIMONT AV 2010-07-15 15:11 2010-07-15 15:33 2010-07-16 14:35

Instead of a single address column, we have a street - Street - and a house number - A. It makes sense to combine those to get a new column that’s the whole address, yeah?

# Address number read in as a number, so we need to convert it to a string
potholes['address'] = potholes.A.astype(str) + " " + potholes.Street
potholes = potholes.drop(columns=['A', 'Street'])
potholes.head()
EnterDt PrintDt ResolvDt address
0 2010-07-15 16:33 2010-07-16 15:44 2010-07-19 15:14 1846 W HALSEY AV
1 2010-07-15 16:06 2010-07-16 10:05 2010-07-21 06:02 9324 W PARK HILL AV
2 2010-07-15 15:13 2010-07-15 15:33 2010-07-16 14:35 1020 E MANITOBA ST
3 2010-07-15 15:12 2010-07-15 15:33 2010-07-16 14:36 3200 S 72ND ST
4 2010-07-15 15:11 2010-07-15 15:33 2010-07-16 14:35 6001 W WARNIMONT AV

Now that our data is nice and clean we can save it. We might have to send it to someone else at some point, and we wouldn’t want to do all this cleaning again, right?

potholes.to_csv("data/potholes-merged.csv", index=False)

2.2.1 Merging Census data and pothole data

If we look at our Census dataset, we see a couple columns that identify the census tract: Geo_FIPS and Geo_NAME.

census = pd.read_csv("data/R12216099_SL140.csv", dtype={'Geo_FIPS': 'str'})
census.head(2)
Geo_FIPS Geo_GEOID Geo_NAME Geo_QName Geo_STUSAB Geo_SUMLEV Geo_GEOCOMP Geo_FILEID Geo_LOGRECNO Geo_US Geo_REGION Geo_DIVISION Geo_STATECE Geo_STATE Geo_COUNTY Geo_COUSUB Geo_PLACE Geo_PLACESE Geo_TRACT Geo_BLKGRP Geo_CONCIT Geo_AIANHH Geo_AIANHHFP Geo_AIHHTLI Geo_AITSCE Geo_AITS Geo_ANRC Geo_CBSA Geo_CSA Geo_METDIV Geo_MACC Geo_MEMI Geo_NECTA Geo_CNECTA Geo_NECTADIV Geo_UA Geo_UACP Geo_CDCURR Geo_SLDU Geo_SLDL Geo_VTD Geo_ZCTA3 Geo_ZCTA5 Geo_SUBMCD Geo_SDELM Geo_SDSEC Geo_SDUNI Geo_UR Geo_PCI Geo_TAZ Geo_UGA Geo_BTTR Geo_BTBG Geo_PUMA5 Geo_PUMA1 SE_A04001_001 SE_A04001_002 SE_A04001_003 SE_A04001_004 SE_A04001_005 SE_A04001_006 SE_A04001_007 SE_A04001_008 SE_A04001_009 SE_A04001_010 SE_A04001_011 SE_A04001_012 SE_A04001_013 SE_A04001_014 SE_A04001_015 SE_A04001_016 SE_A04001_017
55079000101 14000US55079000101 Census Tract 1.01, Milwaukee County, Wisconsin Census Tract 1.01, Milwaukee County, Wisconsin wi 140 0 ACSSF 4717 NaN NaN NaN NaN 55 79 NaN NaN NaN 101 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5306 5015 1343 3567 13 70 0 16 6 291 37 0 0 0 0 254 0
55079000102 14000US55079000102 Census Tract 1.02, Milwaukee County, Wisconsin Census Tract 1.02, Milwaukee County, Wisconsin wi 140 0 ACSSF 4718 NaN NaN NaN NaN 55 79 NaN NaN NaN 102 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3648 3549 1034 2195 36 118 0 0 166 99 67 32 0 0 0 0 0

We had to read in Geo_FIPS as a string, otherwise pandas would convert it into a very long number.

If we look at our pothole dataset, can we match any of the columns with either of those columns?

potholes = pd.read_csv("data/potholes-merged.csv")
potholes.head(2)
EnterDt PrintDt ResolvDt address
2010-07-15 16:33 2010-07-16 15:44 2010-07-19 15:14 1846 W HALSEY AV
2010-07-15 16:06 2010-07-16 10:05 2010-07-21 06:02 9324 W PARK HILL AV

Unfortunately not! Even though we have the street address, which we think of as the “location,” we can’t join it onto the Census dataset because we need a matching column.

What we need is to add a new column that includes the census tract name or number, which means we need to find out which addresses are inside which census tracts.

While there might be a service we can feed the address to and get the census tract back, we’re going to do something a little more generally useful!

From the Census Bureau, we can get a geographic file (a shapefile) of where each census tract is. If we can put the street addresses on the same map, we’ll be able to say “find out which shape you’re in, and take the census tract ID from that shape.”

VISUAL THING OF THAT

But how do we put the street addresses on a map? For that, we need to convert their street address to latitude and longitude, a process called geocoding. Geocoding (or georeferencing) does just that: converts a “normal” way of describing a place, like a street address or a ZIP code or a city name, and converts it into latitude and longitude.

For most geocoding services we’re going to be paying per address. I’m cheap, so let’s look at how many rows we have:

potholes.shape
## (120186, 4)

Oof, look at that! Geocoding 100,000 addresses is going to cost more than we want to pay - are there any shortcuts we can think of to geocoding all these pothole addresses? Luckily, there is!

Each row isn’t actually an address - it’s a pothole at an address. This means the same address might show up in our dataset multiple times. Let’s take a look:

potholes.address.value_counts().head(10)
x
4700 S HOWELL AV 84
6000 W HAMPTON AV 77
6000 W OKLAHOMA AV 74
2700 S 60TH ST 73
6000 W FOND DU LAC AV 72
7400 W APPLETON AV 71
2700 W MORGAN AV 70
7600 W HAMPTON AV 65
10700 W GOOD HOPE RD 63
3100 S 60TH ST 60

Yowza, that’s an impressive number of potholes at one address! It also means we’re lucky! Instead of geocoding each row individually - which would mean geocoding 4700 S HOWELL AV 84 times - we can pull out the addresses, remove the duplicates, geocode them, and then join them back into our dataset.

How many unique addresses do we have?

# Unique list of addresses
unique_addresses = potholes.address.unique()
len(unique_addresses)
## 53172

Wow, that’s about a 50% savings from the original list! Let’s put it into a dataframe to make the unique addresses easier to work with.

address_df = pd.DataFrame({
    'address': unique_addresses,
    'city': 'Milwaukee',
    'state': 'Wisconsin'
})
address_df.head()
address city state
1846 W HALSEY AV Milwaukee Wisconsin
9324 W PARK HILL AV Milwaukee Wisconsin
1020 E MANITOBA ST Milwaukee Wisconsin
3200 S 72ND ST Milwaukee Wisconsin
6001 W WARNIMONT AV Milwaukee Wisconsin

I added the city and state because depending on what you use to geocode, it might come in handy. Let’s save this list of addresses in case we need the separate file.

address_df.to_csv("data/addresses.csv", index=False)

2.2.2 Performing the geocoding

While I wrote some code to do the geocoding in Python with Google’s API, I ended up using Good.io instead because it’s a really, really, really easy service, and Google would cost over twice as much! You just drag and drop the CSV file and it emails you when it’s finished.

Geocoding this dataset of 50k addresses cost me about $15., and Good.io even offers to add on census information to our geocoded addresses for an extra fee. I didn’t do that because we’re going to do it for free in a minute.

When the geocoding was completed, I saved the new dataset as addresses_geocoded.csv.