2.2 Combining our datasets

The Milwaukee Department of Public Works can’t create one Excel file big enough to hold ten years of records, so they sent the data to us in three separate files:

2007-2010 POTHOLES.xls
2010-2013 POTHOLES.xls
2013-2017 POTHOLES.xls

First, we’re going to open them each up and confirm their columns are all the same.

# Open up 2007-2010
df_2007 = pd.read_excel("data/2007-2010 POTHOLES.xls")
df_2007.head(3)

A	Street	EnterDt	PrintDt	ResolvDt
1846	W HALSEY AV	2010-07-15 16:33	2010-07-16 15:44	2010-07-19 15:14
9324	W PARK HILL AV	2010-07-15 16:06	2010-07-16 10:05	2010-07-21 06:02
1020	E MANITOBA ST	2010-07-15 15:13	2010-07-15 15:33	2010-07-16 14:35

# Open up 2010-2013
df_2010 = pd.read_excel("data/2010-2013 POTHOLES.xls")
df_2010.head(3)

A	Street	EnterDt	PrintDt	ResolvDt
3839	N 10TH ST	2013-07-15 23:35	2013-07-16 05:46	2013-07-17 05:50
4900	W MELVINA ST	2013-07-15 20:05	2013-07-16 05:46	2013-07-24 16:58
2400	W WISCONSIN AV	2013-07-15 20:00	2013-07-16 05:56	2013-07-25 14:42

# Open up 2013-2017
df_2013 = pd.read_excel("data/2013-2017 POTHOLES.xls")
df_2013.head(3)

A	Street	EnterDt	PrintDt	ResolvDt
7741	N 59TH ST	2017-07-15 08:55	2017-07-17 05:33	2017-07-21 04:51
5517	N 39TH ST	2017-07-14 22:36	2017-07-17 05:33	2017-07-25 15:29
8242	N GRANVILLE RD	2017-07-14 18:30	2017-07-17 05:33	2017-07-17 06:55

Looks good! Matching columns, and they seem to start at the right date.

Let’s do an extra check to make sure the files don’t have any overlap - it’d be no good if 2007-2010 and 2010-2013 both included 2010 and we ended up double-counting the year.

# min/max of the date in 2007-2010
df_2007.agg({'EnterDt': ['min', 'max']})

##               EnterDt
## min  2007-07-16 07:53
## max  2010-07-15 16:33

# min/max of the date in 2010-2013
df_2010.agg({'EnterDt': ['min', 'max']})

##               EnterDt
## min  2010-07-16 06:34
## max  2013-07-15 23:35

# min/max of the date in 2013-2017
df_2013.agg({'EnterDt': ['min', 'max']})

##               EnterDt
## min  2013-07-16 05:40
## max  2017-07-15 08:55

Okay, we’re looking pretty good!

Let’s combine the three dataframes into a single dataframe.

potholes = pd.concat([
  df_2007,
  df_2010,
  df_2013
])
potholes.shape

## (120186, 5)

Now we’re looking at 120,186 rows of potholes across those ten years.

Also, a little more cleaning. See anything that feels weird about the columns in our dataset?

potholes.head()

	A	Street	EnterDt	PrintDt	ResolvDt
0	1846	W HALSEY AV	2010-07-15 16:33	2010-07-16 15:44	2010-07-19 15:14
1	9324	W PARK HILL AV	2010-07-15 16:06	2010-07-16 10:05	2010-07-21 06:02
2	1020	E MANITOBA ST	2010-07-15 15:13	2010-07-15 15:33	2010-07-16 14:35
3	3200	S 72ND ST	2010-07-15 15:12	2010-07-15 15:33	2010-07-16 14:36
4	6001	W WARNIMONT AV	2010-07-15 15:11	2010-07-15 15:33	2010-07-16 14:35

Instead of a single address column, we have a street - Street - and a house number - A. It makes sense to combine those to get a new column that’s the whole address, yeah?

# Address number read in as a number, so we need to convert it to a string
potholes['address'] = potholes.A.astype(str) + " " + potholes.Street
potholes = potholes.drop(columns=['A', 'Street'])
potholes.head()

	EnterDt	PrintDt	ResolvDt	address
0	2010-07-15 16:33	2010-07-16 15:44	2010-07-19 15:14	1846 W HALSEY AV
1	2010-07-15 16:06	2010-07-16 10:05	2010-07-21 06:02	9324 W PARK HILL AV
2	2010-07-15 15:13	2010-07-15 15:33	2010-07-16 14:35	1020 E MANITOBA ST
3	2010-07-15 15:12	2010-07-15 15:33	2010-07-16 14:36	3200 S 72ND ST
4	2010-07-15 15:11	2010-07-15 15:33	2010-07-16 14:35	6001 W WARNIMONT AV

Now that our data is nice and clean we can save it. We might have to send it to someone else at some point, and we wouldn’t want to do all this cleaning again, right?

potholes.to_csv("data/potholes-merged.csv", index=False)

2.2.1 Merging Census data and pothole data

If we look at our Census dataset, we see a couple columns that identify the census tract: Geo_FIPS and Geo_NAME.

census = pd.read_csv("data/R12216099_SL140.csv", dtype={'Geo_FIPS': 'str'})
census.head(2)

Geo_FIPS	Geo_GEOID	Geo_NAME	Geo_QName	Geo_STUSAB	Geo_SUMLEV	Geo_GEOCOMP	Geo_FILEID	Geo_LOGRECNO	Geo_US	Geo_REGION	Geo_DIVISION	Geo_STATECE	Geo_STATE	Geo_COUNTY	Geo_COUSUB	Geo_PLACE	Geo_PLACESE	Geo_TRACT	Geo_BLKGRP	Geo_CONCIT	Geo_AIANHH	Geo_AIANHHFP	Geo_AIHHTLI	Geo_AITSCE	Geo_AITS	Geo_ANRC	Geo_CBSA	Geo_CSA	Geo_METDIV	Geo_MACC	Geo_MEMI	Geo_NECTA	Geo_CNECTA	Geo_NECTADIV	Geo_UA	Geo_UACP	Geo_CDCURR	Geo_SLDU	Geo_SLDL	Geo_VTD	Geo_ZCTA3	Geo_ZCTA5	Geo_SUBMCD	Geo_SDELM	Geo_SDSEC	Geo_SDUNI	Geo_UR	Geo_PCI	Geo_TAZ	Geo_UGA	Geo_BTTR	Geo_BTBG	Geo_PUMA5	Geo_PUMA1	SE_A04001_001	SE_A04001_002	SE_A04001_003	SE_A04001_004	SE_A04001_005	SE_A04001_006	SE_A04001_007	SE_A04001_008	SE_A04001_009	SE_A04001_010	SE_A04001_011	SE_A04001_012	SE_A04001_013	SE_A04001_014	SE_A04001_015	SE_A04001_016	SE_A04001_017
55079000101	14000US55079000101	Census Tract 1.01, Milwaukee County, Wisconsin	Census Tract 1.01, Milwaukee County, Wisconsin	wi	140	0	ACSSF	4717	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	101	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5306	5015	1343	3567	13	70	0	16	6	291	37	0	0	0	0	254	0
55079000102	14000US55079000102	Census Tract 1.02, Milwaukee County, Wisconsin	Census Tract 1.02, Milwaukee County, Wisconsin	wi	140	0	ACSSF	4718	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	102	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3648	3549	1034	2195	36	118	0	0	166	99	67	32	0	0	0	0	0

We had to read in Geo_FIPS as a string, otherwise pandas would convert it into a very long number.

If we look at our pothole dataset, can we match any of the columns with either of those columns?

potholes = pd.read_csv("data/potholes-merged.csv")
potholes.head(2)

EnterDt	PrintDt	ResolvDt	address
2010-07-15 16:33	2010-07-16 15:44	2010-07-19 15:14	1846 W HALSEY AV
2010-07-15 16:06	2010-07-16 10:05	2010-07-21 06:02	9324 W PARK HILL AV

Unfortunately not! Even though we have the street address, which we think of as the “location,” we can’t join it onto the Census dataset because we need a matching column.

What we need is to add a new column that includes the census tract name or number, which means we need to find out which addresses are inside which census tracts.

While there might be a service we can feed the address to and get the census tract back, we’re going to do something a little more generally useful!

From the Census Bureau, we can get a geographic file (a shapefile) of where each census tract is. If we can put the street addresses on the same map, we’ll be able to say “find out which shape you’re in, and take the census tract ID from that shape.”

VISUAL THING OF THAT

But how do we put the street addresses on a map? For that, we need to convert their street address to latitude and longitude, a process called geocoding. Geocoding (or georeferencing) does just that: converts a “normal” way of describing a place, like a street address or a ZIP code or a city name, and converts it into latitude and longitude.

For most geocoding services we’re going to be paying per address. I’m cheap, so let’s look at how many rows we have:

potholes.shape

## (120186, 4)

Oof, look at that! Geocoding 100,000 addresses is going to cost more than we want to pay - are there any shortcuts we can think of to geocoding all these pothole addresses? Luckily, there is!

Each row isn’t actually an address - it’s a pothole at an address. This means the same address might show up in our dataset multiple times. Let’s take a look:

potholes.address.value_counts().head(10)

	x
4700 S HOWELL AV	84
6000 W HAMPTON AV	77
6000 W OKLAHOMA AV	74
2700 S 60TH ST	73
6000 W FOND DU LAC AV	72
7400 W APPLETON AV	71
2700 W MORGAN AV	70
7600 W HAMPTON AV	65
10700 W GOOD HOPE RD	63
3100 S 60TH ST	60

Yowza, that’s an impressive number of potholes at one address! It also means we’re lucky! Instead of geocoding each row individually - which would mean geocoding 4700 S HOWELL AV 84 times - we can pull out the addresses, remove the duplicates, geocode them, and then join them back into our dataset.

How many unique addresses do we have?

# Unique list of addresses
unique_addresses = potholes.address.unique()
len(unique_addresses)

## 53172

Wow, that’s about a 50% savings from the original list! Let’s put it into a dataframe to make the unique addresses easier to work with.

address_df = pd.DataFrame({
    'address': unique_addresses,
    'city': 'Milwaukee',
    'state': 'Wisconsin'
})
address_df.head()

address	city	state
1846 W HALSEY AV	Milwaukee	Wisconsin
9324 W PARK HILL AV	Milwaukee	Wisconsin
1020 E MANITOBA ST	Milwaukee	Wisconsin
3200 S 72ND ST	Milwaukee	Wisconsin
6001 W WARNIMONT AV	Milwaukee	Wisconsin

I added the city and state because depending on what you use to geocode, it might come in handy. Let’s save this list of addresses in case we need the separate file.

address_df.to_csv("data/addresses.csv", index=False)

2.2.2 Performing the geocoding

While I wrote some code to do the geocoding in Python with Google’s API, I ended up using Good.io instead because it’s a really, really, really easy service, and Google would cost over twice as much! You just drag and drop the CSV file and it emails you when it’s finished.

Geocoding this dataset of 50k addresses cost me about $15., and Good.io even offers to add on census information to our geocoded addresses for an extra fee. I didn’t do that because we’re going to do it for free in a minute.

When the geocoding was completed, I saved the new dataset as addresses_geocoded.csv.