2.4 Merging our datasets

Let’s review the data that we have:

  • A csv file of addresses and pothole report/fill times
  • A csv file of addresses and geographic information, including census tract identifiers
  • A csv file of census tract identifiers and race data

We’ll need to do a few merges in order to complete our finished dataset.

2.4.0.1 Our base dataset: potholes

Let’s start with our pothole dataset. Right after we read it in we’re going to filter for 2013. That’s pretty easy to do because pandas read the date in as text!

potholes = pd.read_csv("data/potholes-merged.csv")
potholes = potholes[potholes.EnterDt.str.startswith('2013')]
potholes.head()
EnterDt PrintDt ResolvDt address
38113 2013-07-15 23:35 2013-07-16 05:46 2013-07-17 05:50 3839 N 10TH ST
38114 2013-07-15 20:05 2013-07-16 05:46 2013-07-24 16:58 4900 W MELVINA ST
38115 2013-07-15 20:00 2013-07-16 05:56 2013-07-25 14:42 2400 W WISCONSIN AV
38116 2013-07-15 19:55 2013-07-16 05:46 2013-07-18 06:06 1800 W HAMPTON AV
38117 2013-07-15 19:50 2013-07-16 05:46 2013-08-02 06:08 4718 N 19TH ST

2.4.0.2 Merging pothole addresses with tracts

To give these addresses census tract identifiers, we’ll join it to our geocoded tract dataset. As I mentioned before, we’re only interested in a couple of the columns - address and tract - so we’ll ignore the lat/lon, city, state, and all of that.

tracts = pd.read_csv("data/2013-addresses-tracts.csv", 
                     dtype={'GEOID': 'str'},
                     usecols=['address', 'GEOID'])
tracts.head()
address GEOID
8900 N 124TH ST 55133200202
2200 N TERRACE AV 55079186900
1000 N PROSPECT AV 55079186900
1700 N PROSPECT AV 55079186900
2298 N TERRACE AV 55079186900

Any time you read in a geographic identifier - ZIP codes, etc - you’ll want to read them in as strings. Since they look like numbers, pandas would take a ZIP code like 06002 and turn it into the integer 6002. To prevent any bad situations arising from something like that, we preemptively read in all of our geographic codes as strings.

Now we can join them together based on the address, giving each pothole report a census tract. Since they both have the same column name for the address - address - we’ll be able to just use on='address'.

potholes_tracts = potholes.merge(tracts, on='address', how='left')
potholes_tracts.head()
EnterDt PrintDt ResolvDt address GEOID
0 2013-07-15 23:35 2013-07-16 05:46 2013-07-17 05:50 3839 N 10TH ST 55079004500
1 2013-07-15 20:05 2013-07-16 05:46 2013-07-24 16:58 4900 W MELVINA ST 55079003800
2 2013-07-15 20:00 2013-07-16 05:56 2013-07-25 14:42 2400 W WISCONSIN AV 55079014900
3 2013-07-15 19:55 2013-07-16 05:46 2013-07-18 06:06 1800 W HAMPTON AV 55079002300
4 2013-07-15 19:50 2013-07-16 05:46 2013-08-02 06:08 4718 N 19TH ST 55079002300

Usually when there’s no match, pandas drops the row from the merged dataset. We add how='left' so that any row that isn’t in our geocoded dataset is still in there. It’s good to be able to check and see if you’re missing anything!

How many are missing census tracts?

potholes_tracts.GEOID.isna().value_counts()
## False    12822
## True        14
## Name: GEOID, dtype: int64

2.4.0.3 Merging potholes with census data

Now we can read in our census data. Again, we’ll be sure to treat the census tract number as a string to prevent any merging issues.

census = pd.read_csv("data/R12216099_SL140.csv", dtype={'Geo_FIPS': 'str'})
census.head()
Geo_FIPS Geo_GEOID Geo_NAME Geo_QName Geo_STUSAB Geo_SUMLEV Geo_GEOCOMP Geo_FILEID Geo_LOGRECNO Geo_US Geo_REGION Geo_DIVISION Geo_STATECE Geo_STATE Geo_COUNTY Geo_COUSUB Geo_PLACE Geo_PLACESE Geo_TRACT Geo_BLKGRP Geo_CONCIT Geo_AIANHH Geo_AIANHHFP Geo_AIHHTLI Geo_AITSCE Geo_AITS Geo_ANRC Geo_CBSA Geo_CSA Geo_METDIV Geo_MACC Geo_MEMI Geo_NECTA Geo_CNECTA Geo_NECTADIV Geo_UA Geo_UACP Geo_CDCURR Geo_SLDU Geo_SLDL Geo_VTD Geo_ZCTA3 Geo_ZCTA5 Geo_SUBMCD Geo_SDELM Geo_SDSEC Geo_SDUNI Geo_UR Geo_PCI Geo_TAZ Geo_UGA Geo_BTTR Geo_BTBG Geo_PUMA5 Geo_PUMA1 SE_A04001_001 SE_A04001_002 SE_A04001_003 SE_A04001_004 SE_A04001_005 SE_A04001_006 SE_A04001_007 SE_A04001_008 SE_A04001_009 SE_A04001_010 SE_A04001_011 SE_A04001_012 SE_A04001_013 SE_A04001_014 SE_A04001_015 SE_A04001_016 SE_A04001_017
55079000101 14000US55079000101 Census Tract 1.01, Milwaukee County, Wisconsin Census Tract 1.01, Milwaukee County, Wisconsin wi 140 0 ACSSF 4717 NaN NaN NaN NaN 55 79 NaN NaN NaN 101 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 5306 5015 1343 3567 13 70 0 16 6 291 37 0 0 0 0 254 0
55079000102 14000US55079000102 Census Tract 1.02, Milwaukee County, Wisconsin Census Tract 1.02, Milwaukee County, Wisconsin wi 140 0 ACSSF 4718 NaN NaN NaN NaN 55 79 NaN NaN NaN 102 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3648 3549 1034 2195 36 118 0 0 166 99 67 32 0 0 0 0 0
55079000201 14000US55079000201 Census Tract 2.01, Milwaukee County, Wisconsin Census Tract 2.01, Milwaukee County, Wisconsin wi 140 0 ACSSF 4719 NaN NaN NaN NaN 55 79 NaN NaN NaN 201 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4791 4019 1185 2657 0 102 0 0 75 772 120 0 0 0 0 613 39
55079000202 14000US55079000202 Census Tract 2.02, Milwaukee County, Wisconsin Census Tract 2.02, Milwaukee County, Wisconsin wi 140 0 ACSSF 4720 NaN NaN NaN NaN 55 79 NaN NaN NaN 202 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 6324 6148 2687 2810 87 390 12 0 162 176 113 0 0 0 0 63 0
55079000301 14000US55079000301 Census Tract 3.01, Milwaukee County, Wisconsin Census Tract 3.01, Milwaukee County, Wisconsin wi 140 0 ACSSF 4721 NaN NaN NaN NaN 55 79 NaN NaN NaN 301 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1339 1339 1005 220 0 90 0 10 14 0 0 0 0 0 0 0 0

Oh boy, how many columns is that?

census.shape
## (298, 72)

I count that as too many! Let’s drop all of the geographic columns we don’t need, which means everything between Geo_GEOID and Geo_PUMA1.

Geo_GEOID looks like it would match the census tract column in our potholes dataset — GEOID — but in reality Geo_FIPS is the column we’ll be merging on (Geo_GEOID is the longer, national version).

census = census.drop(columns=census.columns.to_series()["Geo_GEOID":"Geo_PUMA1"])
census.head()
Geo_FIPS SE_A04001_001 SE_A04001_002 SE_A04001_003 SE_A04001_004 SE_A04001_005 SE_A04001_006 SE_A04001_007 SE_A04001_008 SE_A04001_009 SE_A04001_010 SE_A04001_011 SE_A04001_012 SE_A04001_013 SE_A04001_014 SE_A04001_015 SE_A04001_016 SE_A04001_017
55079000101 5306 5015 1343 3567 13 70 0 16 6 291 37 0 0 0 0 254 0
55079000102 3648 3549 1034 2195 36 118 0 0 166 99 67 32 0 0 0 0 0
55079000201 4791 4019 1185 2657 0 102 0 0 75 772 120 0 0 0 0 613 39
55079000202 6324 6148 2687 2810 87 390 12 0 162 176 113 0 0 0 0 63 0
55079000301 1339 1339 1005 220 0 90 0 10 14 0 0 0 0 0 0 0 0

We’ll keep all the numeric data for now because we haven’t really discussed the analysis yet.

Now we can merge, and add the census data to each row of our potholes dataset. This time the column names we’re merging on are different for each dataset - GEOID for the potholes, Geo_FIPS for the census data - so we’ll need to use left_on= and right_on=.

merged = potholes_tracts.merge(census, left_on='GEOID', right_on='Geo_FIPS', how='left')
merged.head()
EnterDt PrintDt ResolvDt address GEOID Geo_FIPS SE_A04001_001 SE_A04001_002 SE_A04001_003 SE_A04001_004 SE_A04001_005 SE_A04001_006 SE_A04001_007 SE_A04001_008 SE_A04001_009 SE_A04001_010 SE_A04001_011 SE_A04001_012 SE_A04001_013 SE_A04001_014 SE_A04001_015 SE_A04001_016 SE_A04001_017
0 2013-07-15 23:35 2013-07-16 05:46 2013-07-17 05:50 3839 N 10TH ST 55079004500 55079004500 3160 3053 76 2854 1 30 0 8 84 107 16 17 0 0 0 74 0
1 2013-07-15 20:05 2013-07-16 05:46 2013-07-24 16:58 4900 W MELVINA ST 55079003800 55079003800 2323 2276 205 2007 0 0 0 0 64 47 0 0 0 0 0 47 0
2 2013-07-15 20:00 2013-07-16 05:56 2013-07-25 14:42 2400 W WISCONSIN AV 55079014900 55079014900 1275 1113 514 538 0 50 0 0 11 162 148 3 0 0 0 11 0
3 2013-07-15 19:55 2013-07-16 05:46 2013-07-18 06:06 1800 W HAMPTON AV 55079002300 55079002300 4078 4063 179 3765 0 38 0 0 81 15 8 0 0 0 0 0 7
4 2013-07-15 19:50 2013-07-16 05:46 2013-08-02 06:08 4718 N 19TH ST 55079002300 55079002300 4078 4063 179 3765 0 38 0 0 81 15 8 0 0 0 0 0 7

Again, we’ve done how='left' so we can check on what census tracts we don’t have data for. SE_A04001_001 is the column for “Total Population,” so we can use it to see how many rows didn’t have that information added in.

merged.SE_A04001_001.isna().value_counts()
## False    12820
## True        16
## Name: SE_A04001_001, dtype: int64

I’m impressed, that’s not very many at all!

Now that our dataset is ready for analysis, let’s save it and move on to the next step.

merged.to_csv("data/2013-complete.csv", index=False)