2.4 Merging our datasets
Let’s review the data that we have:
- A csv file of addresses and pothole report/fill times
- A csv file of addresses and geographic information, including census tract identifiers
- A csv file of census tract identifiers and race data
We’ll need to do a few merges in order to complete our finished dataset.
2.4.0.1 Our base dataset: potholes
Let’s start with our pothole dataset. Right after we read it in we’re going to filter for 2013. That’s pretty easy to do because pandas read the date in as text!
potholes = pd.read_csv("data/potholes-merged.csv")
potholes = potholes[potholes.EnterDt.str.startswith('2013')]
potholes.head()
EnterDt | PrintDt | ResolvDt | address | |
---|---|---|---|---|
38113 | 2013-07-15 23:35 | 2013-07-16 05:46 | 2013-07-17 05:50 | 3839 N 10TH ST |
38114 | 2013-07-15 20:05 | 2013-07-16 05:46 | 2013-07-24 16:58 | 4900 W MELVINA ST |
38115 | 2013-07-15 20:00 | 2013-07-16 05:56 | 2013-07-25 14:42 | 2400 W WISCONSIN AV |
38116 | 2013-07-15 19:55 | 2013-07-16 05:46 | 2013-07-18 06:06 | 1800 W HAMPTON AV |
38117 | 2013-07-15 19:50 | 2013-07-16 05:46 | 2013-08-02 06:08 | 4718 N 19TH ST |
2.4.0.2 Merging pothole addresses with tracts
To give these addresses census tract identifiers, we’ll join it to our geocoded tract dataset. As I mentioned before, we’re only interested in a couple of the columns - address and tract - so we’ll ignore the lat/lon, city, state, and all of that.
tracts = pd.read_csv("data/2013-addresses-tracts.csv",
dtype={'GEOID': 'str'},
usecols=['address', 'GEOID'])
tracts.head()
address | GEOID |
---|---|
8900 N 124TH ST | 55133200202 |
2200 N TERRACE AV | 55079186900 |
1000 N PROSPECT AV | 55079186900 |
1700 N PROSPECT AV | 55079186900 |
2298 N TERRACE AV | 55079186900 |
Any time you read in a geographic identifier - ZIP codes, etc - you’ll want to read them in as strings. Since they look like numbers, pandas would take a ZIP code like 06002
and turn it into the integer 6002
. To prevent any bad situations arising from something like that, we preemptively read in all of our geographic codes as strings.
Now we can join them together based on the address, giving each pothole report a census tract. Since they both have the same column name for the address - address
- we’ll be able to just use on='address'
.
EnterDt | PrintDt | ResolvDt | address | GEOID | |
---|---|---|---|---|---|
0 | 2013-07-15 23:35 | 2013-07-16 05:46 | 2013-07-17 05:50 | 3839 N 10TH ST | 55079004500 |
1 | 2013-07-15 20:05 | 2013-07-16 05:46 | 2013-07-24 16:58 | 4900 W MELVINA ST | 55079003800 |
2 | 2013-07-15 20:00 | 2013-07-16 05:56 | 2013-07-25 14:42 | 2400 W WISCONSIN AV | 55079014900 |
3 | 2013-07-15 19:55 | 2013-07-16 05:46 | 2013-07-18 06:06 | 1800 W HAMPTON AV | 55079002300 |
4 | 2013-07-15 19:50 | 2013-07-16 05:46 | 2013-08-02 06:08 | 4718 N 19TH ST | 55079002300 |
Usually when there’s no match, pandas drops the row from the merged dataset. We add how='left'
so that any row that isn’t in our geocoded dataset is still in there. It’s good to be able to check and see if you’re missing anything!
How many are missing census tracts?
## False 12822
## True 14
## Name: GEOID, dtype: int64
2.4.0.3 Merging potholes with census data
Now we can read in our census data. Again, we’ll be sure to treat the census tract number as a string to prevent any merging issues.
Geo_FIPS | Geo_GEOID | Geo_NAME | Geo_QName | Geo_STUSAB | Geo_SUMLEV | Geo_GEOCOMP | Geo_FILEID | Geo_LOGRECNO | Geo_US | Geo_REGION | Geo_DIVISION | Geo_STATECE | Geo_STATE | Geo_COUNTY | Geo_COUSUB | Geo_PLACE | Geo_PLACESE | Geo_TRACT | Geo_BLKGRP | Geo_CONCIT | Geo_AIANHH | Geo_AIANHHFP | Geo_AIHHTLI | Geo_AITSCE | Geo_AITS | Geo_ANRC | Geo_CBSA | Geo_CSA | Geo_METDIV | Geo_MACC | Geo_MEMI | Geo_NECTA | Geo_CNECTA | Geo_NECTADIV | Geo_UA | Geo_UACP | Geo_CDCURR | Geo_SLDU | Geo_SLDL | Geo_VTD | Geo_ZCTA3 | Geo_ZCTA5 | Geo_SUBMCD | Geo_SDELM | Geo_SDSEC | Geo_SDUNI | Geo_UR | Geo_PCI | Geo_TAZ | Geo_UGA | Geo_BTTR | Geo_BTBG | Geo_PUMA5 | Geo_PUMA1 | SE_A04001_001 | SE_A04001_002 | SE_A04001_003 | SE_A04001_004 | SE_A04001_005 | SE_A04001_006 | SE_A04001_007 | SE_A04001_008 | SE_A04001_009 | SE_A04001_010 | SE_A04001_011 | SE_A04001_012 | SE_A04001_013 | SE_A04001_014 | SE_A04001_015 | SE_A04001_016 | SE_A04001_017 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
55079000101 | 14000US55079000101 | Census Tract 1.01, Milwaukee County, Wisconsin | Census Tract 1.01, Milwaukee County, Wisconsin | wi | 140 | 0 | ACSSF | 4717 | NaN | NaN | NaN | NaN | 55 | 79 | NaN | NaN | NaN | 101 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5306 | 5015 | 1343 | 3567 | 13 | 70 | 0 | 16 | 6 | 291 | 37 | 0 | 0 | 0 | 0 | 254 | 0 |
55079000102 | 14000US55079000102 | Census Tract 1.02, Milwaukee County, Wisconsin | Census Tract 1.02, Milwaukee County, Wisconsin | wi | 140 | 0 | ACSSF | 4718 | NaN | NaN | NaN | NaN | 55 | 79 | NaN | NaN | NaN | 102 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3648 | 3549 | 1034 | 2195 | 36 | 118 | 0 | 0 | 166 | 99 | 67 | 32 | 0 | 0 | 0 | 0 | 0 |
55079000201 | 14000US55079000201 | Census Tract 2.01, Milwaukee County, Wisconsin | Census Tract 2.01, Milwaukee County, Wisconsin | wi | 140 | 0 | ACSSF | 4719 | NaN | NaN | NaN | NaN | 55 | 79 | NaN | NaN | NaN | 201 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4791 | 4019 | 1185 | 2657 | 0 | 102 | 0 | 0 | 75 | 772 | 120 | 0 | 0 | 0 | 0 | 613 | 39 |
55079000202 | 14000US55079000202 | Census Tract 2.02, Milwaukee County, Wisconsin | Census Tract 2.02, Milwaukee County, Wisconsin | wi | 140 | 0 | ACSSF | 4720 | NaN | NaN | NaN | NaN | 55 | 79 | NaN | NaN | NaN | 202 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6324 | 6148 | 2687 | 2810 | 87 | 390 | 12 | 0 | 162 | 176 | 113 | 0 | 0 | 0 | 0 | 63 | 0 |
55079000301 | 14000US55079000301 | Census Tract 3.01, Milwaukee County, Wisconsin | Census Tract 3.01, Milwaukee County, Wisconsin | wi | 140 | 0 | ACSSF | 4721 | NaN | NaN | NaN | NaN | 55 | 79 | NaN | NaN | NaN | 301 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1339 | 1339 | 1005 | 220 | 0 | 90 | 0 | 10 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Oh boy, how many columns is that?
## (298, 72)
I count that as too many! Let’s drop all of the geographic columns we don’t need, which means everything between Geo_GEOID
and Geo_PUMA1
.
Geo_GEOID
looks like it would match the census tract column in our potholes dataset — GEOID
— but in reality Geo_FIPS
is the column we’ll be merging on (Geo_GEOID
is the longer, national version).
Geo_FIPS | SE_A04001_001 | SE_A04001_002 | SE_A04001_003 | SE_A04001_004 | SE_A04001_005 | SE_A04001_006 | SE_A04001_007 | SE_A04001_008 | SE_A04001_009 | SE_A04001_010 | SE_A04001_011 | SE_A04001_012 | SE_A04001_013 | SE_A04001_014 | SE_A04001_015 | SE_A04001_016 | SE_A04001_017 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
55079000101 | 5306 | 5015 | 1343 | 3567 | 13 | 70 | 0 | 16 | 6 | 291 | 37 | 0 | 0 | 0 | 0 | 254 | 0 |
55079000102 | 3648 | 3549 | 1034 | 2195 | 36 | 118 | 0 | 0 | 166 | 99 | 67 | 32 | 0 | 0 | 0 | 0 | 0 |
55079000201 | 4791 | 4019 | 1185 | 2657 | 0 | 102 | 0 | 0 | 75 | 772 | 120 | 0 | 0 | 0 | 0 | 613 | 39 |
55079000202 | 6324 | 6148 | 2687 | 2810 | 87 | 390 | 12 | 0 | 162 | 176 | 113 | 0 | 0 | 0 | 0 | 63 | 0 |
55079000301 | 1339 | 1339 | 1005 | 220 | 0 | 90 | 0 | 10 | 14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We’ll keep all the numeric data for now because we haven’t really discussed the analysis yet.
Now we can merge, and add the census data to each row of our potholes dataset. This time the column names we’re merging on are different for each dataset - GEOID
for the potholes, Geo_FIPS
for the census data - so we’ll need to use left_on=
and right_on=
.
merged = potholes_tracts.merge(census, left_on='GEOID', right_on='Geo_FIPS', how='left')
merged.head()
EnterDt | PrintDt | ResolvDt | address | GEOID | Geo_FIPS | SE_A04001_001 | SE_A04001_002 | SE_A04001_003 | SE_A04001_004 | SE_A04001_005 | SE_A04001_006 | SE_A04001_007 | SE_A04001_008 | SE_A04001_009 | SE_A04001_010 | SE_A04001_011 | SE_A04001_012 | SE_A04001_013 | SE_A04001_014 | SE_A04001_015 | SE_A04001_016 | SE_A04001_017 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-07-15 23:35 | 2013-07-16 05:46 | 2013-07-17 05:50 | 3839 N 10TH ST | 55079004500 | 55079004500 | 3160 | 3053 | 76 | 2854 | 1 | 30 | 0 | 8 | 84 | 107 | 16 | 17 | 0 | 0 | 0 | 74 | 0 |
1 | 2013-07-15 20:05 | 2013-07-16 05:46 | 2013-07-24 16:58 | 4900 W MELVINA ST | 55079003800 | 55079003800 | 2323 | 2276 | 205 | 2007 | 0 | 0 | 0 | 0 | 64 | 47 | 0 | 0 | 0 | 0 | 0 | 47 | 0 |
2 | 2013-07-15 20:00 | 2013-07-16 05:56 | 2013-07-25 14:42 | 2400 W WISCONSIN AV | 55079014900 | 55079014900 | 1275 | 1113 | 514 | 538 | 0 | 50 | 0 | 0 | 11 | 162 | 148 | 3 | 0 | 0 | 0 | 11 | 0 |
3 | 2013-07-15 19:55 | 2013-07-16 05:46 | 2013-07-18 06:06 | 1800 W HAMPTON AV | 55079002300 | 55079002300 | 4078 | 4063 | 179 | 3765 | 0 | 38 | 0 | 0 | 81 | 15 | 8 | 0 | 0 | 0 | 0 | 0 | 7 |
4 | 2013-07-15 19:50 | 2013-07-16 05:46 | 2013-08-02 06:08 | 4718 N 19TH ST | 55079002300 | 55079002300 | 4078 | 4063 | 179 | 3765 | 0 | 38 | 0 | 0 | 81 | 15 | 8 | 0 | 0 | 0 | 0 | 0 | 7 |
Again, we’ve done how='left'
so we can check on what census tracts we don’t have data for. SE_A04001_001
is the column for “Total Population,” so we can use it to see how many rows didn’t have that information added in.
## False 12820
## True 16
## Name: SE_A04001_001, dtype: int64
I’m impressed, that’s not very many at all!
Now that our dataset is ready for analysis, let’s save it and move on to the next step.