2.4 Merging our datasets

2.4.0.1 Our base dataset: potholes

Let’s start with our pothole dataset. Right after we read it in we’re going to filter for 2013. That’s pretty easy to do because pandas read the date in as text!

potholes = pd.read_csv("data/potholes-merged.csv")
potholes = potholes[potholes.EnterDt.str.startswith('2013')]
potholes.head()

	EnterDt	PrintDt	ResolvDt	address
38113	2013-07-15 23:35	2013-07-16 05:46	2013-07-17 05:50	3839 N 10TH ST
38114	2013-07-15 20:05	2013-07-16 05:46	2013-07-24 16:58	4900 W MELVINA ST
38115	2013-07-15 20:00	2013-07-16 05:56	2013-07-25 14:42	2400 W WISCONSIN AV
38116	2013-07-15 19:55	2013-07-16 05:46	2013-07-18 06:06	1800 W HAMPTON AV
38117	2013-07-15 19:50	2013-07-16 05:46	2013-08-02 06:08	4718 N 19TH ST

2.4.0.2 Merging pothole addresses with tracts

To give these addresses census tract identifiers, we’ll join it to our geocoded tract dataset. As I mentioned before, we’re only interested in a couple of the columns - address and tract - so we’ll ignore the lat/lon, city, state, and all of that.

tracts = pd.read_csv("data/2013-addresses-tracts.csv", 
                     dtype={'GEOID': 'str'},
                     usecols=['address', 'GEOID'])
tracts.head()

address	GEOID
8900 N 124TH ST	55133200202
2200 N TERRACE AV	55079186900
1000 N PROSPECT AV	55079186900
1700 N PROSPECT AV	55079186900
2298 N TERRACE AV	55079186900

Any time you read in a geographic identifier - ZIP codes, etc - you’ll want to read them in as strings. Since they look like numbers, pandas would take a ZIP code like 06002 and turn it into the integer 6002. To prevent any bad situations arising from something like that, we preemptively read in all of our geographic codes as strings.

Now we can join them together based on the address, giving each pothole report a census tract. Since they both have the same column name for the address - address - we’ll be able to just use on='address'.

potholes_tracts = potholes.merge(tracts, on='address', how='left')
potholes_tracts.head()

	EnterDt	PrintDt	ResolvDt	address	GEOID
0	2013-07-15 23:35	2013-07-16 05:46	2013-07-17 05:50	3839 N 10TH ST	55079004500
1	2013-07-15 20:05	2013-07-16 05:46	2013-07-24 16:58	4900 W MELVINA ST	55079003800
2	2013-07-15 20:00	2013-07-16 05:56	2013-07-25 14:42	2400 W WISCONSIN AV	55079014900
3	2013-07-15 19:55	2013-07-16 05:46	2013-07-18 06:06	1800 W HAMPTON AV	55079002300
4	2013-07-15 19:50	2013-07-16 05:46	2013-08-02 06:08	4718 N 19TH ST	55079002300

Usually when there’s no match, pandas drops the row from the merged dataset. We add how='left' so that any row that isn’t in our geocoded dataset is still in there. It’s good to be able to check and see if you’re missing anything!

How many are missing census tracts?

potholes_tracts.GEOID.isna().value_counts()

## False    12822
## True        14
## Name: GEOID, dtype: int64

2.4.0.3 Merging potholes with census data

Now we can read in our census data. Again, we’ll be sure to treat the census tract number as a string to prevent any merging issues.

census = pd.read_csv("data/R12216099_SL140.csv", dtype={'Geo_FIPS': 'str'})
census.head()

Geo_FIPS	Geo_GEOID	Geo_NAME	Geo_QName	Geo_STUSAB	Geo_SUMLEV	Geo_FILEID	Geo_LOGRECNO	Geo_US	Geo_REGION	Geo_DIVISION	Geo_STATECE	Geo_STATE	Geo_COUNTY	Geo_COUSUB	Geo_PLACE	Geo_PLACESE	Geo_TRACT	Geo_BLKGRP	Geo_CONCIT	Geo_AIANHH	Geo_AIANHHFP	Geo_AIHHTLI	Geo_AITSCE	Geo_AITS	Geo_ANRC	Geo_CBSA	Geo_CSA	Geo_METDIV	Geo_MACC	Geo_MEMI	Geo_NECTA	Geo_CNECTA	Geo_NECTADIV	Geo_UA	Geo_UACP	Geo_CDCURR	Geo_SLDU	Geo_SLDL	Geo_VTD	Geo_ZCTA3	Geo_ZCTA5	Geo_SUBMCD	Geo_SDELM	Geo_SDSEC	Geo_SDUNI	Geo_UR	Geo_PCI	Geo_TAZ	Geo_UGA	Geo_BTTR	Geo_BTBG	Geo_PUMA5	Geo_PUMA1	SE_A04001_001	SE_A04001_002	SE_A04001_003	SE_A04001_004	SE_A04001_005	SE_A04001_006	SE_A04001_007	SE_A04001_008	SE_A04001_009	SE_A04001_010	SE_A04001_011	SE_A04001_012	SE_A04001_016	SE_A04001_017
55079000101	14000US55079000101	Census Tract 1.01, Milwaukee County, Wisconsin	Census Tract 1.01, Milwaukee County, Wisconsin	wi	140	ACSSF	4717	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	101	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	5306	5015	1343	3567	13	70	0	16	6	291	37	0	254	0
55079000102	14000US55079000102	Census Tract 1.02, Milwaukee County, Wisconsin	Census Tract 1.02, Milwaukee County, Wisconsin	wi	140	ACSSF	4718	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	102	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3648	3549	1034	2195	36	118	0	0	166	99	67	32	0	0
55079000201	14000US55079000201	Census Tract 2.01, Milwaukee County, Wisconsin	Census Tract 2.01, Milwaukee County, Wisconsin	wi	140	ACSSF	4719	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	201	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4791	4019	1185	2657	0	102	0	0	75	772	120	0	613	39
55079000202	14000US55079000202	Census Tract 2.02, Milwaukee County, Wisconsin	Census Tract 2.02, Milwaukee County, Wisconsin	wi	140	ACSSF	4720	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	202	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	6324	6148	2687	2810	87	390	12	0	162	176	113	0	63	0
55079000301	14000US55079000301	Census Tract 3.01, Milwaukee County, Wisconsin	Census Tract 3.01, Milwaukee County, Wisconsin	wi	140	ACSSF	4721	NaN	NaN	NaN	NaN	55	79	NaN	NaN	NaN	301	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1339	1339	1005	220	0	90	0	10	14	0	0	0	0	0

Oh boy, how many columns is that?

census.shape

## (298, 72)

I count that as too many! Let’s drop all of the geographic columns we don’t need, which means everything between Geo_GEOID and Geo_PUMA1.

Geo_GEOID looks like it would match the census tract column in our potholes dataset — GEOID — but in reality Geo_FIPS is the column we’ll be merging on (Geo_GEOID is the longer, national version).

census = census.drop(columns=census.columns.to_series()["Geo_GEOID":"Geo_PUMA1"])
census.head()

Geo_FIPS	SE_A04001_001	SE_A04001_002	SE_A04001_003	SE_A04001_004	SE_A04001_005	SE_A04001_006	SE_A04001_007	SE_A04001_008	SE_A04001_009	SE_A04001_010	SE_A04001_011	SE_A04001_012	SE_A04001_016	SE_A04001_017
55079000101	5306	5015	1343	3567	13	70	0	16	6	291	37	0	254	0
55079000102	3648	3549	1034	2195	36	118	0	0	166	99	67	32	0	0
55079000201	4791	4019	1185	2657	0	102	0	0	75	772	120	0	613	39
55079000202	6324	6148	2687	2810	87	390	12	0	162	176	113	0	63	0
55079000301	1339	1339	1005	220	0	90	0	10	14	0	0	0	0	0

We’ll keep all the numeric data for now because we haven’t really discussed the analysis yet.

Now we can merge, and add the census data to each row of our potholes dataset. This time the column names we’re merging on are different for each dataset - GEOID for the potholes, Geo_FIPS for the census data - so we’ll need to use left_on= and right_on=.

merged = potholes_tracts.merge(census, left_on='GEOID', right_on='Geo_FIPS', how='left')
merged.head()

	EnterDt	PrintDt	ResolvDt	address	GEOID	Geo_FIPS	SE_A04001_001	SE_A04001_002	SE_A04001_003	SE_A04001_004	SE_A04001_005	SE_A04001_006	SE_A04001_008	SE_A04001_009	SE_A04001_010	SE_A04001_011	SE_A04001_012	SE_A04001_016	SE_A04001_017
0	2013-07-15 23:35	2013-07-16 05:46	2013-07-17 05:50	3839 N 10TH ST	55079004500	55079004500	3160	3053	76	2854	1	30	8	84	107	16	17	74	0
1	2013-07-15 20:05	2013-07-16 05:46	2013-07-24 16:58	4900 W MELVINA ST	55079003800	55079003800	2323	2276	205	2007	0	0	0	64	47	0	0	47	0
2	2013-07-15 20:00	2013-07-16 05:56	2013-07-25 14:42	2400 W WISCONSIN AV	55079014900	55079014900	1275	1113	514	538	0	50	0	11	162	148	3	11	0
3	2013-07-15 19:55	2013-07-16 05:46	2013-07-18 06:06	1800 W HAMPTON AV	55079002300	55079002300	4078	4063	179	3765	0	38	0	81	15	8	0	0	7
4	2013-07-15 19:50	2013-07-16 05:46	2013-08-02 06:08	4718 N 19TH ST	55079002300	55079002300	4078	4063	179	3765	0	38	0	81	15	8	0	0	7

Again, we’ve done how='left' so we can check on what census tracts we don’t have data for. SE_A04001_001 is the column for “Total Population,” so we can use it to see how many rows didn’t have that information added in.

merged.SE_A04001_001.isna().value_counts()

## False    12820
## True        16
## Name: SE_A04001_001, dtype: int64

I’m impressed, that’s not very many at all!

Now that our dataset is ready for analysis, let’s save it and move on to the next step.

merged.to_csv("data/2013-complete.csv", index=False)