3.5 Combining our datasets
These datasets are pretty simple to deal with. Since Social Explorer downloads have a million and one columns, though, we’ll need to adjust the options in pandas to cope.
3.5.1 Reading in life expectancy
The life expectancy data - US_A.CSV
- is a standard csv file.
Tract ID | STATE2KX | CNTY2KX | TRACT2KX | e(0) | se(e(0)) | Abridged life table flag |
---|---|---|---|---|---|---|
1001020100 | 1 | 1 | 20100 | 73.1 | 2.2348 | 3 |
1001020200 | 1 | 1 | 20200 | 76.9 | 3.3453 | 3 |
1001020400 | 1 | 1 | 20400 | 75.4 | 1.0216 | 3 |
1001020500 | 1 | 1 | 20500 | 79.4 | 1.1768 | 1 |
1001020600 | 1 | 1 | 20600 | 73.1 | 1.5519 | 3 |
The column names are weird, though, and it’s only after looking at the data dictionary - Record_Layout_CensusTract_Life_Expectancy.pdf
- that we understand e(0)
is going to be our life expectancy column. Let’s rename it!
Tract ID | STATE2KX | CNTY2KX | TRACT2KX | life_expectancy | se(e(0)) | Abridged life table flag |
---|---|---|---|---|---|---|
1001020100 | 1 | 1 | 20100 | 73.1 | 2.2348 | 3 |
1001020200 | 1 | 1 | 20200 | 76.9 | 3.3453 | 3 |
1001020400 | 1 | 1 | 20400 | 75.4 | 1.0216 | 3 |
1001020500 | 1 | 1 | 20500 | 79.4 | 1.1768 | 1 |
1001020600 | 1 | 1 | 20600 | 73.1 | 1.5519 | 3 |
And while we’re at it, might as well drop the columns we don’t need. Since we’re only using the census tract code - Tract ID - and the life expectancy, it’s slightly easier to select the columns we want instead of using .drop
on the ones we don’t.
Tract ID | life_expectancy |
---|---|
1001020100 | 73.1 |
1001020200 | 76.9 |
1001020400 | 75.4 |
1001020500 | 79.4 |
1001020600 | 73.1 |
3.5.2 Reading in unemployment
An irritating thing about Social Explorer data is that it will always give you an error if you read it into pandas. This is because it’s a Latin-1 encoded file, which more or less means is text that only likes American letters and doesn’t support emoji.
Pandas doesn’t know this unless you tell it, though. By default pandas assumes every file is text that likes all the world’s languages and supports emoji, which is a format called Unicode or UTF-8.
To avoid this error we need to tell pandas to use Latin-1 when opening the file.
Geo_FIPS | Geo_GEOID | Geo_NAME | Geo_QName | Geo_STUSAB | Geo_SUMLEV | Geo_GEOCOMP | Geo_FILEID | Geo_LOGRECNO | Geo_US | Geo_REGION | Geo_DIVISION | Geo_STATECE | Geo_STATE | Geo_COUNTY | Geo_COUSUB | Geo_PLACE | Geo_PLACESE | Geo_TRACT | Geo_BLKGRP | Geo_CONCIT | Geo_AIANHH | Geo_AIANHHFP | Geo_AIHHTLI | Geo_AITSCE | Geo_AITS | Geo_ANRC | Geo_CBSA | Geo_CSA | Geo_METDIV | Geo_MACC | Geo_MEMI | Geo_NECTA | Geo_CNECTA | Geo_NECTADIV | Geo_UA | Geo_UACP | Geo_CDCURR | Geo_SLDU | Geo_SLDL | Geo_VTD | Geo_ZCTA3 | Geo_ZCTA5 | Geo_SUBMCD | Geo_SDELM | Geo_SDSEC | Geo_SDUNI | Geo_UR | Geo_PCI | Geo_TAZ | Geo_UGA | Geo_BTTR | Geo_BTBG | Geo_PUMA5 | Geo_PUMA1 | ACS15_5yr_B23025001 | ACS15_5yr_B23025002 | ACS15_5yr_B23025003 | ACS15_5yr_B23025004 | ACS15_5yr_B23025005 | ACS15_5yr_B23025006 | ACS15_5yr_B23025007 | ACS15_5yr_B23025001s | ACS15_5yr_B23025002s | ACS15_5yr_B23025003s | ACS15_5yr_B23025004s | ACS15_5yr_B23025005s | ACS15_5yr_B23025006s | ACS15_5yr_B23025007s |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1001020100 | 14000US01001020100 | Census Tract 201, Autauga County, Alabama | Census Tract 201, Autauga County, Alabama | al | 140 | 0 | ACSSF | 1760 | NaN | NaN | NaN | NaN | 1 | 1 | NaN | NaN | NaN | 20100 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1554 | 997 | 997 | 943 | 54 | 0 | 557 | 92.12121 | 85.45455 | 85.45455 | 83.63636 | 18.78788 | 6.666667 | 67.87879 |
1001020200 | 14000US01001020200 | Census Tract 202, Autauga County, Alabama | Census Tract 202, Autauga County, Alabama | al | 140 | 0 | ACSSF | 1761 | NaN | NaN | NaN | NaN | 1 | 1 | NaN | NaN | NaN | 20200 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1731 | 884 | 869 | 753 | 116 | 15 | 847 | 143.03030 | 115.15152 | 114.54545 | 107.27273 | 38.18182 | 14.545454 | 86.66667 |
1001020300 | 14000US01001020300 | Census Tract 203, Autauga County, Alabama | Census Tract 203, Autauga County, Alabama | al | 140 | 0 | ACSSF | 1762 | NaN | NaN | NaN | NaN | 1 | 1 | NaN | NaN | NaN | 20300 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2462 | 1472 | 1464 | 1373 | 91 | 8 | 990 | 169.09091 | 132.12121 | 134.54545 | 123.03030 | 31.51515 | 8.484849 | 120.60606 |
1001020400 | 14000US01001020400 | Census Tract 204, Autauga County, Alabama | Census Tract 204, Autauga County, Alabama | al | 140 | 0 | ACSSF | 1763 | NaN | NaN | NaN | NaN | 1 | 1 | NaN | NaN | NaN | 20400 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3424 | 2013 | 1998 | 1782 | 216 | 15 | 1411 | 197.57576 | 157.57576 | 161.81818 | 132.12121 | 58.78788 | 14.545454 | 127.87879 |
1001020500 | 14000US01001020500 | Census Tract 205, Autauga County, Alabama | Census Tract 205, Autauga County, Alabama | al | 140 | 0 | ACSSF | 1764 | NaN | NaN | NaN | NaN | 1 | 1 | NaN | NaN | NaN | 20500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8198 | 5461 | 5258 | 5037 | 221 | 203 | 2737 | 321.81818 | 339.39394 | 356.96970 | 369.09091 | 89.09091 | 103.030303 | 273.93939 |
While it did open, the result is mildly horrifying. How many columns is that?
## (74001, 69)
The first number is rows, the second one is columns - so, either way, a lot. Most of them are weird geographic identifiers - which we aren’t interested in - and then a handful of employment measures. If we look at our data dictionary we can find the codes for the unemployment columns we’re interested in:
ACS15_5yr_B23025001
Total populationACS15_5yr_B23025005
Number unemployed
We’re also interested in Geo_FIPS
, as it’s our census tract identifier which will let us connect to the life expectancy dataset. Let’s select only the columns we’re interested in to clean things up a bit:
unemployment = unemployment[['Geo_FIPS', 'ACS15_5yr_B23025005', 'ACS15_5yr_B23025001']]
unemployment.head()
Geo_FIPS | ACS15_5yr_B23025005 | ACS15_5yr_B23025001 |
---|---|---|
1001020100 | 54 | 1554 |
1001020200 | 116 | 1731 |
1001020300 | 91 | 2462 |
1001020400 | 216 | 3424 |
1001020500 | 221 | 8198 |
Also, these are the raw numbers of unemployed people. We’re instead interested in percent unemployment, which will allow us to be able to compare census tracts with varying numbers of people.
We’ll save the percent unemployment into a new column:
unemployment['unemployed_pct'] = unemployment.ACS15_5yr_B23025005 / unemployment.ACS15_5yr_B23025001 * 100
unemployment.head()
Geo_FIPS | ACS15_5yr_B23025005 | ACS15_5yr_B23025001 | unemployed_pct |
---|---|---|---|
1001020100 | 54 | 1554 | 3.474903 |
1001020200 | 116 | 1731 | 6.701329 |
1001020300 | 91 | 2462 | 3.696182 |
1001020400 | 216 | 3424 | 6.308411 |
1001020500 | 221 | 8198 | 2.695779 |
3.5.3 Final joining
Now that we have two datasets with matching census tract columns, we can easily join our datasets together. Both have different names for those columns, though:
Tract ID
for life expectancyGeo_FIPS
for unemployment
But it isn’t a problem! When we merge, we’ll just need to use left_on=
and right_on=
.
Tract ID | life_expectancy | Geo_FIPS | ACS15_5yr_B23025005 | ACS15_5yr_B23025001 | unemployed_pct | |
---|---|---|---|---|---|---|
0 | 1001020100 | 73.1 | 1001020100 | 54 | 1554 | 3.474903 |
1 | 1001020200 | 76.9 | 1001020200 | 116 | 1731 | 6.701329 |
2 | 1001020400 | 75.4 | 1001020400 | 216 | 3424 | 6.308411 |
3 | 1001020500 | 79.4 | 1001020500 | 221 | 8198 | 2.695779 |
4 | 1001020600 | 73.1 | 1001020600 | 190 | 2855 | 6.654991 |
It’s easy to remember left_on=
and right_on=
if you think about the actual position of the two dataframes. life_expect
is on the left, so left_on=
is about its column. Then unemployment
is on the right, so Geo_FIPS
gets attached to right_on=
.
Now that our data is cleaned and merged we can get started with our regression.