3.5 Combining our datasets

These datasets are pretty simple to deal with. Since Social Explorer downloads have a million and one columns, though, we’ll need to adjust the options in pandas to cope.

import pandas as pd
pd.set_option("display.max_columns", 200)

3.5.1 Reading in life expectancy

The life expectancy data - US_A.CSV - is a standard csv file.

life_expect = pd.read_csv("data/US_A.CSV")
life_expect.head()
Tract ID STATE2KX CNTY2KX TRACT2KX e(0) se(e(0)) Abridged life table flag
1001020100 1 1 20100 73.1 2.2348 3
1001020200 1 1 20200 76.9 3.3453 3
1001020400 1 1 20400 75.4 1.0216 3
1001020500 1 1 20500 79.4 1.1768 1
1001020600 1 1 20600 73.1 1.5519 3

The column names are weird, though, and it’s only after looking at the data dictionary - Record_Layout_CensusTract_Life_Expectancy.pdf - that we understand e(0) is going to be our life expectancy column. Let’s rename it!

life_expect = life_expect.rename(columns={'e(0)': 'life_expectancy'})
life_expect.head()
Tract ID STATE2KX CNTY2KX TRACT2KX life_expectancy se(e(0)) Abridged life table flag
1001020100 1 1 20100 73.1 2.2348 3
1001020200 1 1 20200 76.9 3.3453 3
1001020400 1 1 20400 75.4 1.0216 3
1001020500 1 1 20500 79.4 1.1768 1
1001020600 1 1 20600 73.1 1.5519 3

And while we’re at it, might as well drop the columns we don’t need. Since we’re only using the census tract code - Tract ID - and the life expectancy, it’s slightly easier to select the columns we want instead of using .drop on the ones we don’t.

life_expect = life_expect[['Tract ID', 'life_expectancy']]
life_expect.head()
Tract ID life_expectancy
1001020100 73.1
1001020200 76.9
1001020400 75.4
1001020500 79.4
1001020600 73.1

3.5.2 Reading in unemployment

An irritating thing about Social Explorer data is that it will always give you an error if you read it into pandas. This is because it’s a Latin-1 encoded file, which more or less means is text that only likes American letters and doesn’t support emoji.

Pandas doesn’t know this unless you tell it, though. By default pandas assumes every file is text that likes all the world’s languages and supports emoji, which is a format called Unicode or UTF-8.

To avoid this error we need to tell pandas to use Latin-1 when opening the file.

unemployment = pd.read_csv("data/R12221544_SL140.csv", encoding='latin-1')
unemployment.head()
Geo_FIPS Geo_GEOID Geo_NAME Geo_QName Geo_STUSAB Geo_SUMLEV Geo_GEOCOMP Geo_FILEID Geo_LOGRECNO Geo_US Geo_REGION Geo_DIVISION Geo_STATECE Geo_STATE Geo_COUNTY Geo_COUSUB Geo_PLACE Geo_PLACESE Geo_TRACT Geo_BLKGRP Geo_CONCIT Geo_AIANHH Geo_AIANHHFP Geo_AIHHTLI Geo_AITSCE Geo_AITS Geo_ANRC Geo_CBSA Geo_CSA Geo_METDIV Geo_MACC Geo_MEMI Geo_NECTA Geo_CNECTA Geo_NECTADIV Geo_UA Geo_UACP Geo_CDCURR Geo_SLDU Geo_SLDL Geo_VTD Geo_ZCTA3 Geo_ZCTA5 Geo_SUBMCD Geo_SDELM Geo_SDSEC Geo_SDUNI Geo_UR Geo_PCI Geo_TAZ Geo_UGA Geo_BTTR Geo_BTBG Geo_PUMA5 Geo_PUMA1 ACS15_5yr_B23025001 ACS15_5yr_B23025002 ACS15_5yr_B23025003 ACS15_5yr_B23025004 ACS15_5yr_B23025005 ACS15_5yr_B23025006 ACS15_5yr_B23025007 ACS15_5yr_B23025001s ACS15_5yr_B23025002s ACS15_5yr_B23025003s ACS15_5yr_B23025004s ACS15_5yr_B23025005s ACS15_5yr_B23025006s ACS15_5yr_B23025007s
1001020100 14000US01001020100 Census Tract 201, Autauga County, Alabama Census Tract 201, Autauga County, Alabama al 140 0 ACSSF 1760 NaN NaN NaN NaN 1 1 NaN NaN NaN 20100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1554 997 997 943 54 0 557 92.12121 85.45455 85.45455 83.63636 18.78788 6.666667 67.87879
1001020200 14000US01001020200 Census Tract 202, Autauga County, Alabama Census Tract 202, Autauga County, Alabama al 140 0 ACSSF 1761 NaN NaN NaN NaN 1 1 NaN NaN NaN 20200 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1731 884 869 753 116 15 847 143.03030 115.15152 114.54545 107.27273 38.18182 14.545454 86.66667
1001020300 14000US01001020300 Census Tract 203, Autauga County, Alabama Census Tract 203, Autauga County, Alabama al 140 0 ACSSF 1762 NaN NaN NaN NaN 1 1 NaN NaN NaN 20300 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2462 1472 1464 1373 91 8 990 169.09091 132.12121 134.54545 123.03030 31.51515 8.484849 120.60606
1001020400 14000US01001020400 Census Tract 204, Autauga County, Alabama Census Tract 204, Autauga County, Alabama al 140 0 ACSSF 1763 NaN NaN NaN NaN 1 1 NaN NaN NaN 20400 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3424 2013 1998 1782 216 15 1411 197.57576 157.57576 161.81818 132.12121 58.78788 14.545454 127.87879
1001020500 14000US01001020500 Census Tract 205, Autauga County, Alabama Census Tract 205, Autauga County, Alabama al 140 0 ACSSF 1764 NaN NaN NaN NaN 1 1 NaN NaN NaN 20500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 8198 5461 5258 5037 221 203 2737 321.81818 339.39394 356.96970 369.09091 89.09091 103.030303 273.93939

While it did open, the result is mildly horrifying. How many columns is that?

unemployment.shape
## (74001, 69)

The first number is rows, the second one is columns - so, either way, a lot. Most of them are weird geographic identifiers - which we aren’t interested in - and then a handful of employment measures. If we look at our data dictionary we can find the codes for the unemployment columns we’re interested in:

  • ACS15_5yr_B23025001 Total population
  • ACS15_5yr_B23025005 Number unemployed

We’re also interested in Geo_FIPS, as it’s our census tract identifier which will let us connect to the life expectancy dataset. Let’s select only the columns we’re interested in to clean things up a bit:

unemployment = unemployment[['Geo_FIPS', 'ACS15_5yr_B23025005', 'ACS15_5yr_B23025001']]
unemployment.head()
Geo_FIPS ACS15_5yr_B23025005 ACS15_5yr_B23025001
1001020100 54 1554
1001020200 116 1731
1001020300 91 2462
1001020400 216 3424
1001020500 221 8198

Also, these are the raw numbers of unemployed people. We’re instead interested in percent unemployment, which will allow us to be able to compare census tracts with varying numbers of people.

We’ll save the percent unemployment into a new column:

unemployment['unemployed_pct'] = unemployment.ACS15_5yr_B23025005 / unemployment.ACS15_5yr_B23025001 * 100
unemployment.head()
Geo_FIPS ACS15_5yr_B23025005 ACS15_5yr_B23025001 unemployed_pct
1001020100 54 1554 3.474903
1001020200 116 1731 6.701329
1001020300 91 2462 3.696182
1001020400 216 3424 6.308411
1001020500 221 8198 2.695779

3.5.3 Final joining

Now that we have two datasets with matching census tract columns, we can easily join our datasets together. Both have different names for those columns, though:

  • Tract ID for life expectancy
  • Geo_FIPS for unemployment

But it isn’t a problem! When we merge, we’ll just need to use left_on= and right_on=.

df = life_expect.merge(unemployment, left_on='Tract ID', right_on='Geo_FIPS')
df.head()
Tract ID life_expectancy Geo_FIPS ACS15_5yr_B23025005 ACS15_5yr_B23025001 unemployed_pct
0 1001020100 73.1 1001020100 54 1554 3.474903
1 1001020200 76.9 1001020200 116 1731 6.701329
2 1001020400 75.4 1001020400 216 3424 6.308411
3 1001020500 79.4 1001020500 221 8198 2.695779
4 1001020600 73.1 1001020600 190 2855 6.654991

It’s easy to remember left_on= and right_on= if you think about the actual position of the two dataframes. life_expect is on the left, so left_on= is about its column. Then unemployment is on the right, so Geo_FIPS gets attached to right_on=.

Now that our data is cleaned and merged we can get started with our regression.