3.5 Combining our datasets

3.5.1 Reading in life expectancy

The life expectancy data - US_A.CSV - is a standard csv file.

life_expect = pd.read_csv("data/US_A.CSV")
life_expect.head()

Tract ID	STATE2KX	CNTY2KX	TRACT2KX	e(0)	se(e(0))	Abridged life table flag
1001020100	1	1	20100	73.1	2.2348	3
1001020200	1	1	20200	76.9	3.3453	3
1001020400	1	1	20400	75.4	1.0216	3
1001020500	1	1	20500	79.4	1.1768	1
1001020600	1	1	20600	73.1	1.5519	3

The column names are weird, though, and it’s only after looking at the data dictionary - Record_Layout_CensusTract_Life_Expectancy.pdf - that we understand e(0) is going to be our life expectancy column. Let’s rename it!

life_expect = life_expect.rename(columns={'e(0)': 'life_expectancy'})
life_expect.head()

Tract ID	STATE2KX	CNTY2KX	TRACT2KX	life_expectancy	se(e(0))	Abridged life table flag
1001020100	1	1	20100	73.1	2.2348	3
1001020200	1	1	20200	76.9	3.3453	3
1001020400	1	1	20400	75.4	1.0216	3
1001020500	1	1	20500	79.4	1.1768	1
1001020600	1	1	20600	73.1	1.5519	3

And while we’re at it, might as well drop the columns we don’t need. Since we’re only using the census tract code - Tract ID - and the life expectancy, it’s slightly easier to select the columns we want instead of using .drop on the ones we don’t.

life_expect = life_expect[['Tract ID', 'life_expectancy']]
life_expect.head()

Tract ID	life_expectancy
1001020100	73.1
1001020200	76.9
1001020400	75.4
1001020500	79.4
1001020600	73.1

3.5.2 Reading in unemployment

An irritating thing about Social Explorer data is that it will always give you an error if you read it into pandas. This is because it’s a Latin-1 encoded file, which more or less means is text that only likes American letters and doesn’t support emoji.

Pandas doesn’t know this unless you tell it, though. By default pandas assumes every file is text that likes all the world’s languages and supports emoji, which is a format called Unicode or UTF-8.

To avoid this error we need to tell pandas to use Latin-1 when opening the file.

unemployment = pd.read_csv("data/R12221544_SL140.csv", encoding='latin-1')
unemployment.head()

Geo_FIPS	Geo_GEOID	Geo_NAME	Geo_QName	Geo_STUSAB	Geo_SUMLEV	Geo_FILEID	Geo_LOGRECNO	Geo_US	Geo_REGION	Geo_DIVISION	Geo_STATECE	Geo_STATE	Geo_COUNTY	Geo_COUSUB	Geo_PLACE	Geo_PLACESE	Geo_TRACT	Geo_BLKGRP	Geo_CONCIT	Geo_AIANHH	Geo_AIANHHFP	Geo_AIHHTLI	Geo_AITSCE	Geo_AITS	Geo_ANRC	Geo_CBSA	Geo_CSA	Geo_METDIV	Geo_MACC	Geo_MEMI	Geo_NECTA	Geo_CNECTA	Geo_NECTADIV	Geo_UA	Geo_UACP	Geo_CDCURR	Geo_SLDU	Geo_SLDL	Geo_VTD	Geo_ZCTA3	Geo_ZCTA5	Geo_SUBMCD	Geo_SDELM	Geo_SDSEC	Geo_SDUNI	Geo_UR	Geo_PCI	Geo_TAZ	Geo_UGA	Geo_BTTR	Geo_BTBG	Geo_PUMA5	Geo_PUMA1	ACS15_5yr_B23025001	ACS15_5yr_B23025002	ACS15_5yr_B23025003	ACS15_5yr_B23025004	ACS15_5yr_B23025005	ACS15_5yr_B23025006	ACS15_5yr_B23025007	ACS15_5yr_B23025001s	ACS15_5yr_B23025002s	ACS15_5yr_B23025003s	ACS15_5yr_B23025004s	ACS15_5yr_B23025005s	ACS15_5yr_B23025006s	ACS15_5yr_B23025007s
1001020100	14000US01001020100	Census Tract 201, Autauga County, Alabama	Census Tract 201, Autauga County, Alabama	al	140	ACSSF	1760	NaN	NaN	NaN	NaN	1	1	NaN	NaN	NaN	20100	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1554	997	997	943	54	0	557	92.12121	85.45455	85.45455	83.63636	18.78788	6.666667	67.87879
1001020200	14000US01001020200	Census Tract 202, Autauga County, Alabama	Census Tract 202, Autauga County, Alabama	al	140	ACSSF	1761	NaN	NaN	NaN	NaN	1	1	NaN	NaN	NaN	20200	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1731	884	869	753	116	15	847	143.03030	115.15152	114.54545	107.27273	38.18182	14.545454	86.66667
1001020300	14000US01001020300	Census Tract 203, Autauga County, Alabama	Census Tract 203, Autauga County, Alabama	al	140	ACSSF	1762	NaN	NaN	NaN	NaN	1	1	NaN	NaN	NaN	20300	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2462	1472	1464	1373	91	8	990	169.09091	132.12121	134.54545	123.03030	31.51515	8.484849	120.60606
1001020400	14000US01001020400	Census Tract 204, Autauga County, Alabama	Census Tract 204, Autauga County, Alabama	al	140	ACSSF	1763	NaN	NaN	NaN	NaN	1	1	NaN	NaN	NaN	20400	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3424	2013	1998	1782	216	15	1411	197.57576	157.57576	161.81818	132.12121	58.78788	14.545454	127.87879
1001020500	14000US01001020500	Census Tract 205, Autauga County, Alabama	Census Tract 205, Autauga County, Alabama	al	140	ACSSF	1764	NaN	NaN	NaN	NaN	1	1	NaN	NaN	NaN	20500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	8198	5461	5258	5037	221	203	2737	321.81818	339.39394	356.96970	369.09091	89.09091	103.030303	273.93939

While it did open, the result is mildly horrifying. How many columns is that?

unemployment.shape

## (74001, 69)

The first number is rows, the second one is columns - so, either way, a lot. Most of them are weird geographic identifiers - which we aren’t interested in - and then a handful of employment measures. If we look at our data dictionary we can find the codes for the unemployment columns we’re interested in:

ACS15_5yr_B23025001 Total population
ACS15_5yr_B23025005 Number unemployed

We’re also interested in Geo_FIPS, as it’s our census tract identifier which will let us connect to the life expectancy dataset. Let’s select only the columns we’re interested in to clean things up a bit:

unemployment = unemployment[['Geo_FIPS', 'ACS15_5yr_B23025005', 'ACS15_5yr_B23025001']]
unemployment.head()

Geo_FIPS	ACS15_5yr_B23025005	ACS15_5yr_B23025001
1001020100	54	1554
1001020200	116	1731
1001020300	91	2462
1001020400	216	3424
1001020500	221	8198

Also, these are the raw numbers of unemployed people. We’re instead interested in percent unemployment, which will allow us to be able to compare census tracts with varying numbers of people.

We’ll save the percent unemployment into a new column:

unemployment['unemployed_pct'] = unemployment.ACS15_5yr_B23025005 / unemployment.ACS15_5yr_B23025001 * 100
unemployment.head()

Geo_FIPS	ACS15_5yr_B23025005	ACS15_5yr_B23025001	unemployed_pct
1001020100	54	1554	3.474903
1001020200	116	1731	6.701329
1001020300	91	2462	3.696182
1001020400	216	3424	6.308411
1001020500	221	8198	2.695779

3.5.3 Final joining

Now that we have two datasets with matching census tract columns, we can easily join our datasets together. Both have different names for those columns, though:

Tract ID for life expectancy
Geo_FIPS for unemployment

But it isn’t a problem! When we merge, we’ll just need to use left_on= and right_on=.

df = life_expect.merge(unemployment, left_on='Tract ID', right_on='Geo_FIPS')
df.head()

	Tract ID	life_expectancy	Geo_FIPS	ACS15_5yr_B23025005	ACS15_5yr_B23025001	unemployed_pct
0	1001020100	73.1	1001020100	54	1554	3.474903
1	1001020200	76.9	1001020200	116	1731	6.701329
2	1001020400	75.4	1001020400	216	3424	6.308411
3	1001020500	79.4	1001020500	221	8198	2.695779
4	1001020600	73.1	1001020600	190	2855	6.654991

It’s easy to remember left_on= and right_on= if you think about the actual position of the two dataframes. life_expect is on the left, so left_on= is about its column. Then unemployment is on the right, so Geo_FIPS gets attached to right_on=.

Now that our data is cleaned and merged we can get started with our regression.