The Associated Press and Life Expectancy#

Story: AP analysis: Unemployment, income affect life expectancy

Author: Nicky Forster, Associated Press

Topics: Census Data, Linear Regression

Datasets

  • R12221544_SL140.csv: ACS 2015 5-year, tract level, from Social Explorer
    • Table B23025: Employment Status
    • R12221544.txt is the data dictionary
  • R12221550_SL140.csv: ACS 2015 5-year, tract level, from Social Explorer
    • Table B23025: Employment Status
    • Table B06009: Educational Attainment
    • Table B03002: Race
    • Table B19013: Median income
    • Table C17002: Ratio of income to poverty level
    • R12221550.txt is the data dictionary
  • US_A.CSV: life expectancy by census tract, from USALEEP
    • Record_Layout_CensusTract_Life_Expectancy.pdf is data dictionary

What's the story?#

We're trying to figure out how the life expectancy in a census tract is related to other factors like unemployment, income, and others.

Imports#

import pandas as pd

pd.set_option("display.max_columns", 100)

Reading in our data#

Read in USA_A.CSV#

We're going to rename a few columns so they make a little more sense.

life_expec = pd.read_csv("data/US_A.CSV")
life_expec.columns = ['tract_id', 'STATE2KX','CNTY2KX', 'TRACT2KX', 'life_expectancy', 
                      'life_expectancy_std_err', 'flag']
life_expec.head()
tract_id STATE2KX CNTY2KX TRACT2KX life_expectancy life_expectancy_std_err flag
0 1001020100 1 1 20100 73.1 2.2348 3
1 1001020200 1 1 20200 76.9 3.3453 3
2 1001020400 1 1 20400 75.4 1.0216 3
3 1001020500 1 1 20500 79.4 1.1768 1
4 1001020600 1 1 20600 73.1 1.5519 3

Open R12221544_SL140.csv#

We'll keep the original names here - we'll just need to keep an eye on the codebook later.

columns = ['Geo_FIPS', 'ACS15_5yr_B23025001', 'ACS15_5yr_B23025002',
            'ACS15_5yr_B23025003', 'ACS15_5yr_B23025004', 'ACS15_5yr_B23025005', 
            'ACS15_5yr_B23025006', 'ACS15_5yr_B23025007']
employment = pd.read_csv("data/R12221544_SL140.csv", usecols=columns, encoding='latin-1')
employment.head()
Geo_FIPS ACS15_5yr_B23025001 ACS15_5yr_B23025002 ACS15_5yr_B23025003 ACS15_5yr_B23025004 ACS15_5yr_B23025005 ACS15_5yr_B23025006 ACS15_5yr_B23025007
0 1001020100 1554 997 997 943 54 0 557
1 1001020200 1731 884 869 753 116 15 847
2 1001020300 2462 1472 1464 1373 91 8 990
3 1001020400 3424 2013 1998 1782 216 15 1411
4 1001020500 8198 5461 5258 5037 221 203 2737

Create a new column for percent unemployment#

We'll be using the total population in the census tract as the baseline for employment.

employment['pct_unemployment'] = employment['ACS15_5yr_B23025005'] / employment['ACS15_5yr_B23025001'] * 100
employment.head()
Geo_FIPS ACS15_5yr_B23025001 ACS15_5yr_B23025002 ACS15_5yr_B23025003 ACS15_5yr_B23025004 ACS15_5yr_B23025005 ACS15_5yr_B23025006 ACS15_5yr_B23025007 pct_unemployment
0 1001020100 1554 997 997 943 54 0 557 3.474903
1 1001020200 1731 884 869 753 116 15 847 6.701329
2 1001020300 2462 1472 1464 1373 91 8 990 3.696182
3 1001020400 3424 2013 1998 1782 216 15 1411 6.308411
4 1001020500 8198 5461 5258 5037 221 203 2737 2.695779

Read in R12221550_SL140.csv#

It's also from the Census, and has many, many, many more columns with impossible names.

census = pd.read_csv("data/R12221550_SL140.csv", encoding='latin-1')
census.head()
Geo_FIPS Geo_GEOID Geo_NAME Geo_QName Geo_STUSAB Geo_SUMLEV Geo_GEOCOMP Geo_FILEID Geo_LOGRECNO Geo_US Geo_REGION Geo_DIVISION Geo_STATECE Geo_STATE Geo_COUNTY Geo_COUSUB Geo_PLACE Geo_PLACESE Geo_TRACT Geo_BLKGRP Geo_CONCIT Geo_AIANHH Geo_AIANHHFP Geo_AIHHTLI Geo_AITSCE Geo_AITS Geo_ANRC Geo_CBSA Geo_CSA Geo_METDIV Geo_MACC Geo_MEMI Geo_NECTA Geo_CNECTA Geo_NECTADIV Geo_UA Geo_UACP Geo_CDCURR Geo_SLDU Geo_SLDL Geo_VTD Geo_ZCTA3 Geo_ZCTA5 Geo_SUBMCD Geo_SDELM Geo_SDSEC Geo_SDUNI Geo_UR Geo_PCI Geo_TAZ ... ACS15_5yr_B06009013s ACS15_5yr_B06009014s ACS15_5yr_B06009015s ACS15_5yr_B06009016s ACS15_5yr_B06009017s ACS15_5yr_B06009018s ACS15_5yr_B06009019s ACS15_5yr_B06009020s ACS15_5yr_B06009021s ACS15_5yr_B06009022s ACS15_5yr_B06009023s ACS15_5yr_B06009024s ACS15_5yr_B06009025s ACS15_5yr_B06009026s ACS15_5yr_B06009027s ACS15_5yr_B06009028s ACS15_5yr_B06009029s ACS15_5yr_B06009030s ACS15_5yr_C17002001 ACS15_5yr_C17002002 ACS15_5yr_C17002003 ACS15_5yr_C17002004 ACS15_5yr_C17002005 ACS15_5yr_C17002006 ACS15_5yr_C17002007 ACS15_5yr_C17002008 ACS15_5yr_C17002001s ACS15_5yr_C17002002s ACS15_5yr_C17002003s ACS15_5yr_C17002004s ACS15_5yr_C17002005s ACS15_5yr_C17002006s ACS15_5yr_C17002007s ACS15_5yr_C17002008s ACS15_5yr_B19013001 ACS15_5yr_B19013001s ACS15_5yr_B23025001 ACS15_5yr_B23025002 ACS15_5yr_B23025003 ACS15_5yr_B23025004 ACS15_5yr_B23025005 ACS15_5yr_B23025006 ACS15_5yr_B23025007 ACS15_5yr_B23025001s ACS15_5yr_B23025002s ACS15_5yr_B23025003s ACS15_5yr_B23025004s ACS15_5yr_B23025005s ACS15_5yr_B23025006s ACS15_5yr_B23025007s
0 1001020100 14000US01001020100 Census Tract 201, Autauga County, Alabama Census Tract 201, Autauga County, Alabama al 140 0 ACSSF 1760 NaN NaN NaN NaN 1 1 NaN NaN NaN 20100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 67.878788 24.848485 51.515152 17.575758 23.636364 24.242424 18.787879 5.454545 6.666667 17.575758 6.666667 6.666667 18.181818 12.727273 4.242424 6.666667 6.666667 11.515152 1948 26 132 81 101 125 16 1467 123.030303 18.787879 60.606061 40.606061 58.181818 60.000000 10.909091 127.272727 61838.0 7212.121212 1554 997 997 943 54 0 557 92.121212 85.454545 85.454545 83.636364 18.787879 6.666667 67.878788
1 1001020200 14000US01001020200 Census Tract 202, Autauga County, Alabama Census Tract 202, Autauga County, Alabama al 140 0 ACSSF 1761 NaN NaN NaN NaN 1 1 NaN NaN NaN 20200 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 33.939394 22.424242 26.060606 20.000000 13.939394 9.696970 9.090909 6.666667 6.666667 9.090909 6.666667 6.666667 28.484848 18.181818 20.000000 6.666667 4.848485 6.666667 1983 185 320 232 58 34 25 1129 155.151515 110.909091 74.545455 88.484848 25.454545 18.181818 16.969697 144.848485 32303.0 8204.848485 1731 884 869 753 116 15 847 143.030303 115.151515 114.545455 107.272727 38.181818 14.545455 86.666667
2 1001020300 14000US01001020300 Census Tract 203, Autauga County, Alabama Census Tract 203, Autauga County, Alabama al 140 0 ACSSF 1762 NaN NaN NaN NaN 1 1 NaN NaN NaN 20300 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 112.121212 34.545455 59.393939 64.848485 33.939394 28.484848 20.000000 6.666667 6.666667 8.484848 17.575758 6.666667 26.666667 16.969697 13.939394 6.666667 6.666667 6.666667 2968 164 213 148 207 82 520 1634 244.848485 138.181818 70.303030 60.606061 78.181818 39.393939 189.090909 175.151515 44922.0 3411.515152 2462 1472 1464 1373 91 8 990 169.090909 132.121212 134.545455 123.030303 31.515152 8.484848 120.606061
3 1001020400 14000US01001020400 Census Tract 204, Autauga County, Alabama Census Tract 204, Autauga County, Alabama al 140 0 ACSSF 1763 NaN NaN NaN NaN 1 1 NaN NaN NaN 20400 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 107.272727 36.969697 89.090909 56.969697 38.787879 37.575758 17.575758 6.666667 6.060606 16.363636 6.666667 6.666667 58.181818 60.000000 17.575758 12.121212 6.666667 12.727273 4423 18 74 141 182 583 201 3224 298.787879 17.575758 41.818182 53.333333 58.181818 188.484848 140.000000 331.515152 54329.0 4244.242424 3424 2013 1998 1782 216 15 1411 197.575758 157.575758 161.818182 132.121212 58.787879 14.545455 127.878788
4 1001020500 14000US01001020500 Census Tract 205, Autauga County, Alabama Census Tract 205, Autauga County, Alabama al 140 0 ACSSF 1764 NaN NaN NaN NaN 1 1 NaN NaN NaN 20500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 283.636364 82.424242 120.606061 210.303030 152.121212 157.575758 58.787879 10.909091 14.545455 38.181818 26.666667 23.636364 93.939394 27.272727 69.696970 10.909091 29.090909 10.909091 10563 251 952 256 1064 289 89 7662 369.696970 94.545455 521.212121 113.333333 385.454545 162.424242 52.121212 641.818182 51965.0 4203.030303 8198 5461 5258 5037 221 203 2737 321.818182 339.393939 356.969697 369.090909 89.090909 103.030303 273.939394

5 rows × 189 columns

Feature engineering#

Instead of raw population numbers, we're curious about percentages. What percent of people are certain races? What percent of people have not finished high school?

We're also adjusting the median income to be tens of thousands, because it reads better when we're understanding our final regression output.

census_features = pd.DataFrame({
    'Geo_FIPS': census.Geo_FIPS,
    'pct_black': census.ACS15_5yr_B03002004 / census.ACS15_5yr_B03002001 * 100,
    'pct_white': census.ACS15_5yr_B03002003 / census.ACS15_5yr_B03002001 * 100,
    'pct_hispanic': census.ACS15_5yr_B03002012 / census.ACS15_5yr_B03002001 * 100,
    'pct_less_than_hs': census.ACS15_5yr_B06009002 / census.ACS15_5yr_B06009001 * 100,
    'pct_1_15_poverty': (census.ACS15_5yr_C17002004 + census.ACS15_5yr_C17002005) / census.ACS15_5yr_C17002001 * 100,
    'income_10k': census.ACS15_5yr_B19013001 / 10000,
})
census_features.head()
Geo_FIPS pct_black pct_white pct_hispanic pct_less_than_hs pct_1_15_poverty income_10k
0 1001020100 7.700205 87.422998 0.872690 14.802896 9.342916 6.1838
1 1001020200 53.293135 40.445269 0.788497 25.483178 14.624307 3.2303
2 1001020300 18.564690 74.528302 0.000000 10.655738 11.960916 4.4922
3 1001020400 3.662672 82.794483 10.490617 11.693687 7.302736 5.4329
4 1001020500 24.844374 68.456750 0.743287 4.445082 12.496450 5.1965

Merging the data#

Merge the dataframes together based on their census tract.

merged = life_expec.merge(employment, left_on='tract_id', right_on='Geo_FIPS')
merged = merged.merge(census_features, left_on='Geo_FIPS', right_on='Geo_FIPS')
merged.head()
tract_id STATE2KX CNTY2KX TRACT2KX life_expectancy life_expectancy_std_err flag Geo_FIPS ACS15_5yr_B23025001 ACS15_5yr_B23025002 ACS15_5yr_B23025003 ACS15_5yr_B23025004 ACS15_5yr_B23025005 ACS15_5yr_B23025006 ACS15_5yr_B23025007 pct_unemployment pct_black pct_white pct_hispanic pct_less_than_hs pct_1_15_poverty income_10k
0 1001020100 1 1 20100 73.1 2.2348 3 1001020100 1554 997 997 943 54 0 557 3.474903 7.700205 87.422998 0.872690 14.802896 9.342916 6.1838
1 1001020200 1 1 20200 76.9 3.3453 3 1001020200 1731 884 869 753 116 15 847 6.701329 53.293135 40.445269 0.788497 25.483178 14.624307 3.2303
2 1001020400 1 1 20400 75.4 1.0216 3 1001020400 3424 2013 1998 1782 216 15 1411 6.308411 3.662672 82.794483 10.490617 11.693687 7.302736 5.4329
3 1001020500 1 1 20500 79.4 1.1768 1 1001020500 8198 5461 5258 5037 221 203 2737 2.695779 24.844374 68.456750 0.743287 4.445082 12.496450 5.1965
4 1001020600 1 1 20600 73.1 1.5519 3 1001020600 2855 1802 1750 1560 190 52 1053 6.654991 11.918982 72.916126 13.061542 17.487267 10.854324 6.3092

Select our feature columns and remove missing data#

We're only interested in a few columns, so we'll keep those and discard the rest. Note that we're including our features as well as our target column, life_expectancy.

features = merged[['pct_black', 'pct_white', 'pct_hispanic', 'pct_less_than_hs', 'pct_1_15_poverty',
                   'income_10k', 'pct_unemployment', 'life_expectancy']].copy()
features.head()
pct_black pct_white pct_hispanic pct_less_than_hs pct_1_15_poverty income_10k pct_unemployment life_expectancy
0 7.700205 87.422998 0.872690 14.802896 9.342916 6.1838 3.474903 73.1
1 53.293135 40.445269 0.788497 25.483178 14.624307 3.2303 6.701329 76.9
2 3.662672 82.794483 10.490617 11.693687 7.302736 5.4329 6.308411 75.4
3 24.844374 68.456750 0.743287 4.445082 12.496450 5.1965 2.695779 79.4
4 11.918982 72.916126 13.061542 17.487267 10.854324 6.3092 6.654991 73.1

Check how many rows we have, then how many we have after removing missing data.

features.shape
(65662, 8)
features = features.dropna()
features.shape
(65656, 8)

Running the regression#

Using the statsmodels package, we'll run a linear regression to find the coefficient relating life expectancy and all of our feature columns from above. We're doing this in the dataframe method, as opposed to the formula method, which is covered in another notebook.

import statsmodels.api as sm

X = features.drop('life_expectancy', axis=1)
y = features.life_expectancy

model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
results.summary()
OLS Regression Results
Dep. Variable: life_expectancy R-squared: 0.490
Model: OLS Adj. R-squared: 0.490
Method: Least Squares F-statistic: 8997.
Date: Thu, 07 Nov 2019 Prob (F-statistic): 0.00
Time: 12:26:41 Log-Likelihood: -1.6208e+05
No. Observations: 65656 AIC: 3.242e+05
Df Residuals: 65648 BIC: 3.243e+05
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 81.2365 0.122 665.628 0.000 80.997 81.476
pct_black -0.0666 0.001 -56.960 0.000 -0.069 -0.064
pct_white -0.0386 0.001 -36.707 0.000 -0.041 -0.037
pct_hispanic 0.0131 0.001 10.298 0.000 0.011 0.016
pct_less_than_hs -0.0862 0.002 -48.979 0.000 -0.090 -0.083
pct_1_15_poverty -0.0596 0.003 -21.738 0.000 -0.065 -0.054
income_10k 0.4825 0.006 83.217 0.000 0.471 0.494
pct_unemployment -0.1490 0.004 -33.408 0.000 -0.158 -0.140
Omnibus: 2114.193 Durbin-Watson: 1.520
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4788.035
Skew: 0.183 Prob(JB): 0.00
Kurtosis: 4.271 Cond. No. 790.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Translate that into the form "every 1 percentage point change in unemployment translates to a Y change in life expectancy"

# Every 1 percentage point change in unemployment translates to a -0.15 change in life expectancy

Translate some of your coefficients into the form "every X percentage point change in unemployment translates to a Y change in life expectancy." Do this with numbers that are meaningful, and in a way that is easily understandable to your reader.

# A 1 percentage point increase in unemployment translates to a 0.15 year decrease in life expectancy
# A 10 percentage point increase in unemployment translates to a 1.5 year decrease in life expectancy

Do your numbers seem off? Things too big, or too small? Make sure your percentages are percentage points between 0 and 100, not fractions between 0 and 1.