The Associated Press and Life Expectancy#
Story: AP analysis: Unemployment, income affect life expectancy
Author: Nicky Forster, Associated Press
Topics: Census Data, Linear Regression
Datasets
- R12221544_SL140.csv: ACS 2015 5-year, tract level, from Social Explorer
- Table B23025: Employment Status
- R12221544.txt is the data dictionary
- R12221550_SL140.csv: ACS 2015 5-year, tract level, from Social Explorer
- Table B23025: Employment Status
- Table B06009: Educational Attainment
- Table B03002: Race
- Table B19013: Median income
- Table C17002: Ratio of income to poverty level
- R12221550.txt is the data dictionary
- US_A.CSV: life expectancy by census tract, from USALEEP
- Record_Layout_CensusTract_Life_Expectancy.pdf is data dictionary
What's the story?#
We're trying to figure out how the life expectancy in a census tract is related to other factors like unemployment, income, and others.
Imports#
import pandas as pd
pd.set_option("display.max_columns", 100)
life_expec = pd.read_csv("data/US_A.CSV")
life_expec.columns = ['tract_id', 'STATE2KX','CNTY2KX', 'TRACT2KX', 'life_expectancy',
'life_expectancy_std_err', 'flag']
life_expec.head()
Open R12221544_SL140.csv
#
We'll keep the original names here - we'll just need to keep an eye on the codebook later.
columns = ['Geo_FIPS', 'ACS15_5yr_B23025001', 'ACS15_5yr_B23025002',
'ACS15_5yr_B23025003', 'ACS15_5yr_B23025004', 'ACS15_5yr_B23025005',
'ACS15_5yr_B23025006', 'ACS15_5yr_B23025007']
employment = pd.read_csv("data/R12221544_SL140.csv", usecols=columns, encoding='latin-1')
employment.head()
Create a new column for percent unemployment#
We'll be using the total population in the census tract as the baseline for employment.
employment['pct_unemployment'] = employment['ACS15_5yr_B23025005'] / employment['ACS15_5yr_B23025001'] * 100
employment.head()
Read in R12221550_SL140.csv
#
It's also from the Census, and has many, many, many more columns with impossible names.
census = pd.read_csv("data/R12221550_SL140.csv", encoding='latin-1')
census.head()
Feature engineering#
Instead of raw population numbers, we're curious about percentages. What percent of people are certain races? What percent of people have not finished high school?
We're also adjusting the median income to be tens of thousands, because it reads better when we're understanding our final regression output.
census_features = pd.DataFrame({
'Geo_FIPS': census.Geo_FIPS,
'pct_black': census.ACS15_5yr_B03002004 / census.ACS15_5yr_B03002001 * 100,
'pct_white': census.ACS15_5yr_B03002003 / census.ACS15_5yr_B03002001 * 100,
'pct_hispanic': census.ACS15_5yr_B03002012 / census.ACS15_5yr_B03002001 * 100,
'pct_less_than_hs': census.ACS15_5yr_B06009002 / census.ACS15_5yr_B06009001 * 100,
'pct_1_15_poverty': (census.ACS15_5yr_C17002004 + census.ACS15_5yr_C17002005) / census.ACS15_5yr_C17002001 * 100,
'income_10k': census.ACS15_5yr_B19013001 / 10000,
})
census_features.head()
Merging the data#
Merge the dataframes together based on their census tract.
merged = life_expec.merge(employment, left_on='tract_id', right_on='Geo_FIPS')
merged = merged.merge(census_features, left_on='Geo_FIPS', right_on='Geo_FIPS')
merged.head()
Select our feature columns and remove missing data#
We're only interested in a few columns, so we'll keep those and discard the rest. Note that we're including our features as well as our target column, life_expectancy
.
features = merged[['pct_black', 'pct_white', 'pct_hispanic', 'pct_less_than_hs', 'pct_1_15_poverty',
'income_10k', 'pct_unemployment', 'life_expectancy']].copy()
features.head()
Check how many rows we have, then how many we have after removing missing data.
features.shape
features = features.dropna()
features.shape
Running the regression#
Using the statsmodels
package, we'll run a linear regression to find the coefficient relating life expectancy and all of our feature columns from above. We're doing this in the dataframe method, as opposed to the formula method, which is covered in another notebook.
import statsmodels.api as sm
X = features.drop('life_expectancy', axis=1)
y = features.life_expectancy
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
results.summary()
Translate that into the form "every 1 percentage point change in unemployment translates to a Y change in life expectancy"
# Every 1 percentage point change in unemployment translates to a -0.15 change in life expectancy
Translate some of your coefficients into the form "every X percentage point change in unemployment translates to a Y change in life expectancy." Do this with numbers that are meaningful, and in a way that is easily understandable to your reader.
# A 1 percentage point increase in unemployment translates to a 0.15 year decrease in life expectancy
# A 10 percentage point increase in unemployment translates to a 1.5 year decrease in life expectancy
Do your numbers seem off? Things too big, or too small? Make sure your percentages are percentage points between 0 and 100, not fractions between 0 and 1.