The Associated Press and Life Expectancy#

Story: AP analysis: Unemployment, income affect life expectancy

Author: Nicky Forster, Associated Press

Topics: Census Data, Linear Regression

Datasets

R12221544_SL140.csv: ACS 2015 5-year, tract level, from Social Explorer
- Table B23025: Employment Status
- R12221544.txt is the data dictionary
R12221550_SL140.csv: ACS 2015 5-year, tract level, from Social Explorer
- Table B23025: Employment Status
- Table B06009: Educational Attainment
- Table B03002: Race
- Table B19013: Median income
- Table C17002: Ratio of income to poverty level
- R12221550.txt is the data dictionary
US_A.CSV: life expectancy by census tract, from USALEEP
- Record_Layout_CensusTract_Life_Expectancy.pdf is data dictionary

What's the story?#

We're trying to figure out how the life expectancy in a census tract is related to other factors like unemployment, income, and others.

Read online Download notebook Interactive version

Imports#

import pandas as pd

pd.set_option("display.max_columns", 100)

Reading in our data#

Read in `USA_A.CSV`#

We're going to rename a few columns so they make a little more sense.

life_expec = pd.read_csv("data/US_A.CSV")
life_expec.columns = ['tract_id', 'STATE2KX','CNTY2KX', 'TRACT2KX', 'life_expectancy', 
                      'life_expectancy_std_err', 'flag']
life_expec.head()

	tract_id	STATE2KX	CNTY2KX	TRACT2KX	life_expectancy	life_expectancy_std_err	flag
0	1001020100	1	1	20100	73.1	2.2348	3
1	1001020200	1	1	20200	76.9	3.3453	3
2	1001020400	1	1	20400	75.4	1.0216	3
3	1001020500	1	1	20500	79.4	1.1768	1
4	1001020600	1	1	20600	73.1	1.5519	3

Open `R12221544_SL140.csv`#

We'll keep the original names here - we'll just need to keep an eye on the codebook later.

columns = ['Geo_FIPS', 'ACS15_5yr_B23025001', 'ACS15_5yr_B23025002',
            'ACS15_5yr_B23025003', 'ACS15_5yr_B23025004', 'ACS15_5yr_B23025005', 
            'ACS15_5yr_B23025006', 'ACS15_5yr_B23025007']
employment = pd.read_csv("data/R12221544_SL140.csv", usecols=columns, encoding='latin-1')
employment.head()

	Geo_FIPS	ACS15_5yr_B23025001	ACS15_5yr_B23025002	ACS15_5yr_B23025003	ACS15_5yr_B23025004	ACS15_5yr_B23025005	ACS15_5yr_B23025006	ACS15_5yr_B23025007
0	1001020100	1554	997	997	943	54	0	557
1	1001020200	1731	884	869	753	116	15	847
2	1001020300	2462	1472	1464	1373	91	8	990
3	1001020400	3424	2013	1998	1782	216	15	1411
4	1001020500	8198	5461	5258	5037	221	203	2737

Create a new column for percent unemployment#

We'll be using the total population in the census tract as the baseline for employment.

employment['pct_unemployment'] = employment['ACS15_5yr_B23025005'] / employment['ACS15_5yr_B23025001'] * 100
employment.head()

	Geo_FIPS	ACS15_5yr_B23025001	ACS15_5yr_B23025002	ACS15_5yr_B23025003	ACS15_5yr_B23025004	ACS15_5yr_B23025005	ACS15_5yr_B23025006	ACS15_5yr_B23025007	pct_unemployment
0	1001020100	1554	997	997	943	54	0	557	3.474903
1	1001020200	1731	884	869	753	116	15	847	6.701329
2	1001020300	2462	1472	1464	1373	91	8	990	3.696182
3	1001020400	3424	2013	1998	1782	216	15	1411	6.308411
4	1001020500	8198	5461	5258	5037	221	203	2737	2.695779

Read in `R12221550_SL140.csv`#

It's also from the Census, and has many, many, many more columns with impossible names.

census = pd.read_csv("data/R12221550_SL140.csv", encoding='latin-1')
census.head()

	Geo_FIPS	Geo_GEOID	Geo_NAME	Geo_QName	Geo_STUSAB	Geo_SUMLEV	Geo_FILEID	Geo_LOGRECNO	Geo_US	Geo_REGION	Geo_DIVISION	Geo_STATECE	Geo_STATE	Geo_COUNTY	Geo_COUSUB	Geo_PLACE	Geo_PLACESE	Geo_TRACT	Geo_BLKGRP	Geo_CONCIT	Geo_AIANHH	Geo_AIANHHFP	Geo_AIHHTLI	Geo_AITSCE	Geo_AITS	Geo_ANRC	Geo_CBSA	Geo_CSA	Geo_METDIV	Geo_MACC	Geo_MEMI	Geo_NECTA	Geo_CNECTA	Geo_NECTADIV	Geo_UA	Geo_UACP	Geo_CDCURR	Geo_SLDU	Geo_SLDL	Geo_VTD	Geo_ZCTA3	Geo_ZCTA5	Geo_SUBMCD	Geo_SDELM	Geo_SDSEC	Geo_SDUNI	Geo_UR	Geo_PCI	Geo_TAZ	...	ACS15_5yr_B06009013s	ACS15_5yr_B06009014s	ACS15_5yr_B06009015s	ACS15_5yr_B06009016s	ACS15_5yr_B06009017s	ACS15_5yr_B06009018s	ACS15_5yr_B06009019s	ACS15_5yr_B06009020s	ACS15_5yr_B06009021s	ACS15_5yr_B06009022s	ACS15_5yr_B06009023s	ACS15_5yr_B06009024s	ACS15_5yr_B06009025s	ACS15_5yr_B06009026s	ACS15_5yr_B06009027s	ACS15_5yr_B06009028s	ACS15_5yr_B06009029s	ACS15_5yr_B06009030s	ACS15_5yr_C17002001	ACS15_5yr_C17002002	ACS15_5yr_C17002003	ACS15_5yr_C17002004	ACS15_5yr_C17002005	ACS15_5yr_C17002006	ACS15_5yr_C17002007	ACS15_5yr_C17002008	ACS15_5yr_C17002001s	ACS15_5yr_C17002002s	ACS15_5yr_C17002003s	ACS15_5yr_C17002004s	ACS15_5yr_C17002005s	ACS15_5yr_C17002006s	ACS15_5yr_C17002007s	ACS15_5yr_C17002008s	ACS15_5yr_B19013001	ACS15_5yr_B19013001s	ACS15_5yr_B23025001	ACS15_5yr_B23025002	ACS15_5yr_B23025003	ACS15_5yr_B23025004	ACS15_5yr_B23025005	ACS15_5yr_B23025006	ACS15_5yr_B23025007	ACS15_5yr_B23025001s	ACS15_5yr_B23025002s	ACS15_5yr_B23025003s	ACS15_5yr_B23025004s	ACS15_5yr_B23025005s	ACS15_5yr_B23025006s	ACS15_5yr_B23025007s
0	1001020100	14000US01001020100	Census Tract 201, Autauga County, Alabama	Census Tract 201, Autauga County, Alabama	al	140	ACSSF	1760	NaN	NaN	NaN	NaN	1	1	NaN	NaN	NaN	20100	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	67.878788	24.848485	51.515152	17.575758	23.636364	24.242424	18.787879	5.454545	6.666667	17.575758	6.666667	6.666667	18.181818	12.727273	4.242424	6.666667	6.666667	11.515152	1948	26	132	81	101	125	16	1467	123.030303	18.787879	60.606061	40.606061	58.181818	60.000000	10.909091	127.272727	61838.0	7212.121212	1554	997	997	943	54	0	557	92.121212	85.454545	85.454545	83.636364	18.787879	6.666667	67.878788
1	1001020200	14000US01001020200	Census Tract 202, Autauga County, Alabama	Census Tract 202, Autauga County, Alabama	al	140	ACSSF	1761	NaN	NaN	NaN	NaN	1	1	NaN	NaN	NaN	20200	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	33.939394	22.424242	26.060606	20.000000	13.939394	9.696970	9.090909	6.666667	6.666667	9.090909	6.666667	6.666667	28.484848	18.181818	20.000000	6.666667	4.848485	6.666667	1983	185	320	232	58	34	25	1129	155.151515	110.909091	74.545455	88.484848	25.454545	18.181818	16.969697	144.848485	32303.0	8204.848485	1731	884	869	753	116	15	847	143.030303	115.151515	114.545455	107.272727	38.181818	14.545455	86.666667
2	1001020300	14000US01001020300	Census Tract 203, Autauga County, Alabama	Census Tract 203, Autauga County, Alabama	al	140	ACSSF	1762	NaN	NaN	NaN	NaN	1	1	NaN	NaN	NaN	20300	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	112.121212	34.545455	59.393939	64.848485	33.939394	28.484848	20.000000	6.666667	6.666667	8.484848	17.575758	6.666667	26.666667	16.969697	13.939394	6.666667	6.666667	6.666667	2968	164	213	148	207	82	520	1634	244.848485	138.181818	70.303030	60.606061	78.181818	39.393939	189.090909	175.151515	44922.0	3411.515152	2462	1472	1464	1373	91	8	990	169.090909	132.121212	134.545455	123.030303	31.515152	8.484848	120.606061
3	1001020400	14000US01001020400	Census Tract 204, Autauga County, Alabama	Census Tract 204, Autauga County, Alabama	al	140	ACSSF	1763	NaN	NaN	NaN	NaN	1	1	NaN	NaN	NaN	20400	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	107.272727	36.969697	89.090909	56.969697	38.787879	37.575758	17.575758	6.666667	6.060606	16.363636	6.666667	6.666667	58.181818	60.000000	17.575758	12.121212	6.666667	12.727273	4423	18	74	141	182	583	201	3224	298.787879	17.575758	41.818182	53.333333	58.181818	188.484848	140.000000	331.515152	54329.0	4244.242424	3424	2013	1998	1782	216	15	1411	197.575758	157.575758	161.818182	132.121212	58.787879	14.545455	127.878788
4	1001020500	14000US01001020500	Census Tract 205, Autauga County, Alabama	Census Tract 205, Autauga County, Alabama	al	140	ACSSF	1764	NaN	NaN	NaN	NaN	1	1	NaN	NaN	NaN	20500	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	283.636364	82.424242	120.606061	210.303030	152.121212	157.575758	58.787879	10.909091	14.545455	38.181818	26.666667	23.636364	93.939394	27.272727	69.696970	10.909091	29.090909	10.909091	10563	251	952	256	1064	289	89	7662	369.696970	94.545455	521.212121	113.333333	385.454545	162.424242	52.121212	641.818182	51965.0	4203.030303	8198	5461	5258	5037	221	203	2737	321.818182	339.393939	356.969697	369.090909	89.090909	103.030303	273.939394

5 rows × 189 columns

Feature engineering#

Instead of raw population numbers, we're curious about percentages. What percent of people are certain races? What percent of people have not finished high school?

We're also adjusting the median income to be tens of thousands, because it reads better when we're understanding our final regression output.

census_features = pd.DataFrame({
    'Geo_FIPS': census.Geo_FIPS,
    'pct_black': census.ACS15_5yr_B03002004 / census.ACS15_5yr_B03002001 * 100,
    'pct_white': census.ACS15_5yr_B03002003 / census.ACS15_5yr_B03002001 * 100,
    'pct_hispanic': census.ACS15_5yr_B03002012 / census.ACS15_5yr_B03002001 * 100,
    'pct_less_than_hs': census.ACS15_5yr_B06009002 / census.ACS15_5yr_B06009001 * 100,
    'pct_1_15_poverty': (census.ACS15_5yr_C17002004 + census.ACS15_5yr_C17002005) / census.ACS15_5yr_C17002001 * 100,
    'income_10k': census.ACS15_5yr_B19013001 / 10000,
})
census_features.head()

	Geo_FIPS	pct_black	pct_white	pct_hispanic	pct_less_than_hs	pct_1_15_poverty	income_10k
0	1001020100	7.700205	87.422998	0.872690	14.802896	9.342916	6.1838
1	1001020200	53.293135	40.445269	0.788497	25.483178	14.624307	3.2303
2	1001020300	18.564690	74.528302	0.000000	10.655738	11.960916	4.4922
3	1001020400	3.662672	82.794483	10.490617	11.693687	7.302736	5.4329
4	1001020500	24.844374	68.456750	0.743287	4.445082	12.496450	5.1965

Merging the data#

Merge the dataframes together based on their census tract.

merged = life_expec.merge(employment, left_on='tract_id', right_on='Geo_FIPS')
merged = merged.merge(census_features, left_on='Geo_FIPS', right_on='Geo_FIPS')
merged.head()

	tract_id	STATE2KX	CNTY2KX	TRACT2KX	life_expectancy	life_expectancy_std_err	flag	Geo_FIPS	ACS15_5yr_B23025001	ACS15_5yr_B23025002	ACS15_5yr_B23025003	ACS15_5yr_B23025004	ACS15_5yr_B23025005	ACS15_5yr_B23025006	ACS15_5yr_B23025007	pct_unemployment	pct_black	pct_white	pct_hispanic	pct_less_than_hs	pct_1_15_poverty	income_10k
0	1001020100	1	1	20100	73.1	2.2348	3	1001020100	1554	997	997	943	54	0	557	3.474903	7.700205	87.422998	0.872690	14.802896	9.342916	6.1838
1	1001020200	1	1	20200	76.9	3.3453	3	1001020200	1731	884	869	753	116	15	847	6.701329	53.293135	40.445269	0.788497	25.483178	14.624307	3.2303
2	1001020400	1	1	20400	75.4	1.0216	3	1001020400	3424	2013	1998	1782	216	15	1411	6.308411	3.662672	82.794483	10.490617	11.693687	7.302736	5.4329
3	1001020500	1	1	20500	79.4	1.1768	1	1001020500	8198	5461	5258	5037	221	203	2737	2.695779	24.844374	68.456750	0.743287	4.445082	12.496450	5.1965
4	1001020600	1	1	20600	73.1	1.5519	3	1001020600	2855	1802	1750	1560	190	52	1053	6.654991	11.918982	72.916126	13.061542	17.487267	10.854324	6.3092

Select our feature columns and remove missing data#

We're only interested in a few columns, so we'll keep those and discard the rest. Note that we're including our features as well as our target column, life_expectancy.

features = merged[['pct_black', 'pct_white', 'pct_hispanic', 'pct_less_than_hs', 'pct_1_15_poverty',
                   'income_10k', 'pct_unemployment', 'life_expectancy']].copy()
features.head()

	pct_black	pct_white	pct_hispanic	pct_less_than_hs	pct_1_15_poverty	income_10k	pct_unemployment	life_expectancy
0	7.700205	87.422998	0.872690	14.802896	9.342916	6.1838	3.474903	73.1
1	53.293135	40.445269	0.788497	25.483178	14.624307	3.2303	6.701329	76.9
2	3.662672	82.794483	10.490617	11.693687	7.302736	5.4329	6.308411	75.4
3	24.844374	68.456750	0.743287	4.445082	12.496450	5.1965	2.695779	79.4
4	11.918982	72.916126	13.061542	17.487267	10.854324	6.3092	6.654991	73.1

Check how many rows we have, then how many we have after removing missing data.

features.shape

(65662, 8)

features = features.dropna()
features.shape

(65656, 8)

Running the regression#

Using the statsmodels package, we'll run a linear regression to find the coefficient relating life expectancy and all of our feature columns from above. We're doing this in the dataframe method, as opposed to the formula method, which is covered in another notebook.

import statsmodels.api as sm

X = features.drop('life_expectancy', axis=1)
y = features.life_expectancy

model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:	life_expectancy	R-squared:	0.490
Model:	OLS	Adj. R-squared:	0.490
Method:	Least Squares	F-statistic:	8997.
Date:	Thu, 07 Nov 2019	Prob (F-statistic):	0.00
Time:	12:26:41	Log-Likelihood:	-1.6208e+05
No. Observations:	65656	AIC:	3.242e+05
Df Residuals:	65648	BIC:	3.243e+05
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	81.2365	0.122	665.628	0.000	80.997	81.476
pct_black	-0.0666	0.001	-56.960	0.000	-0.069	-0.064
pct_white	-0.0386	0.001	-36.707	0.000	-0.041	-0.037
pct_hispanic	0.0131	0.001	10.298	0.000	0.011	0.016
pct_less_than_hs	-0.0862	0.002	-48.979	0.000	-0.090	-0.083
pct_1_15_poverty	-0.0596	0.003	-21.738	0.000	-0.065	-0.054
income_10k	0.4825	0.006	83.217	0.000	0.471	0.494
pct_unemployment	-0.1490	0.004	-33.408	0.000	-0.158	-0.140

Omnibus:	2114.193	Durbin-Watson:	1.520
Prob(Omnibus):	0.000	Jarque-Bera (JB):	4788.035
Skew:	0.183	Prob(JB):	0.00
Kurtosis:	4.271	Cond. No.	790.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Translate that into the form "every 1 percentage point change in unemployment translates to a Y change in life expectancy"

# Every 1 percentage point change in unemployment translates to a -0.15 change in life expectancy

Translate some of your coefficients into the form "every X percentage point change in unemployment translates to a Y change in life expectancy." Do this with numbers that are meaningful, and in a way that is easily understandable to your reader.

# A 1 percentage point increase in unemployment translates to a 0.15 year decrease in life expectancy

# A 10 percentage point increase in unemployment translates to a 1.5 year decrease in life expectancy

Do your numbers seem off? Things too big, or too small? Make sure your percentages are percentage points between 0 and 100, not fractions between 0 and 1.

The Associated Press and Life Expectancy#

What's the story?#

Imports#

Reading in our data#

Read in `USA_A.CSV`#

Open `R12221544_SL140.csv`#

Create a new column for percent unemployment#

Read in `R12221550_SL140.csv`#

Feature engineering#

Merging the data#

Select our feature columns and remove missing data#

Running the regression#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects

The Associated Press and Life Expectancy#

What's the story?#

Imports#

Reading in our data#

Read in USA_A.CSV#

Open R12221544_SL140.csv#

Create a new column for percent unemployment#

Read in R12221550_SL140.csv#

Feature engineering#

Merging the data#

Select our feature columns and remove missing data#

Running the regression#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects

Read in `USA_A.CSV`#

Open `R12221544_SL140.csv`#

Read in `R12221550_SL140.csv`#