Linear regression with the Associated Press#

In this piece from the Associated Press, Nicky Forster combines from the US Census Bureau and the CDC to see how life expectancy is related to actors like unemployment, income, and others. We'll be looking at how they can write sentences like this one:

"An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy, the AP found."

import pandas as pd
import numpy as np

# Keep decimal numbers to 4 decimal places
pd.set_option("display.float_format", '{:,.4f}'.format)
pd.set_option("display.max_columns", 100)

Reading in our data#

We'll start by reading in our datasets (as we must!).

Life expectancy data#

The first dataset is USALEEP, the U.S. Small-area Life Expectancy Estimates Project. It's from the National Center for Health Statistics at the CDC.

We're going to rename a few columns so they make a little more sense. The columns themselves aren't too complicated, mostly pointing out the census tracts and the life expectancy.

life_expec = pd.read_csv("data/US_A.CSV")
life_expec.columns = ['tract_id', 'STATE2KX','CNTY2KX', 'TRACT2KX', 'life_expectancy', 
                      'life_expectancy_std_err', 'flag']
life_expec.head()
tract_id STATE2KX CNTY2KX TRACT2KX life_expectancy life_expectancy_std_err flag
0 1001020100 1 1 20100 73.1000 2.2348 3
1 1001020200 1 1 20200 76.9000 3.3453 3
2 1001020400 1 1 20400 75.4000 1.0216 3
3 1001020500 1 1 20500 79.4000 1.1768 1
4 1001020600 1 1 20600 73.1000 1.5519 3

flag sounds a little mysterious and scary, but it's just whether the numbers were observed or predicted.

Unemployment Data#

For unemployment, we'll be using data from the Census. You can get unemployment measures from a lot of places in the US government, so the Census Bureau actually publishes data on which ones you should use. In our case we're going to collect unemployment data from the American Community Survey, for the 5-year period nearest to our life expectancy dataset.

We'll be using Table B23025005, Employment Status for the Population 16 Years and Over. Take a look at the codebook for more details. We're only reading in a few columns to keep things looking clean.

columns = {
    'Geo_FIPS': 'Geo_FIPS',
    'ACS15_5yr_B23025001': 'total_pop',
    'ACS15_5yr_B23025005': 'unemployed'
}
employment = pd.read_csv("data/R12221544_SL140.csv", usecols=columns.keys(), encoding='latin-1')

employment = employment.rename(columns=columns)
employment.head()
Geo_FIPS total_pop unemployed
0 1001020100 1554 54
1 1001020200 1731 116
2 1001020300 2462 91
3 1001020400 3424 216
4 1001020500 8198 221

Since the important thing is percent unemployed, not just how many people are in an area, we'll need to do a little calculation.

Note that I'm multiplying by 100 here - if you have a little extra time, try running this notebook on your own with a 0-1.0 percentage instead of the 0-100 version.

employment['pct_unemployment'] = employment['unemployed'] / employment['total_pop'] * 100
employment.head()
Geo_FIPS total_pop unemployed pct_unemployment
0 1001020100 1554 54 3.4749
1 1001020200 1731 116 6.7013
2 1001020300 2462 91 3.6962
3 1001020400 3424 216 6.3084
4 1001020500 8198 221 2.6958

Demographic data#

The other dataset we're using is also from the Census Bureau's American Community Survey. It's a collection of many more columns with very, very long names and very, very mysterious codes. The tables include:

  • B03002: Hispanic or Latino Origin by Race
  • B06009: Place of Birth by Educational Attainment in the United States
  • C17002: Ratio of Income to Poverty Level
  • B19013: Median Household Income

Again, we're only picking a few columns to read in. Generally they're the raw count of certain populations, as well as the total population counted of each table.

columns = {
    'Geo_FIPS': 'Geo_FIPS',
    'ACS15_5yr_B03002001': 'race_table_total',
    'ACS15_5yr_B03002004': 'black',
    'ACS15_5yr_B03002003': 'white',
    'ACS15_5yr_B03002012': 'hispanic',
    'ACS15_5yr_B06009001': 'edu_table_total',
    'ACS15_5yr_B06009002': 'less_than_hs',
    'ACS15_5yr_C17002004': 'poverty_100_124',
    'ACS15_5yr_C17002005': 'poverty_125_149',
    'ACS15_5yr_C17002001': 'poverty_table_total',
    'ACS15_5yr_B19013001': 'income'
}        
census = pd.read_csv("data/R12221550_SL140.csv", usecols=columns.keys(), encoding='latin-1')
census = census.rename(columns=columns)
census.head()
Geo_FIPS race_table_total white black hispanic edu_table_total less_than_hs poverty_table_total poverty_100_124 poverty_125_149 income
0 1001020100 1948 1703 150 17 1,243.0000 184.0000 1948 81 101 61,838.0000
1 1001020200 2156 872 1149 17 1,397.0000 356.0000 1983 232 58 32,303.0000
2 1001020300 2968 2212 551 0 2,074.0000 221.0000 2968 148 207 44,922.0000
3 1001020400 4423 3662 162 464 2,899.0000 339.0000 4423 141 182 54,329.0000
4 1001020500 10763 7368 2674 80 6,974.0000 310.0000 10563 256 1064 51,965.0000

We'll again convert these raw population numbers into percentages - what percent of people are certain races? What percent of people have not finished high school?

census_features = pd.DataFrame({
    'Geo_FIPS': census.Geo_FIPS,
    'pct_black': census.black / census.race_table_total * 100,
    'pct_white': census.white / census.race_table_total * 100,
    'pct_hispanic': census.hispanic / census.race_table_total * 100,
    'pct_less_than_hs': census.less_than_hs / census.edu_table_total * 100,
    'pct_under_150_poverty': (census.poverty_100_124 + census.poverty_125_149) / census.poverty_table_total * 100,
    'income': census.income,
})
census_features.head()
Geo_FIPS pct_black pct_white pct_hispanic pct_less_than_hs pct_under_150_poverty income
0 1001020100 7.7002 87.4230 0.8727 14.8029 9.3429 61,838.0000
1 1001020200 53.2931 40.4453 0.7885 25.4832 14.6243 32,303.0000
2 1001020300 18.5647 74.5283 0.0000 10.6557 11.9609 44,922.0000
3 1001020400 3.6627 82.7945 10.4906 11.6937 7.3027 54,329.0000
4 1001020500 24.8444 68.4567 0.7433 4.4451 12.4964 51,965.0000

Merge the data#

To perform our regression, we need all of our data in a single dataframe. Fortunately these dataframes all keep track of census tracks with FIPS codes, unique geographic codes that we can rely on to merge our datasets.

merged = life_expec.merge(employment, left_on='tract_id', right_on='Geo_FIPS')
merged = merged.merge(census_features, left_on='Geo_FIPS', right_on='Geo_FIPS')
merged.head()
tract_id STATE2KX CNTY2KX TRACT2KX life_expectancy life_expectancy_std_err flag Geo_FIPS total_pop unemployed pct_unemployment pct_black pct_white pct_hispanic pct_less_than_hs pct_under_150_poverty income
0 1001020100 1 1 20100 73.1000 2.2348 3 1001020100 1554 54 3.4749 7.7002 87.4230 0.8727 14.8029 9.3429 61,838.0000
1 1001020200 1 1 20200 76.9000 3.3453 3 1001020200 1731 116 6.7013 53.2931 40.4453 0.7885 25.4832 14.6243 32,303.0000
2 1001020400 1 1 20400 75.4000 1.0216 3 1001020400 3424 216 6.3084 3.6627 82.7945 10.4906 11.6937 7.3027 54,329.0000
3 1001020500 1 1 20500 79.4000 1.1768 1 1001020500 8198 221 2.6958 24.8444 68.4567 0.7433 4.4451 12.4964 51,965.0000
4 1001020600 1 1 20600 73.1000 1.5519 3 1001020600 2855 190 6.6550 11.9190 72.9161 13.0615 17.4873 10.8543 63,092.0000

Running the regression#

Using the statsmodels package, we'll run a linear regression to find the relationship between life expectancy and our calculated columns. Note that we're using the formula method of writing a regression instead of the dataframes method. I find it both more readable and more usable than the dataframes method.

Python note: I'm also using a multiline string which is started and ended using three quotation marks. Multi-line strings and formulas are a match made in heaven!

import statsmodels.formula.api as smf

model = smf.ols("""
    life_expectancy ~ pct_black + pct_white + pct_hispanic + pct_less_than_hs
        + pct_under_150_poverty + income + pct_unemployment
""", data=merged)

results = model.fit()
results.summary()
OLS Regression Results
Dep. Variable: life_expectancy R-squared: 0.490
Model: OLS Adj. R-squared: 0.490
Method: Least Squares F-statistic: 8997.
Date: Thu, 12 Dec 2019 Prob (F-statistic): 0.00
Time: 21:50:03 Log-Likelihood: -1.6208e+05
No. Observations: 65656 AIC: 3.242e+05
Df Residuals: 65648 BIC: 3.243e+05
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 81.2365 0.122 665.628 0.000 80.997 81.476
pct_black -0.0666 0.001 -56.960 0.000 -0.069 -0.064
pct_white -0.0386 0.001 -36.707 0.000 -0.041 -0.037
pct_hispanic 0.0131 0.001 10.298 0.000 0.011 0.016
pct_less_than_hs -0.0862 0.002 -48.979 0.000 -0.090 -0.083
pct_under_150_poverty -0.0596 0.003 -21.738 0.000 -0.065 -0.054
income 4.825e-05 5.8e-07 83.217 0.000 4.71e-05 4.94e-05
pct_unemployment -0.1490 0.004 -33.408 0.000 -0.158 -0.140
Omnibus: 2114.193 Durbin-Watson: 1.520
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4788.035
Skew: 0.183 Prob(JB): 0.00
Kurtosis: 4.271 Cond. No. 7.06e+05


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.06e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

It's a lot to take in, so I'll zero us in on the important parts:

The coefficient for pct_unemployment is -0.15, which means "for every percentage point increase in unemployment, life expectancy drops 0.15 years." While this might be correct it unfortunately sounds very stupid.

Income's coefficient is rather unfortunate, too. "For every extra dollar in median income, life expectancy goes up 0.00004825 years." No thanks! Individual dollars don't really make sense here.

Instead, it'd be nice to say something like "for every ten point increase of unemployment," or "for every 10k increase in income." If we were using the dataframes version of regression, we'd create new columns, maybe something like this:

merged['income_10k'] = merged.income / 10000
merged['unemployment_pct_10pts'] = merged.pct_unemployment / 10

Since we're using the formulas method, though, we can do the division right in the regression!

model = smf.ols("""
    life_expectancy ~ 
        pct_black 
        + pct_white 
        + pct_hispanic 
        + pct_less_than_hs 
        + pct_under_150_poverty 
        + np.divide(income, 10000)
        + np.divide(pct_unemployment, 10)
""", data=merged)

results = model.fit()
results.summary()
OLS Regression Results
Dep. Variable: life_expectancy R-squared: 0.490
Model: OLS Adj. R-squared: 0.490
Method: Least Squares F-statistic: 8997.
Date: Thu, 12 Dec 2019 Prob (F-statistic): 0.00
Time: 21:50:50 Log-Likelihood: -1.6208e+05
No. Observations: 65656 AIC: 3.242e+05
Df Residuals: 65648 BIC: 3.243e+05
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 81.2365 0.122 665.628 0.000 80.997 81.476
pct_black -0.0666 0.001 -56.960 0.000 -0.069 -0.064
pct_white -0.0386 0.001 -36.707 0.000 -0.041 -0.037
pct_hispanic 0.0131 0.001 10.298 0.000 0.011 0.016
pct_less_than_hs -0.0862 0.002 -48.979 0.000 -0.090 -0.083
pct_under_150_poverty -0.0596 0.003 -21.738 0.000 -0.065 -0.054
np.divide(income, 10000) 0.4825 0.006 83.217 0.000 0.471 0.494
np.divide(pct_unemployment, 10) -1.4900 0.045 -33.408 0.000 -1.577 -1.403
Omnibus: 2114.193 Durbin-Watson: 1.520
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4788.035
Skew: 0.183 Prob(JB): 0.00
Kurtosis: 4.271 Cond. No. 792.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

That's much much nicer! Unemployment now has a -1.49 coefficient, which (surprise!) leads us right back to the original article:

An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy, the AP found.

Additionally, the negative coefficient for pct_less_than_hs is the source of this line:

A neighborhood where more adults failed to graduate high school had shorter predicted longevity.

Even though we accomplished this pretty quickly in very few lines of code, the actual process at the Associated Press was somewhat more complicated. They didn't just throw a bunch of columns in a regression!

Stats-wise, the AP did things like examined other columns that didn't meet a p-value threshold, and reasoned about collinearity between variables different. For domain expertise, they included a threshold for poverty that's actually above the poverty line. This is due to the way that Medicaid provides health insurance to people in low-income households.

Review#

In this exercise, we reproduced an article from the Associated Press that analyzed life expectancy and demographic information from the census. They used a linear regression to find the relationship between census tract qualities like unemployment, education, race, and income and how long people live.

After running the regression once, we ran it a second time to get numbers that were more human and easier to use in a story, like a "1.5 year decrease in life expectancy" as opposed to a 0.15-year or 8-week decrease. We also used the formula version of a statsmodels linear regression to perform those calculations in the regression with np.divide.

Discussion topics#

Let's say we have the sentence "a 1 percentage point increase in unemployment translates to a 0.15 year decrease in life expectancy." How do we change this into something people can appreciate?

Translate some of your coefficients into the form "every X percentage point change in unemployment translates to a Y change in life expectancy." Do this with numbers that are meaningful, and in a way that is easily understandable to your reader.

What is the difference between using percentage points (0 to 100) vs fractions (0 to 1) when doing a regression analysis?

Income, education, and race are all related, which can cause problems related to multicollinearity in our analysis. How can we avoid this?

We should consult someone who "knows something" before publishing this story. Where would we go to find someone who understands statistics and the subject matter?