Linear regression with the Associated Press#
In this piece from the Associated Press, Nicky Forster combines from the US Census Bureau and the CDC to see how life expectancy is related to actors like unemployment, income, and others. We'll be looking at how they can write sentences like this one:
"An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy, the AP found."
import pandas as pd
import numpy as np
# Keep decimal numbers to 4 decimal places
pd.set_option("display.float_format", '{:,.4f}'.format)
pd.set_option("display.max_columns", 100)
Reading in our data#
We'll start by reading in our datasets (as we must!).
Life expectancy data#
The first dataset is USALEEP, the U.S. Small-area Life Expectancy Estimates Project. It's from the National Center for Health Statistics at the CDC.
We're going to rename a few columns so they make a little more sense. The columns themselves aren't too complicated, mostly pointing out the census tracts and the life expectancy.
life_expec = pd.read_csv("data/US_A.CSV")
life_expec.columns = ['tract_id', 'STATE2KX','CNTY2KX', 'TRACT2KX', 'life_expectancy',
'life_expectancy_std_err', 'flag']
life_expec.head()
flag
sounds a little mysterious and scary, but it's just whether the numbers were observed or predicted.
Unemployment Data#
For unemployment, we'll be using data from the Census. You can get unemployment measures from a lot of places in the US government, so the Census Bureau actually publishes data on which ones you should use. In our case we're going to collect unemployment data from the American Community Survey, for the 5-year period nearest to our life expectancy dataset.
We'll be using Table B23025005, Employment Status for the Population 16 Years and Over. Take a look at the codebook for more details. We're only reading in a few columns to keep things looking clean.
columns = {
'Geo_FIPS': 'Geo_FIPS',
'ACS15_5yr_B23025001': 'total_pop',
'ACS15_5yr_B23025005': 'unemployed'
}
employment = pd.read_csv("data/R12221544_SL140.csv", usecols=columns.keys(), encoding='latin-1')
employment = employment.rename(columns=columns)
employment.head()
Since the important thing is percent unemployed, not just how many people are in an area, we'll need to do a little calculation.
Note that I'm multiplying by 100 here - if you have a little extra time, try running this notebook on your own with a 0-1.0 percentage instead of the 0-100 version.
employment['pct_unemployment'] = employment['unemployed'] / employment['total_pop'] * 100
employment.head()
Demographic data#
The other dataset we're using is also from the Census Bureau's American Community Survey. It's a collection of many more columns with very, very long names and very, very mysterious codes. The tables include:
- B03002: Hispanic or Latino Origin by Race
- B06009: Place of Birth by Educational Attainment in the United States
- C17002: Ratio of Income to Poverty Level
- B19013: Median Household Income
Again, we're only picking a few columns to read in. Generally they're the raw count of certain populations, as well as the total population counted of each table.
columns = {
'Geo_FIPS': 'Geo_FIPS',
'ACS15_5yr_B03002001': 'race_table_total',
'ACS15_5yr_B03002004': 'black',
'ACS15_5yr_B03002003': 'white',
'ACS15_5yr_B03002012': 'hispanic',
'ACS15_5yr_B06009001': 'edu_table_total',
'ACS15_5yr_B06009002': 'less_than_hs',
'ACS15_5yr_C17002004': 'poverty_100_124',
'ACS15_5yr_C17002005': 'poverty_125_149',
'ACS15_5yr_C17002001': 'poverty_table_total',
'ACS15_5yr_B19013001': 'income'
}
census = pd.read_csv("data/R12221550_SL140.csv", usecols=columns.keys(), encoding='latin-1')
census = census.rename(columns=columns)
census.head()
We'll again convert these raw population numbers into percentages - what percent of people are certain races? What percent of people have not finished high school?
census_features = pd.DataFrame({
'Geo_FIPS': census.Geo_FIPS,
'pct_black': census.black / census.race_table_total * 100,
'pct_white': census.white / census.race_table_total * 100,
'pct_hispanic': census.hispanic / census.race_table_total * 100,
'pct_less_than_hs': census.less_than_hs / census.edu_table_total * 100,
'pct_under_150_poverty': (census.poverty_100_124 + census.poverty_125_149) / census.poverty_table_total * 100,
'income': census.income,
})
census_features.head()
Merge the data#
To perform our regression, we need all of our data in a single dataframe. Fortunately these dataframes all keep track of census tracks with FIPS codes, unique geographic codes that we can rely on to merge our datasets.
merged = life_expec.merge(employment, left_on='tract_id', right_on='Geo_FIPS')
merged = merged.merge(census_features, left_on='Geo_FIPS', right_on='Geo_FIPS')
merged.head()
Running the regression#
Using the statsmodels
package, we'll run a linear regression to find the relationship between life expectancy and our calculated columns. Note that we're using the formula method of writing a regression instead of the dataframes method. I find it both more readable and more usable than the dataframes method.
Python note: I'm also using a multiline string which is started and ended using three quotation marks. Multi-line strings and formulas are a match made in heaven!
import statsmodels.formula.api as smf
model = smf.ols("""
life_expectancy ~ pct_black + pct_white + pct_hispanic + pct_less_than_hs
+ pct_under_150_poverty + income + pct_unemployment
""", data=merged)
results = model.fit()
results.summary()
It's a lot to take in, so I'll zero us in on the important parts:
The coefficient for pct_unemployment
is -0.15, which means "for every percentage point increase in unemployment, life expectancy drops 0.15 years." While this might be correct it unfortunately sounds very stupid.
Income's coefficient is rather unfortunate, too. "For every extra dollar in median income, life expectancy goes up 0.00004825 years." No thanks! Individual dollars don't really make sense here.
Instead, it'd be nice to say something like "for every ten point increase of unemployment," or "for every 10k increase in income." If we were using the dataframes version of regression, we'd create new columns, maybe something like this:
merged['income_10k'] = merged.income / 10000
merged['unemployment_pct_10pts'] = merged.pct_unemployment / 10
Since we're using the formulas method, though, we can do the division right in the regression!
model = smf.ols("""
life_expectancy ~
pct_black
+ pct_white
+ pct_hispanic
+ pct_less_than_hs
+ pct_under_150_poverty
+ np.divide(income, 10000)
+ np.divide(pct_unemployment, 10)
""", data=merged)
results = model.fit()
results.summary()
That's much much nicer! Unemployment now has a -1.49 coefficient, which (surprise!) leads us right back to the original article:
An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy, the AP found.
Additionally, the negative coefficient for pct_less_than_hs
is the source of this line:
A neighborhood where more adults failed to graduate high school had shorter predicted longevity.
Even though we accomplished this pretty quickly in very few lines of code, the actual process at the Associated Press was somewhat more complicated. They didn't just throw a bunch of columns in a regression!
Stats-wise, the AP did things like examined other columns that didn't meet a p-value threshold, and reasoned about collinearity between variables different. For domain expertise, they included a threshold for poverty that's actually above the poverty line. This is due to the way that Medicaid provides health insurance to people in low-income households.
Review#
In this exercise, we reproduced an article from the Associated Press that analyzed life expectancy and demographic information from the census. They used a linear regression to find the relationship between census tract qualities like unemployment, education, race, and income and how long people live.
After running the regression once, we ran it a second time to get numbers that were more human and easier to use in a story, like a "1.5 year decrease in life expectancy" as opposed to a 0.15-year or 8-week decrease. We also used the formula version of a statsmodels linear regression to perform those calculations in the regression with np.divide
.
Discussion topics#
Let's say we have the sentence "a 1 percentage point increase in unemployment translates to a 0.15 year decrease in life expectancy." How do we change this into something people can appreciate?
Translate some of your coefficients into the form "every X percentage point change in unemployment translates to a Y change in life expectancy." Do this with numbers that are meaningful, and in a way that is easily understandable to your reader.
What is the difference between using percentage points (0 to 100) vs fractions (0 to 1) when doing a regression analysis?
Income, education, and race are all related, which can cause problems related to multicollinearity in our analysis. How can we avoid this?
We should consult someone who "knows something" before publishing this story. Where would we go to find someone who understands statistics and the subject matter?