5.1 Performing the multivariate regression
Remember how linear regression hates missing data? Nothing’s changed since last chapter! Before we perform our regression, we’ll need to drop any rows missing data, whether it’s race data, unemployment data, poverty data, or anything else.
## (65662, 10)
## (65656, 10)
Now we can perform our regression. This time instead of picking the columns we want to include in our regression, we’re just going to use .drop
to remove the columns we don’t want to include in our regression. First we’ll look at it in isolation before we do the regression.
X = df.drop(columns=['Tract ID', 'Geo_FIPS', 'life_expectancy'])
X = sm.add_constant(X)
X.head()
Make a note that we drop life_expectancy
because it’s what we’re predicting. That makes it our y
value, not our X
. And even though we have all these extra columns this time, we still need to add the constant - linear regression would still want to make life expectancy zero if each of our columns were zero.
Now let’s do our multivariate regression.
import statsmodels.api as sm
X = df.drop(columns=['Tract ID', 'Geo_FIPS', 'life_expectancy'])
y = df['life_expectancy']
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
results.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: life_expectancy R-squared: 0.490
## Model: OLS Adj. R-squared: 0.490
## Method: Least Squares F-statistic: 8997.
## Date: Tue, 14 Jan 2020 Prob (F-statistic): 0.00
## Time: 13:29:58 Log-Likelihood: -1.6208e+05
## No. Observations: 65656 AIC: 3.242e+05
## Df Residuals: 65648 BIC: 3.243e+05
## Df Model: 7
## Covariance Type: nonrobust
## =======================================================================================
## coef std err t P>|t| [0.025 0.975]
## ---------------------------------------------------------------------------------------
## const 81.2365 0.122 665.628 0.000 80.997 81.476
## ritp_100_149_pct -0.0596 0.003 -21.738 0.000 -0.065 -0.054
## black_pct -0.0666 0.001 -56.960 0.000 -0.069 -0.064
## white_pct -0.0386 0.001 -36.707 0.000 -0.041 -0.037
## hisp_pct 0.0131 0.001 10.298 0.000 0.011 0.016
## unemployed_pct -0.1490 0.004 -33.408 0.000 -0.158 -0.140
## ea_less_than_hs_pct -0.0862 0.002 -48.979 0.000 -0.090 -0.083
## median_income_10k 0.4825 0.006 83.217 0.000 0.471 0.494
## ==============================================================================
## Omnibus: 2114.193 Durbin-Watson: 1.520
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 4788.035
## Skew: 0.183 Prob(JB): 0.00
## Kurtosis: 4.271 Cond. No. 790.
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """
So many numbers! First, notice that the coefficient for unemployed_pct
changed, changing from -0.5214
to -0.1490
. Now instead of losing half a year of life expectancy for a 1% increase in employment, you now lose about 0.15 years (a little under two months). Why did this happen?
When you do a multivariate regression, the regression is considering all of the factors at the same time. So now instead of thinking that unemployment is the only factor going into life expectancy, the regression is also weighing racial breakdown, income, and the other fields.
The new regression then discovered that some of the variation that the first regression said was due to unemployment is actually better explained by those other fields, so the coefficient for unemployment moved closer to zero.
Let’s take a look at all of the coefficients and see what we get.
|—|—|—| |variable|coefficient|meaning| |—|—|—| |const|81.2365|If everything else is 0, life expectancy will be about 81 years| |ritp_100_149_pct|-0.0596|For every 1 percentage point increase in people just above the poverty line, life expectancy goes down 0.06 years (~3 weeks)| |black_pct|−0.0666|For every 1 percentage point increase in the black population, life expectancy decreases by 0.07 years (~3 and a half weeks)| |white_pct|−0.0386|For every 1 percentage point increase in the white population, life expectancy goes down by 0.04 years (~2 weeks)| |hisp_pct|0.0131|For every 1 percentage point increase in the Hispanic population, life expectancy goes up 0.01 years (~4 days)| |unemployed_pct|−0.1490|For every 1 percentage point increase in the unemployed population, life expectancy goes down 0.15 years (~8 weeks)| |ea_less_than_hs_pct|−0.0862|For every 1 percentage point increase in people with less than high school education, life expectancy goes down 0.09 years (~4.5 weeks)| |median_income_10k|0.4825|For every additional $10,000 in median income in an area, life expectancy increases 0.49 years (~6 months)|
So now we can find that line from the final published piece:
An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy, the AP found. A neighborhood where more adults failed to graduate high school had shorter predicted longevity.
While the regression was dealing with one percentage point of unemployment, saying “one percentage point loses roughly 8 weeks of life expectancy” is not nearly as impactful. If we multiply both sides by ten, we get something much more publishable: 1 percentage point becomes 10 percentage points, and −0.15 years becomes −1.5 years. And then we’re published in the AP!