5.1 Performing the multivariate regression

Remember how linear regression hates missing data? Nothing’s changed since last chapter! Before we perform our regression, we’ll need to drop any rows missing data, whether it’s race data, unemployment data, poverty data, or anything else.

# Make note of our original dataframe size
df.shape

## (65662, 10)

# Drop rows with missing data, compare size
df = df.dropna()
df.shape

## (65656, 10)

Now we can perform our regression. This time instead of picking the columns we want to include in our regression, we’re just going to use .drop to remove the columns we don’t want to include in our regression. First we’ll look at it in isolation before we do the regression.

X = df.drop(columns=['Tract ID', 'Geo_FIPS', 'life_expectancy'])
X = sm.add_constant(X)
X.head()

Make a note that we drop life_expectancy because it’s what we’re predicting. That makes it our y value, not our X. And even though we have all these extra columns this time, we still need to add the constant - linear regression would still want to make life expectancy zero if each of our columns were zero.

Now let’s do our multivariate regression.

import statsmodels.api as sm

X = df.drop(columns=['Tract ID', 'Geo_FIPS', 'life_expectancy'])
y = df['life_expectancy']

X = sm.add_constant(X)
model = sm.OLS(y,X)

results = model.fit()
results.summary()

## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:        life_expectancy   R-squared:                       0.490
## Model:                            OLS   Adj. R-squared:                  0.490
## Method:                 Least Squares   F-statistic:                     8997.
## Date:                Tue, 14 Jan 2020   Prob (F-statistic):               0.00
## Time:                        13:29:58   Log-Likelihood:            -1.6208e+05
## No. Observations:               65656   AIC:                         3.242e+05
## Df Residuals:                   65648   BIC:                         3.243e+05
## Df Model:                           7                                         
## Covariance Type:            nonrobust                                         
## =======================================================================================
##                           coef    std err          t      P>|t|      [0.025      0.975]
## ---------------------------------------------------------------------------------------
## const                  81.2365      0.122    665.628      0.000      80.997      81.476
## ritp_100_149_pct       -0.0596      0.003    -21.738      0.000      -0.065      -0.054
## black_pct              -0.0666      0.001    -56.960      0.000      -0.069      -0.064
## white_pct              -0.0386      0.001    -36.707      0.000      -0.041      -0.037
## hisp_pct                0.0131      0.001     10.298      0.000       0.011       0.016
## unemployed_pct         -0.1490      0.004    -33.408      0.000      -0.158      -0.140
## ea_less_than_hs_pct    -0.0862      0.002    -48.979      0.000      -0.090      -0.083
## median_income_10k       0.4825      0.006     83.217      0.000       0.471       0.494
## ==============================================================================
## Omnibus:                     2114.193   Durbin-Watson:                   1.520
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4788.035
## Skew:                           0.183   Prob(JB):                         0.00
## Kurtosis:                       4.271   Cond. No.                         790.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """

So many numbers! First, notice that the coefficient for unemployed_pct changed, changing from -0.5214 to -0.1490. Now instead of losing half a year of life expectancy for a 1% increase in employment, you now lose about 0.15 years (a little under two months). Why did this happen?

When you do a multivariate regression, the regression is considering all of the factors at the same time. So now instead of thinking that unemployment is the only factor going into life expectancy, the regression is also weighing racial breakdown, income, and the other fields.

The new regression then discovered that some of the variation that the first regression said was due to unemployment is actually better explained by those other fields, so the coefficient for unemployment moved closer to zero.

Let’s take a look at all of the coefficients and see what we get.

|—|—|—| |variable|coefficient|meaning| |—|—|—| |const|81.2365|If everything else is 0, life expectancy will be about 81 years| |ritp_100_149_pct|-0.0596|For every 1 percentage point increase in people just above the poverty line, life expectancy goes down 0.06 years (~3 weeks)| |black_pct|−0.0666|For every 1 percentage point increase in the black population, life expectancy decreases by 0.07 years (~3 and a half weeks)| |white_pct|−0.0386|For every 1 percentage point increase in the white population, life expectancy goes down by 0.04 years (~2 weeks)| |hisp_pct|0.0131|For every 1 percentage point increase in the Hispanic population, life expectancy goes up 0.01 years (~4 days)| |unemployed_pct|−0.1490|For every 1 percentage point increase in the unemployed population, life expectancy goes down 0.15 years (~8 weeks)| |ea_less_than_hs_pct|−0.0862|For every 1 percentage point increase in people with less than high school education, life expectancy goes down 0.09 years (~4.5 weeks)| |median_income_10k|0.4825|For every additional $10,000 in median income in an area, life expectancy increases 0.49 years (~6 months)|

So now we can find that line from the final published piece:

An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy, the AP found. A neighborhood where more adults failed to graduate high school had shorter predicted longevity.

While the regression was dealing with one percentage point of unemployment, saying “one percentage point loses roughly 8 weeks of life expectancy” is not nearly as impactful. If we multiply both sides by ten, we get something much more publishable: 1 percentage point becomes 10 percentage points, and −0.15 years becomes −1.5 years. And then we’re published in the AP!