4.1 Linear Regression

Before we do our regression, we want to make sure we don’t have any missing data. No one likes missing data, but linear regression dislikes it most of all. If we’re missing numbers for unemployment or life expectancy for any of our census tracts, our regression just plain won’t work!

# Check how many rows we have to start with
df.shape
## (65662, 6)
# Drop everything missing, see how many are left
df = df.dropna()
df.shape
## (65662, 6)

Once we’re sure we have our columns and have ditched any missing data, we’re free to run our regression.

When you run a regression you have two variables - and X and a y (and yes, they’re usually capitalized like that). Speaking simply, X is the cause and y is the result - in this case, we’re claiming something similar to “unemployment causes a change in life expectancy,” so X is unemployment and y is life expectancy.

Let’s start by pulling out our X, our unemployment percentage.

import statsmodels.api as sm

X = df[['unemployed_pct']]
X.head()
##    unemployed_pct
## 0        3.474903
## 1        6.701329
## 2        6.308411
## 3        2.695779
## 4        6.654991

Linear regressions always have one value for y, but can have multiple values for X - for example, later we’ll look at how unemployment and income affect life expectancy. That’s why up above we end up with a dataframe, which can hold multiple possible columns.

When we pull out y below, you’ll notice is a series instead, just one single column of values. Unlike our inputs, our outcome is only ever one number.

y = df.life_expectancy
y.head()
## 0    73.1
## 1    76.9
## 2    75.4
## 3    79.4
## 4    73.1
## Name: life_expectancy, dtype: float64

Now that we have our X and y, we can run the actual regression. We’ll be using statsmodels, one of the popular Python packages for doing statistical analysis (another being scikit-learn). I’m going to move setting X and y into the same code block, too, just so we can see it all at once.

# Create our X and y
X = df[['unemployed_pct']]
y = df.life_expectancy

X = sm.add_constant(X)
## /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
##   return ptp(axis=axis, out=out, **kwargs)
model = sm.OLS(y,X)

results = model.fit()
results.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:        life_expectancy   R-squared:                       0.169
## Model:                            OLS   Adj. R-squared:                  0.169
## Method:                 Least Squares   F-statistic:                 1.336e+04
## Date:                Tue, 14 Jan 2020   Prob (F-statistic):               0.00
## Time:                        13:29:54   Log-Likelihood:            -1.7810e+05
## No. Observations:               65662   AIC:                         3.562e+05
## Df Residuals:                   65660   BIC:                         3.562e+05
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==================================================================================
##                      coef    std err          t      P>|t|      [0.025      0.975]
## ----------------------------------------------------------------------------------
## const             81.1377      0.028   2856.410      0.000      81.082      81.193
## unemployed_pct    -0.5214      0.005   -115.595      0.000      -0.530      -0.513
## ==============================================================================
## Omnibus:                      616.108   Durbin-Watson:                   1.117
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):              807.895
## Skew:                          -0.146   Prob(JB):                    3.70e-176
## Kurtosis:                       3.459   Cond. No.                         12.8
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """

The code that learns from our data is called the model. After the model analyzes each row of our data, it decides the relationship between unemployed_pct and life_expectancy. This relationship then shows up under the coef section.

Under coef it lists unemployed_pct as -0.5214… but what’s that mean?