4.1 Linear Regression
Before we do our regression, we want to make sure we don’t have any missing data. No one likes missing data, but linear regression dislikes it most of all. If we’re missing numbers for unemployment or life expectancy for any of our census tracts, our regression just plain won’t work!
## (65662, 6)
## (65662, 6)
Once we’re sure we have our columns and have ditched any missing data, we’re free to run our regression.
When you run a regression you have two variables - and X
and a y
(and yes, they’re usually capitalized like that). Speaking simply, X
is the cause and y
is the result - in this case, we’re claiming something similar to “unemployment causes a change in life expectancy,” so X
is unemployment and y
is life expectancy.
Let’s start by pulling out our X
, our unemployment percentage.
## unemployed_pct
## 0 3.474903
## 1 6.701329
## 2 6.308411
## 3 2.695779
## 4 6.654991
Linear regressions always have one value for y
, but can have multiple values for X
- for example, later we’ll look at how unemployment and income affect life expectancy. That’s why up above we end up with a dataframe, which can hold multiple possible columns.
When we pull out y
below, you’ll notice is a series instead, just one single column of values. Unlike our inputs, our outcome is only ever one number.
## 0 73.1
## 1 76.9
## 2 75.4
## 3 79.4
## 4 73.1
## Name: life_expectancy, dtype: float64
Now that we have our X
and y
, we can run the actual regression. We’ll be using statsmodels
, one of the popular Python packages for doing statistical analysis (another being scikit-learn). I’m going to move setting X
and y
into the same code block, too, just so we can see it all at once.
## /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
## return ptp(axis=axis, out=out, **kwargs)
## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: life_expectancy R-squared: 0.169
## Model: OLS Adj. R-squared: 0.169
## Method: Least Squares F-statistic: 1.336e+04
## Date: Tue, 14 Jan 2020 Prob (F-statistic): 0.00
## Time: 13:29:54 Log-Likelihood: -1.7810e+05
## No. Observations: 65662 AIC: 3.562e+05
## Df Residuals: 65660 BIC: 3.562e+05
## Df Model: 1
## Covariance Type: nonrobust
## ==================================================================================
## coef std err t P>|t| [0.025 0.975]
## ----------------------------------------------------------------------------------
## const 81.1377 0.028 2856.410 0.000 81.082 81.193
## unemployed_pct -0.5214 0.005 -115.595 0.000 -0.530 -0.513
## ==============================================================================
## Omnibus: 616.108 Durbin-Watson: 1.117
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 807.895
## Skew: -0.146 Prob(JB): 3.70e-176
## Kurtosis: 3.459 Cond. No. 12.8
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """
The code that learns from our data is called the model. After the model analyzes each row of our data, it decides the relationship between unemployed_pct
and life_expectancy
. This relationship then shows up under the coef
section.
Under coef
it lists unemployed_pct
as -0.5214
… but what’s that mean?