2.6 Linear Regression
2.6.1 Correlation
Before we start getting fancy with a regression, there’s a quick check we can do to see if two variables are related called correlation. Correlation tells you the general idea of “if one number goes up, does another number go up, too?”
x |
---|
-0.098948 |
x |
---|
0.098948 |
TODO
2.6.2 Linear Regression
TODO
Journalistically, linear regression allows us to make statements like “for every X percent increase in minorities in an area, pothole wait times will go up Y days”.
Always always always always check your regressions with an expert. You (probably) aren’t a mathematician, and there are a lot of ‘gotchas’ that you can come across when you’re dealing with statistics.
In our code, we’ll be asking what the effect of X
is on y
. No matter what stats package you use, these variable names will generally be the same! In this case, we want to know the affect of pct_minority
on wait_days
, so X
is going to be pct_minority
and y
is going to be wait_days
.
In this case, we’re using the statsmodels
package for our regression, because it has a real nice-looking output. You run a linear regression with statsmodels like this:
## /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
## return ptp(axis=axis, out=out, **kwargs)
## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: wait_days R-squared: 0.010
## Model: OLS Adj. R-squared: 0.010
## Method: Least Squares F-statistic: 126.4
## Date: Tue, 14 Jan 2020 Prob (F-statistic): 3.49e-29
## Time: 13:30:15 Log-Likelihood: -50049.
## No. Observations: 12783 AIC: 1.001e+05
## Df Residuals: 12781 BIC: 1.001e+05
## Df Model: 1
## Covariance Type: nonrobust
## ================================================================================
## coef std err t P>|t| [0.025 0.975]
## --------------------------------------------------------------------------------
## const 6.0386 0.247 24.489 0.000 5.555 6.522
## pct_minority 3.9611 0.352 11.242 0.000 3.270 4.652
## ==============================================================================
## Omnibus: 6385.090 Durbin-Watson: 1.411
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 37480.668
## Skew: 2.405 Prob(JB): 0.00
## Kurtosis: 9.873 Cond. No. 4.68
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """
The machine that learns from our data is called the model. After the model analyzes each row of our data, it decides the relationship between pct_minority
and wait_days
, which shows up under the coef
section.
Under coef
it lists pct_minority
as 3.9611
… but what’s that mean?
2.6.2.1 Understanding the coefficient
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 6.0386 0.247 24.489 0.000 5.555 6.522
pct_minority 3.9611 0.352 11.242 0.000 3.270 4.652
The coefficient - coef
- is what goes in our sentence: “for every increase of 1 in pct_minority, pothole wait times will go up Y days”. In this case, the coefficient is 3.9611, so our sentence goes something like this:
For every increase of 1 in pct_minority
, the number of days you wait is increased by 3.9611 . ( We’d hopefully round it up to around 4, because no one cares about those extra digits.)
Now, in this case pct_minority
goes from 0-1
, with 0 being 0% minorities and 1 being 100% minorities. As a result, “an increase of 1” from our sentence actually means increasing the number of minorities from 0% to 100%. That doesn’t really make sense, so you might do a little division to break it down into smaller units:
- The output: 1 point increase in
pct_minority
, an additional 4 days - multiplied by 10: 0.5 increase in
pct_minority
, an additional 2 days - multiplied by 25: 0.25 increase in
pct_minority
, an additional 1 day
As a result: if you have two areas, one with a pct_minority
of 0.37 and one with a pct_minority
of 0.62 — a 0.25 difference — you can expect pothole fixing to take an extra 1 day in the second area.
Very important note: this doesn’t mean a 25% increase in
pct_minority
(which would be 0.37 + 0.09 = 0.46), it means an actual increase of0.25
(which would be 0.37 + 0.25 = 0.62).
2.6.2.2 Adjusting our units
While can change those numbers around in our heads - 0.25 is 25%, 0.5 is 50%, etc - some people might find that kind of tough to think about. Even though 0
-1
can work as a percent, you might have an easier time if we do our analysis with actual percentages.
To use “real” percentages, we can just multiply by 100.
address | GEOID | Geo_FIPS | pct_white | pct_minority | wait_days | |
---|---|---|---|---|---|---|
0 | 3839 N 10TH ST | 55079004500 | 55079004500 | 2.405063 | 97.59494 | 1.250000 |
1 | 4900 W MELVINA ST | 55079003800 | 55079003800 | 8.824796 | 91.17520 | 8.833333 |
2 | 2400 W WISCONSIN AV | 55079014900 | 55079014900 | 40.313725 | 59.68627 | 9.750000 |
3 | 1800 W HAMPTON AV | 55079002300 | 55079002300 | 4.389407 | 95.61059 | 2.416667 |
4 | 4718 N 19TH ST | 55079002300 | 55079002300 | 4.389407 | 95.61059 | 17.416667 |
Now that we’ve adjusted our numbers, let’s try out the regression one more time:
import statsmodels.api as sm
X = df[['pct_minority']]
X = sm.add_constant(X)
y = df.wait_days
model = sm.OLS(y, X)
result = model.fit()
result.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
## OLS Regression Results
## ==============================================================================
## Dep. Variable: wait_days R-squared: 0.010
## Model: OLS Adj. R-squared: 0.010
## Method: Least Squares F-statistic: 126.4
## Date: Tue, 14 Jan 2020 Prob (F-statistic): 3.49e-29
## Time: 13:30:15 Log-Likelihood: -50049.
## No. Observations: 12783 AIC: 1.001e+05
## Df Residuals: 12781 BIC: 1.001e+05
## Df Model: 1
## Covariance Type: nonrobust
## ================================================================================
## coef std err t P>|t| [0.025 0.975]
## --------------------------------------------------------------------------------
## const 6.0386 0.247 24.489 0.000 5.555 6.522
## pct_minority 0.0396 0.004 11.242 0.000 0.033 0.047
## ==============================================================================
## Omnibus: 6385.090 Durbin-Watson: 1.411
## Prob(Omnibus): 0.000 Jarque-Bera (JB): 37480.668
## Skew: 2.405 Prob(JB): 0.00
## Kurtosis: 9.873 Cond. No. 161.
## ==============================================================================
##
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """
The numbers are much smaller, but you might find them easier to deal with.
- The output: 1 percentage point increase in minorities, an additional .04 days
- multiplied by 10: 10 percentage point increase in minorities, an additional 0.4 days
- multiplied by 25: 25 percentage point increase in minorities, an additional 1 day
You’re welcome to multiply, divide, or anything else to your regression units before you do the actual regression. Just think about what your final sentence might be and aim for those.
Again, if you have two areas, one with a pct_minority
of 37% and one with a pct_minority
of 62%, you can expect pothole fixing to take an extra 1 day in the second area. And yes, this is not a 25% increase of 37% to 46%, this is an increase of 25 percentage points, 37% + 25% = 62%.
2.6.2.3 Meaning of const
Under coef
there’s another coefficient we’ve been ignoring named const
.
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 6.0386 0.247 24.489 0.000 5.555 6.522
pct_minority 0.0396 0.004 11.242 0.000 0.033 0.047
The basic idea is that linear regression loves the number zero. By default, linear regression on statsmodels assumes that if you have a pct_minority
of zero, wait_days
will also be zero.
Since that’s not true at all, you always need to add in this constant. What’s what the weird .add_constant
thing was when we were building our model:
It means “Hey, model! Zero pct_minority doesn’t mean zero wait_days. Thanks!” And as a result, the model comes up with a const
of 6.0386 - the number of days you’ll wait if pct_minority is zero.