2.6 Linear Regression

2.6.1 Correlation

Before we start getting fancy with a regression, there’s a quick check we can do to see if two variables are related called correlation. Correlation tells you the general idea of “if one number goes up, does another number go up, too?”

df['pct_white'].corr(df['wait_days'])
x
-0.098948
df['pct_minority'].corr(df['wait_days'])
x
0.098948

TODO

2.6.2 Linear Regression

TODO

Journalistically, linear regression allows us to make statements like “for every X percent increase in minorities in an area, pothole wait times will go up Y days”.

Always always always always check your regressions with an expert. You (probably) aren’t a mathematician, and there are a lot of ‘gotchas’ that you can come across when you’re dealing with statistics.

In our code, we’ll be asking what the effect of X is on y. No matter what stats package you use, these variable names will generally be the same! In this case, we want to know the affect of pct_minority on wait_days, so X is going to be pct_minority and y is going to be wait_days.

In this case, we’re using the statsmodels package for our regression, because it has a real nice-looking output. You run a linear regression with statsmodels like this:

import statsmodels.api as sm
X = df[['pct_minority']]
X = sm.add_constant(X)
## /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
##   return ptp(axis=axis, out=out, **kwargs)
y = df.wait_days

model = sm.OLS(y, X)
result = model.fit()
result.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:              wait_days   R-squared:                       0.010
## Model:                            OLS   Adj. R-squared:                  0.010
## Method:                 Least Squares   F-statistic:                     126.4
## Date:                Tue, 14 Jan 2020   Prob (F-statistic):           3.49e-29
## Time:                        13:30:15   Log-Likelihood:                -50049.
## No. Observations:               12783   AIC:                         1.001e+05
## Df Residuals:                   12781   BIC:                         1.001e+05
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ================================================================================
##                    coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------
## const            6.0386      0.247     24.489      0.000       5.555       6.522
## pct_minority     3.9611      0.352     11.242      0.000       3.270       4.652
## ==============================================================================
## Omnibus:                     6385.090   Durbin-Watson:                   1.411
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):            37480.668
## Skew:                           2.405   Prob(JB):                         0.00
## Kurtosis:                       9.873   Cond. No.                         4.68
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """

The machine that learns from our data is called the model. After the model analyzes each row of our data, it decides the relationship between pct_minority and wait_days, which shows up under the coef section.

Under coef it lists pct_minority as 3.9611… but what’s that mean?

2.6.2.1 Understanding the coefficient

                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            6.0386      0.247     24.489      0.000       5.555       6.522
pct_minority     3.9611      0.352     11.242      0.000       3.270       4.652

The coefficient - coef - is what goes in our sentence: “for every increase of 1 in pct_minority, pothole wait times will go up Y days”. In this case, the coefficient is 3.9611, so our sentence goes something like this:

For every increase of 1 in pct_minority, the number of days you wait is increased by 3.9611 . ( We’d hopefully round it up to around 4, because no one cares about those extra digits.)

Now, in this case pct_minority goes from 0-1, with 0 being 0% minorities and 1 being 100% minorities. As a result, “an increase of 1” from our sentence actually means increasing the number of minorities from 0% to 100%. That doesn’t really make sense, so you might do a little division to break it down into smaller units:

  • The output: 1 point increase in pct_minority, an additional 4 days
  • multiplied by 10: 0.5 increase in pct_minority, an additional 2 days
  • multiplied by 25: 0.25 increase in pct_minority, an additional 1 day

As a result: if you have two areas, one with a pct_minority of 0.37 and one with a pct_minority of 0.62 — a 0.25 difference — you can expect pothole fixing to take an extra 1 day in the second area.

Very important note: this doesn’t mean a 25% increase in pct_minority (which would be 0.37 + 0.09 = 0.46), it means an actual increase of 0.25 (which would be 0.37 + 0.25 = 0.62).

2.6.2.2 Adjusting our units

While can change those numbers around in our heads - 0.25 is 25%, 0.5 is 50%, etc - some people might find that kind of tough to think about. Even though 0-1 can work as a percent, you might have an easier time if we do our analysis with actual percentages.

To use “real” percentages, we can just multiply by 100.

df['pct_minority'] = df.pct_minority * 100
df['pct_white'] = df.pct_white * 100
df.head()
address GEOID Geo_FIPS pct_white pct_minority wait_days
0 3839 N 10TH ST 55079004500 55079004500 2.405063 97.59494 1.250000
1 4900 W MELVINA ST 55079003800 55079003800 8.824796 91.17520 8.833333
2 2400 W WISCONSIN AV 55079014900 55079014900 40.313725 59.68627 9.750000
3 1800 W HAMPTON AV 55079002300 55079002300 4.389407 95.61059 2.416667
4 4718 N 19TH ST 55079002300 55079002300 4.389407 95.61059 17.416667

Now that we’ve adjusted our numbers, let’s try out the regression one more time:

import statsmodels.api as sm
X = df[['pct_minority']]
X = sm.add_constant(X)
y = df.wait_days

model = sm.OLS(y, X)
result = model.fit()
result.summary()
## <class 'statsmodels.iolib.summary.Summary'>
## """
##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:              wait_days   R-squared:                       0.010
## Model:                            OLS   Adj. R-squared:                  0.010
## Method:                 Least Squares   F-statistic:                     126.4
## Date:                Tue, 14 Jan 2020   Prob (F-statistic):           3.49e-29
## Time:                        13:30:15   Log-Likelihood:                -50049.
## No. Observations:               12783   AIC:                         1.001e+05
## Df Residuals:                   12781   BIC:                         1.001e+05
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ================================================================================
##                    coef    std err          t      P>|t|      [0.025      0.975]
## --------------------------------------------------------------------------------
## const            6.0386      0.247     24.489      0.000       5.555       6.522
## pct_minority     0.0396      0.004     11.242      0.000       0.033       0.047
## ==============================================================================
## Omnibus:                     6385.090   Durbin-Watson:                   1.411
## Prob(Omnibus):                  0.000   Jarque-Bera (JB):            37480.668
## Skew:                           2.405   Prob(JB):                         0.00
## Kurtosis:                       9.873   Cond. No.                         161.
## ==============================================================================
## 
## Warnings:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## """

The numbers are much smaller, but you might find them easier to deal with.

  • The output: 1 percentage point increase in minorities, an additional .04 days
  • multiplied by 10: 10 percentage point increase in minorities, an additional 0.4 days
  • multiplied by 25: 25 percentage point increase in minorities, an additional 1 day

You’re welcome to multiply, divide, or anything else to your regression units before you do the actual regression. Just think about what your final sentence might be and aim for those.

Again, if you have two areas, one with a pct_minority of 37% and one with a pct_minority of 62%, you can expect pothole fixing to take an extra 1 day in the second area. And yes, this is not a 25% increase of 37% to 46%, this is an increase of 25 percentage points, 37% + 25% = 62%.

2.6.2.3 Meaning of const

Under coef there’s another coefficient we’ve been ignoring named const.

                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const            6.0386      0.247     24.489      0.000       5.555       6.522
pct_minority     0.0396      0.004     11.242      0.000       0.033       0.047

The basic idea is that linear regression loves the number zero. By default, linear regression on statsmodels assumes that if you have a pct_minority of zero, wait_days will also be zero.

Since that’s not true at all, you always need to add in this constant. What’s what the weird .add_constant thing was when we were building our model:

X = df[['pct_minority']]
X = sm.add_constant(X)

It means “Hey, model! Zero pct_minority doesn’t mean zero wait_days. Thanks!” And as a result, the model comes up with a const of 6.0386 - the number of days you’ll wait if pct_minority is zero.