Regression snippets

Python data science coding reference from investigate.ai

Linear Regression

Using formulas

We're using statsmodels to perform our regression. They have a nice formula-based method that looks a lot like R.

Here we're using a basketball players' height and weight to predict the number of points scored, using ordinary least-squares regression. There's a lot more detail on formulas here.

import statsmodels.formula.api as smf

# regression for points as relates to height and weight
model = smf.ols('points ~ height + weight', data=df)
results = model.fit()

results.summary()

Using dataframes

To perform a regression using a whole dataframe, you need two variables:

  • X - your features/inputs
  • y - the outputs you're predicting

Drop the column you're predicting when creating X, and then save that column as y. You probably want sm.add_constant unless your output is zero when all your inputs are zero.

I almost never use the dataframe version, the formula version (see above) is much nicer.

In this case, df is a dataframe with columns height, weight and points. You're predicting points using height and weight.

import statsmodels.api as sm

# predicting points using all other columns
X = df.drop('points', axis=1)
y = df.points

# add_constant is automatic in the formula version
# but for dataframes we need to do it manually
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()

results.summary()

Prediction residuals

The residual or error is the difference between the actual value and the predicted value. After you've fit a model, you can just pull the residual from results.resid and send it right back into your dataframe.

In this case you'll get the number of points above or below the predicted score. If someone scored 12 points but it was predicted that they'd score 14, the residual would be -2.

import statsmodels.formula.api as smf

# regression for points as relates to height and weight
model = smf.ols('points ~ height + weight', data=df)
results = model.fit()

# store residual in new column of original dataframe
df['residual'] = results.resid

Error standard deviation

Instead of judging the residual by its raw amount, it might be useful to see how many standard deviations away from the mean it is. That's useful for finding rows that performed much better or worse than predicted.

If someone scored 12 points but it was predicted that they'd score 14, the residual would be -2, but it might be -0.5 standard deviations from the mean. If you're interesting in finding cheaters, you might look for people with an error over 3 standard deviations.

import statsmodels.formula.api as smf

model = smf.ols('points ~ height + weight', data=df)
results = model.fit()

# store residual standard deviation in
# new column of original dataframe
df['error_std_dev'] = results.resid / np.sqrt(results.mse_resid)

Logistic Regression

Using formulas

Use an applicant's SAT score and GPA to predict acceptance to a college. There's a lot more detail on formulas here.

import statsmodels.formula.api as smf

# regression for acceptance as relates to SAT score and gpa
model = smf.logit('acceptance ~ sat_score + gpa', data=df)
results = model.fit()

results.summary()

Using dataframes

To perform a regression using a whole dataframe, you drop the column you're predicting when creating X (your features), and then save that column as y (predicted value). You probably want sm.add_constant unless your output is zero when all your inputs are zero.

I've given up on the dataframe version for the formula version, it's much nicer.

In this case, df is a dataframe with columns sat_score, gpa and acceptance. You're predicting acceptance using sat_score and gpa.

import statsmodels.api as sm

# predicting acceptance using all other columns
X = df.drop('acceptance', axis=1)
y = df.points

# add_constant is automatic in the formula version
# but for dataframes we need to do it manually
model = sm.Logit(y, sm.add_constant(X))
results = model.fit()

results.summary()

Odds ratios

We'll convert the log odds ratio to an odds ratio using np.exp, and carry along the p-value because we're feeling extravagant.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
})
coefs

Formula tricks

Using categories

Find what role height, weight, and position play in determining how many points a player will score. The position column will be treated as a category, with each position being judged separately.

import statsmodels.formula.api as smf

model = smf.ols(
  'points ~ height + weight + C(position)', 
  data=df)

results = model.fit()
results.summary()

Using categories + reference

The position column will be treated as a category, with each position being judged separately. By providing a reference category, each other position will be judged in relation to that category. For example, "Centers score twice as many points as point guards."

import statsmodels.formula.api as smf

model = smf.ols(
  'points ~ height + weight + C(position, Treatment("Point Guard"))', 
  data=df)

results = model.fit()
results.summary()

Math calculations

If height were in inches and we'd prefer to see feet in our regression, we can use np.div to divide height by 12. No need to create a new column!

We can also convert weight to be in increments of 10 pounds instead of single pounds. This is useful in situations like using median income as an input: a single dollar probably has little effect, but you can easily np.div(income, 10000) to look at $10,000 increases.

import statsmodels.formula.api as smf

model = smf.ols(
  'points ~ np.div(height, 12) + np.div(weight, 10)', 
  data=df)

results = model.fit()
results.summary()