Multivariable logistic regression and reference categories#

This is a continuation of the introduction to logistic regression. Once we've got the basics down, we can start to have some real fun. In this section we'll examine having multiple inputs to our regression, along with dealing with categorical data.

import pandas as pd
import statsmodels.formula.api as smf
import numpy as np

Multivariable regression#

As a gentle reminder: we're lazy lovers of knitting and scarves, and trying to do statistical analyses to see which scarves we're bound to finish.

Last time we were looking at how the length of a scarf affects whether we complete a scarf or not. We decided on logistic regression because the output is a category (completed/not completed), and found out that every additional inch we're supposed to knit lowers our chance of finishing the scarf.

This time we've added a couple more columns of data, since there might be more at work than just scarf length.

df = pd.DataFrame([
    { 'length_in': 55, 'large_gauge': 1, 'color': 'orange', 'completed': 1 },
    { 'length_in': 55, 'large_gauge': 0, 'color': 'orange', 'completed': 1 },
    { 'length_in': 55, 'large_gauge': 0, 'color': 'brown', 'completed': 1 },
    { 'length_in': 60, 'large_gauge': 0, 'color': 'brown', 'completed': 1 },
    { 'length_in': 60, 'large_gauge': 0, 'color': 'grey', 'completed': 0 },
    { 'length_in': 70, 'large_gauge': 0, 'color': 'grey', 'completed': 1 },
    { 'length_in': 70, 'large_gauge': 0, 'color': 'orange', 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 1, 'color': 'grey', 'completed': 1 },
    { 'length_in': 82, 'large_gauge': 0, 'color': 'brown', 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 0, 'color': 'orange', 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 1, 'color': 'brown', 'completed': 0 },
])
df
length_in large_gauge color completed
0 55 1 orange 1
1 55 0 orange 1
2 55 0 brown 1
3 60 0 brown 1
4 60 0 grey 0
5 70 0 grey 1
6 70 0 orange 0
7 82 1 grey 1
8 82 0 brown 0
9 82 0 orange 0
10 82 1 brown 0

Our columns are as follows:

  • length_in, the length of the potential scarf in inches
  • large_gauge, whether we're using big fat needles to do our knitting (1=used large gauge needles)
  • color, the color of our yarn
  • completed, whether the scarf was completed or not (1=completed)

Our original regression#

Previously we ran a regression relating the length of the scarf and whether the scarf was completed. Let's run that same regression again, just as a quick reminder.

model = smf.logit("completed ~ length_in", data=df)
results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.531806
         Iterations 5
Logit Regression Results
Dep. Variable: completed No. Observations: 11
Model: Logit Df Residuals: 9
Method: MLE Df Model: 1
Date: Mon, 30 Dec 2019 Pseudo R-squ.: 0.2282
Time: 13:38:21 Log-Likelihood: -5.8499
converged: True LL-Null: -7.5791
Covariance Type: nonrobust LLR p-value: 0.06293
coef std err z P>|z| [0.025 0.975]
Intercept 7.8531 4.736 1.658 0.097 -1.429 17.135
length_in -0.1112 0.067 -1.649 0.099 -0.243 0.021

And then we'll also check out the odds ratio for the sake of completeness.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs
coef odds ratio name
0 7.853131 2573.780516 Intercept
1 -0.111171 0.894786 length_in

Adding another variable#

This time we're going to add our new columns to the mix. We'll start by adding whether we used large-gauge needles when knitting the scarf. Large gauge needles typically leave large gaps between your stitches, allowing you to knit more area more quickly.

model = smf.logit("completed ~ length_in + large_gauge", data=df)
results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.449028
         Iterations 7
Logit Regression Results
Dep. Variable: completed No. Observations: 11
Model: Logit Df Residuals: 8
Method: MLE Df Model: 2
Date: Mon, 30 Dec 2019 Pseudo R-squ.: 0.3483
Time: 13:38:28 Log-Likelihood: -4.9393
converged: True LL-Null: -7.5791
Covariance Type: nonrobust LLR p-value: 0.07138
coef std err z P>|z| [0.025 0.975]
Intercept 12.0850 7.615 1.587 0.113 -2.840 27.010
length_in -0.1833 0.117 -1.573 0.116 -0.412 0.045
large_gauge 2.9609 2.589 1.144 0.253 -2.113 8.035

Easy-peasy. You'll see that we have new row down in the features section, handcrafted just for our new large_gauge column! Right now we're specifically interested in the coefficient, which explains how using a large gauge knitting needle is related to our completion rate.

Since it's LOGistic regression, the coefficients are currently LOGarithms. To turn them into odds ratios we'll need to use np.exp to reverse the logarithm with an exponent.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs
coef odds ratio name
0 12.085035 177200.102739 Intercept
1 -0.183318 0.832504 length_in
2 2.960890 19.315158 large_gauge

Let's translate these odds ratios into human being language:

  • For scarf length, our odds of completion drop about 17% for each inch we add (0.83x)
  • If we use large gauge needles, though, our odds are increased 19x!

With an increase like that, it seems like large-gauge needles are a superpower! (if you've never used them: yes, they are.)

When we talk about multivariable regression, this is the source of phrases like "everything else being equal," or "controlling for other variables." In this case, we're judging the performance of large gauge needles controlling for the length of a scarf. For every inch we add, we're seeing the effect of that inch, everything else being equal (if "everything else" is "are we using large gauge needles?").

When we look at the effect of a single feature, each variable we include in the regression is being balanced out. But only for the other variables in the regression: we aren't controlling for the color of the scarf, though, or what month of the year it is, or whether we had a cold when we started it.

Note: We're ignoring p values for now, we'll address that when we get to the next chapter on model evaluation.

Categorical variables#

So far we've looked at two sorts of variables:

  • length_in, a numeric category
  • completed and large_gauge, both boolean categories (1/0, yes/no, true/false)

The other column we have here is color. It is unique in that it's a string, not a number. As a result, it gets special treatment.

If we want to add color to our regression, we'll need to explicitly tell statsmodels that the column is a category.

model = smf.logit("completed ~ length_in + large_gauge + C(color)", data=df)
results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.424906
         Iterations 7
Logit Regression Results
Dep. Variable: completed No. Observations: 11
Model: Logit Df Residuals: 6
Method: MLE Df Model: 4
Date: Fri, 13 Dec 2019 Pseudo R-squ.: 0.3833
Time: 19:08:11 Log-Likelihood: -4.6740
converged: True LL-Null: -7.5791
Covariance Type: nonrobust LLR p-value: 0.2138
coef std err z P>|z| [0.025 0.975]
Intercept 12.5839 7.995 1.574 0.115 -3.086 28.254
C(color)[T.grey] 1.0113 1.906 0.531 0.596 -2.725 4.748
C(color)[T.orange] -0.4594 2.257 -0.204 0.839 -4.884 3.965
length_in -0.1944 0.126 -1.540 0.124 -0.442 0.053
large_gauge 2.8814 2.845 1.013 0.311 -2.694 8.457

Instead of one new row for "color," we now we have two new, very oddly named features: C(color)[T.grey] and C(color)[T.orange].

When you're dealing with categorical data, each value in the category gets broken out into a different feature. The logistic regression doesn't say "color has an event like this on completion" - instead, it says "the color orange has a certain effect" and "the color grey has a certain effect" and so on.

We can tell that grey helps things along and orange works against completion since grey's coefficient is positive and orange's is negative, but let's compute the odds ratios anyway.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs
coef odds ratio name
0 12.085035 177200.102739 Intercept
1 -0.183318 0.832504 length_in
2 2.960890 19.315158 large_gauge

Switching to grey gives us a 2.7x improvement in our odds, while orange penalizes our odds of completion by 0.64x.

But... wait a second - how many colors did we have?

df.color.value_counts()
orange    4
brown     4
grey      3
Name: color, dtype: int64

We have results for grey and orange, but... where's our result for brown?

Reference categories#

In this case, our grey and orange odds ratios are in comparison to brown. Grey gives us a 2.7x improvement in our odds compared to using brown. Orange penalizes our odds of completion by 0.64x, compared to using brown.

This is called the reference category, and it will come up almost every time you have a categorical variable. The reference category should typically be the most common category, as you get to compare less common things to whatever is thought of as "normal." For some reason, though, statsmodels defaults to picking the first in alphabetical order.

Thanks for keeping it weird, statsmodels!

Setting the reference category in statsmodels#

Let's say orange is our favorite color, and we love love love to knit with it. If we're running our regression, it doesn't make sense to compare grey's completion rate to brown's completion rate: I want to know everything in comparison to orange!

Luckily, we can tell statsmodels exactly which value we want to have as our reference.

model = smf.logit("completed ~ length_in + large_gauge + C(color, Treatment('orange'))", data=df)
results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.424906
         Iterations 7
Logit Regression Results
Dep. Variable: completed No. Observations: 11
Model: Logit Df Residuals: 6
Method: MLE Df Model: 4
Date: Mon, 30 Dec 2019 Pseudo R-squ.: 0.3833
Time: 13:56:44 Log-Likelihood: -4.6740
converged: True LL-Null: -7.5791
Covariance Type: nonrobust LLR p-value: 0.2138
coef std err z P>|z| [0.025 0.975]
Intercept 12.1245 8.094 1.498 0.134 -3.740 27.989
C(color, Treatment('orange'))[T.brown] 0.4594 2.257 0.204 0.839 -3.965 4.884
C(color, Treatment('orange'))[T.grey] 1.4708 2.289 0.643 0.520 -3.015 5.957
length_in -0.1944 0.126 -1.540 0.124 -0.442 0.053
large_gauge 2.8814 2.845 1.013 0.311 -2.694 8.457

Despite our features being a little longer and uglier - C(color, Treatment('orange'))[T.brown] is a mouthful - we can now explain each color in reference to orange. Let's look at our updated odds ratios:

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
})
coefs
coef odds ratio name
0 12.124529 184338.541216 Intercept
1 0.459412 1.583143 C(color, Treatment('orange'))[T.brown]
2 1.470759 4.352538 C(color, Treatment('orange'))[T.grey]
3 -0.194425 0.823308 length_in
4 2.881375 17.838786 large_gauge

Do we really love orange that much? Because if we used grey our odds of finishing would be four times better.

While we've done a lot of work in figuring out how to build models and organize our features, we don't yet know if our model is any good. Our next step will be evaluating our models and our features to see our findings are accurate.

Review#

In this section, we learned how to perform multivariable logistic regression by adding additional features as our regression's input variables.

We also saw how to encode categorical variables when writing formula-style regression. Along with using C() to convert a string to a statsmodels-friendly category, we also learned how to use Treatment to create reference categories.

Discussion topics#

TODO