Multivariable logistic regression and reference categories#
This is a continuation of the introduction to logistic regression. Once we've got the basics down, we can start to have some real fun. In this section we'll examine having multiple inputs to our regression, along with dealing with categorical data.
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
Multivariable regression#
As a gentle reminder: we're lazy lovers of knitting and scarves, and trying to do statistical analyses to see which scarves we're bound to finish.
Last time we were looking at how the length of a scarf affects whether we complete a scarf or not. We decided on logistic regression because the output is a category (completed/not completed), and found out that every additional inch we're supposed to knit lowers our chance of finishing the scarf.
This time we've added a couple more columns of data, since there might be more at work than just scarf length.
df = pd.DataFrame([
{ 'length_in': 55, 'large_gauge': 1, 'color': 'orange', 'completed': 1 },
{ 'length_in': 55, 'large_gauge': 0, 'color': 'orange', 'completed': 1 },
{ 'length_in': 55, 'large_gauge': 0, 'color': 'brown', 'completed': 1 },
{ 'length_in': 60, 'large_gauge': 0, 'color': 'brown', 'completed': 1 },
{ 'length_in': 60, 'large_gauge': 0, 'color': 'grey', 'completed': 0 },
{ 'length_in': 70, 'large_gauge': 0, 'color': 'grey', 'completed': 1 },
{ 'length_in': 70, 'large_gauge': 0, 'color': 'orange', 'completed': 0 },
{ 'length_in': 82, 'large_gauge': 1, 'color': 'grey', 'completed': 1 },
{ 'length_in': 82, 'large_gauge': 0, 'color': 'brown', 'completed': 0 },
{ 'length_in': 82, 'large_gauge': 0, 'color': 'orange', 'completed': 0 },
{ 'length_in': 82, 'large_gauge': 1, 'color': 'brown', 'completed': 0 },
])
df
Our columns are as follows:
length_in
, the length of the potential scarf in incheslarge_gauge
, whether we're using big fat needles to do our knitting (1=used large gauge needles)color
, the color of our yarncompleted
, whether the scarf was completed or not (1=completed)
Our original regression#
Previously we ran a regression relating the length of the scarf and whether the scarf was completed. Let's run that same regression again, just as a quick reminder.
model = smf.logit("completed ~ length_in", data=df)
results = model.fit()
results.summary()
And then we'll also check out the odds ratio for the sake of completeness.
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'name': results.params.index
})
coefs
Adding another variable#
This time we're going to add our new columns to the mix. We'll start by adding whether we used large-gauge needles when knitting the scarf. Large gauge needles typically leave large gaps between your stitches, allowing you to knit more area more quickly.
model = smf.logit("completed ~ length_in + large_gauge", data=df)
results = model.fit()
results.summary()
Easy-peasy. You'll see that we have new row down in the features section, handcrafted just for our new large_gauge
column! Right now we're specifically interested in the coefficient, which explains how using a large gauge knitting needle is related to our completion rate.
Since it's LOGistic regression, the coefficients are currently LOGarithms. To turn them into odds ratios we'll need to use np.exp
to reverse the logarithm with an exponent.
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'name': results.params.index
})
coefs
Let's translate these odds ratios into human being language:
- For scarf length, our odds of completion drop about 17% for each inch we add (0.83x)
- If we use large gauge needles, though, our odds are increased 19x!
With an increase like that, it seems like large-gauge needles are a superpower! (if you've never used them: yes, they are.)
When we talk about multivariable regression, this is the source of phrases like "everything else being equal," or "controlling for other variables." In this case, we're judging the performance of large gauge needles controlling for the length of a scarf. For every inch we add, we're seeing the effect of that inch, everything else being equal (if "everything else" is "are we using large gauge needles?").
When we look at the effect of a single feature, each variable we include in the regression is being balanced out. But only for the other variables in the regression: we aren't controlling for the color of the scarf, though, or what month of the year it is, or whether we had a cold when we started it.
Note: We're ignoring p values for now, we'll address that when we get to the next chapter on model evaluation.
Categorical variables#
So far we've looked at two sorts of variables:
length_in
, a numeric categorycompleted
andlarge_gauge
, both boolean categories (1/0, yes/no, true/false)
The other column we have here is color
. It is unique in that it's a string, not a number. As a result, it gets special treatment.
If we want to add color
to our regression, we'll need to explicitly tell statsmodels that the column is a category.
model = smf.logit("completed ~ length_in + large_gauge + C(color)", data=df)
results = model.fit()
results.summary()
Instead of one new row for "color," we now we have two new, very oddly named features: C(color)[T.grey]
and C(color)[T.orange]
.
When you're dealing with categorical data, each value in the category gets broken out into a different feature. The logistic regression doesn't say "color has an event like this on completion" - instead, it says "the color orange has a certain effect" and "the color grey has a certain effect" and so on.
We can tell that grey helps things along and orange works against completion since grey's coefficient is positive and orange's is negative, but let's compute the odds ratios anyway.
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'name': results.params.index
})
coefs
Switching to grey gives us a 2.7x improvement in our odds, while orange penalizes our odds of completion by 0.64x.
But... wait a second - how many colors did we have?
df.color.value_counts()
We have results for grey and orange, but... where's our result for brown?
Reference categories#
In this case, our grey and orange odds ratios are in comparison to brown. Grey gives us a 2.7x improvement in our odds compared to using brown. Orange penalizes our odds of completion by 0.64x, compared to using brown.
This is called the reference category, and it will come up almost every time you have a categorical variable. The reference category should typically be the most common category, as you get to compare less common things to whatever is thought of as "normal." For some reason, though, statsmodels defaults to picking the first in alphabetical order.
Thanks for keeping it weird, statsmodels!
Setting the reference category in statsmodels#
Let's say orange is our favorite color, and we love love love to knit with it. If we're running our regression, it doesn't make sense to compare grey's completion rate to brown's completion rate: I want to know everything in comparison to orange!
Luckily, we can tell statsmodels exactly which value we want to have as our reference.
model = smf.logit("completed ~ length_in + large_gauge + C(color, Treatment('orange'))", data=df)
results = model.fit()
results.summary()
Despite our features being a little longer and uglier - C(color, Treatment('orange'))[T.brown]
is a mouthful - we can now explain each color in reference to orange. Let's look at our updated odds ratios:
coefs = pd.DataFrame({
'coef': results.params.values,
'odds ratio': np.exp(results.params.values),
'name': results.params.index
})
coefs
Do we really love orange that much? Because if we used grey our odds of finishing would be four times better.
While we've done a lot of work in figuring out how to build models and organize our features, we don't yet know if our model is any good. Our next step will be evaluating our models and our features to see our findings are accurate.
Review#
In this section, we learned how to perform multivariable logistic regression by adding additional features as our regression's input variables.
We also saw how to encode categorical variables when writing formula-style regression. Along with using C()
to convert a string to a statsmodels-friendly category, we also learned how to use Treatment
to create reference categories.
Discussion topics#
TODO