Logistic Regression Quickstart#

Already know what's what with logistic regression, just need to know how to tackle it in Python? We're here for you! If not, continue on to the next section.

We're going to ignore the nuance of what we're doing in this notebook, it's really just for people who need to see the process.

Pandas for our data#

As is typical, we'll be using pandas dataframes for the data.

import pandas as pd
import numpy as np

df = pd.DataFrame([
    { 'length_in': 55, 'completed': 1 },
    { 'length_in': 55, 'completed': 1 },
    { 'length_in': 55, 'completed': 1 },
    { 'length_in': 60, 'completed': 1 },
    { 'length_in': 60, 'completed': 0 },
    { 'length_in': 70, 'completed': 1 },
    { 'length_in': 70, 'completed': 0 },
    { 'length_in': 82, 'completed': 1 },
    { 'length_in': 82, 'completed': 0 },
    { 'length_in': 82, 'completed': 0 },
    { 'length_in': 82, 'completed': 0 },
])
df
length_in completed
0 55 1
1 55 1
2 55 1
3 60 1
4 60 0
5 70 1
6 70 0
7 82 1
8 82 0
9 82 0
10 82 0

Performing a regression#

The statsmodels package is your best friend when it comes to regression. In theory you can do it using other techniques or libraries, but statsmodels is just so simple.

For the regression below, I'm using the formula method of describing the regression. If that makes you grumpy, check the regression reference page for more details.

import statsmodels.formula.api as smf

model = smf.logit("completed ~ length_in", data=df)
results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.531806
         Iterations 5
Logit Regression Results
Dep. Variable: completed No. Observations: 11
Model: Logit Df Residuals: 9
Method: MLE Df Model: 1
Date: Tue, 10 Dec 2019 Pseudo R-squ.: 0.2282
Time: 18:14:57 Log-Likelihood: -5.8499
converged: True LL-Null: -7.5791
Covariance Type: nonrobust LLR p-value: 0.06293
coef std err z P>|z| [0.025 0.975]
Intercept 7.8531 4.736 1.658 0.097 -1.429 17.135
length_in -0.1112 0.067 -1.649 0.099 -0.243 0.021

Converting coefficient to odds#

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
})
coefs
coef odds ratio pvalue name
Intercept 7.853131 2573.780516 0.097279 Intercept
length_in -0.111171 0.894786 0.099062 length_in

For each additional inch I add to a scarf, my odds of finishing is 94% of what it was before (a.k.a. is lowered by 6%).

Making predictions#

X_unknown = pd.DataFrame([
    { 'length_in': 20 },
    { 'length_in': 55 },
    { 'length_in': 80 },
    { 'length_in': 100 }
])

X_unknown['prediction'] = results.predict(X_unknown)
X_unknown
length_in prediction
0 20 0.996423
1 55 0.850526
2 80 0.261047
3 100 0.036829

Multivariable regression#

Multivariable regression is easy-peasy. We're going to add the size of our needles to our dataset. Larger needles make work go faster, so lazy people like me are more likely to finish.

df = pd.DataFrame([
    { 'length_in': 55, 'large_gauge': 1, 'completed': 1 },
    { 'length_in': 55, 'large_gauge': 0, 'completed': 1 },
    { 'length_in': 55, 'large_gauge': 0, 'completed': 1 },
    { 'length_in': 60, 'large_gauge': 0, 'completed': 1 },
    { 'length_in': 60, 'large_gauge': 0, 'completed': 0 },
    { 'length_in': 70, 'large_gauge': 0, 'completed': 1 },
    { 'length_in': 70, 'large_gauge': 0, 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 1, 'completed': 1 },
    { 'length_in': 82, 'large_gauge': 0, 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 0, 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 1, 'completed': 0 },
])
df
length_in large_gauge completed
0 55 1 1
1 55 0 1
2 55 0 1
3 60 0 1
4 60 0 0
5 70 0 1
6 70 0 0
7 82 1 1
8 82 0 0
9 82 0 0
10 82 1 0
model = smf.logit("completed ~ length_in + large_gauge", data=df)
results = model.fit()
results.summary()
Optimization terminated successfully.
         Current function value: 0.449028
         Iterations 7
Logit Regression Results
Dep. Variable: completed No. Observations: 11
Model: Logit Df Residuals: 8
Method: MLE Df Model: 2
Date: Tue, 10 Dec 2019 Pseudo R-squ.: 0.3483
Time: 18:15:05 Log-Likelihood: -4.9393
converged: True LL-Null: -7.5791
Covariance Type: nonrobust LLR p-value: 0.07138
coef std err z P>|z| [0.025 0.975]
Intercept 12.0850 7.615 1.587 0.113 -2.840 27.010
length_in -0.1833 0.117 -1.573 0.116 -0.412 0.045
large_gauge 2.9609 2.589 1.144 0.253 -2.113 8.035

Converting coefficient to odds ratio#

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
})
coefs
coef odds ratio pvalue name
Intercept 12.085035 177200.102739 0.112516 Intercept
length_in -0.183318 0.832504 0.115759 length_in
large_gauge 2.960890 19.315158 0.252771 large_gauge

Using large gauge needles doubles your odds of finishing a project!

# Switching from small to large gauge needles
# is equivalent to how many inches?
math.log(2.15, 1.08)
9.946173165532088
X_unknown = pd.DataFrame([
    { 'length_in': 60, 'large_gauge': 1 },
    { 'length_in': 60, 'large_gauge': 0 },
    { 'length_in': 70, 'large_gauge': 1 },
    { 'length_in': 70, 'large_gauge': 0 },
])

X_unknown['prediction'] = results.predict(X_unknown)
X_unknown
length_in large_gauge prediction
0 60 1 0.982823
1 60 0 0.747624
2 70 1 0.901472
3 70 0 0.321432

There you go!

If you'd like more details, you can continue on in this section. If you'd just like the how-to-do-an-exact-thing explanations, check out the regression reference page.