Logistic Regression Quickstart#

Already know what's what with logistic regression, just need to know how to tackle it in Python? We're here for you! If not, continue on to the next section.

We're going to ignore the nuance of what we're doing in this notebook, it's really just for people who need to see the process.

Read online Download notebook Interactive version

Pandas for our data#

As is typical, we'll be using pandas dataframes for the data.

import pandas as pd
import numpy as np

df = pd.DataFrame([
    { 'length_in': 55, 'completed': 1 },
    { 'length_in': 55, 'completed': 1 },
    { 'length_in': 55, 'completed': 1 },
    { 'length_in': 60, 'completed': 1 },
    { 'length_in': 60, 'completed': 0 },
    { 'length_in': 70, 'completed': 1 },
    { 'length_in': 70, 'completed': 0 },
    { 'length_in': 82, 'completed': 1 },
    { 'length_in': 82, 'completed': 0 },
    { 'length_in': 82, 'completed': 0 },
    { 'length_in': 82, 'completed': 0 },
])
df

	length_in	completed
0	55	1
1	55	1
2	55	1
3	60	1
4	60	0
5	70	1
6	70	0
7	82	1
8	82	0
9	82	0
10	82	0

Performing a regression#

The statsmodels package is your best friend when it comes to regression. In theory you can do it using other techniques or libraries, but statsmodels is just so simple.

For the regression below, I'm using the formula method of describing the regression. If that makes you grumpy, check the regression reference page for more details.

import statsmodels.formula.api as smf

model = smf.logit("completed ~ length_in", data=df)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.531806
         Iterations 5

Logit Regression Results
Dep. Variable:	completed	No. Observations:	11
Model:	Logit	Df Residuals:	9
Method:	MLE	Df Model:	1
Date:	Tue, 10 Dec 2019	Pseudo R-squ.:	0.2282
Time:	18:14:57	Log-Likelihood:	-5.8499
converged:	True	LL-Null:	-7.5791
Covariance Type:	nonrobust	LLR p-value:	0.06293

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	7.8531	4.736	1.658	0.097	-1.429	17.135
length_in	-0.1112	0.067	-1.649	0.099	-0.243	0.021

Converting coefficient to odds#

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
})
coefs

	coef	odds ratio	pvalue	name
Intercept	7.853131	2573.780516	0.097279	Intercept
length_in	-0.111171	0.894786	0.099062	length_in

For each additional inch I add to a scarf, my odds of finishing is 94% of what it was before (a.k.a. is lowered by 6%).

Making predictions#

X_unknown = pd.DataFrame([
    { 'length_in': 20 },
    { 'length_in': 55 },
    { 'length_in': 80 },
    { 'length_in': 100 }
])

X_unknown['prediction'] = results.predict(X_unknown)
X_unknown

	length_in	prediction
0	20	0.996423
1	55	0.850526
2	80	0.261047
3	100	0.036829

Multivariable regression#

Multivariable regression is easy-peasy. We're going to add the size of our needles to our dataset. Larger needles make work go faster, so lazy people like me are more likely to finish.

df = pd.DataFrame([
    { 'length_in': 55, 'large_gauge': 1, 'completed': 1 },
    { 'length_in': 55, 'large_gauge': 0, 'completed': 1 },
    { 'length_in': 55, 'large_gauge': 0, 'completed': 1 },
    { 'length_in': 60, 'large_gauge': 0, 'completed': 1 },
    { 'length_in': 60, 'large_gauge': 0, 'completed': 0 },
    { 'length_in': 70, 'large_gauge': 0, 'completed': 1 },
    { 'length_in': 70, 'large_gauge': 0, 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 1, 'completed': 1 },
    { 'length_in': 82, 'large_gauge': 0, 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 0, 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 1, 'completed': 0 },
])
df

	length_in	large_gauge	completed
0	55	1	1
1	55	0	1
2	55	0	1
3	60	0	1
4	60	0	0
5	70	0	1
6	70	0	0
7	82	1	1
8	82	0	0
9	82	0	0
10	82	1	0

model = smf.logit("completed ~ length_in + large_gauge", data=df)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 0.449028
         Iterations 7

Logit Regression Results
Dep. Variable:	completed	No. Observations:	11
Model:	Logit	Df Residuals:	8
Method:	MLE	Df Model:	2
Date:	Tue, 10 Dec 2019	Pseudo R-squ.:	0.3483
Time:	18:15:05	Log-Likelihood:	-4.9393
converged:	True	LL-Null:	-7.5791
Covariance Type:	nonrobust	LLR p-value:	0.07138

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	12.0850	7.615	1.587	0.113	-2.840	27.010
length_in	-0.1833	0.117	-1.573	0.116	-0.412	0.045
large_gauge	2.9609	2.589	1.144	0.253	-2.113	8.035

Converting coefficient to odds ratio#

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'pvalue': results.pvalues,
    'name': results.params.index
})
coefs

	coef	odds ratio	pvalue	name
Intercept	12.085035	177200.102739	0.112516	Intercept
length_in	-0.183318	0.832504	0.115759	length_in
large_gauge	2.960890	19.315158	0.252771	large_gauge

Using large gauge needles doubles your odds of finishing a project!

# Switching from small to large gauge needles
# is equivalent to how many inches?
math.log(2.15, 1.08)

9.946173165532088

X_unknown = pd.DataFrame([
    { 'length_in': 60, 'large_gauge': 1 },
    { 'length_in': 60, 'large_gauge': 0 },
    { 'length_in': 70, 'large_gauge': 1 },
    { 'length_in': 70, 'large_gauge': 0 },
])

X_unknown['prediction'] = results.predict(X_unknown)
X_unknown

	length_in	large_gauge	prediction
0	60	1	0.982823
1	60	0	0.747624
2	70	1	0.901472
3	70	0	0.321432

There you go!

If you'd like more details, you can continue on in this section. If you'd just like the how-to-do-an-exact-thing explanations, check out the regression reference page.

Logistic Regression Quickstart#

Pandas for our data#

Performing a regression#

Converting coefficient to odds#

Making predictions#

Multivariable regression#

Converting coefficient to odds ratio#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects