Linear Regression Quickstart#

Already know what's what with linear regression, just need to know how to tackle it in Python? We're here for you! If not, continue on to the next section.

We're going to ignore the nuance of what we're doing in this notebook, it's really just for people who need to see the process.

Read online Download notebook Interactive version

Pandas for our data#

As is typical, we'll be using pandas dataframes for the data.

import pandas as pd

df = pd.DataFrame([
    { 'sold': 0, 'revenue': 0 },
    { 'sold': 4, 'revenue': 8 },
    { 'sold': 16, 'revenue': 32 },
])
df

	sold	revenue
0	0	0
1	4	8
2	16	32

Performing a regression#

The statsmodels package is your best friend when it comes to regression. In theory you can do it using other techniques or libraries, but statsmodels is just so simple.

For the regression below, I'm using the formula method of describing the regression. If that makes you grumpy, check the regression reference page for more details.

import statsmodels.formula.api as smf

model = smf.ols("revenue ~ sold", data=df)
results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:	revenue	R-squared:	1.000
Model:	OLS	Adj. R-squared:	1.000
Method:	Least Squares	F-statistic:	9.502e+30
Date:	Sun, 08 Dec 2019	Prob (F-statistic):	2.07e-16
Time:	10:14:18	Log-Likelihood:	94.907
No. Observations:	3	AIC:	-185.8
Df Residuals:	1	BIC:	-187.6
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-2.665e-15	6.18e-15	-0.431	0.741	-8.12e-14	7.58e-14
sold	2.0000	6.49e-16	3.08e+15	0.000	2.000	2.000

Omnibus:	nan	Durbin-Watson:	1.149
Prob(Omnibus):	nan	Jarque-Bera (JB):	0.471
Skew:	-0.616	Prob(JB):	0.790
Kurtosis:	1.500	Cond. No.	13.4

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

For each unit sold, we get 2 revenue. That's about it.

Multivariable regression#

Multivariable regression is easy-peasy. Let's add a couple more columns to our dataset, adding tips to the equation.

import pandas as pd

df = pd.DataFrame([
    { 'sold': 0, 'revenue': 0, 'tips': 0, 'charge_amount': 0 },
    { 'sold': 4, 'revenue': 8, 'tips': 1, 'charge_amount': 9 },
    { 'sold': 16, 'revenue': 32, 'tips': 2, 'charge_amount': 34 },
])
df

	sold	revenue	tips	charge_amount
0	0	0	0	0
1	4	8	1	9
2	16	32	2	34

import statsmodels.formula.api as smf

model = smf.ols("charge_amount ~ sold + tips", data=df)
results = model.fit()
results.summary()

OLS Regression Results
Dep. Variable:	charge_amount	R-squared:	1.000
Model:	OLS	Adj. R-squared:	nan
Method:	Least Squares	F-statistic:	0.000
Date:	Sun, 08 Dec 2019	Prob (F-statistic):	nan
Time:	10:14:20	Log-Likelihood:	89.745
No. Observations:	3	AIC:	-173.5
Df Residuals:	0	BIC:	-176.2
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-1.685e-15	inf	-0	nan	nan	nan
sold	2.0000	inf	0	nan	nan	nan
tips	1.0000	inf	0	nan	nan	nan

Omnibus:	nan	Durbin-Watson:	0.922
Prob(Omnibus):	nan	Jarque-Bera (JB):	0.520
Skew:	-0.691	Prob(JB):	0.771
Kurtosis:	1.500	Cond. No.	44.0

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

There you go!

If you'd like more details, you can continue on in this section. If you'd just like the how-to-do-an-exact-thing explanations, check out the regression reference page.

Linear Regression Quickstart#

Pandas for our data#

Performing a regression#

Multivariable regression#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects