Linear Regression for Human Beings#
Let's try to explain some linear regression concepts without formulas or official definitions or anything things like that!
Introduction#
We sell coffee, and it costs $2. Since we're πΌπ°Important Business Peopleπ°πΌ, we might have some big questions about finance, such as:
- If we sell zero coffees, how much money do we make?
- If we sell four coffees, how much money do we make?
- If we sell sixteen coffees, how much money do we make?
Since coffee costs $2, we can just multiply it out.
β | β | π΅ |
---|---|---|
We sell zero coffees | 0 Γ $2 per coffee |
We make $0 |
We sell four coffees | 4 Γ $2 per coffee |
We make $8 |
We sell sixteen coffees | 16 Γ $2 per coffee |
We make $32 |
Since coffee costs $2, in these situations we would make $0, $8, and $32. Easy-peasy!
Linear regression, if we're going to skip over the specifics, is the opposite of what we just did. If we take the same example but twist it around a little bit, we can see how regression works.
Let's say we know this:
- We sold zero coffees, and made $0
- We sold four coffees, and made $8
- We sold sixteen coffees, and made $32
Linear regression is when we ask ourselves, how much does coffee cost? Let's draw that same table again, but adjusted for our new question.
β | β | π΅ |
---|---|---|
We sell zero coffees | 0 Γ ??? per coffee |
We make $0 |
We sell four coffees | 4 Γ ??? per coffee |
We make $8 |
We sell sixteen coffees | 16 Γ ??? per coffee |
We make $32 |
Maybe we can even figure it out in our heads: coffee costs $2! Easy, right? That's it. We're done!! That's linear regression!!
Kind of, sort of, more or less, anyway. Let's move on to see how linear regression works in Python code.
import pandas as pd
df = pd.DataFrame([
{ 'sold': 0, 'revenue': 0 },
{ 'sold': 4, 'revenue': 8 },
{ 'sold': 16, 'revenue': 32 },
])
df
Our very tiny dataset has two columns:
- Number of coffees sold
- Amount of revenue from selling those coffees
We want to ask a simple question using this data: if we sold this many coffees and made this much money, how much does a coffee cost? Up above we learned that this kind of question is linear regression.
To perform our linear regression, we're going to use a library called statsmodels, which conveniently (?) has two different ways of writing the code.
Formula style#
One way to calculate how much the coffee costs is writing a formula. It seems to be a less popular way of doing regressions in statsmodels, but it's so nice and perfect that we're going to look at it first.
import statsmodels.formula.api as smf
# What effect does the number of coffees sold have on our revenue?
model = smf.ols(formula='revenue ~ sold', data=df)
results = model.fit()
It doesn't print anything out, but that's okay: we'll figure out how to look at the results in a second!
Dataframe style#
The other style of using statsmodels for linear regression uses pandas dataframes directly instead of writing out a formula. It can be a little more complicated looking, but it's very popular! It must be the default technique people learn when they pick up statsmodels.
import statsmodels.api as sm
# What effect does the number of coffees sold have on our revenue?
X = df[['sold']]
y = df.revenue
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
Don't worry about sm.add_constant(X)
, we'll talk about it later.
Note: To be specific, the kind of regression we're using is called ordinary least squares regression, which is why we're using
smf.ols
andsm.OLS
. Statsmodels supports other types, too.
Examining our results#
No matter which method we use to calculate how much coffee costs, we end up with a variable called results
. We'll use this variable to see the answer.
If we only want the most basic of results, we can write something like this:
results.params
The 2.000000e+00
next to sold
means for every coffee sold, we make $2! If we want to get technical, it really means "for every increase of 1 in sold
, our revenue
will increase by 2."
While it's definitely useful, it unfortunately doesn't look very fancy. We like β¨ππ π»πΆππΈπ ππ½πΎπππ ππβ¨, so we'll run this code instead:
results.summary()
Nice, right? But it's a lot of information, so let's take a closer look at some of the bits and pieces.
Reading our summary#
The fancy style has the same results as results.params
- try to find sold
and 2.0000
hiding on the left a ways down.
We can put this into words like this:
- For every one more "sold" we have, we get two more "revenue"
- For every one point increase in sold, we'll have a two point increase in revenue
I know we weren't supposed to get technical, but just so you know: the 2.0000
is called the coefficient. The coefficient for sold
is how much revenue
will change if sold
goes up by one.
sold
isn't our only only coefficient, though! There's also the const
one right above it, which is -2.665e-15
.
Explaining the intercept#
const
basically means "how much money we've made if we've sold zero coffees." It's called Intercept
when you use the formula-style regression, even though it'll be exact same number.
In this case, const
is -2.665e-15
. The e-15
part means "move the decimal point 15 places to the left to see what the number really is." That means when we sell zero coffees, we make -0.000000000000002665
dollars. That's basically zero, right?
We need the constant in our regression because sometimes it isn't zero coffees making zero dollars. What if instead we were talking about scores on a test based on hours of studying?
More studying would (hopefully) give us a higher score, but if we studied for zero hours we (hopefully) wouldn't score a zero on the test. If const
were 70, that would mean even if you study for zero hours, you're predicted to get a 70 on the test.
Formula-style regression automatically adds a constant, but the dataframe version requires you to use sm.add_constant(X)
:
model = sm.OLS(y, sm.add_constant(X))
If you use model = sm.OLS(y, X)
instead, the regression would insist that studying for zero hours deserves a zero. Ouch!
Why does the formula technique do it the friendly way by default, but the dataframe version make us take an extra step? No clue. Maybe regression just seemed too easy without something like that to trip us up. Yet another reason to stick with the formula version!
Review#
OK, so what did we just learn?
Sometimes we know how many cups of coffee we sold and how much each coffee costs, and we want to know how much money we made.
β | β | π΅ |
---|---|---|
We sell zero coffees | 0 Γ $2 per coffee |
We make ??? |
We sell four coffees | 4 Γ $2 per coffee |
We make ??? |
We sell sixteen coffees | 16 Γ $2 per coffee |
We make ??? |
That is not linear regression. That is, I don't know, normal math?
Linear regression is when we know how much money we made and how many coffees we sold, but not how much coffee is.
β | β | π΅ |
---|---|---|
We sell zero coffees | 0 Γ ??? per coffee |
We make $0 |
We sell four coffees | 4 Γ ??? per coffee |
We make $8 |
We sell sixteen coffees | 16 Γ ??? per coffee |
We make $32 |
If we want to risk sounding halfway technical, linear regression is a question of "how do the inputs affect the number that comes out at the end."