\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" sold revenue\n",
"0 0 0\n",
"1 4 8\n",
"2 16 32"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.DataFrame([\n",
" { 'sold': 0, 'revenue': 0 },\n",
" { 'sold': 4, 'revenue': 8 },\n",
" { 'sold': 16, 'revenue': 32 },\n",
"])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our very tiny dataset has two columns:\n",
"\n",
"* Number of coffees sold\n",
"* Amount of revenue from selling those coffees\n",
"\n",
"We want to ask a simple question using this data: **if we sold this many coffees and made this much money, how much does a coffee cost?** Up above we learned that this kind of question is **linear regression**.\n",
"\n",
"To perform our linear regression, we're going to use a library called [statsmodels](https://www.statsmodels.org), which conveniently (?) has two different ways of writing the code.\n",
"\n",
"### Formula style\n",
"\n",
"One way to calculate how much the coffee costs is [writing a formula](https://www.statsmodels.org/stable/example_formulas.html). It seems to be a less popular way of doing regressions in statsmodels, but _it's so nice and perfect_ that we're going to look at it first."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import statsmodels.formula.api as smf\n",
"\n",
"# What effect does the number of coffees sold have on our revenue?\n",
"model = smf.ols(formula='revenue ~ sold', data=df)\n",
"results = model.fit()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It doesn't print anything out, but that's okay: we'll figure out how to look at the results in a second!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataframe style\n",
"\n",
"The other style of using statsmodels for linear regression uses pandas dataframes **directly** instead of writing out a formula. It can be a little more complicated looking, but it's very popular! It must be the default technique people learn when they pick up statsmodels."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import statsmodels.api as sm\n",
"\n",
"# What effect does the number of coffees sold have on our revenue?\n",
"X = df[['sold']]\n",
"y = df.revenue\n",
"\n",
"model = sm.OLS(y, sm.add_constant(X))\n",
"results = model.fit()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Don't worry about `sm.add_constant(X)`, we'll talk about it later.\n",
"\n",
"> Note: To be specific, the kind of regression we're using is called **ordinary least squares** regression, which is why we're using `smf.ols` and `sm.OLS`. Statsmodels supports [other types](https://www.statsmodels.org/stable/regression.html), too."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examining our results\n",
"\n",
"No matter which method we use to calculate how much coffee costs, we end up with a variable called `results`. **We'll use this variable to see the answer.**\n",
"\n",
"If we _only_ want the most basic of results, we can write something like this:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"const -2.664535e-15\n",
"sold 2.000000e+00\n",
"dtype: float64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results.params"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `2.000000e+00` next to `sold` means for every coffee sold, we make $2! If we want to get technical, it really means \"for every increase of 1 in `sold`, our `revenue` will increase by 2.\"\n",
"\n",
"While it's definitely useful, it unfortunately doesn't look very fancy. We like \u2728\ud83c\udf1f\ud83d\udc8e \ud835\udcbb\ud835\udcb6\ud835\udcc3\ud835\udcb8\ud835\udcce \ud835\udcc9\ud835\udcbd\ud835\udcbe\ud835\udcc3\ud835\udc54\ud835\udcc8 \ud83d\udc8e\ud83c\udf1f\u2728, so we'll run this code instead:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
"sold | revenue | |
---|---|---|

0 | 0 | 0 |

1 | 4 | 8 |

2 | 16 | 32 |

Dep. Variable: | revenue | R-squared: | 1.000 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 1.000 |

Method: | Least Squares | F-statistic: | 9.502e+30 |

Date: | Sat, 07 Dec 2019 | Prob (F-statistic): | 2.07e-16 |

Time: | 13:32:47 | Log-Likelihood: | 94.907 |

No. Observations: | 3 | AIC: | -185.8 |

Df Residuals: | 1 | BIC: | -187.6 |

Df Model: | 1 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | -2.665e-15 | 6.18e-15 | -0.431 | 0.741 | -8.12e-14 | 7.58e-14 |

sold | 2.0000 | 6.49e-16 | 3.08e+15 | 0.000 | 2.000 | 2.000 |

Omnibus: | nan | Durbin-Watson: | 1.149 |
---|---|---|---|

Prob(Omnibus): | nan | Jarque-Bera (JB): | 0.471 |

Skew: | -0.616 | Prob(JB): | 0.790 |

Kurtosis: | 1.500 | Cond. No. | 13.4 |

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "