{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear Regression for Human Beings\n", "\n", "Let's try to explain some linear regression concepts without formulas or official definitions or anything things like that!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<p class=\"reading-options\">\n <a class=\"btn\" href=\"/regression/linear-regression\">\n <i class=\"fa fa-sm fa-book\"></i>\n Read online\n </a>\n <a class=\"btn\" href=\"/regression/notebooks/Linear Regression.ipynb\">\n <i class=\"fa fa-sm fa-download\"></i>\n Download notebook\n </a>\n <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/regression/notebooks/Linear Regression.ipynb\" target=\"_new\">\n <i class=\"fa fa-sm fa-laptop\"></i>\n Interactive version\n </a>\n</p>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "**We sell coffee, and it costs $2.** Since we're \ud83d\udcbc\ud83d\udcb0Important Business People\ud83d\udcb0\ud83d\udcbc, we might have some big questions about finance, such as:\n", "\n", "* If we sell **zero coffees**, how much money do we make?\n", "* If we sell **four coffees**, how much money do we make?\n", "* If we sell **sixteen coffees**, how much money do we make?\n", "\n", "Since coffee costs $2, we can just multiply it out.\n", "\n", "|\u2615|\u2716|\ud83d\udcb5|\n", "|---|---|---|\n", "|We sell zero coffees|`0 \u00d7 $2 per coffee`|We make $0|\n", "|We sell four coffees|`4 \u00d7 $2 per coffee`|We make $8|\n", "|We sell sixteen coffees|`16 \u00d7 $2 per coffee`|We make $32|\n", "\n", "Since coffee costs $2, in these situations we would make $0, $8, and $32. Easy-peasy!\n", "\n", "Linear regression, if we're going to skip over the specifics, is the **opposite of what we just did.** If we take the same example but twist it around a little bit, we can see how regression works.\n", "\n", "**Let's say we know this:**\n", "\n", "* We sold **zero coffees**, and made $0\n", "* We sold **four coffees**, and made $8\n", "* We sold **sixteen coffees**, and made $32\n", "\n", "Linear regression is when we ask ourselves, **how much does coffee cost?** Let's draw that same table again, but adjusted for our new question.\n", "\n", "|\u2615|\u2716|\ud83d\udcb5|\n", "|---|---|---|\n", "|We sell zero coffees|`0 \u00d7 ??? per coffee`|We make $0|\n", "|We sell four coffees|`4 \u00d7 ??? per coffee`|We make $8|\n", "|We sell sixteen coffees|`16 \u00d7 ??? per coffee`|We make $32|\n", "\n", "Maybe we can even figure it out in our heads: **coffee costs $2!** Easy, right? That's it. We're done!! That's linear regression!!\n", "\n", "Kind of, sort of, more or less, anyway. Let's move on to see how **linear regression works in Python code.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Performing a linear regression\n", "\n", "We'll start off with our data. To wrangle our data we're going to use [pandas](https://pandas.pydata.org/), a super-popular Python library for doing data-y things." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>sold</th>\n", " <th>revenue</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>4</td>\n", " <td>8</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>16</td>\n", " <td>32</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " sold revenue\n", "0 0 0\n", "1 4 8\n", "2 16 32" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.DataFrame([\n", " { 'sold': 0, 'revenue': 0 },\n", " { 'sold': 4, 'revenue': 8 },\n", " { 'sold': 16, 'revenue': 32 },\n", "])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our very tiny dataset has two columns:\n", "\n", "* Number of coffees sold\n", "* Amount of revenue from selling those coffees\n", "\n", "We want to ask a simple question using this data: **if we sold this many coffees and made this much money, how much does a coffee cost?** Up above we learned that this kind of question is **linear regression**.\n", "\n", "To perform our linear regression, we're going to use a library called [statsmodels](https://www.statsmodels.org), which conveniently (?) has two different ways of writing the code.\n", "\n", "### Formula style\n", "\n", "One way to calculate how much the coffee costs is [writing a formula](https://www.statsmodels.org/stable/example_formulas.html). It seems to be a less popular way of doing regressions in statsmodels, but _it's so nice and perfect_ that we're going to look at it first." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import statsmodels.formula.api as smf\n", "\n", "# What effect does the number of coffees sold have on our revenue?\n", "model = smf.ols(formula='revenue ~ sold', data=df)\n", "results = model.fit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It doesn't print anything out, but that's okay: we'll figure out how to look at the results in a second!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataframe style\n", "\n", "The other style of using statsmodels for linear regression uses pandas dataframes **directly** instead of writing out a formula. It can be a little more complicated looking, but it's very popular! It must be the default technique people learn when they pick up statsmodels." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import statsmodels.api as sm\n", "\n", "# What effect does the number of coffees sold have on our revenue?\n", "X = df[['sold']]\n", "y = df.revenue\n", "\n", "model = sm.OLS(y, sm.add_constant(X))\n", "results = model.fit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Don't worry about `sm.add_constant(X)`, we'll talk about it later.\n", "\n", "> Note: To be specific, the kind of regression we're using is called **ordinary least squares** regression, which is why we're using `smf.ols` and `sm.OLS`. Statsmodels supports [other types](https://www.statsmodels.org/stable/regression.html), too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examining our results\n", "\n", "No matter which method we use to calculate how much coffee costs, we end up with a variable called `results`. **We'll use this variable to see the answer.**\n", "\n", "If we _only_ want the most basic of results, we can write something like this:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "const -2.664535e-15\n", "sold 2.000000e+00\n", "dtype: float64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.params" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `2.000000e+00` next to `sold` means for every coffee sold, we make $2! If we want to get technical, it really means \"for every increase of 1 in `sold`, our `revenue` will increase by 2.\"\n", "\n", "While it's definitely useful, it unfortunately doesn't look very fancy. We like \u2728\ud83c\udf1f\ud83d\udc8e \ud835\udcbb\ud835\udcb6\ud835\udcc3\ud835\udcb8\ud835\udcce \ud835\udcc9\ud835\udcbd\ud835\udcbe\ud835\udcc3\ud835\udc54\ud835\udcc8 \ud83d\udc8e\ud83c\udf1f\u2728, so we'll run this code instead:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<table class=\"simpletable\">\n", "<caption>OLS Regression Results</caption>\n", "<tr>\n", " <th>Dep. Variable:</th> <td>revenue</td> <th> R-squared: </th> <td> 1.000</td> \n", "</tr>\n", "<tr>\n", " <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 1.000</td> \n", "</tr>\n", "<tr>\n", " <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td>9.502e+30</td>\n", "</tr>\n", "<tr>\n", " <th>Date:</th> <td>Sat, 07 Dec 2019</td> <th> Prob (F-statistic):</th> <td>2.07e-16</td> \n", "</tr>\n", "<tr>\n", " <th>Time:</th> <td>13:32:47</td> <th> Log-Likelihood: </th> <td> 94.907</td> \n", "</tr>\n", "<tr>\n", " <th>No. Observations:</th> <td> 3</td> <th> AIC: </th> <td> -185.8</td> \n", "</tr>\n", "<tr>\n", " <th>Df Residuals:</th> <td> 1</td> <th> BIC: </th> <td> -187.6</td> \n", "</tr>\n", "<tr>\n", " <th>Df Model:</th> <td> 1</td> <th> </th> <td> </td> \n", "</tr>\n", "<tr>\n", " <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n", "</tr>\n", "</table>\n", "<table class=\"simpletable\">\n", "<tr>\n", " <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n", "</tr>\n", "<tr>\n", " <th>const</th> <td>-2.665e-15</td> <td> 6.18e-15</td> <td> -0.431</td> <td> 0.741</td> <td>-8.12e-14</td> <td> 7.58e-14</td>\n", "</tr>\n", "<tr>\n", " <th>sold</th> <td> 2.0000</td> <td> 6.49e-16</td> <td> 3.08e+15</td> <td> 0.000</td> <td> 2.000</td> <td> 2.000</td>\n", "</tr>\n", "</table>\n", "<table class=\"simpletable\">\n", "<tr>\n", " <th>Omnibus:</th> <td> nan</td> <th> Durbin-Watson: </th> <td> 1.149</td>\n", "</tr>\n", "<tr>\n", " <th>Prob(Omnibus):</th> <td> nan</td> <th> Jarque-Bera (JB): </th> <td> 0.471</td>\n", "</tr>\n", "<tr>\n", " <th>Skew:</th> <td>-0.616</td> <th> Prob(JB): </th> <td> 0.790</td>\n", "</tr>\n", "<tr>\n", " <th>Kurtosis:</th> <td> 1.500</td> <th> Cond. No. </th> <td> 13.4</td>\n", "</tr>\n", "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "<class 'statsmodels.iolib.summary.Summary'>\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: revenue R-squared: 1.000\n", "Model: OLS Adj. R-squared: 1.000\n", "Method: Least Squares F-statistic: 9.502e+30\n", "Date: Sat, 07 Dec 2019 Prob (F-statistic): 2.07e-16\n", "Time: 13:32:47 Log-Likelihood: 94.907\n", "No. Observations: 3 AIC: -185.8\n", "Df Residuals: 1 BIC: -187.6\n", "Df Model: 1 \n", "Covariance Type: nonrobust \n", "==============================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "------------------------------------------------------------------------------\n", "const -2.665e-15 6.18e-15 -0.431 0.741 -8.12e-14 7.58e-14\n", "sold 2.0000 6.49e-16 3.08e+15 0.000 2.000 2.000\n", "==============================================================================\n", "Omnibus: nan Durbin-Watson: 1.149\n", "Prob(Omnibus): nan Jarque-Bera (JB): 0.471\n", "Skew: -0.616 Prob(JB): 0.790\n", "Kurtosis: 1.500 Cond. No. 13.4\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nice, right? But it's a **lot of information,** so let's take a closer look at some of the bits and pieces.\n", "\n", "### Reading our summary\n", "\n", "The fancy style has the same results as `results.params` - try to find `sold` and `2.0000` hiding on the left a ways down." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<img src=\"\">\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can put this into words like this:\n", "\n", "* For every one more \"sold\" we have, we get two more \"revenue\"\n", "* For every one point increase in sold, we'll have a two point increase in revenue\n", "\n", "I know we weren't supposed to get technical, but just so you know: the `2.0000` is called the **coefficient**. The coefficient for `sold` is how much `revenue` will change if `sold` goes up by one.\n", "\n", "**`sold` isn't our only only coefficient, though!** There's also the `const` one right above it, which is `-2.665e-15`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<img src=\"\">" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explaining the intercept\n", "\n", "`const` basically means \"how much money we've made if we've sold **zero coffees**.\" It's called `Intercept` when you use the formula-style regression, even though it'll be exact same number.\n", "\n", "In this case, `const` is `-2.665e-15`. The `e-15` part means \"move the decimal point 15 places to the left to see what the number really is.\" That means when we sell zero coffees, we make `-0.000000000000002665` dollars. That's basically zero, right?\n", "\n", "We need the constant in our regression because sometimes it isn't zero coffees making zero dollars. What if instead we were talking about scores on a test based on hours of studying?\n", "\n", "More studying would (hopefully) give us a higher score, but if we **studied for zero hours** we (hopefully) wouldn't score a zero on the test. If `const` were 70, that would mean even if you study for zero hours, you're **predicted to get a 70 on the test.**\n", "\n", "Formula-style regression automatically adds a constant, but the dataframe version requires you to use `sm.add_constant(X)`:\n", "\n", "> model = sm.OLS(y, sm.add_constant(X))\n", "\n", "If you use `model = sm.OLS(y, X)` instead, the regression would insist that studying for zero hours deserves a zero. Ouch!\n", "\n", "Why does the formula technique do it the friendly way by default, but the dataframe version make us take an extra step? _No clue._ Maybe regression just seemed _too easy_ without something like that to trip us up. Yet another reason to stick with the formula version!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review\n", "\n", "OK, so what did we just learn?\n", "\n", "Sometimes we know how many cups of coffee we sold and how much each coffee costs, and we want to know how much money we made.\n", "\n", "|\u2615|\u2716|\ud83d\udcb5|\n", "|---|---|---|\n", "|We sell zero coffees|`0 \u00d7 $2 per coffee`|We make ???|\n", "|We sell four coffees|`4 \u00d7 $2 per coffee`|We make ???|\n", "|We sell sixteen coffees|`16 \u00d7 $2 per coffee`|We make ???|\n", "\n", "That is **not** linear regression. That is, I don't know, normal math?\n", "\n", "Linear regression is when **we know how much money we made and how many coffees we sold, but not how much coffee is.** \n", "\n", "|\u2615|\u2716|\ud83d\udcb5|\n", "|---|---|---|\n", "|We sell zero coffees|`0 \u00d7 ??? per coffee`|We make $0|\n", "|We sell four coffees|`4 \u00d7 ??? per coffee`|We make $8|\n", "|We sell sixteen coffees|`16 \u00d7 ??? per coffee`|We make $32|\n", "\n", "If we want to risk sounding halfway technical, linear regression is a question of \"how do the inputs affect the number that comes out at the end.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }