{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Linear regression with the Associated Press\n",
                "\n",
                "In [this piece from the Associated Press](https://apnews.com/66ac44186b6249709501f07a7eab36da), Nicky Forster combines from the US Census Bureau and the CDC to see how life expectancy is related to actors like unemployment, income, and others. We'll be looking at how they can write sentences like this one: \n",
                "\n",
                "> \"An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy, the AP found.\""
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<p class=\"reading-options\">\n  <a class=\"btn\" href=\"/ap-regression-unemployment/simple-regression-with-census-data-statsmodels-with-formulas\">\n    <i class=\"fa fa-sm fa-book\"></i>\n    Read online\n  </a>\n  <a class=\"btn\" href=\"/ap-regression-unemployment/notebooks/Simple regression with census data (statsmodels with formulas).ipynb\">\n    <i class=\"fa fa-sm fa-download\"></i>\n    Download notebook\n  </a>\n  <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/ap-regression-unemployment/notebooks/Simple regression with census data (statsmodels with formulas).ipynb\" target=\"_new\">\n    <i class=\"fa fa-sm fa-laptop\"></i>\n    Interactive version\n  </a>\n</p>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Prep work: Downloading necessary files\n",
                "Before we get started, we need to download all of the data we'll be using.\n",
                "* **R12221544_SL140.csv:** Employment status by census tract - ACS 2015 5-year, tract level, table B23025\n",
                "* **R12221544.txt:** Data dictionary for R12221544_SL140.csv - Description of columns and tables\n",
                "* **R12221550_SL140.csv:** Demographic data by census tract - ACS 2015 5-year, tract level, tables B23025, B06009, B03002, B19013, C17002\n",
                "* **R12221550.txt:** Data dictionary for R12221550_SL140.csv - Description of columns and tables\n",
                "* **US_A.CSV:** USALEEP national data - estimates of life expectancy at birth for most of the census tracts in the USA\n",
                "* **Record_Layout_CensusTract_Life_Expectancy.pdf:** USALEEP data dictionary - None\n"
            ]
        },
        {
            "cell_type": "code",
            "metadata": {},
            "source": [
                "# Make data directory if it doesn't exist\n",
                "!mkdir -p data\n",
                "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/R12221544_SL140.csv.zip -P data\n",
                "!unzip -n -d data data/R12221544_SL140.csv.zip\n",
                "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/R12221544.txt -P data\n",
                "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/R12221550_SL140.csv.zip -P data\n",
                "!unzip -n -d data data/R12221550_SL140.csv.zip\n",
                "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/R12221550.txt -P data\n",
                "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/US_A.CSV -P data\n",
                "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/Record_Layout_CensusTract_Life_Expectancy.pdf -P data"
            ],
            "outputs": [],
            "execution_count": null
        },
        {
            "cell_type": "code",
            "execution_count": 48,
            "metadata": {},
            "outputs": [],
            "source": [
                "import pandas as pd\n",
                "import numpy as np\n",
                "\n",
                "# Keep decimal numbers to 4 decimal places\n",
                "pd.set_option(\"display.float_format\", '{:,.4f}'.format)\n",
                "pd.set_option(\"display.max_columns\", 100)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Reading in our data\n",
                "\n",
                "We'll start by reading in our datasets (as we must!).\n",
                "\n",
                "### Life expectancy data\n",
                "\n",
                "The first dataset is [USALEEP](https://www.cdc.gov/nchs/nvss/usaleep/usaleep.html), the U.S. Small-area Life Expectancy Estimates Project. It's from the National Center for Health Statistics at the CDC.\n",
                "\n",
                "We're going to rename a few columns so they make a little more sense. The columns themselves aren't too complicated, mostly pointing out the census tracts and the life expectancy."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 60,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>tract_id</th>\n",
                            "      <th>STATE2KX</th>\n",
                            "      <th>CNTY2KX</th>\n",
                            "      <th>TRACT2KX</th>\n",
                            "      <th>life_expectancy</th>\n",
                            "      <th>life_expectancy_std_err</th>\n",
                            "      <th>flag</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>1001020100</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>20100</td>\n",
                            "      <td>73.1000</td>\n",
                            "      <td>2.2348</td>\n",
                            "      <td>3</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>1001020200</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>20200</td>\n",
                            "      <td>76.9000</td>\n",
                            "      <td>3.3453</td>\n",
                            "      <td>3</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>1001020400</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>20400</td>\n",
                            "      <td>75.4000</td>\n",
                            "      <td>1.0216</td>\n",
                            "      <td>3</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>1001020500</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>20500</td>\n",
                            "      <td>79.4000</td>\n",
                            "      <td>1.1768</td>\n",
                            "      <td>1</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>1001020600</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>20600</td>\n",
                            "      <td>73.1000</td>\n",
                            "      <td>1.5519</td>\n",
                            "      <td>3</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "     tract_id  STATE2KX  CNTY2KX  TRACT2KX  life_expectancy  \\\n",
                            "0  1001020100         1        1     20100          73.1000   \n",
                            "1  1001020200         1        1     20200          76.9000   \n",
                            "2  1001020400         1        1     20400          75.4000   \n",
                            "3  1001020500         1        1     20500          79.4000   \n",
                            "4  1001020600         1        1     20600          73.1000   \n",
                            "\n",
                            "   life_expectancy_std_err  flag  \n",
                            "0                   2.2348     3  \n",
                            "1                   3.3453     3  \n",
                            "2                   1.0216     3  \n",
                            "3                   1.1768     1  \n",
                            "4                   1.5519     3  "
                        ]
                    },
                    "execution_count": 60,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "life_expec = pd.read_csv(\"data/US_A.CSV\")\n",
                "life_expec.columns = ['tract_id', 'STATE2KX','CNTY2KX', 'TRACT2KX', 'life_expectancy', \n",
                "                      'life_expectancy_std_err', 'flag']\n",
                "life_expec.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "`flag` sounds a little mysterious and scary, but it's just whether the numbers were observed or predicted."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Unemployment Data\n",
                "\n",
                "For unemployment, we'll be using data from the Census. You can get unemployment measures from a lot of places in the US government, so the Census Bureau actually [publishes data on which ones you should use](https://www.census.gov/topics/employment/labor-force/guidance.html). In our case we're going to collect unemployment data from the American Community Survery, for the 5-year period nearest to our life expectancy dataset.\n",
                "\n",
                "We'll be using Table B23025005, Employment Status for the Population 16 Years and Over. Take a look at [the codebook](https://www.socialexplorer.com/data/ACS2015_5yr/metadata/?ds=ACS15_5yr&var=B23025005) for more details. We're only reading in a few columns to keep things looking clean."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 63,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>Geo_FIPS</th>\n",
                            "      <th>total_pop</th>\n",
                            "      <th>unemployed</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>1001020100</td>\n",
                            "      <td>1554</td>\n",
                            "      <td>54</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>1001020200</td>\n",
                            "      <td>1731</td>\n",
                            "      <td>116</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>1001020300</td>\n",
                            "      <td>2462</td>\n",
                            "      <td>91</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>1001020400</td>\n",
                            "      <td>3424</td>\n",
                            "      <td>216</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>1001020500</td>\n",
                            "      <td>8198</td>\n",
                            "      <td>221</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "     Geo_FIPS  total_pop  unemployed\n",
                            "0  1001020100       1554          54\n",
                            "1  1001020200       1731         116\n",
                            "2  1001020300       2462          91\n",
                            "3  1001020400       3424         216\n",
                            "4  1001020500       8198         221"
                        ]
                    },
                    "execution_count": 63,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "columns = {\n",
                "    'Geo_FIPS': 'Geo_FIPS',\n",
                "    'ACS15_5yr_B23025001': 'total_pop',\n",
                "    'ACS15_5yr_B23025005': 'unemployed'\n",
                "}\n",
                "employment = pd.read_csv(\"data/R12221544_SL140.csv\", usecols=columns.keys(), encoding='latin-1')\n",
                "\n",
                "employment = employment.rename(columns=columns)\n",
                "employment.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Since the important thing is _percent_ unemployed, not just how many people are in an area, we'll need to do a little calculation.\n",
                "\n",
                "> **Note that I'm multiplying by 100 here** - if you have a litle extra time, try running this notebook on your own with a 0-1.0 percentage instead of the 0-100 version."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 66,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>Geo_FIPS</th>\n",
                            "      <th>total_pop</th>\n",
                            "      <th>unemployed</th>\n",
                            "      <th>pct_unemployment</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>1001020100</td>\n",
                            "      <td>1554</td>\n",
                            "      <td>54</td>\n",
                            "      <td>3.4749</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>1001020200</td>\n",
                            "      <td>1731</td>\n",
                            "      <td>116</td>\n",
                            "      <td>6.7013</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>1001020300</td>\n",
                            "      <td>2462</td>\n",
                            "      <td>91</td>\n",
                            "      <td>3.6962</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>1001020400</td>\n",
                            "      <td>3424</td>\n",
                            "      <td>216</td>\n",
                            "      <td>6.3084</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>1001020500</td>\n",
                            "      <td>8198</td>\n",
                            "      <td>221</td>\n",
                            "      <td>2.6958</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "     Geo_FIPS  total_pop  unemployed  pct_unemployment\n",
                            "0  1001020100       1554          54            3.4749\n",
                            "1  1001020200       1731         116            6.7013\n",
                            "2  1001020300       2462          91            3.6962\n",
                            "3  1001020400       3424         216            6.3084\n",
                            "4  1001020500       8198         221            2.6958"
                        ]
                    },
                    "execution_count": 66,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "employment['pct_unemployment'] = employment['unemployed'] / employment['total_pop'] * 100\n",
                "employment.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### Demographic data\n",
                "\n",
                "The other dataset we're using is also from the Census Bureau's American Community Survey. It's a collection of many more columns with very, very long names and very, very mysterious codes. The tables include:\n",
                "\n",
                "* [B03002](https://censusreporter.org/tables/B03002/): Hispanic or Latino Origin by Race\n",
                "* [B06009](https://censusreporter.org/tables/B06009/): Place of Birth by Educational Attainment in the United States\n",
                "* [C17002](https://censusreporter.org/tables/C17002/): Ratio of Income to Poverty Level\n",
                "* [B19013](https://censusreporter.org/tables/B19013/): Median Household Income\n",
                "\n",
                "Again, we're only picking a few columns to read in. Generally they're the raw count of certain populations, as well as the total population counted of each table."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 67,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>Geo_FIPS</th>\n",
                            "      <th>race_table_total</th>\n",
                            "      <th>white</th>\n",
                            "      <th>black</th>\n",
                            "      <th>hispanic</th>\n",
                            "      <th>edu_table_total</th>\n",
                            "      <th>less_than_hs</th>\n",
                            "      <th>poverty_table_total</th>\n",
                            "      <th>poverty_100_124</th>\n",
                            "      <th>poverty_125_149</th>\n",
                            "      <th>income</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>1001020100</td>\n",
                            "      <td>1948</td>\n",
                            "      <td>1703</td>\n",
                            "      <td>150</td>\n",
                            "      <td>17</td>\n",
                            "      <td>1,243.0000</td>\n",
                            "      <td>184.0000</td>\n",
                            "      <td>1948</td>\n",
                            "      <td>81</td>\n",
                            "      <td>101</td>\n",
                            "      <td>61,838.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>1001020200</td>\n",
                            "      <td>2156</td>\n",
                            "      <td>872</td>\n",
                            "      <td>1149</td>\n",
                            "      <td>17</td>\n",
                            "      <td>1,397.0000</td>\n",
                            "      <td>356.0000</td>\n",
                            "      <td>1983</td>\n",
                            "      <td>232</td>\n",
                            "      <td>58</td>\n",
                            "      <td>32,303.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>1001020300</td>\n",
                            "      <td>2968</td>\n",
                            "      <td>2212</td>\n",
                            "      <td>551</td>\n",
                            "      <td>0</td>\n",
                            "      <td>2,074.0000</td>\n",
                            "      <td>221.0000</td>\n",
                            "      <td>2968</td>\n",
                            "      <td>148</td>\n",
                            "      <td>207</td>\n",
                            "      <td>44,922.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>1001020400</td>\n",
                            "      <td>4423</td>\n",
                            "      <td>3662</td>\n",
                            "      <td>162</td>\n",
                            "      <td>464</td>\n",
                            "      <td>2,899.0000</td>\n",
                            "      <td>339.0000</td>\n",
                            "      <td>4423</td>\n",
                            "      <td>141</td>\n",
                            "      <td>182</td>\n",
                            "      <td>54,329.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>1001020500</td>\n",
                            "      <td>10763</td>\n",
                            "      <td>7368</td>\n",
                            "      <td>2674</td>\n",
                            "      <td>80</td>\n",
                            "      <td>6,974.0000</td>\n",
                            "      <td>310.0000</td>\n",
                            "      <td>10563</td>\n",
                            "      <td>256</td>\n",
                            "      <td>1064</td>\n",
                            "      <td>51,965.0000</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "     Geo_FIPS  race_table_total  white  black  hispanic  edu_table_total  \\\n",
                            "0  1001020100              1948   1703    150        17       1,243.0000   \n",
                            "1  1001020200              2156    872   1149        17       1,397.0000   \n",
                            "2  1001020300              2968   2212    551         0       2,074.0000   \n",
                            "3  1001020400              4423   3662    162       464       2,899.0000   \n",
                            "4  1001020500             10763   7368   2674        80       6,974.0000   \n",
                            "\n",
                            "   less_than_hs  poverty_table_total  poverty_100_124  poverty_125_149  \\\n",
                            "0      184.0000                 1948               81              101   \n",
                            "1      356.0000                 1983              232               58   \n",
                            "2      221.0000                 2968              148              207   \n",
                            "3      339.0000                 4423              141              182   \n",
                            "4      310.0000                10563              256             1064   \n",
                            "\n",
                            "       income  \n",
                            "0 61,838.0000  \n",
                            "1 32,303.0000  \n",
                            "2 44,922.0000  \n",
                            "3 54,329.0000  \n",
                            "4 51,965.0000  "
                        ]
                    },
                    "execution_count": 67,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "columns = {\n",
                "    'Geo_FIPS': 'Geo_FIPS',\n",
                "    'ACS15_5yr_B03002001': 'race_table_total',\n",
                "    'ACS15_5yr_B03002004': 'black',\n",
                "    'ACS15_5yr_B03002003': 'white',\n",
                "    'ACS15_5yr_B03002012': 'hispanic',\n",
                "    'ACS15_5yr_B06009001': 'edu_table_total',\n",
                "    'ACS15_5yr_B06009002': 'less_than_hs',\n",
                "    'ACS15_5yr_C17002004': 'poverty_100_124',\n",
                "    'ACS15_5yr_C17002005': 'poverty_125_149',\n",
                "    'ACS15_5yr_C17002001': 'poverty_table_total',\n",
                "    'ACS15_5yr_B19013001': 'income'\n",
                "}        \n",
                "census = pd.read_csv(\"data/R12221550_SL140.csv\", usecols=columns.keys(), encoding='latin-1')\n",
                "census = census.rename(columns=columns)\n",
                "census.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We'll again convert these **raw population numbers into percentages** - what percent of people are certain races? What percent of people have not finished high school?"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 68,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>Geo_FIPS</th>\n",
                            "      <th>pct_black</th>\n",
                            "      <th>pct_white</th>\n",
                            "      <th>pct_hispanic</th>\n",
                            "      <th>pct_less_than_hs</th>\n",
                            "      <th>pct_under_150_poverty</th>\n",
                            "      <th>income</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>1001020100</td>\n",
                            "      <td>7.7002</td>\n",
                            "      <td>87.4230</td>\n",
                            "      <td>0.8727</td>\n",
                            "      <td>14.8029</td>\n",
                            "      <td>9.3429</td>\n",
                            "      <td>61,838.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>1001020200</td>\n",
                            "      <td>53.2931</td>\n",
                            "      <td>40.4453</td>\n",
                            "      <td>0.7885</td>\n",
                            "      <td>25.4832</td>\n",
                            "      <td>14.6243</td>\n",
                            "      <td>32,303.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>1001020300</td>\n",
                            "      <td>18.5647</td>\n",
                            "      <td>74.5283</td>\n",
                            "      <td>0.0000</td>\n",
                            "      <td>10.6557</td>\n",
                            "      <td>11.9609</td>\n",
                            "      <td>44,922.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>1001020400</td>\n",
                            "      <td>3.6627</td>\n",
                            "      <td>82.7945</td>\n",
                            "      <td>10.4906</td>\n",
                            "      <td>11.6937</td>\n",
                            "      <td>7.3027</td>\n",
                            "      <td>54,329.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>1001020500</td>\n",
                            "      <td>24.8444</td>\n",
                            "      <td>68.4567</td>\n",
                            "      <td>0.7433</td>\n",
                            "      <td>4.4451</td>\n",
                            "      <td>12.4964</td>\n",
                            "      <td>51,965.0000</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "     Geo_FIPS  pct_black  pct_white  pct_hispanic  pct_less_than_hs  \\\n",
                            "0  1001020100     7.7002    87.4230        0.8727           14.8029   \n",
                            "1  1001020200    53.2931    40.4453        0.7885           25.4832   \n",
                            "2  1001020300    18.5647    74.5283        0.0000           10.6557   \n",
                            "3  1001020400     3.6627    82.7945       10.4906           11.6937   \n",
                            "4  1001020500    24.8444    68.4567        0.7433            4.4451   \n",
                            "\n",
                            "   pct_under_150_poverty      income  \n",
                            "0                 9.3429 61,838.0000  \n",
                            "1                14.6243 32,303.0000  \n",
                            "2                11.9609 44,922.0000  \n",
                            "3                 7.3027 54,329.0000  \n",
                            "4                12.4964 51,965.0000  "
                        ]
                    },
                    "execution_count": 68,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "census_features = pd.DataFrame({\n",
                "    'Geo_FIPS': census.Geo_FIPS,\n",
                "    'pct_black': census.black / census.race_table_total * 100,\n",
                "    'pct_white': census.white / census.race_table_total * 100,\n",
                "    'pct_hispanic': census.hispanic / census.race_table_total * 100,\n",
                "    'pct_less_than_hs': census.less_than_hs / census.edu_table_total * 100,\n",
                "    'pct_under_150_poverty': (census.poverty_100_124 + census.poverty_125_149) / census.poverty_table_total * 100,\n",
                "    'income': census.income,\n",
                "})\n",
                "census_features.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Merge the data\n",
                "\n",
                "To perform our regression, we need all of our data in **a single dataframe**. Fortunately these dataframes all keep track of census tracks with [FIPS codes](https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt), unique geographic codes that we can rely on to merge our datasets."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 72,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>tract_id</th>\n",
                            "      <th>STATE2KX</th>\n",
                            "      <th>CNTY2KX</th>\n",
                            "      <th>TRACT2KX</th>\n",
                            "      <th>life_expectancy</th>\n",
                            "      <th>life_expectancy_std_err</th>\n",
                            "      <th>flag</th>\n",
                            "      <th>Geo_FIPS</th>\n",
                            "      <th>total_pop</th>\n",
                            "      <th>unemployed</th>\n",
                            "      <th>pct_unemployment</th>\n",
                            "      <th>pct_black</th>\n",
                            "      <th>pct_white</th>\n",
                            "      <th>pct_hispanic</th>\n",
                            "      <th>pct_less_than_hs</th>\n",
                            "      <th>pct_under_150_poverty</th>\n",
                            "      <th>income</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>1001020100</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>20100</td>\n",
                            "      <td>73.1000</td>\n",
                            "      <td>2.2348</td>\n",
                            "      <td>3</td>\n",
                            "      <td>1001020100</td>\n",
                            "      <td>1554</td>\n",
                            "      <td>54</td>\n",
                            "      <td>3.4749</td>\n",
                            "      <td>7.7002</td>\n",
                            "      <td>87.4230</td>\n",
                            "      <td>0.8727</td>\n",
                            "      <td>14.8029</td>\n",
                            "      <td>9.3429</td>\n",
                            "      <td>61,838.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>1001020200</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>20200</td>\n",
                            "      <td>76.9000</td>\n",
                            "      <td>3.3453</td>\n",
                            "      <td>3</td>\n",
                            "      <td>1001020200</td>\n",
                            "      <td>1731</td>\n",
                            "      <td>116</td>\n",
                            "      <td>6.7013</td>\n",
                            "      <td>53.2931</td>\n",
                            "      <td>40.4453</td>\n",
                            "      <td>0.7885</td>\n",
                            "      <td>25.4832</td>\n",
                            "      <td>14.6243</td>\n",
                            "      <td>32,303.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>1001020400</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>20400</td>\n",
                            "      <td>75.4000</td>\n",
                            "      <td>1.0216</td>\n",
                            "      <td>3</td>\n",
                            "      <td>1001020400</td>\n",
                            "      <td>3424</td>\n",
                            "      <td>216</td>\n",
                            "      <td>6.3084</td>\n",
                            "      <td>3.6627</td>\n",
                            "      <td>82.7945</td>\n",
                            "      <td>10.4906</td>\n",
                            "      <td>11.6937</td>\n",
                            "      <td>7.3027</td>\n",
                            "      <td>54,329.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>1001020500</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>20500</td>\n",
                            "      <td>79.4000</td>\n",
                            "      <td>1.1768</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1001020500</td>\n",
                            "      <td>8198</td>\n",
                            "      <td>221</td>\n",
                            "      <td>2.6958</td>\n",
                            "      <td>24.8444</td>\n",
                            "      <td>68.4567</td>\n",
                            "      <td>0.7433</td>\n",
                            "      <td>4.4451</td>\n",
                            "      <td>12.4964</td>\n",
                            "      <td>51,965.0000</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>1001020600</td>\n",
                            "      <td>1</td>\n",
                            "      <td>1</td>\n",
                            "      <td>20600</td>\n",
                            "      <td>73.1000</td>\n",
                            "      <td>1.5519</td>\n",
                            "      <td>3</td>\n",
                            "      <td>1001020600</td>\n",
                            "      <td>2855</td>\n",
                            "      <td>190</td>\n",
                            "      <td>6.6550</td>\n",
                            "      <td>11.9190</td>\n",
                            "      <td>72.9161</td>\n",
                            "      <td>13.0615</td>\n",
                            "      <td>17.4873</td>\n",
                            "      <td>10.8543</td>\n",
                            "      <td>63,092.0000</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "     tract_id  STATE2KX  CNTY2KX  TRACT2KX  life_expectancy  \\\n",
                            "0  1001020100         1        1     20100          73.1000   \n",
                            "1  1001020200         1        1     20200          76.9000   \n",
                            "2  1001020400         1        1     20400          75.4000   \n",
                            "3  1001020500         1        1     20500          79.4000   \n",
                            "4  1001020600         1        1     20600          73.1000   \n",
                            "\n",
                            "   life_expectancy_std_err  flag    Geo_FIPS  total_pop  unemployed  \\\n",
                            "0                   2.2348     3  1001020100       1554          54   \n",
                            "1                   3.3453     3  1001020200       1731         116   \n",
                            "2                   1.0216     3  1001020400       3424         216   \n",
                            "3                   1.1768     1  1001020500       8198         221   \n",
                            "4                   1.5519     3  1001020600       2855         190   \n",
                            "\n",
                            "   pct_unemployment  pct_black  pct_white  pct_hispanic  pct_less_than_hs  \\\n",
                            "0            3.4749     7.7002    87.4230        0.8727           14.8029   \n",
                            "1            6.7013    53.2931    40.4453        0.7885           25.4832   \n",
                            "2            6.3084     3.6627    82.7945       10.4906           11.6937   \n",
                            "3            2.6958    24.8444    68.4567        0.7433            4.4451   \n",
                            "4            6.6550    11.9190    72.9161       13.0615           17.4873   \n",
                            "\n",
                            "   pct_under_150_poverty      income  \n",
                            "0                 9.3429 61,838.0000  \n",
                            "1                14.6243 32,303.0000  \n",
                            "2                 7.3027 54,329.0000  \n",
                            "3                12.4964 51,965.0000  \n",
                            "4                10.8543 63,092.0000  "
                        ]
                    },
                    "execution_count": 72,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "merged = life_expec.merge(employment, left_on='tract_id', right_on='Geo_FIPS')\n",
                "merged = merged.merge(census_features, left_on='Geo_FIPS', right_on='Geo_FIPS')\n",
                "merged.head()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Running the regression\n",
                "\n",
                "Using the `statsmodels` package, we'll run a linear regression to find the relationship between **life expectancy** and our **calculated columns**. Note that we're using the **formula method** of writing a regression instead of the dataframes method. I find it both more readable and more useable than the dataframes method.\n",
                "\n",
                "> Python note: I'm also using a [multiline string](https://stackoverflow.com/posts/10660443/revisions) which is started and ended using three quotation marks. Multi-line strings and formulas are a match made in heaven!"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 74,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<table class=\"simpletable\">\n",
                            "<caption>OLS Regression Results</caption>\n",
                            "<tr>\n",
                            "  <th>Dep. Variable:</th>     <td>life_expectancy</td> <th>  R-squared:         </th>  <td>   0.490</td>  \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th>  <td>   0.490</td>  \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th>  <td>   8997.</td>  \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Date:</th>             <td>Thu, 12 Dec 2019</td> <th>  Prob (F-statistic):</th>   <td>  0.00</td>   \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Time:</th>                 <td>21:50:03</td>     <th>  Log-Likelihood:    </th> <td>-1.6208e+05</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>No. Observations:</th>      <td> 65656</td>      <th>  AIC:               </th>  <td>3.242e+05</td> \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Df Residuals:</th>          <td> 65648</td>      <th>  BIC:               </th>  <td>3.243e+05</td> \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Df Model:</th>              <td>     7</td>      <th>                     </th>      <td> </td>     \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>      <td> </td>     \n",
                            "</tr>\n",
                            "</table>\n",
                            "<table class=\"simpletable\">\n",
                            "<tr>\n",
                            "            <td></td>               <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Intercept</th>             <td>   81.2365</td> <td>    0.122</td> <td>  665.628</td> <td> 0.000</td> <td>   80.997</td> <td>   81.476</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_black</th>             <td>   -0.0666</td> <td>    0.001</td> <td>  -56.960</td> <td> 0.000</td> <td>   -0.069</td> <td>   -0.064</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_white</th>             <td>   -0.0386</td> <td>    0.001</td> <td>  -36.707</td> <td> 0.000</td> <td>   -0.041</td> <td>   -0.037</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_hispanic</th>          <td>    0.0131</td> <td>    0.001</td> <td>   10.298</td> <td> 0.000</td> <td>    0.011</td> <td>    0.016</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_less_than_hs</th>      <td>   -0.0862</td> <td>    0.002</td> <td>  -48.979</td> <td> 0.000</td> <td>   -0.090</td> <td>   -0.083</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_under_150_poverty</th> <td>   -0.0596</td> <td>    0.003</td> <td>  -21.738</td> <td> 0.000</td> <td>   -0.065</td> <td>   -0.054</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>income</th>                <td> 4.825e-05</td> <td>  5.8e-07</td> <td>   83.217</td> <td> 0.000</td> <td> 4.71e-05</td> <td> 4.94e-05</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_unemployment</th>      <td>   -0.1490</td> <td>    0.004</td> <td>  -33.408</td> <td> 0.000</td> <td>   -0.158</td> <td>   -0.140</td>\n",
                            "</tr>\n",
                            "</table>\n",
                            "<table class=\"simpletable\">\n",
                            "<tr>\n",
                            "  <th>Omnibus:</th>       <td>2114.193</td> <th>  Durbin-Watson:     </th> <td>   1.520</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Prob(Omnibus):</th>  <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>4788.035</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Skew:</th>           <td> 0.183</td>  <th>  Prob(JB):          </th> <td>    0.00</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Kurtosis:</th>       <td> 4.271</td>  <th>  Cond. No.          </th> <td>7.06e+05</td>\n",
                            "</tr>\n",
                            "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 7.06e+05. This might indicate that there are<br/>strong multicollinearity or other numerical problems."
                        ],
                        "text/plain": [
                            "<class 'statsmodels.iolib.summary.Summary'>\n",
                            "\"\"\"\n",
                            "                            OLS Regression Results                            \n",
                            "==============================================================================\n",
                            "Dep. Variable:        life_expectancy   R-squared:                       0.490\n",
                            "Model:                            OLS   Adj. R-squared:                  0.490\n",
                            "Method:                 Least Squares   F-statistic:                     8997.\n",
                            "Date:                Thu, 12 Dec 2019   Prob (F-statistic):               0.00\n",
                            "Time:                        21:50:03   Log-Likelihood:            -1.6208e+05\n",
                            "No. Observations:               65656   AIC:                         3.242e+05\n",
                            "Df Residuals:                   65648   BIC:                         3.243e+05\n",
                            "Df Model:                           7                                         \n",
                            "Covariance Type:            nonrobust                                         \n",
                            "=========================================================================================\n",
                            "                            coef    std err          t      P>|t|      [0.025      0.975]\n",
                            "-----------------------------------------------------------------------------------------\n",
                            "Intercept                81.2365      0.122    665.628      0.000      80.997      81.476\n",
                            "pct_black                -0.0666      0.001    -56.960      0.000      -0.069      -0.064\n",
                            "pct_white                -0.0386      0.001    -36.707      0.000      -0.041      -0.037\n",
                            "pct_hispanic              0.0131      0.001     10.298      0.000       0.011       0.016\n",
                            "pct_less_than_hs         -0.0862      0.002    -48.979      0.000      -0.090      -0.083\n",
                            "pct_under_150_poverty    -0.0596      0.003    -21.738      0.000      -0.065      -0.054\n",
                            "income                 4.825e-05    5.8e-07     83.217      0.000    4.71e-05    4.94e-05\n",
                            "pct_unemployment         -0.1490      0.004    -33.408      0.000      -0.158      -0.140\n",
                            "==============================================================================\n",
                            "Omnibus:                     2114.193   Durbin-Watson:                   1.520\n",
                            "Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4788.035\n",
                            "Skew:                           0.183   Prob(JB):                         0.00\n",
                            "Kurtosis:                       4.271   Cond. No.                     7.06e+05\n",
                            "==============================================================================\n",
                            "\n",
                            "Warnings:\n",
                            "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
                            "[2] The condition number is large, 7.06e+05. This might indicate that there are\n",
                            "strong multicollinearity or other numerical problems.\n",
                            "\"\"\""
                        ]
                    },
                    "execution_count": 74,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "import statsmodels.formula.api as smf\n",
                "\n",
                "model = smf.ols(\"\"\"\n",
                "    life_expectancy ~ pct_black + pct_white + pct_hispanic + pct_less_than_hs\n",
                "        + pct_under_150_poverty + income + pct_unemployment\n",
                "\"\"\", data=merged)\n",
                "\n",
                "results = model.fit()\n",
                "results.summary()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "It's a lot to take in, so I'll zero us in on the important parts:\n",
                "    \n",
                "<img src=\"\">"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The coefficient for `pct_unemployment` is -0.15, which means **\"for every percentage point increase in unemployment, life expectancy drops 0.15 years.\"** While this might be _correct_ it unfortunately sounds very stupid.\n",
                "\n",
                "Income's coefficient is rather unfortunate, too. \"For every extra dollar in median income, life expectancy goes up 0.00004825 years.\" No thanks! Individual dollars don't really make sense here.\n",
                "\n",
                "Instead, it'd be nice to say something like **\"for every ten point increase of unemployment,\"** or **\"for every 10k increase in income.\"** If we were using the dataframes version of regression, we'd create new columns, maybe something like this:\n",
                "\n",
                "```python\n",
                "merged['income_10k'] = merged.income / 10000\n",
                "merged['unemployment_pct_10pts'] = merged.pct_unemployment / 10\n",
                "```\n",
                "\n",
                "Since we're using the formulas method, though, we can do the division right in the regression!"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 75,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<table class=\"simpletable\">\n",
                            "<caption>OLS Regression Results</caption>\n",
                            "<tr>\n",
                            "  <th>Dep. Variable:</th>     <td>life_expectancy</td> <th>  R-squared:         </th>  <td>   0.490</td>  \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Model:</th>                   <td>OLS</td>       <th>  Adj. R-squared:    </th>  <td>   0.490</td>  \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Method:</th>             <td>Least Squares</td>  <th>  F-statistic:       </th>  <td>   8997.</td>  \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Date:</th>             <td>Thu, 12 Dec 2019</td> <th>  Prob (F-statistic):</th>   <td>  0.00</td>   \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Time:</th>                 <td>21:50:50</td>     <th>  Log-Likelihood:    </th> <td>-1.6208e+05</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>No. Observations:</th>      <td> 65656</td>      <th>  AIC:               </th>  <td>3.242e+05</td> \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Df Residuals:</th>          <td> 65648</td>      <th>  BIC:               </th>  <td>3.243e+05</td> \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Df Model:</th>              <td>     7</td>      <th>                     </th>      <td> </td>     \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Covariance Type:</th>      <td>nonrobust</td>    <th>                     </th>      <td> </td>     \n",
                            "</tr>\n",
                            "</table>\n",
                            "<table class=\"simpletable\">\n",
                            "<tr>\n",
                            "                 <td></td>                    <th>coef</th>     <th>std err</th>      <th>t</th>      <th>P>|t|</th>  <th>[0.025</th>    <th>0.975]</th>  \n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Intercept</th>                       <td>   81.2365</td> <td>    0.122</td> <td>  665.628</td> <td> 0.000</td> <td>   80.997</td> <td>   81.476</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_black</th>                       <td>   -0.0666</td> <td>    0.001</td> <td>  -56.960</td> <td> 0.000</td> <td>   -0.069</td> <td>   -0.064</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_white</th>                       <td>   -0.0386</td> <td>    0.001</td> <td>  -36.707</td> <td> 0.000</td> <td>   -0.041</td> <td>   -0.037</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_hispanic</th>                    <td>    0.0131</td> <td>    0.001</td> <td>   10.298</td> <td> 0.000</td> <td>    0.011</td> <td>    0.016</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_less_than_hs</th>                <td>   -0.0862</td> <td>    0.002</td> <td>  -48.979</td> <td> 0.000</td> <td>   -0.090</td> <td>   -0.083</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>pct_under_150_poverty</th>           <td>   -0.0596</td> <td>    0.003</td> <td>  -21.738</td> <td> 0.000</td> <td>   -0.065</td> <td>   -0.054</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>np.divide(income, 10000)</th>        <td>    0.4825</td> <td>    0.006</td> <td>   83.217</td> <td> 0.000</td> <td>    0.471</td> <td>    0.494</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>np.divide(pct_unemployment, 10)</th> <td>   -1.4900</td> <td>    0.045</td> <td>  -33.408</td> <td> 0.000</td> <td>   -1.577</td> <td>   -1.403</td>\n",
                            "</tr>\n",
                            "</table>\n",
                            "<table class=\"simpletable\">\n",
                            "<tr>\n",
                            "  <th>Omnibus:</th>       <td>2114.193</td> <th>  Durbin-Watson:     </th> <td>   1.520</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Prob(Omnibus):</th>  <td> 0.000</td>  <th>  Jarque-Bera (JB):  </th> <td>4788.035</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Skew:</th>           <td> 0.183</td>  <th>  Prob(JB):          </th> <td>    0.00</td>\n",
                            "</tr>\n",
                            "<tr>\n",
                            "  <th>Kurtosis:</th>       <td> 4.271</td>  <th>  Cond. No.          </th> <td>    792.</td>\n",
                            "</tr>\n",
                            "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified."
                        ],
                        "text/plain": [
                            "<class 'statsmodels.iolib.summary.Summary'>\n",
                            "\"\"\"\n",
                            "                            OLS Regression Results                            \n",
                            "==============================================================================\n",
                            "Dep. Variable:        life_expectancy   R-squared:                       0.490\n",
                            "Model:                            OLS   Adj. R-squared:                  0.490\n",
                            "Method:                 Least Squares   F-statistic:                     8997.\n",
                            "Date:                Thu, 12 Dec 2019   Prob (F-statistic):               0.00\n",
                            "Time:                        21:50:50   Log-Likelihood:            -1.6208e+05\n",
                            "No. Observations:               65656   AIC:                         3.242e+05\n",
                            "Df Residuals:                   65648   BIC:                         3.243e+05\n",
                            "Df Model:                           7                                         \n",
                            "Covariance Type:            nonrobust                                         \n",
                            "===================================================================================================\n",
                            "                                      coef    std err          t      P>|t|      [0.025      0.975]\n",
                            "---------------------------------------------------------------------------------------------------\n",
                            "Intercept                          81.2365      0.122    665.628      0.000      80.997      81.476\n",
                            "pct_black                          -0.0666      0.001    -56.960      0.000      -0.069      -0.064\n",
                            "pct_white                          -0.0386      0.001    -36.707      0.000      -0.041      -0.037\n",
                            "pct_hispanic                        0.0131      0.001     10.298      0.000       0.011       0.016\n",
                            "pct_less_than_hs                   -0.0862      0.002    -48.979      0.000      -0.090      -0.083\n",
                            "pct_under_150_poverty              -0.0596      0.003    -21.738      0.000      -0.065      -0.054\n",
                            "np.divide(income, 10000)            0.4825      0.006     83.217      0.000       0.471       0.494\n",
                            "np.divide(pct_unemployment, 10)    -1.4900      0.045    -33.408      0.000      -1.577      -1.403\n",
                            "==============================================================================\n",
                            "Omnibus:                     2114.193   Durbin-Watson:                   1.520\n",
                            "Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4788.035\n",
                            "Skew:                           0.183   Prob(JB):                         0.00\n",
                            "Kurtosis:                       4.271   Cond. No.                         792.\n",
                            "==============================================================================\n",
                            "\n",
                            "Warnings:\n",
                            "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n",
                            "\"\"\""
                        ]
                    },
                    "execution_count": 75,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "model = smf.ols(\"\"\"\n",
                "    life_expectancy ~ \n",
                "        pct_black \n",
                "        + pct_white \n",
                "        + pct_hispanic \n",
                "        + pct_less_than_hs \n",
                "        + pct_under_150_poverty \n",
                "        + np.divide(income, 10000)\n",
                "        + np.divide(pct_unemployment, 10)\n",
                "\"\"\", data=merged)\n",
                "\n",
                "results = model.fit()\n",
                "results.summary()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "**That's much much nicer!** Unemployment now has a -1.49 coefficient, which (surprise!) leads us right back to the [original article](https://apnews.com/66ac44186b6249709501f07a7eab36da):\n",
                "\n",
                "> An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy, the AP found. \n",
                "\n",
                "Additionally, the negative coefficient for `pct_less_than_hs` is the source of this line:\n",
                "\n",
                "> A neighborhood where more adults failed to graduate high school had shorter predicted longevity.\n",
                "\n",
                "Even though we accomplished this pretty quickly in very few lines of code, the actual process at the Associated Press was somewhat more complicated. They didn't just throw a bunch of columns in a regression!\n",
                "\n",
                "Stats-wise, the AP did things like examined other columns that didn't meet a p-value threshold, and reasoned about [collinearity](http://www.stat.tamu.edu/~hart/652/collinear.pdf) between variables different. For domain expertise, they included a threshold for poverty that's actually _above_ the poverty line. This is due to the way that Medicaid provides health insurance to people in low-income households.\n",
                "\n",
                "## Review\n",
                "\n",
                "In this exercise, we reproduced [an article from the Associated Press](https://apnews.com/66ac44186b6249709501f07a7eab36da) that analyzed life expectancy and demographic information from the census. They used a linear regression to find the relationship between census tract qualities like unemployment, education, race, and income and how long people live.\n",
                "\n",
                "After running the regression once, we ran it a second time to get numbers that were **more human and easier to use in a story**, like a \"1.5 year decrease in life expectancy\" as opposed to a 0.15-year or 8-week decrease. We also used the **formula version** of a statsmodels linear regression to perform those calculations in the regression with `np.divide`."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Discussion topics\n",
                "\n",
                "Let's say we have the sentence \"a 1 percentage point increase in unemployment translates to a 0.15 year decrease in life expectancy.\" How do we change this into something people can appreciate?\n",
                "\n",
                "Translate some of your coefficients into the form **\"every X percentage point change in unemployment translates to a Y change in life expectancy.\"** Do this with numbers that are meaningful, and in a way that is easily understandable to your reader.\n",
                "\n",
                "What is the difference between using percentage points (0 to 100) vs fractions (0 to 1) when doing a regression analysis?\n",
                "\n",
                "Income, education, and race are all related, which can cause problems related to [multicollinearity](https://www.statisticssolutions.com/multicollinearity/) in our analysis. How can we avoid this?\n",
                "\n",
                "We should consult someone who \"knows something\" before publishing this story. Where would we go to find someone who understands statistics and the subject matter?"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": []
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.6.8"
        },
        "toc": {
            "base_numbering": 1,
            "nav_menu": {},
            "number_sections": true,
            "sideBar": true,
            "skip_h1_title": false,
            "title_cell": "Table of Contents",
            "title_sidebar": "Contents",
            "toc_cell": false,
            "toc_position": {},
            "toc_section_display": true,
            "toc_window_display": false
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}