{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Linear regression with the Associated Press\n", "\n", "In [this piece from the Associated Press](https://apnews.com/66ac44186b6249709501f07a7eab36da), Nicky Forster combines from the US Census Bureau and the CDC to see how life expectancy is related to actors like unemployment, income, and others. We'll be looking at how they can write sentences like this one: \n", "\n", "> \"An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy, the AP found.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<p class=\"reading-options\">\n <a class=\"btn\" href=\"/ap-regression-unemployment/simple-regression-with-census-data-statsmodels-with-formulas\">\n <i class=\"fa fa-sm fa-book\"></i>\n Read online\n </a>\n <a class=\"btn\" href=\"/ap-regression-unemployment/notebooks/Simple regression with census data (statsmodels with formulas).ipynb\">\n <i class=\"fa fa-sm fa-download\"></i>\n Download notebook\n </a>\n <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/ap-regression-unemployment/notebooks/Simple regression with census data (statsmodels with formulas).ipynb\" target=\"_new\">\n <i class=\"fa fa-sm fa-laptop\"></i>\n Interactive version\n </a>\n</p>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prep work: Downloading necessary files\n", "Before we get started, we need to download all of the data we'll be using.\n", "* **R12221544_SL140.csv:** Employment status by census tract - ACS 2015 5-year, tract level, table B23025\n", "* **R12221544.txt:** Data dictionary for R12221544_SL140.csv - Description of columns and tables\n", "* **R12221550_SL140.csv:** Demographic data by census tract - ACS 2015 5-year, tract level, tables B23025, B06009, B03002, B19013, C17002\n", "* **R12221550.txt:** Data dictionary for R12221550_SL140.csv - Description of columns and tables\n", "* **US_A.CSV:** USALEEP national data - estimates of life expectancy at birth for most of the census tracts in the USA\n", "* **Record_Layout_CensusTract_Life_Expectancy.pdf:** USALEEP data dictionary - None\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Make data directory if it doesn't exist\n", "!mkdir -p data\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/R12221544_SL140.csv.zip -P data\n", "!unzip -n -d data data/R12221544_SL140.csv.zip\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/R12221544.txt -P data\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/R12221550_SL140.csv.zip -P data\n", "!unzip -n -d data data/R12221550_SL140.csv.zip\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/R12221550.txt -P data\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/US_A.CSV -P data\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/ap-regression-unemployment/data/Record_Layout_CensusTract_Life_Expectancy.pdf -P data" ], "outputs": [], "execution_count": null }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "# Keep decimal numbers to 4 decimal places\n", "pd.set_option(\"display.float_format\", '{:,.4f}'.format)\n", "pd.set_option(\"display.max_columns\", 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading in our data\n", "\n", "We'll start by reading in our datasets (as we must!).\n", "\n", "### Life expectancy data\n", "\n", "The first dataset is [USALEEP](https://www.cdc.gov/nchs/nvss/usaleep/usaleep.html), the U.S. Small-area Life Expectancy Estimates Project. It's from the National Center for Health Statistics at the CDC.\n", "\n", "We're going to rename a few columns so they make a little more sense. The columns themselves aren't too complicated, mostly pointing out the census tracts and the life expectancy." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>tract_id</th>\n", " <th>STATE2KX</th>\n", " <th>CNTY2KX</th>\n", " <th>TRACT2KX</th>\n", " <th>life_expectancy</th>\n", " <th>life_expectancy_std_err</th>\n", " <th>flag</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1001020100</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>20100</td>\n", " <td>73.1000</td>\n", " <td>2.2348</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1001020200</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>20200</td>\n", " <td>76.9000</td>\n", " <td>3.3453</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1001020400</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>20400</td>\n", " <td>75.4000</td>\n", " <td>1.0216</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1001020500</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>20500</td>\n", " <td>79.4000</td>\n", " <td>1.1768</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1001020600</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>20600</td>\n", " <td>73.1000</td>\n", " <td>1.5519</td>\n", " <td>3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tract_id STATE2KX CNTY2KX TRACT2KX life_expectancy \\\n", "0 1001020100 1 1 20100 73.1000 \n", "1 1001020200 1 1 20200 76.9000 \n", "2 1001020400 1 1 20400 75.4000 \n", "3 1001020500 1 1 20500 79.4000 \n", "4 1001020600 1 1 20600 73.1000 \n", "\n", " life_expectancy_std_err flag \n", "0 2.2348 3 \n", "1 3.3453 3 \n", "2 1.0216 3 \n", "3 1.1768 1 \n", "4 1.5519 3 " ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "life_expec = pd.read_csv(\"data/US_A.CSV\")\n", "life_expec.columns = ['tract_id', 'STATE2KX','CNTY2KX', 'TRACT2KX', 'life_expectancy', \n", " 'life_expectancy_std_err', 'flag']\n", "life_expec.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`flag` sounds a little mysterious and scary, but it's just whether the numbers were observed or predicted." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unemployment Data\n", "\n", "For unemployment, we'll be using data from the Census. You can get unemployment measures from a lot of places in the US government, so the Census Bureau actually [publishes data on which ones you should use](https://www.census.gov/topics/employment/labor-force/guidance.html). In our case we're going to collect unemployment data from the American Community Survery, for the 5-year period nearest to our life expectancy dataset.\n", "\n", "We'll be using Table B23025005, Employment Status for the Population 16 Years and Over. Take a look at [the codebook](https://www.socialexplorer.com/data/ACS2015_5yr/metadata/?ds=ACS15_5yr&var=B23025005) for more details. We're only reading in a few columns to keep things looking clean." ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Geo_FIPS</th>\n", " <th>total_pop</th>\n", " <th>unemployed</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1001020100</td>\n", " <td>1554</td>\n", " <td>54</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1001020200</td>\n", " <td>1731</td>\n", " <td>116</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1001020300</td>\n", " <td>2462</td>\n", " <td>91</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1001020400</td>\n", " <td>3424</td>\n", " <td>216</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1001020500</td>\n", " <td>8198</td>\n", " <td>221</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Geo_FIPS total_pop unemployed\n", "0 1001020100 1554 54\n", "1 1001020200 1731 116\n", "2 1001020300 2462 91\n", "3 1001020400 3424 216\n", "4 1001020500 8198 221" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = {\n", " 'Geo_FIPS': 'Geo_FIPS',\n", " 'ACS15_5yr_B23025001': 'total_pop',\n", " 'ACS15_5yr_B23025005': 'unemployed'\n", "}\n", "employment = pd.read_csv(\"data/R12221544_SL140.csv\", usecols=columns.keys(), encoding='latin-1')\n", "\n", "employment = employment.rename(columns=columns)\n", "employment.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the important thing is _percent_ unemployed, not just how many people are in an area, we'll need to do a little calculation.\n", "\n", "> **Note that I'm multiplying by 100 here** - if you have a litle extra time, try running this notebook on your own with a 0-1.0 percentage instead of the 0-100 version." ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Geo_FIPS</th>\n", " <th>total_pop</th>\n", " <th>unemployed</th>\n", " <th>pct_unemployment</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1001020100</td>\n", " <td>1554</td>\n", " <td>54</td>\n", " <td>3.4749</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1001020200</td>\n", " <td>1731</td>\n", " <td>116</td>\n", " <td>6.7013</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1001020300</td>\n", " <td>2462</td>\n", " <td>91</td>\n", " <td>3.6962</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1001020400</td>\n", " <td>3424</td>\n", " <td>216</td>\n", " <td>6.3084</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1001020500</td>\n", " <td>8198</td>\n", " <td>221</td>\n", " <td>2.6958</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Geo_FIPS total_pop unemployed pct_unemployment\n", "0 1001020100 1554 54 3.4749\n", "1 1001020200 1731 116 6.7013\n", "2 1001020300 2462 91 3.6962\n", "3 1001020400 3424 216 6.3084\n", "4 1001020500 8198 221 2.6958" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "employment['pct_unemployment'] = employment['unemployed'] / employment['total_pop'] * 100\n", "employment.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Demographic data\n", "\n", "The other dataset we're using is also from the Census Bureau's American Community Survey. It's a collection of many more columns with very, very long names and very, very mysterious codes. The tables include:\n", "\n", "* [B03002](https://censusreporter.org/tables/B03002/): Hispanic or Latino Origin by Race\n", "* [B06009](https://censusreporter.org/tables/B06009/): Place of Birth by Educational Attainment in the United States\n", "* [C17002](https://censusreporter.org/tables/C17002/): Ratio of Income to Poverty Level\n", "* [B19013](https://censusreporter.org/tables/B19013/): Median Household Income\n", "\n", "Again, we're only picking a few columns to read in. Generally they're the raw count of certain populations, as well as the total population counted of each table." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Geo_FIPS</th>\n", " <th>race_table_total</th>\n", " <th>white</th>\n", " <th>black</th>\n", " <th>hispanic</th>\n", " <th>edu_table_total</th>\n", " <th>less_than_hs</th>\n", " <th>poverty_table_total</th>\n", " <th>poverty_100_124</th>\n", " <th>poverty_125_149</th>\n", " <th>income</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1001020100</td>\n", " <td>1948</td>\n", " <td>1703</td>\n", " <td>150</td>\n", " <td>17</td>\n", " <td>1,243.0000</td>\n", " <td>184.0000</td>\n", " <td>1948</td>\n", " <td>81</td>\n", " <td>101</td>\n", " <td>61,838.0000</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1001020200</td>\n", " <td>2156</td>\n", " <td>872</td>\n", " <td>1149</td>\n", " <td>17</td>\n", " <td>1,397.0000</td>\n", " <td>356.0000</td>\n", " <td>1983</td>\n", " <td>232</td>\n", " <td>58</td>\n", " <td>32,303.0000</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1001020300</td>\n", " <td>2968</td>\n", " <td>2212</td>\n", " <td>551</td>\n", " <td>0</td>\n", " <td>2,074.0000</td>\n", " <td>221.0000</td>\n", " <td>2968</td>\n", " <td>148</td>\n", " <td>207</td>\n", " <td>44,922.0000</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1001020400</td>\n", " <td>4423</td>\n", " <td>3662</td>\n", " <td>162</td>\n", " <td>464</td>\n", " <td>2,899.0000</td>\n", " <td>339.0000</td>\n", " <td>4423</td>\n", " <td>141</td>\n", " <td>182</td>\n", " <td>54,329.0000</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1001020500</td>\n", " <td>10763</td>\n", " <td>7368</td>\n", " <td>2674</td>\n", " <td>80</td>\n", " <td>6,974.0000</td>\n", " <td>310.0000</td>\n", " <td>10563</td>\n", " <td>256</td>\n", " <td>1064</td>\n", " <td>51,965.0000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Geo_FIPS race_table_total white black hispanic edu_table_total \\\n", "0 1001020100 1948 1703 150 17 1,243.0000 \n", "1 1001020200 2156 872 1149 17 1,397.0000 \n", "2 1001020300 2968 2212 551 0 2,074.0000 \n", "3 1001020400 4423 3662 162 464 2,899.0000 \n", "4 1001020500 10763 7368 2674 80 6,974.0000 \n", "\n", " less_than_hs poverty_table_total poverty_100_124 poverty_125_149 \\\n", "0 184.0000 1948 81 101 \n", "1 356.0000 1983 232 58 \n", "2 221.0000 2968 148 207 \n", "3 339.0000 4423 141 182 \n", "4 310.0000 10563 256 1064 \n", "\n", " income \n", "0 61,838.0000 \n", "1 32,303.0000 \n", "2 44,922.0000 \n", "3 54,329.0000 \n", "4 51,965.0000 " ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = {\n", " 'Geo_FIPS': 'Geo_FIPS',\n", " 'ACS15_5yr_B03002001': 'race_table_total',\n", " 'ACS15_5yr_B03002004': 'black',\n", " 'ACS15_5yr_B03002003': 'white',\n", " 'ACS15_5yr_B03002012': 'hispanic',\n", " 'ACS15_5yr_B06009001': 'edu_table_total',\n", " 'ACS15_5yr_B06009002': 'less_than_hs',\n", " 'ACS15_5yr_C17002004': 'poverty_100_124',\n", " 'ACS15_5yr_C17002005': 'poverty_125_149',\n", " 'ACS15_5yr_C17002001': 'poverty_table_total',\n", " 'ACS15_5yr_B19013001': 'income'\n", "} \n", "census = pd.read_csv(\"data/R12221550_SL140.csv\", usecols=columns.keys(), encoding='latin-1')\n", "census = census.rename(columns=columns)\n", "census.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll again convert these **raw population numbers into percentages** - what percent of people are certain races? What percent of people have not finished high school?" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Geo_FIPS</th>\n", " <th>pct_black</th>\n", " <th>pct_white</th>\n", " <th>pct_hispanic</th>\n", " <th>pct_less_than_hs</th>\n", " <th>pct_under_150_poverty</th>\n", " <th>income</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1001020100</td>\n", " <td>7.7002</td>\n", " <td>87.4230</td>\n", " <td>0.8727</td>\n", " <td>14.8029</td>\n", " <td>9.3429</td>\n", " <td>61,838.0000</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1001020200</td>\n", " <td>53.2931</td>\n", " <td>40.4453</td>\n", " <td>0.7885</td>\n", " <td>25.4832</td>\n", " <td>14.6243</td>\n", " <td>32,303.0000</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1001020300</td>\n", " <td>18.5647</td>\n", " <td>74.5283</td>\n", " <td>0.0000</td>\n", " <td>10.6557</td>\n", " <td>11.9609</td>\n", " <td>44,922.0000</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1001020400</td>\n", " <td>3.6627</td>\n", " <td>82.7945</td>\n", " <td>10.4906</td>\n", " <td>11.6937</td>\n", " <td>7.3027</td>\n", " <td>54,329.0000</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1001020500</td>\n", " <td>24.8444</td>\n", " <td>68.4567</td>\n", " <td>0.7433</td>\n", " <td>4.4451</td>\n", " <td>12.4964</td>\n", " <td>51,965.0000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Geo_FIPS pct_black pct_white pct_hispanic pct_less_than_hs \\\n", "0 1001020100 7.7002 87.4230 0.8727 14.8029 \n", "1 1001020200 53.2931 40.4453 0.7885 25.4832 \n", "2 1001020300 18.5647 74.5283 0.0000 10.6557 \n", "3 1001020400 3.6627 82.7945 10.4906 11.6937 \n", "4 1001020500 24.8444 68.4567 0.7433 4.4451 \n", "\n", " pct_under_150_poverty income \n", "0 9.3429 61,838.0000 \n", "1 14.6243 32,303.0000 \n", "2 11.9609 44,922.0000 \n", "3 7.3027 54,329.0000 \n", "4 12.4964 51,965.0000 " ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_features = pd.DataFrame({\n", " 'Geo_FIPS': census.Geo_FIPS,\n", " 'pct_black': census.black / census.race_table_total * 100,\n", " 'pct_white': census.white / census.race_table_total * 100,\n", " 'pct_hispanic': census.hispanic / census.race_table_total * 100,\n", " 'pct_less_than_hs': census.less_than_hs / census.edu_table_total * 100,\n", " 'pct_under_150_poverty': (census.poverty_100_124 + census.poverty_125_149) / census.poverty_table_total * 100,\n", " 'income': census.income,\n", "})\n", "census_features.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Merge the data\n", "\n", "To perform our regression, we need all of our data in **a single dataframe**. Fortunately these dataframes all keep track of census tracks with [FIPS codes](https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt), unique geographic codes that we can rely on to merge our datasets." ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>tract_id</th>\n", " <th>STATE2KX</th>\n", " <th>CNTY2KX</th>\n", " <th>TRACT2KX</th>\n", " <th>life_expectancy</th>\n", " <th>life_expectancy_std_err</th>\n", " <th>flag</th>\n", " <th>Geo_FIPS</th>\n", " <th>total_pop</th>\n", " <th>unemployed</th>\n", " <th>pct_unemployment</th>\n", " <th>pct_black</th>\n", " <th>pct_white</th>\n", " <th>pct_hispanic</th>\n", " <th>pct_less_than_hs</th>\n", " <th>pct_under_150_poverty</th>\n", " <th>income</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1001020100</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>20100</td>\n", " <td>73.1000</td>\n", " <td>2.2348</td>\n", " <td>3</td>\n", " <td>1001020100</td>\n", " <td>1554</td>\n", " <td>54</td>\n", " <td>3.4749</td>\n", " <td>7.7002</td>\n", " <td>87.4230</td>\n", " <td>0.8727</td>\n", " <td>14.8029</td>\n", " <td>9.3429</td>\n", " <td>61,838.0000</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>1001020200</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>20200</td>\n", " <td>76.9000</td>\n", " <td>3.3453</td>\n", " <td>3</td>\n", " <td>1001020200</td>\n", " <td>1731</td>\n", " <td>116</td>\n", " <td>6.7013</td>\n", " <td>53.2931</td>\n", " <td>40.4453</td>\n", " <td>0.7885</td>\n", " <td>25.4832</td>\n", " <td>14.6243</td>\n", " <td>32,303.0000</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>1001020400</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>20400</td>\n", " <td>75.4000</td>\n", " <td>1.0216</td>\n", " <td>3</td>\n", " <td>1001020400</td>\n", " <td>3424</td>\n", " <td>216</td>\n", " <td>6.3084</td>\n", " <td>3.6627</td>\n", " <td>82.7945</td>\n", " <td>10.4906</td>\n", " <td>11.6937</td>\n", " <td>7.3027</td>\n", " <td>54,329.0000</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>1001020500</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>20500</td>\n", " <td>79.4000</td>\n", " <td>1.1768</td>\n", " <td>1</td>\n", " <td>1001020500</td>\n", " <td>8198</td>\n", " <td>221</td>\n", " <td>2.6958</td>\n", " <td>24.8444</td>\n", " <td>68.4567</td>\n", " <td>0.7433</td>\n", " <td>4.4451</td>\n", " <td>12.4964</td>\n", " <td>51,965.0000</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>1001020600</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>20600</td>\n", " <td>73.1000</td>\n", " <td>1.5519</td>\n", " <td>3</td>\n", " <td>1001020600</td>\n", " <td>2855</td>\n", " <td>190</td>\n", " <td>6.6550</td>\n", " <td>11.9190</td>\n", " <td>72.9161</td>\n", " <td>13.0615</td>\n", " <td>17.4873</td>\n", " <td>10.8543</td>\n", " <td>63,092.0000</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " tract_id STATE2KX CNTY2KX TRACT2KX life_expectancy \\\n", "0 1001020100 1 1 20100 73.1000 \n", "1 1001020200 1 1 20200 76.9000 \n", "2 1001020400 1 1 20400 75.4000 \n", "3 1001020500 1 1 20500 79.4000 \n", "4 1001020600 1 1 20600 73.1000 \n", "\n", " life_expectancy_std_err flag Geo_FIPS total_pop unemployed \\\n", "0 2.2348 3 1001020100 1554 54 \n", "1 3.3453 3 1001020200 1731 116 \n", "2 1.0216 3 1001020400 3424 216 \n", "3 1.1768 1 1001020500 8198 221 \n", "4 1.5519 3 1001020600 2855 190 \n", "\n", " pct_unemployment pct_black pct_white pct_hispanic pct_less_than_hs \\\n", "0 3.4749 7.7002 87.4230 0.8727 14.8029 \n", "1 6.7013 53.2931 40.4453 0.7885 25.4832 \n", "2 6.3084 3.6627 82.7945 10.4906 11.6937 \n", "3 2.6958 24.8444 68.4567 0.7433 4.4451 \n", "4 6.6550 11.9190 72.9161 13.0615 17.4873 \n", "\n", " pct_under_150_poverty income \n", "0 9.3429 61,838.0000 \n", "1 14.6243 32,303.0000 \n", "2 7.3027 54,329.0000 \n", "3 12.4964 51,965.0000 \n", "4 10.8543 63,092.0000 " ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "merged = life_expec.merge(employment, left_on='tract_id', right_on='Geo_FIPS')\n", "merged = merged.merge(census_features, left_on='Geo_FIPS', right_on='Geo_FIPS')\n", "merged.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running the regression\n", "\n", "Using the `statsmodels` package, we'll run a linear regression to find the relationship between **life expectancy** and our **calculated columns**. Note that we're using the **formula method** of writing a regression instead of the dataframes method. I find it both more readable and more useable than the dataframes method.\n", "\n", "> Python note: I'm also using a [multiline string](https://stackoverflow.com/posts/10660443/revisions) which is started and ended using three quotation marks. Multi-line strings and formulas are a match made in heaven!" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<table class=\"simpletable\">\n", "<caption>OLS Regression Results</caption>\n", "<tr>\n", " <th>Dep. Variable:</th> <td>life_expectancy</td> <th> R-squared: </th> <td> 0.490</td> \n", "</tr>\n", "<tr>\n", " <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.490</td> \n", "</tr>\n", "<tr>\n", " <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td> 8997.</td> \n", "</tr>\n", "<tr>\n", " <th>Date:</th> <td>Thu, 12 Dec 2019</td> <th> Prob (F-statistic):</th> <td> 0.00</td> \n", "</tr>\n", "<tr>\n", " <th>Time:</th> <td>21:50:03</td> <th> Log-Likelihood: </th> <td>-1.6208e+05</td>\n", "</tr>\n", "<tr>\n", " <th>No. Observations:</th> <td> 65656</td> <th> AIC: </th> <td>3.242e+05</td> \n", "</tr>\n", "<tr>\n", " <th>Df Residuals:</th> <td> 65648</td> <th> BIC: </th> <td>3.243e+05</td> \n", "</tr>\n", "<tr>\n", " <th>Df Model:</th> <td> 7</td> <th> </th> <td> </td> \n", "</tr>\n", "<tr>\n", " <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n", "</tr>\n", "</table>\n", "<table class=\"simpletable\">\n", "<tr>\n", " <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n", "</tr>\n", "<tr>\n", " <th>Intercept</th> <td> 81.2365</td> <td> 0.122</td> <td> 665.628</td> <td> 0.000</td> <td> 80.997</td> <td> 81.476</td>\n", "</tr>\n", "<tr>\n", " <th>pct_black</th> <td> -0.0666</td> <td> 0.001</td> <td> -56.960</td> <td> 0.000</td> <td> -0.069</td> <td> -0.064</td>\n", "</tr>\n", "<tr>\n", " <th>pct_white</th> <td> -0.0386</td> <td> 0.001</td> <td> -36.707</td> <td> 0.000</td> <td> -0.041</td> <td> -0.037</td>\n", "</tr>\n", "<tr>\n", " <th>pct_hispanic</th> <td> 0.0131</td> <td> 0.001</td> <td> 10.298</td> <td> 0.000</td> <td> 0.011</td> <td> 0.016</td>\n", "</tr>\n", "<tr>\n", " <th>pct_less_than_hs</th> <td> -0.0862</td> <td> 0.002</td> <td> -48.979</td> <td> 0.000</td> <td> -0.090</td> <td> -0.083</td>\n", "</tr>\n", "<tr>\n", " <th>pct_under_150_poverty</th> <td> -0.0596</td> <td> 0.003</td> <td> -21.738</td> <td> 0.000</td> <td> -0.065</td> <td> -0.054</td>\n", "</tr>\n", "<tr>\n", " <th>income</th> <td> 4.825e-05</td> <td> 5.8e-07</td> <td> 83.217</td> <td> 0.000</td> <td> 4.71e-05</td> <td> 4.94e-05</td>\n", "</tr>\n", "<tr>\n", " <th>pct_unemployment</th> <td> -0.1490</td> <td> 0.004</td> <td> -33.408</td> <td> 0.000</td> <td> -0.158</td> <td> -0.140</td>\n", "</tr>\n", "</table>\n", "<table class=\"simpletable\">\n", "<tr>\n", " <th>Omnibus:</th> <td>2114.193</td> <th> Durbin-Watson: </th> <td> 1.520</td>\n", "</tr>\n", "<tr>\n", " <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td>4788.035</td>\n", "</tr>\n", "<tr>\n", " <th>Skew:</th> <td> 0.183</td> <th> Prob(JB): </th> <td> 0.00</td>\n", "</tr>\n", "<tr>\n", " <th>Kurtosis:</th> <td> 4.271</td> <th> Cond. No. </th> <td>7.06e+05</td>\n", "</tr>\n", "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.<br/>[2] The condition number is large, 7.06e+05. This might indicate that there are<br/>strong multicollinearity or other numerical problems." ], "text/plain": [ "<class 'statsmodels.iolib.summary.Summary'>\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: life_expectancy R-squared: 0.490\n", "Model: OLS Adj. R-squared: 0.490\n", "Method: Least Squares F-statistic: 8997.\n", "Date: Thu, 12 Dec 2019 Prob (F-statistic): 0.00\n", "Time: 21:50:03 Log-Likelihood: -1.6208e+05\n", "No. Observations: 65656 AIC: 3.242e+05\n", "Df Residuals: 65648 BIC: 3.243e+05\n", "Df Model: 7 \n", "Covariance Type: nonrobust \n", "=========================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "-----------------------------------------------------------------------------------------\n", "Intercept 81.2365 0.122 665.628 0.000 80.997 81.476\n", "pct_black -0.0666 0.001 -56.960 0.000 -0.069 -0.064\n", "pct_white -0.0386 0.001 -36.707 0.000 -0.041 -0.037\n", "pct_hispanic 0.0131 0.001 10.298 0.000 0.011 0.016\n", "pct_less_than_hs -0.0862 0.002 -48.979 0.000 -0.090 -0.083\n", "pct_under_150_poverty -0.0596 0.003 -21.738 0.000 -0.065 -0.054\n", "income 4.825e-05 5.8e-07 83.217 0.000 4.71e-05 4.94e-05\n", "pct_unemployment -0.1490 0.004 -33.408 0.000 -0.158 -0.140\n", "==============================================================================\n", "Omnibus: 2114.193 Durbin-Watson: 1.520\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 4788.035\n", "Skew: 0.183 Prob(JB): 0.00\n", "Kurtosis: 4.271 Cond. No. 7.06e+05\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "[2] The condition number is large, 7.06e+05. This might indicate that there are\n", "strong multicollinearity or other numerical problems.\n", "\"\"\"" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import statsmodels.formula.api as smf\n", "\n", "model = smf.ols(\"\"\"\n", " life_expectancy ~ pct_black + pct_white + pct_hispanic + pct_less_than_hs\n", " + pct_under_150_poverty + income + pct_unemployment\n", "\"\"\", data=merged)\n", "\n", "results = model.fit()\n", "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's a lot to take in, so I'll zero us in on the important parts:\n", " \n", "<img src=\"\">" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The coefficient for `pct_unemployment` is -0.15, which means **\"for every percentage point increase in unemployment, life expectancy drops 0.15 years.\"** While this might be _correct_ it unfortunately sounds very stupid.\n", "\n", "Income's coefficient is rather unfortunate, too. \"For every extra dollar in median income, life expectancy goes up 0.00004825 years.\" No thanks! Individual dollars don't really make sense here.\n", "\n", "Instead, it'd be nice to say something like **\"for every ten point increase of unemployment,\"** or **\"for every 10k increase in income.\"** If we were using the dataframes version of regression, we'd create new columns, maybe something like this:\n", "\n", "```python\n", "merged['income_10k'] = merged.income / 10000\n", "merged['unemployment_pct_10pts'] = merged.pct_unemployment / 10\n", "```\n", "\n", "Since we're using the formulas method, though, we can do the division right in the regression!" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<table class=\"simpletable\">\n", "<caption>OLS Regression Results</caption>\n", "<tr>\n", " <th>Dep. Variable:</th> <td>life_expectancy</td> <th> R-squared: </th> <td> 0.490</td> \n", "</tr>\n", "<tr>\n", " <th>Model:</th> <td>OLS</td> <th> Adj. R-squared: </th> <td> 0.490</td> \n", "</tr>\n", "<tr>\n", " <th>Method:</th> <td>Least Squares</td> <th> F-statistic: </th> <td> 8997.</td> \n", "</tr>\n", "<tr>\n", " <th>Date:</th> <td>Thu, 12 Dec 2019</td> <th> Prob (F-statistic):</th> <td> 0.00</td> \n", "</tr>\n", "<tr>\n", " <th>Time:</th> <td>21:50:50</td> <th> Log-Likelihood: </th> <td>-1.6208e+05</td>\n", "</tr>\n", "<tr>\n", " <th>No. Observations:</th> <td> 65656</td> <th> AIC: </th> <td>3.242e+05</td> \n", "</tr>\n", "<tr>\n", " <th>Df Residuals:</th> <td> 65648</td> <th> BIC: </th> <td>3.243e+05</td> \n", "</tr>\n", "<tr>\n", " <th>Df Model:</th> <td> 7</td> <th> </th> <td> </td> \n", "</tr>\n", "<tr>\n", " <th>Covariance Type:</th> <td>nonrobust</td> <th> </th> <td> </td> \n", "</tr>\n", "</table>\n", "<table class=\"simpletable\">\n", "<tr>\n", " <td></td> <th>coef</th> <th>std err</th> <th>t</th> <th>P>|t|</th> <th>[0.025</th> <th>0.975]</th> \n", "</tr>\n", "<tr>\n", " <th>Intercept</th> <td> 81.2365</td> <td> 0.122</td> <td> 665.628</td> <td> 0.000</td> <td> 80.997</td> <td> 81.476</td>\n", "</tr>\n", "<tr>\n", " <th>pct_black</th> <td> -0.0666</td> <td> 0.001</td> <td> -56.960</td> <td> 0.000</td> <td> -0.069</td> <td> -0.064</td>\n", "</tr>\n", "<tr>\n", " <th>pct_white</th> <td> -0.0386</td> <td> 0.001</td> <td> -36.707</td> <td> 0.000</td> <td> -0.041</td> <td> -0.037</td>\n", "</tr>\n", "<tr>\n", " <th>pct_hispanic</th> <td> 0.0131</td> <td> 0.001</td> <td> 10.298</td> <td> 0.000</td> <td> 0.011</td> <td> 0.016</td>\n", "</tr>\n", "<tr>\n", " <th>pct_less_than_hs</th> <td> -0.0862</td> <td> 0.002</td> <td> -48.979</td> <td> 0.000</td> <td> -0.090</td> <td> -0.083</td>\n", "</tr>\n", "<tr>\n", " <th>pct_under_150_poverty</th> <td> -0.0596</td> <td> 0.003</td> <td> -21.738</td> <td> 0.000</td> <td> -0.065</td> <td> -0.054</td>\n", "</tr>\n", "<tr>\n", " <th>np.divide(income, 10000)</th> <td> 0.4825</td> <td> 0.006</td> <td> 83.217</td> <td> 0.000</td> <td> 0.471</td> <td> 0.494</td>\n", "</tr>\n", "<tr>\n", " <th>np.divide(pct_unemployment, 10)</th> <td> -1.4900</td> <td> 0.045</td> <td> -33.408</td> <td> 0.000</td> <td> -1.577</td> <td> -1.403</td>\n", "</tr>\n", "</table>\n", "<table class=\"simpletable\">\n", "<tr>\n", " <th>Omnibus:</th> <td>2114.193</td> <th> Durbin-Watson: </th> <td> 1.520</td>\n", "</tr>\n", "<tr>\n", " <th>Prob(Omnibus):</th> <td> 0.000</td> <th> Jarque-Bera (JB): </th> <td>4788.035</td>\n", "</tr>\n", "<tr>\n", " <th>Skew:</th> <td> 0.183</td> <th> Prob(JB): </th> <td> 0.00</td>\n", "</tr>\n", "<tr>\n", " <th>Kurtosis:</th> <td> 4.271</td> <th> Cond. No. </th> <td> 792.</td>\n", "</tr>\n", "</table><br/><br/>Warnings:<br/>[1] Standard Errors assume that the covariance matrix of the errors is correctly specified." ], "text/plain": [ "<class 'statsmodels.iolib.summary.Summary'>\n", "\"\"\"\n", " OLS Regression Results \n", "==============================================================================\n", "Dep. Variable: life_expectancy R-squared: 0.490\n", "Model: OLS Adj. R-squared: 0.490\n", "Method: Least Squares F-statistic: 8997.\n", "Date: Thu, 12 Dec 2019 Prob (F-statistic): 0.00\n", "Time: 21:50:50 Log-Likelihood: -1.6208e+05\n", "No. Observations: 65656 AIC: 3.242e+05\n", "Df Residuals: 65648 BIC: 3.243e+05\n", "Df Model: 7 \n", "Covariance Type: nonrobust \n", "===================================================================================================\n", " coef std err t P>|t| [0.025 0.975]\n", "---------------------------------------------------------------------------------------------------\n", "Intercept 81.2365 0.122 665.628 0.000 80.997 81.476\n", "pct_black -0.0666 0.001 -56.960 0.000 -0.069 -0.064\n", "pct_white -0.0386 0.001 -36.707 0.000 -0.041 -0.037\n", "pct_hispanic 0.0131 0.001 10.298 0.000 0.011 0.016\n", "pct_less_than_hs -0.0862 0.002 -48.979 0.000 -0.090 -0.083\n", "pct_under_150_poverty -0.0596 0.003 -21.738 0.000 -0.065 -0.054\n", "np.divide(income, 10000) 0.4825 0.006 83.217 0.000 0.471 0.494\n", "np.divide(pct_unemployment, 10) -1.4900 0.045 -33.408 0.000 -1.577 -1.403\n", "==============================================================================\n", "Omnibus: 2114.193 Durbin-Watson: 1.520\n", "Prob(Omnibus): 0.000 Jarque-Bera (JB): 4788.035\n", "Skew: 0.183 Prob(JB): 0.00\n", "Kurtosis: 4.271 Cond. No. 792.\n", "==============================================================================\n", "\n", "Warnings:\n", "[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.\n", "\"\"\"" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = smf.ols(\"\"\"\n", " life_expectancy ~ \n", " pct_black \n", " + pct_white \n", " + pct_hispanic \n", " + pct_less_than_hs \n", " + pct_under_150_poverty \n", " + np.divide(income, 10000)\n", " + np.divide(pct_unemployment, 10)\n", "\"\"\", data=merged)\n", "\n", "results = model.fit()\n", "results.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**That's much much nicer!** Unemployment now has a -1.49 coefficient, which (surprise!) leads us right back to the [original article](https://apnews.com/66ac44186b6249709501f07a7eab36da):\n", "\n", "> An increase of 10 percentage points in the unemployment rate in a neighborhood translated to a loss of roughly a year and a half of life expectancy, the AP found. \n", "\n", "Additionally, the negative coefficient for `pct_less_than_hs` is the source of this line:\n", "\n", "> A neighborhood where more adults failed to graduate high school had shorter predicted longevity.\n", "\n", "Even though we accomplished this pretty quickly in very few lines of code, the actual process at the Associated Press was somewhat more complicated. They didn't just throw a bunch of columns in a regression!\n", "\n", "Stats-wise, the AP did things like examined other columns that didn't meet a p-value threshold, and reasoned about [collinearity](http://www.stat.tamu.edu/~hart/652/collinear.pdf) between variables different. For domain expertise, they included a threshold for poverty that's actually _above_ the poverty line. This is due to the way that Medicaid provides health insurance to people in low-income households.\n", "\n", "## Review\n", "\n", "In this exercise, we reproduced [an article from the Associated Press](https://apnews.com/66ac44186b6249709501f07a7eab36da) that analyzed life expectancy and demographic information from the census. They used a linear regression to find the relationship between census tract qualities like unemployment, education, race, and income and how long people live.\n", "\n", "After running the regression once, we ran it a second time to get numbers that were **more human and easier to use in a story**, like a \"1.5 year decrease in life expectancy\" as opposed to a 0.15-year or 8-week decrease. We also used the **formula version** of a statsmodels linear regression to perform those calculations in the regression with `np.divide`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discussion topics\n", "\n", "Let's say we have the sentence \"a 1 percentage point increase in unemployment translates to a 0.15 year decrease in life expectancy.\" How do we change this into something people can appreciate?\n", "\n", "Translate some of your coefficients into the form **\"every X percentage point change in unemployment translates to a Y change in life expectancy.\"** Do this with numbers that are meaningful, and in a way that is easily understandable to your reader.\n", "\n", "What is the difference between using percentage points (0 to 100) vs fractions (0 to 1) when doing a regression analysis?\n", "\n", "Income, education, and race are all related, which can cause problems related to [multicollinearity](https://www.statisticssolutions.com/multicollinearity/) in our analysis. How can we avoid this?\n", "\n", "We should consult someone who \"knows something\" before publishing this story. Where would we go to find someone who understands statistics and the subject matter?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }