{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Finding surveillance planes using random forests\n",
"\n",
"**The story:**\n",
"\n",
"- https://www.buzzfeednews.com/article/peteraldhous/spies-in-the-skies\n",
"- https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes\n",
" \n",
"This story, done by Peter Aldhous at Buzzfeed News, involved training a machine learning algorithm to recognize government surveillance planes based on what their flight patterns look like.\n",
"\n",
"**Topics:** Random Forests\n",
"\n",
"**Datasets**\n",
"\n",
"* **feds.csv:** Transponder codes of planes operated by the federal government\n",
"* **planes_features.csv:** various features describing each plane's flight patterns\n",
"* **train.csv:** a labeled dataset of transponder codes and whether each plane is a surveillance plane or not\n",
" - The `label` column was originally `class`, but I renamed it because pandas freaks out a bit with a column named `class`\n",
" - This was created by Buzzfeed `feds.csv`\n",
"* **data dictionary:** You can find the data dictionary published with their analysis [here](https://buzzfeednews.github.io/2016-04-federal-surveillance-planes/analysis.html)\n",
"* **a few other files**\n",
"\n",
"## What's the goal?\n",
"\n",
"The FBI and Department of Homeland Security operate many planes that are not directly labeled as belonging to the government. If we can uncover these planes, we have a better idea of the surveillance activities they are undertaking."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p class=\"reading-options\">\n <a class=\"btn\" href=\"/buzzfeed-spy-planes/buzzfeed-surveillance-planes-random-forests\">\n <i class=\"fa fa-sm fa-book\"></i>\n Read online\n </a>\n <a class=\"btn\" href=\"/buzzfeed-spy-planes/notebooks/Buzzfeed Surveillance Planes Random Forests.ipynb\">\n <i class=\"fa fa-sm fa-download\"></i>\n Download notebook\n </a>\n <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/buzzfeed-spy-planes/notebooks/Buzzfeed Surveillance Planes Random Forests.ipynb\" target=\"_new\">\n <i class=\"fa fa-sm fa-laptop\"></i>\n Interactive version\n </a>\n</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Prep work: Downloading necessary files\n",
"Before we get started, we need to download all of the data we'll be using.\n",
"* **planes_features.csv:** BuzzFeed plane features - as provided by BuzzFeed\n",
"* **train.csv:** BuzzFeed labeled plane data - as provided by BuzzFeed\n",
"* **feds.csv:** BuzzFeed federal planes list - as provided by BuzzFeed\n"
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"# Make data directory if it doesn't exist\n",
"!mkdir -p data\n",
"!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/buzzfeed-spy-planes/data/planes_features.csv -P data\n",
"!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/buzzfeed-spy-planes/data/train.csv -P data\n",
"!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/buzzfeed-spy-planes/data/feds.csv -P data"
],
"outputs": [],
"execution_count": null
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports\n",
"\n",
"Also set a large number of maximum columns."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"pd.set_option(\"display.max_columns\", 100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Read in our data\n",
"\n",
"Almost all classification problems start with a set of labeled features. In this case, the features are in one CSV file and the labels are in another. **Read both files in and merge them on `adshex`, the transpoder code.**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>adshex</th>\n",
" <th>duration1</th>\n",
" <th>duration2</th>\n",
" <th>duration3</th>\n",
" <th>duration4</th>\n",
" <th>duration5</th>\n",
" <th>boxes1</th>\n",
" <th>boxes2</th>\n",
" <th>boxes3</th>\n",
" <th>boxes4</th>\n",
" <th>boxes5</th>\n",
" <th>speed1</th>\n",
" <th>speed2</th>\n",
" <th>speed3</th>\n",
" <th>speed4</th>\n",
" <th>speed5</th>\n",
" <th>altitude1</th>\n",
" <th>altitude2</th>\n",
" <th>altitude3</th>\n",
" <th>altitude4</th>\n",
" <th>altitude5</th>\n",
" <th>steer1</th>\n",
" <th>steer2</th>\n",
" <th>steer3</th>\n",
" <th>steer4</th>\n",
" <th>steer5</th>\n",
" <th>steer6</th>\n",
" <th>steer7</th>\n",
" <th>steer8</th>\n",
" <th>flights</th>\n",
" <th>squawk_1</th>\n",
" <th>observations</th>\n",
" <th>type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A</td>\n",
" <td>0.120253</td>\n",
" <td>0.075949</td>\n",
" <td>0.183544</td>\n",
" <td>0.335443</td>\n",
" <td>0.284810</td>\n",
" <td>0.088608</td>\n",
" <td>0.044304</td>\n",
" <td>0.069620</td>\n",
" <td>0.120253</td>\n",
" <td>0.677215</td>\n",
" <td>0.021824</td>\n",
" <td>0.020550</td>\n",
" <td>0.062330</td>\n",
" <td>0.100713</td>\n",
" <td>0.794582</td>\n",
" <td>0.042374</td>\n",
" <td>0.060971</td>\n",
" <td>0.066831</td>\n",
" <td>0.106403</td>\n",
" <td>0.723421</td>\n",
" <td>0.020211</td>\n",
" <td>0.048913</td>\n",
" <td>0.270550</td>\n",
" <td>0.344090</td>\n",
" <td>0.097317</td>\n",
" <td>0.186651</td>\n",
" <td>0.011379</td>\n",
" <td>0.009426</td>\n",
" <td>158</td>\n",
" <td>0</td>\n",
" <td>11776</td>\n",
" <td>GRND</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A00000</td>\n",
" <td>0.211735</td>\n",
" <td>0.155612</td>\n",
" <td>0.181122</td>\n",
" <td>0.198980</td>\n",
" <td>0.252551</td>\n",
" <td>0.204082</td>\n",
" <td>0.183673</td>\n",
" <td>0.168367</td>\n",
" <td>0.173469</td>\n",
" <td>0.267857</td>\n",
" <td>0.107348</td>\n",
" <td>0.143410</td>\n",
" <td>0.208139</td>\n",
" <td>0.177013</td>\n",
" <td>0.364090</td>\n",
" <td>0.177318</td>\n",
" <td>0.114457</td>\n",
" <td>0.129648</td>\n",
" <td>0.197694</td>\n",
" <td>0.380882</td>\n",
" <td>0.034976</td>\n",
" <td>0.048127</td>\n",
" <td>0.240732</td>\n",
" <td>0.356314</td>\n",
" <td>0.116116</td>\n",
" <td>0.159325</td>\n",
" <td>0.012828</td>\n",
" <td>0.013628</td>\n",
" <td>392</td>\n",
" <td>0</td>\n",
" <td>52465</td>\n",
" <td>TBM7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A00002</td>\n",
" <td>0.517241</td>\n",
" <td>0.103448</td>\n",
" <td>0.103448</td>\n",
" <td>0.103448</td>\n",
" <td>0.172414</td>\n",
" <td>0.862069</td>\n",
" <td>0.137931</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.990792</td>\n",
" <td>0.000921</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.008287</td>\n",
" <td>0.599448</td>\n",
" <td>0.400552</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.105893</td>\n",
" <td>0.090239</td>\n",
" <td>0.174954</td>\n",
" <td>0.244015</td>\n",
" <td>0.034070</td>\n",
" <td>0.202578</td>\n",
" <td>0.021179</td>\n",
" <td>0.068140</td>\n",
" <td>29</td>\n",
" <td>0</td>\n",
" <td>1086</td>\n",
" <td>SHIP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A00008</td>\n",
" <td>0.125000</td>\n",
" <td>0.041667</td>\n",
" <td>0.208333</td>\n",
" <td>0.166667</td>\n",
" <td>0.458333</td>\n",
" <td>0.125000</td>\n",
" <td>0.083333</td>\n",
" <td>0.125000</td>\n",
" <td>0.166667</td>\n",
" <td>0.500000</td>\n",
" <td>0.187960</td>\n",
" <td>0.278952</td>\n",
" <td>0.221048</td>\n",
" <td>0.190257</td>\n",
" <td>0.121783</td>\n",
" <td>0.014706</td>\n",
" <td>0.053309</td>\n",
" <td>0.149816</td>\n",
" <td>0.279871</td>\n",
" <td>0.502298</td>\n",
" <td>0.029871</td>\n",
" <td>0.044118</td>\n",
" <td>0.202665</td>\n",
" <td>0.380515</td>\n",
" <td>0.094669</td>\n",
" <td>0.182904</td>\n",
" <td>0.014706</td>\n",
" <td>0.020221</td>\n",
" <td>24</td>\n",
" <td>0</td>\n",
" <td>2176</td>\n",
" <td>PA46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A0001E</td>\n",
" <td>0.100000</td>\n",
" <td>0.200000</td>\n",
" <td>0.200000</td>\n",
" <td>0.400000</td>\n",
" <td>0.100000</td>\n",
" <td>0.100000</td>\n",
" <td>0.000000</td>\n",
" <td>0.100000</td>\n",
" <td>0.400000</td>\n",
" <td>0.400000</td>\n",
" <td>0.007937</td>\n",
" <td>0.026984</td>\n",
" <td>0.084127</td>\n",
" <td>0.179365</td>\n",
" <td>0.701587</td>\n",
" <td>0.041270</td>\n",
" <td>0.085714</td>\n",
" <td>0.039683</td>\n",
" <td>0.111111</td>\n",
" <td>0.722222</td>\n",
" <td>0.019048</td>\n",
" <td>0.049206</td>\n",
" <td>0.249206</td>\n",
" <td>0.326984</td>\n",
" <td>0.112698</td>\n",
" <td>0.206349</td>\n",
" <td>0.012698</td>\n",
" <td>0.011111</td>\n",
" <td>10</td>\n",
" <td>1135</td>\n",
" <td>630</td>\n",
" <td>C56X</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" adshex duration1 duration2 duration3 duration4 duration5 boxes1 \\\n",
"0 A 0.120253 0.075949 0.183544 0.335443 0.284810 0.088608 \n",
"1 A00000 0.211735 0.155612 0.181122 0.198980 0.252551 0.204082 \n",
"2 A00002 0.517241 0.103448 0.103448 0.103448 0.172414 0.862069 \n",
"3 A00008 0.125000 0.041667 0.208333 0.166667 0.458333 0.125000 \n",
"4 A0001E 0.100000 0.200000 0.200000 0.400000 0.100000 0.100000 \n",
"\n",
" boxes2 boxes3 boxes4 boxes5 speed1 speed2 speed3 \\\n",
"0 0.044304 0.069620 0.120253 0.677215 0.021824 0.020550 0.062330 \n",
"1 0.183673 0.168367 0.173469 0.267857 0.107348 0.143410 0.208139 \n",
"2 0.137931 0.000000 0.000000 0.000000 0.990792 0.000921 0.000000 \n",
"3 0.083333 0.125000 0.166667 0.500000 0.187960 0.278952 0.221048 \n",
"4 0.000000 0.100000 0.400000 0.400000 0.007937 0.026984 0.084127 \n",
"\n",
" speed4 speed5 altitude1 altitude2 altitude3 altitude4 altitude5 \\\n",
"0 0.100713 0.794582 0.042374 0.060971 0.066831 0.106403 0.723421 \n",
"1 0.177013 0.364090 0.177318 0.114457 0.129648 0.197694 0.380882 \n",
"2 0.000000 0.008287 0.599448 0.400552 0.000000 0.000000 0.000000 \n",
"3 0.190257 0.121783 0.014706 0.053309 0.149816 0.279871 0.502298 \n",
"4 0.179365 0.701587 0.041270 0.085714 0.039683 0.111111 0.722222 \n",
"\n",
" steer1 steer2 steer3 steer4 steer5 steer6 steer7 \\\n",
"0 0.020211 0.048913 0.270550 0.344090 0.097317 0.186651 0.011379 \n",
"1 0.034976 0.048127 0.240732 0.356314 0.116116 0.159325 0.012828 \n",
"2 0.105893 0.090239 0.174954 0.244015 0.034070 0.202578 0.021179 \n",
"3 0.029871 0.044118 0.202665 0.380515 0.094669 0.182904 0.014706 \n",
"4 0.019048 0.049206 0.249206 0.326984 0.112698 0.206349 0.012698 \n",
"\n",
" steer8 flights squawk_1 observations type \n",
"0 0.009426 158 0 11776 GRND \n",
"1 0.013628 392 0 52465 TBM7 \n",
"2 0.068140 29 0 1086 SHIP \n",
"3 0.020221 24 0 2176 PA46 \n",
"4 0.011111 10 1135 630 C56X "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read in your features\n",
"features = pd.read_csv(\"data/planes_features.csv\")\n",
"features.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>adshex</th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A00C4B</td>\n",
" <td>surveil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A0AB21</td>\n",
" <td>surveil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A0AE77</td>\n",
" <td>surveil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A0AE7C</td>\n",
" <td>surveil</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A0C462</td>\n",
" <td>surveil</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" adshex label\n",
"0 A00C4B surveil\n",
"1 A0AB21 surveil\n",
"2 A0AE77 surveil\n",
"3 A0AE7C surveil\n",
"4 A0C462 surveil"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Read in your labels\n",
"labeled = pd.read_csv(\"data/train.csv\").rename(columns={'class': 'label'})\n",
"labeled.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>adshex</th>\n",
" <th>label</th>\n",
" <th>duration1</th>\n",
" <th>duration2</th>\n",
" <th>duration3</th>\n",
" <th>duration4</th>\n",
" <th>duration5</th>\n",
" <th>boxes1</th>\n",
" <th>boxes2</th>\n",
" <th>boxes3</th>\n",
" <th>boxes4</th>\n",
" <th>boxes5</th>\n",
" <th>speed1</th>\n",
" <th>speed2</th>\n",
" <th>speed3</th>\n",
" <th>speed4</th>\n",
" <th>speed5</th>\n",
" <th>altitude1</th>\n",
" <th>altitude2</th>\n",
" <th>altitude3</th>\n",
" <th>altitude4</th>\n",
" <th>altitude5</th>\n",
" <th>steer1</th>\n",
" <th>steer2</th>\n",
" <th>steer3</th>\n",
" <th>steer4</th>\n",
" <th>steer5</th>\n",
" <th>steer6</th>\n",
" <th>steer7</th>\n",
" <th>steer8</th>\n",
" <th>flights</th>\n",
" <th>squawk_1</th>\n",
" <th>observations</th>\n",
" <th>type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A00C4B</td>\n",
" <td>surveil</td>\n",
" <td>0.450000</td>\n",
" <td>0.125000</td>\n",
" <td>0.025000</td>\n",
" <td>0.025000</td>\n",
" <td>0.375000</td>\n",
" <td>0.475000</td>\n",
" <td>0.250000</td>\n",
" <td>0.250000</td>\n",
" <td>0.025000</td>\n",
" <td>0.000000</td>\n",
" <td>0.337128</td>\n",
" <td>0.408286</td>\n",
" <td>0.185431</td>\n",
" <td>0.053026</td>\n",
" <td>0.016129</td>\n",
" <td>0.010226</td>\n",
" <td>0.168564</td>\n",
" <td>0.793274</td>\n",
" <td>0.027936</td>\n",
" <td>0.000000</td>\n",
" <td>0.151697</td>\n",
" <td>0.203774</td>\n",
" <td>0.303922</td>\n",
" <td>0.154544</td>\n",
" <td>0.033312</td>\n",
" <td>0.088024</td>\n",
" <td>0.010858</td>\n",
" <td>0.010753</td>\n",
" <td>40</td>\n",
" <td>4414</td>\n",
" <td>9486</td>\n",
" <td>C182</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A0AB21</td>\n",
" <td>surveil</td>\n",
" <td>0.523810</td>\n",
" <td>0.000000</td>\n",
" <td>0.047619</td>\n",
" <td>0.095238</td>\n",
" <td>0.333333</td>\n",
" <td>0.714286</td>\n",
" <td>0.095238</td>\n",
" <td>0.047619</td>\n",
" <td>0.142857</td>\n",
" <td>0.000000</td>\n",
" <td>0.703329</td>\n",
" <td>0.144543</td>\n",
" <td>0.114201</td>\n",
" <td>0.026549</td>\n",
" <td>0.011378</td>\n",
" <td>0.007164</td>\n",
" <td>0.580700</td>\n",
" <td>0.374210</td>\n",
" <td>0.037927</td>\n",
" <td>0.000000</td>\n",
" <td>0.141593</td>\n",
" <td>0.152550</td>\n",
" <td>0.166456</td>\n",
" <td>0.309313</td>\n",
" <td>0.008007</td>\n",
" <td>0.078382</td>\n",
" <td>0.021492</td>\n",
" <td>0.064054</td>\n",
" <td>21</td>\n",
" <td>4414</td>\n",
" <td>2373</td>\n",
" <td>C182</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A0AE77</td>\n",
" <td>surveil</td>\n",
" <td>0.262295</td>\n",
" <td>0.196721</td>\n",
" <td>0.081967</td>\n",
" <td>0.114754</td>\n",
" <td>0.344262</td>\n",
" <td>0.639344</td>\n",
" <td>0.295082</td>\n",
" <td>0.032787</td>\n",
" <td>0.032787</td>\n",
" <td>0.000000</td>\n",
" <td>0.703037</td>\n",
" <td>0.181262</td>\n",
" <td>0.066502</td>\n",
" <td>0.030956</td>\n",
" <td>0.018244</td>\n",
" <td>0.000000</td>\n",
" <td>0.000118</td>\n",
" <td>0.034134</td>\n",
" <td>0.923376</td>\n",
" <td>0.042373</td>\n",
" <td>0.121234</td>\n",
" <td>0.256709</td>\n",
" <td>0.279779</td>\n",
" <td>0.209981</td>\n",
" <td>0.009416</td>\n",
" <td>0.037900</td>\n",
" <td>0.011064</td>\n",
" <td>0.027778</td>\n",
" <td>61</td>\n",
" <td>4414</td>\n",
" <td>8496</td>\n",
" <td>T206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A0AE7C</td>\n",
" <td>surveil</td>\n",
" <td>0.521739</td>\n",
" <td>0.086957</td>\n",
" <td>0.043478</td>\n",
" <td>0.043478</td>\n",
" <td>0.304348</td>\n",
" <td>0.565217</td>\n",
" <td>0.043478</td>\n",
" <td>0.260870</td>\n",
" <td>0.000000</td>\n",
" <td>0.130435</td>\n",
" <td>0.129674</td>\n",
" <td>0.291088</td>\n",
" <td>0.384954</td>\n",
" <td>0.098159</td>\n",
" <td>0.096126</td>\n",
" <td>0.000000</td>\n",
" <td>0.004631</td>\n",
" <td>0.200723</td>\n",
" <td>0.722806</td>\n",
" <td>0.071840</td>\n",
" <td>0.159494</td>\n",
" <td>0.256636</td>\n",
" <td>0.238111</td>\n",
" <td>0.168305</td>\n",
" <td>0.023043</td>\n",
" <td>0.086073</td>\n",
" <td>0.014007</td>\n",
" <td>0.014797</td>\n",
" <td>23</td>\n",
" <td>4415</td>\n",
" <td>8853</td>\n",
" <td>T206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A0C462</td>\n",
" <td>surveil</td>\n",
" <td>0.250000</td>\n",
" <td>0.083333</td>\n",
" <td>0.500000</td>\n",
" <td>0.083333</td>\n",
" <td>0.083333</td>\n",
" <td>0.208333</td>\n",
" <td>0.041667</td>\n",
" <td>0.041667</td>\n",
" <td>0.500000</td>\n",
" <td>0.208333</td>\n",
" <td>0.040691</td>\n",
" <td>0.002466</td>\n",
" <td>0.041924</td>\n",
" <td>0.170160</td>\n",
" <td>0.744760</td>\n",
" <td>0.011097</td>\n",
" <td>0.007398</td>\n",
" <td>0.023428</td>\n",
" <td>0.090012</td>\n",
" <td>0.868064</td>\n",
" <td>0.019729</td>\n",
" <td>0.020962</td>\n",
" <td>0.199753</td>\n",
" <td>0.478422</td>\n",
" <td>0.119605</td>\n",
" <td>0.118372</td>\n",
" <td>0.006165</td>\n",
" <td>0.011097</td>\n",
" <td>24</td>\n",
" <td>1731</td>\n",
" <td>811</td>\n",
" <td>P8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" adshex label duration1 duration2 duration3 duration4 duration5 \\\n",
"0 A00C4B surveil 0.450000 0.125000 0.025000 0.025000 0.375000 \n",
"1 A0AB21 surveil 0.523810 0.000000 0.047619 0.095238 0.333333 \n",
"2 A0AE77 surveil 0.262295 0.196721 0.081967 0.114754 0.344262 \n",
"3 A0AE7C surveil 0.521739 0.086957 0.043478 0.043478 0.304348 \n",
"4 A0C462 surveil 0.250000 0.083333 0.500000 0.083333 0.083333 \n",
"\n",
" boxes1 boxes2 boxes3 boxes4 boxes5 speed1 speed2 \\\n",
"0 0.475000 0.250000 0.250000 0.025000 0.000000 0.337128 0.408286 \n",
"1 0.714286 0.095238 0.047619 0.142857 0.000000 0.703329 0.144543 \n",
"2 0.639344 0.295082 0.032787 0.032787 0.000000 0.703037 0.181262 \n",
"3 0.565217 0.043478 0.260870 0.000000 0.130435 0.129674 0.291088 \n",
"4 0.208333 0.041667 0.041667 0.500000 0.208333 0.040691 0.002466 \n",
"\n",
" speed3 speed4 speed5 altitude1 altitude2 altitude3 altitude4 \\\n",
"0 0.185431 0.053026 0.016129 0.010226 0.168564 0.793274 0.027936 \n",
"1 0.114201 0.026549 0.011378 0.007164 0.580700 0.374210 0.037927 \n",
"2 0.066502 0.030956 0.018244 0.000000 0.000118 0.034134 0.923376 \n",
"3 0.384954 0.098159 0.096126 0.000000 0.004631 0.200723 0.722806 \n",
"4 0.041924 0.170160 0.744760 0.011097 0.007398 0.023428 0.090012 \n",
"\n",
" altitude5 steer1 steer2 steer3 steer4 steer5 steer6 \\\n",
"0 0.000000 0.151697 0.203774 0.303922 0.154544 0.033312 0.088024 \n",
"1 0.000000 0.141593 0.152550 0.166456 0.309313 0.008007 0.078382 \n",
"2 0.042373 0.121234 0.256709 0.279779 0.209981 0.009416 0.037900 \n",
"3 0.071840 0.159494 0.256636 0.238111 0.168305 0.023043 0.086073 \n",
"4 0.868064 0.019729 0.020962 0.199753 0.478422 0.119605 0.118372 \n",
"\n",
" steer7 steer8 flights squawk_1 observations type \n",
"0 0.010858 0.010753 40 4414 9486 C182 \n",
"1 0.021492 0.064054 21 4414 2373 C182 \n",
"2 0.011064 0.027778 61 4414 8496 T206 \n",
"3 0.014007 0.014797 23 4415 8853 T206 \n",
"4 0.006165 0.011097 24 1731 811 P8 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = labeled.merge(features, on='adshex')\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### No wait, merge them again!\n",
"\n",
"We have features for about 20,000 planes and labels for about 600 planes. When you merge, the planes you have features for but not labels for will disappear.\n",
"\n",
"We want to keep those in the dataframe so we can play detective with them later, and try to find surveillance planes using the features. When you merge, you should use `how='left'` or `how='right'` to keep unmatched columns from the left (or right) dataframe."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"df = labeled.merge(features, on='adshex', how='right')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Confirm you have 19,799 rows and 34 columns."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(19799, 34)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cleaning up our data\n",
"\n",
"## Number-izing our labels\n",
"\n",
"Each row is a plane, and it's marked as either a surveillance plane or not. How many do we have in each category?"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"other 500\n",
"surveil 97\n",
"Name: label, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.label.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How do you feel about that split?\n",
"\n",
"**Prepare this column for machine learning.** What's wrong with it as `\"surveil\"` and `\"other\"`? Add a new column that we can use for classification."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>adshex</th>\n",
" <th>label</th>\n",
" <th>duration1</th>\n",
" <th>duration2</th>\n",
" <th>duration3</th>\n",
" <th>duration4</th>\n",
" <th>duration5</th>\n",
" <th>boxes1</th>\n",
" <th>boxes2</th>\n",
" <th>boxes3</th>\n",
" <th>boxes4</th>\n",
" <th>boxes5</th>\n",
" <th>speed1</th>\n",
" <th>speed2</th>\n",
" <th>speed3</th>\n",
" <th>speed4</th>\n",
" <th>speed5</th>\n",
" <th>altitude1</th>\n",
" <th>altitude2</th>\n",
" <th>altitude3</th>\n",
" <th>altitude4</th>\n",
" <th>altitude5</th>\n",
" <th>steer1</th>\n",
" <th>steer2</th>\n",
" <th>steer3</th>\n",
" <th>steer4</th>\n",
" <th>steer5</th>\n",
" <th>steer6</th>\n",
" <th>steer7</th>\n",
" <th>steer8</th>\n",
" <th>flights</th>\n",
" <th>squawk_1</th>\n",
" <th>observations</th>\n",
" <th>type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A00C4B</td>\n",
" <td>1.0</td>\n",
" <td>0.450000</td>\n",
" <td>0.125000</td>\n",
" <td>0.025000</td>\n",
" <td>0.025000</td>\n",
" <td>0.375000</td>\n",
" <td>0.475000</td>\n",
" <td>0.250000</td>\n",
" <td>0.250000</td>\n",
" <td>0.025000</td>\n",
" <td>0.000000</td>\n",
" <td>0.337128</td>\n",
" <td>0.408286</td>\n",
" <td>0.185431</td>\n",
" <td>0.053026</td>\n",
" <td>0.016129</td>\n",
" <td>0.010226</td>\n",
" <td>0.168564</td>\n",
" <td>0.793274</td>\n",
" <td>0.027936</td>\n",
" <td>0.000000</td>\n",
" <td>0.151697</td>\n",
" <td>0.203774</td>\n",
" <td>0.303922</td>\n",
" <td>0.154544</td>\n",
" <td>0.033312</td>\n",
" <td>0.088024</td>\n",
" <td>0.010858</td>\n",
" <td>0.010753</td>\n",
" <td>40</td>\n",
" <td>4414</td>\n",
" <td>9486</td>\n",
" <td>C182</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A0AB21</td>\n",
" <td>1.0</td>\n",
" <td>0.523810</td>\n",
" <td>0.000000</td>\n",
" <td>0.047619</td>\n",
" <td>0.095238</td>\n",
" <td>0.333333</td>\n",
" <td>0.714286</td>\n",
" <td>0.095238</td>\n",
" <td>0.047619</td>\n",
" <td>0.142857</td>\n",
" <td>0.000000</td>\n",
" <td>0.703329</td>\n",
" <td>0.144543</td>\n",
" <td>0.114201</td>\n",
" <td>0.026549</td>\n",
" <td>0.011378</td>\n",
" <td>0.007164</td>\n",
" <td>0.580700</td>\n",
" <td>0.374210</td>\n",
" <td>0.037927</td>\n",
" <td>0.000000</td>\n",
" <td>0.141593</td>\n",
" <td>0.152550</td>\n",
" <td>0.166456</td>\n",
" <td>0.309313</td>\n",
" <td>0.008007</td>\n",
" <td>0.078382</td>\n",
" <td>0.021492</td>\n",
" <td>0.064054</td>\n",
" <td>21</td>\n",
" <td>4414</td>\n",
" <td>2373</td>\n",
" <td>C182</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>A0AE77</td>\n",
" <td>1.0</td>\n",
" <td>0.262295</td>\n",
" <td>0.196721</td>\n",
" <td>0.081967</td>\n",
" <td>0.114754</td>\n",
" <td>0.344262</td>\n",
" <td>0.639344</td>\n",
" <td>0.295082</td>\n",
" <td>0.032787</td>\n",
" <td>0.032787</td>\n",
" <td>0.000000</td>\n",
" <td>0.703037</td>\n",
" <td>0.181262</td>\n",
" <td>0.066502</td>\n",
" <td>0.030956</td>\n",
" <td>0.018244</td>\n",
" <td>0.000000</td>\n",
" <td>0.000118</td>\n",
" <td>0.034134</td>\n",
" <td>0.923376</td>\n",
" <td>0.042373</td>\n",
" <td>0.121234</td>\n",
" <td>0.256709</td>\n",
" <td>0.279779</td>\n",
" <td>0.209981</td>\n",
" <td>0.009416</td>\n",
" <td>0.037900</td>\n",
" <td>0.011064</td>\n",
" <td>0.027778</td>\n",
" <td>61</td>\n",
" <td>4414</td>\n",
" <td>8496</td>\n",
" <td>T206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>A0AE7C</td>\n",
" <td>1.0</td>\n",
" <td>0.521739</td>\n",
" <td>0.086957</td>\n",
" <td>0.043478</td>\n",
" <td>0.043478</td>\n",
" <td>0.304348</td>\n",
" <td>0.565217</td>\n",
" <td>0.043478</td>\n",
" <td>0.260870</td>\n",
" <td>0.000000</td>\n",
" <td>0.130435</td>\n",
" <td>0.129674</td>\n",
" <td>0.291088</td>\n",
" <td>0.384954</td>\n",
" <td>0.098159</td>\n",
" <td>0.096126</td>\n",
" <td>0.000000</td>\n",
" <td>0.004631</td>\n",
" <td>0.200723</td>\n",
" <td>0.722806</td>\n",
" <td>0.071840</td>\n",
" <td>0.159494</td>\n",
" <td>0.256636</td>\n",
" <td>0.238111</td>\n",
" <td>0.168305</td>\n",
" <td>0.023043</td>\n",
" <td>0.086073</td>\n",
" <td>0.014007</td>\n",
" <td>0.014797</td>\n",
" <td>23</td>\n",
" <td>4415</td>\n",
" <td>8853</td>\n",
" <td>T206</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>A0C462</td>\n",
" <td>1.0</td>\n",
" <td>0.250000</td>\n",
" <td>0.083333</td>\n",
" <td>0.500000</td>\n",
" <td>0.083333</td>\n",
" <td>0.083333</td>\n",
" <td>0.208333</td>\n",
" <td>0.041667</td>\n",
" <td>0.041667</td>\n",
" <td>0.500000</td>\n",
" <td>0.208333</td>\n",
" <td>0.040691</td>\n",
" <td>0.002466</td>\n",
" <td>0.041924</td>\n",
" <td>0.170160</td>\n",
" <td>0.744760</td>\n",
" <td>0.011097</td>\n",
" <td>0.007398</td>\n",
" <td>0.023428</td>\n",
" <td>0.090012</td>\n",
" <td>0.868064</td>\n",
" <td>0.019729</td>\n",
" <td>0.020962</td>\n",
" <td>0.199753</td>\n",
" <td>0.478422</td>\n",
" <td>0.119605</td>\n",
" <td>0.118372</td>\n",
" <td>0.006165</td>\n",
" <td>0.011097</td>\n",
" <td>24</td>\n",
" <td>1731</td>\n",
" <td>811</td>\n",
" <td>P8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" adshex label duration1 duration2 duration3 duration4 duration5 \\\n",
"0 A00C4B 1.0 0.450000 0.125000 0.025000 0.025000 0.375000 \n",
"1 A0AB21 1.0 0.523810 0.000000 0.047619 0.095238 0.333333 \n",
"2 A0AE77 1.0 0.262295 0.196721 0.081967 0.114754 0.344262 \n",
"3 A0AE7C 1.0 0.521739 0.086957 0.043478 0.043478 0.304348 \n",
"4 A0C462 1.0 0.250000 0.083333 0.500000 0.083333 0.083333 \n",
"\n",
" boxes1 boxes2 boxes3 boxes4 boxes5 speed1 speed2 \\\n",
"0 0.475000 0.250000 0.250000 0.025000 0.000000 0.337128 0.408286 \n",
"1 0.714286 0.095238 0.047619 0.142857 0.000000 0.703329 0.144543 \n",
"2 0.639344 0.295082 0.032787 0.032787 0.000000 0.703037 0.181262 \n",
"3 0.565217 0.043478 0.260870 0.000000 0.130435 0.129674 0.291088 \n",
"4 0.208333 0.041667 0.041667 0.500000 0.208333 0.040691 0.002466 \n",
"\n",
" speed3 speed4 speed5 altitude1 altitude2 altitude3 altitude4 \\\n",
"0 0.185431 0.053026 0.016129 0.010226 0.168564 0.793274 0.027936 \n",
"1 0.114201 0.026549 0.011378 0.007164 0.580700 0.374210 0.037927 \n",
"2 0.066502 0.030956 0.018244 0.000000 0.000118 0.034134 0.923376 \n",
"3 0.384954 0.098159 0.096126 0.000000 0.004631 0.200723 0.722806 \n",
"4 0.041924 0.170160 0.744760 0.011097 0.007398 0.023428 0.090012 \n",
"\n",
" altitude5 steer1 steer2 steer3 steer4 steer5 steer6 \\\n",
"0 0.000000 0.151697 0.203774 0.303922 0.154544 0.033312 0.088024 \n",
"1 0.000000 0.141593 0.152550 0.166456 0.309313 0.008007 0.078382 \n",
"2 0.042373 0.121234 0.256709 0.279779 0.209981 0.009416 0.037900 \n",
"3 0.071840 0.159494 0.256636 0.238111 0.168305 0.023043 0.086073 \n",
"4 0.868064 0.019729 0.020962 0.199753 0.478422 0.119605 0.118372 \n",
"\n",
" steer7 steer8 flights squawk_1 observations type \n",
"0 0.010858 0.010753 40 4414 9486 C182 \n",
"1 0.021492 0.064054 21 4414 2373 C182 \n",
"2 0.011064 0.027778 61 4414 8496 T206 \n",
"3 0.014007 0.014797 23 4415 8853 T206 \n",
"4 0.006165 0.011097 24 1731 811 P8 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Replace label with numbers\n",
"df['label'] = df.label.replace({\n",
" 'surveil': 1,\n",
" 'other': 0\n",
"})\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Categorical variables\n",
"\n",
"Do we have any variables that count as categories? Yes, we do! ...but how many different categories does it have?\n",
"\n",
"* **Tip:** You can use `.unique()` or `.value_counts()` to count unique items, depending on what you're looking for"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"unknown 2528\n",
"C172 1014\n",
"SR22 799\n",
"BE36 699\n",
"C182 693\n",
" ... \n",
"MS76 1\n",
"FGT 1\n",
"SC7 1\n",
"E35L 1\n",
"M20J 1\n",
"Name: type, Length: 455, dtype: int64"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.type.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of those types of plane only have one appearance, which means they wouldn't be very helpful identifiers in the final analysis. For example, if I only see one GLF5 and it's a surveillance plane, does that mean the next one I see is probably a surveillance plane? With such a small sample size, I have no idea!\n",
"\n",
"We have a few options\n",
"\n",
"1. Create a very large set of dummy variables out of all 133 types of planes\n",
"2. Create `0`/`1` columns for common plane types and ignore the less common ones - C182, T206, SR22\n",
"3. Interview someone who knows something about planes and put these into a few broader categories\n",
"4. Keep them as one column, just turn them into numbers - it doesn't make sense in terms of order, but if one or two plane types are very indicative of a surveillance plane the forest might pick it up\n",
"\n",
"Oddly enough, **the last one is a common approach.** Let's use it!\n",
"\n",
"If you want to convert a list of categories into numbers, an easy way is to use the `Categorical` data type."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 C182\n",
"1 C182\n",
"2 T206\n",
"3 T206\n",
"4 P8\n",
"Name: type, dtype: category\n",
"Categories (455, object): [208, A109, A119, A139, ..., WW24, XL2, ZZZZ, unknown]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.type = df.type.astype('category')\n",
"df.type.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like a normal bunch of strings, but pandas is secretly using a number for each one! You can find the number with `.cat.codes`.\n",
"\n",
"**Use `df.type.cat.codes` to make a new columns called `type_code`.** "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>type</th>\n",
" <th>type_code</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>C182</td>\n",
" <td>91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>C182</td>\n",
" <td>91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>T206</td>\n",
" <td>417</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>T206</td>\n",
" <td>417</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>P8</td>\n",
" <td>337</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>BE20</td>\n",
" <td>48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>BE20</td>\n",
" <td>48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>BE20</td>\n",
" <td>48</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>C182</td>\n",
" <td>91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>C182</td>\n",
" <td>91</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" type type_code\n",
"0 C182 91\n",
"1 C182 91\n",
"2 T206 417\n",
"3 T206 417\n",
"4 P8 337\n",
"5 BE20 48\n",
"6 BE20 48\n",
"7 BE20 48\n",
"8 C182 91\n",
"9 C182 91"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['type_code'] = df.type.cat.codes\n",
"df[['type', 'type_code']].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll use `type_code` for machine learning since sklearn needs a number, and `type` for reading since we like text."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Building our classifier\n",
"\n",
"When we're about to classify, we usually just drop our target column to build our inputs and outputs:\n",
"\n",
"```python\n",
"X = train_df.drop(column='column_you_are_predicting')\n",
"y = train_df.column_you_are_predicting\n",
"```\n",
"\n",
"This time is a little different. First, we have unlabeled data in there! Use `.dropna()` to filter your training data so we only have labeled data.\n",
"\n",
"Confirm `train_df` has 597 rows and 35 columns."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(597, 35)"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df = df.dropna()\n",
"train_df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also have a few extra columns that we aren't using for classification (like the text version of the type column and the transponder code). It's fine to drop multiple columns here that you aren't using, just a little bit messier. You also have to make sure you're dropping all the right ones.\n",
"\n",
"Do a `.head()` to double-check all of the columns you need to drop when creating your `X`."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>adshex</th>\n",
" <th>label</th>\n",
" <th>duration1</th>\n",
" <th>duration2</th>\n",
" <th>duration3</th>\n",
" <th>duration4</th>\n",
" <th>duration5</th>\n",
" <th>boxes1</th>\n",
" <th>boxes2</th>\n",
" <th>boxes3</th>\n",
" <th>boxes4</th>\n",
" <th>boxes5</th>\n",
" <th>speed1</th>\n",
" <th>speed2</th>\n",
" <th>speed3</th>\n",
" <th>speed4</th>\n",
" <th>speed5</th>\n",
" <th>altitude1</th>\n",
" <th>altitude2</th>\n",
" <th>altitude3</th>\n",
" <th>altitude4</th>\n",
" <th>altitude5</th>\n",
" <th>steer1</th>\n",
" <th>steer2</th>\n",
" <th>steer3</th>\n",
" <th>steer4</th>\n",
" <th>steer5</th>\n",
" <th>steer6</th>\n",
" <th>steer7</th>\n",
" <th>steer8</th>\n",
" <th>flights</th>\n",
" <th>squawk_1</th>\n",
" <th>observations</th>\n",
" <th>type</th>\n",
" <th>type_code</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>A00C4B</td>\n",
" <td>1.0</td>\n",
" <td>0.45000</td>\n",
" <td>0.125</td>\n",
" <td>0.025000</td>\n",
" <td>0.025000</td>\n",
" <td>0.375000</td>\n",
" <td>0.475000</td>\n",
" <td>0.250000</td>\n",
" <td>0.250000</td>\n",
" <td>0.025000</td>\n",
" <td>0.0</td>\n",
" <td>0.337128</td>\n",
" <td>0.408286</td>\n",
" <td>0.185431</td>\n",
" <td>0.053026</td>\n",
" <td>0.016129</td>\n",
" <td>0.010226</td>\n",
" <td>0.168564</td>\n",
" <td>0.793274</td>\n",
" <td>0.027936</td>\n",
" <td>0.0</td>\n",
" <td>0.151697</td>\n",
" <td>0.203774</td>\n",
" <td>0.303922</td>\n",
" <td>0.154544</td>\n",
" <td>0.033312</td>\n",
" <td>0.088024</td>\n",
" <td>0.010858</td>\n",
" <td>0.010753</td>\n",
" <td>40</td>\n",
" <td>4414</td>\n",
" <td>9486</td>\n",
" <td>C182</td>\n",
" <td>91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>A0AB21</td>\n",
" <td>1.0</td>\n",
" <td>0.52381</td>\n",
" <td>0.000</td>\n",
" <td>0.047619</td>\n",
" <td>0.095238</td>\n",
" <td>0.333333</td>\n",
" <td>0.714286</td>\n",
" <td>0.095238</td>\n",
" <td>0.047619</td>\n",
" <td>0.142857</td>\n",
" <td>0.0</td>\n",
" <td>0.703329</td>\n",
" <td>0.144543</td>\n",
" <td>0.114201</td>\n",
" <td>0.026549</td>\n",
" <td>0.011378</td>\n",
" <td>0.007164</td>\n",
" <td>0.580700</td>\n",
" <td>0.374210</td>\n",
" <td>0.037927</td>\n",
" <td>0.0</td>\n",
" <td>0.141593</td>\n",
" <td>0.152550</td>\n",
" <td>0.166456</td>\n",
" <td>0.309313</td>\n",
" <td>0.008007</td>\n",
" <td>0.078382</td>\n",
" <td>0.021492</td>\n",
" <td>0.064054</td>\n",
" <td>21</td>\n",
" <td>4414</td>\n",
" <td>2373</td>\n",
" <td>C182</td>\n",
" <td>91</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" adshex label duration1 duration2 duration3 duration4 duration5 \\\n",
"0 A00C4B 1.0 0.45000 0.125 0.025000 0.025000 0.375000 \n",
"1 A0AB21 1.0 0.52381 0.000 0.047619 0.095238 0.333333 \n",
"\n",
" boxes1 boxes2 boxes3 boxes4 boxes5 speed1 speed2 \\\n",
"0 0.475000 0.250000 0.250000 0.025000 0.0 0.337128 0.408286 \n",
"1 0.714286 0.095238 0.047619 0.142857 0.0 0.703329 0.144543 \n",
"\n",
" speed3 speed4 speed5 altitude1 altitude2 altitude3 altitude4 \\\n",
"0 0.185431 0.053026 0.016129 0.010226 0.168564 0.793274 0.027936 \n",
"1 0.114201 0.026549 0.011378 0.007164 0.580700 0.374210 0.037927 \n",
"\n",
" altitude5 steer1 steer2 steer3 steer4 steer5 steer6 \\\n",
"0 0.0 0.151697 0.203774 0.303922 0.154544 0.033312 0.088024 \n",
"1 0.0 0.141593 0.152550 0.166456 0.309313 0.008007 0.078382 \n",
"\n",
" steer7 steer8 flights squawk_1 observations type type_code \n",
"0 0.010858 0.010753 40 4414 9486 C182 91 \n",
"1 0.021492 0.064054 21 4414 2373 C182 91 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create your `X` and `y`.\n",
"\n",
"When you do `train_df.drop`, you'll want to remove more than just your `0`/`1` surveillance label. What other columns do you not want to use as input? Maybe some categories you converted into codes?"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"X = train_df.drop(columns=['adshex', 'type', 'label'])\n",
"y = train_df.label"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Triple-check that `X` is a list of numeric features and and `y` is a numeric label."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>duration1</th>\n",
" <th>duration2</th>\n",
" <th>duration3</th>\n",
" <th>duration4</th>\n",
" <th>duration5</th>\n",
" <th>boxes1</th>\n",
" <th>boxes2</th>\n",
" <th>boxes3</th>\n",
" <th>boxes4</th>\n",
" <th>boxes5</th>\n",
" <th>speed1</th>\n",
" <th>speed2</th>\n",
" <th>speed3</th>\n",
" <th>speed4</th>\n",
" <th>speed5</th>\n",
" <th>altitude1</th>\n",
" <th>altitude2</th>\n",
" <th>altitude3</th>\n",
" <th>altitude4</th>\n",
" <th>altitude5</th>\n",
" <th>steer1</th>\n",
" <th>steer2</th>\n",
" <th>steer3</th>\n",
" <th>steer4</th>\n",
" <th>steer5</th>\n",
" <th>steer6</th>\n",
" <th>steer7</th>\n",
" <th>steer8</th>\n",
" <th>flights</th>\n",
" <th>squawk_1</th>\n",
" <th>observations</th>\n",
" <th>type_code</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.45000</td>\n",
" <td>0.125</td>\n",
" <td>0.025000</td>\n",
" <td>0.025000</td>\n",
" <td>0.375000</td>\n",
" <td>0.475000</td>\n",
" <td>0.250000</td>\n",
" <td>0.250000</td>\n",
" <td>0.025000</td>\n",
" <td>0.0</td>\n",
" <td>0.337128</td>\n",
" <td>0.408286</td>\n",
" <td>0.185431</td>\n",
" <td>0.053026</td>\n",
" <td>0.016129</td>\n",
" <td>0.010226</td>\n",
" <td>0.168564</td>\n",
" <td>0.793274</td>\n",
" <td>0.027936</td>\n",
" <td>0.0</td>\n",
" <td>0.151697</td>\n",
" <td>0.203774</td>\n",
" <td>0.303922</td>\n",
" <td>0.154544</td>\n",
" <td>0.033312</td>\n",
" <td>0.088024</td>\n",
" <td>0.010858</td>\n",
" <td>0.010753</td>\n",
" <td>40</td>\n",
" <td>4414</td>\n",
" <td>9486</td>\n",
" <td>91</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.52381</td>\n",
" <td>0.000</td>\n",
" <td>0.047619</td>\n",
" <td>0.095238</td>\n",
" <td>0.333333</td>\n",
" <td>0.714286</td>\n",
" <td>0.095238</td>\n",
" <td>0.047619</td>\n",
" <td>0.142857</td>\n",
" <td>0.0</td>\n",
" <td>0.703329</td>\n",
" <td>0.144543</td>\n",
" <td>0.114201</td>\n",
" <td>0.026549</td>\n",
" <td>0.011378</td>\n",
" <td>0.007164</td>\n",
" <td>0.580700</td>\n",
" <td>0.374210</td>\n",
" <td>0.037927</td>\n",
" <td>0.0</td>\n",
" <td>0.141593</td>\n",
" <td>0.152550</td>\n",
" <td>0.166456</td>\n",
" <td>0.309313</td>\n",
" <td>0.008007</td>\n",
" <td>0.078382</td>\n",
" <td>0.021492</td>\n",
" <td>0.064054</td>\n",
" <td>21</td>\n",
" <td>4414</td>\n",
" <td>2373</td>\n",
" <td>91</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" duration1 duration2 duration3 duration4 duration5 boxes1 boxes2 \\\n",
"0 0.45000 0.125 0.025000 0.025000 0.375000 0.475000 0.250000 \n",
"1 0.52381 0.000 0.047619 0.095238 0.333333 0.714286 0.095238 \n",
"\n",
" boxes3 boxes4 boxes5 speed1 speed2 speed3 speed4 \\\n",
"0 0.250000 0.025000 0.0 0.337128 0.408286 0.185431 0.053026 \n",
"1 0.047619 0.142857 0.0 0.703329 0.144543 0.114201 0.026549 \n",
"\n",
" speed5 altitude1 altitude2 altitude3 altitude4 altitude5 steer1 \\\n",
"0 0.016129 0.010226 0.168564 0.793274 0.027936 0.0 0.151697 \n",
"1 0.011378 0.007164 0.580700 0.374210 0.037927 0.0 0.141593 \n",
"\n",
" steer2 steer3 steer4 steer5 steer6 steer7 steer8 \\\n",
"0 0.203774 0.303922 0.154544 0.033312 0.088024 0.010858 0.010753 \n",
"1 0.152550 0.166456 0.309313 0.008007 0.078382 0.021492 0.064054 \n",
"\n",
" flights squawk_1 observations type_code \n",
"0 40 4414 9486 91 \n",
"1 21 4414 2373 91 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head(2)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1.0\n",
"1 1.0\n",
"Name: label, dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Split into test and train datasets\n",
"\n",
"We could be nice and lazy and use all our data for training, but it just isn't right! Taking a test using the exact same questions you studied is just cheating. Split your data into test and train.\n",
"\n",
"* **Tip:** Don't do this manually! There's a method for it in sklearn"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Classify using a logistic classifier\n",
"\n",
"## Train your classifier\n",
"\n",
"Build a `LogisticRegression` and fit it to your data, making sure you're training using only `X_train` and `y_train`.\n",
"\n",
"* **Tip:** You'll want to give `LogisticRegression` an extra argument of `max_iter=4000` - it means \"work a little harder than you expect,\" because otherwise it won't find an answer (by default it only has a `max_iter` of 100)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=1000000000.0, class_weight=None, dual=False,\n",
" fit_intercept=True, intercept_scaling=1, l1_ratio=None,\n",
" max_iter=4000, multi_class='warn', n_jobs=None, penalty='l2',\n",
" random_state=None, solver='lbfgs', tol=0.0001, verbose=0,\n",
" warm_start=False)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)\n",
"\n",
"clf.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examine the coefficients\n",
"\n",
"What does it mean? What features is the classifier using? Do you care about the odds ratio? **What is even the point of this `LogisticRegression` thing?**"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>feature</th>\n",
" <th>coefficient (log odds ratio)</th>\n",
" <th>odds ratio</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>speed1</td>\n",
" <td>0.622477</td>\n",
" <td>1.863538</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>steer2</td>\n",
" <td>0.507835</td>\n",
" <td>1.661689</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>boxes1</td>\n",
" <td>0.403334</td>\n",
" <td>1.496807</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>boxes2</td>\n",
" <td>0.339208</td>\n",
" <td>1.403836</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>steer1</td>\n",
" <td>0.304670</td>\n",
" <td>1.356177</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>altitude3</td>\n",
" <td>0.251857</td>\n",
" <td>1.286412</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>duration1</td>\n",
" <td>0.114182</td>\n",
" <td>1.120956</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>steer8</td>\n",
" <td>0.002105</td>\n",
" <td>1.002107</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>squawk_1</td>\n",
" <td>0.000745</td>\n",
" <td>1.000745</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31</th>\n",
" <td>type_code</td>\n",
" <td>0.000180</td>\n",
" <td>1.000180</td>\n",
" </tr>\n",
" <tr>\n",
" <th>30</th>\n",
" <td>observations</td>\n",
" <td>0.000014</td>\n",
" <td>1.000014</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>flights</td>\n",
" <td>-0.003415</td>\n",
" <td>0.996590</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>steer7</td>\n",
" <td>-0.013651</td>\n",
" <td>0.986441</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>altitude4</td>\n",
" <td>-0.022505</td>\n",
" <td>0.977746</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>speed2</td>\n",
" <td>-0.090921</td>\n",
" <td>0.913090</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>duration2</td>\n",
" <td>-0.117767</td>\n",
" <td>0.888903</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>boxes3</td>\n",
" <td>-0.391191</td>\n",
" <td>0.676251</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>steer3</td>\n",
" <td>-0.397919</td>\n",
" <td>0.671716</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>altitude2</td>\n",
" <td>-0.401974</td>\n",
" <td>0.668998</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>steer5</td>\n",
" <td>-0.460506</td>\n",
" <td>0.630964</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>steer6</td>\n",
" <td>-0.523748</td>\n",
" <td>0.592297</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>speed3</td>\n",
" <td>-0.554705</td>\n",
" <td>0.574241</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>duration5</td>\n",
" <td>-0.557450</td>\n",
" <td>0.572667</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>duration3</td>\n",
" <td>-0.638612</td>\n",
" <td>0.528025</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>altitude1</td>\n",
" <td>-0.692721</td>\n",
" <td>0.500213</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>duration4</td>\n",
" <td>-0.823567</td>\n",
" <td>0.438863</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>boxes4</td>\n",
" <td>-0.826607</td>\n",
" <td>0.437531</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>speed4</td>\n",
" <td>-0.974281</td>\n",
" <td>0.377463</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>speed5</td>\n",
" <td>-1.025784</td>\n",
" <td>0.358515</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>altitude5</td>\n",
" <td>-1.157871</td>\n",
" <td>0.314154</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>steer4</td>\n",
" <td>-1.477424</td>\n",
" <td>0.228225</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>boxes5</td>\n",
" <td>-1.547959</td>\n",
" <td>0.212682</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" feature coefficient (log odds ratio) odds ratio\n",
"10 speed1 0.622477 1.863538\n",
"21 steer2 0.507835 1.661689\n",
"5 boxes1 0.403334 1.496807\n",
"6 boxes2 0.339208 1.403836\n",
"20 steer1 0.304670 1.356177\n",
"17 altitude3 0.251857 1.286412\n",
"0 duration1 0.114182 1.120956\n",
"27 steer8 0.002105 1.002107\n",
"29 squawk_1 0.000745 1.000745\n",
"31 type_code 0.000180 1.000180\n",
"30 observations 0.000014 1.000014\n",
"28 flights -0.003415 0.996590\n",
"26 steer7 -0.013651 0.986441\n",
"18 altitude4 -0.022505 0.977746\n",
"11 speed2 -0.090921 0.913090\n",
"1 duration2 -0.117767 0.888903\n",
"7 boxes3 -0.391191 0.676251\n",
"22 steer3 -0.397919 0.671716\n",
"16 altitude2 -0.401974 0.668998\n",
"24 steer5 -0.460506 0.630964\n",
"25 steer6 -0.523748 0.592297\n",
"12 speed3 -0.554705 0.574241\n",
"4 duration5 -0.557450 0.572667\n",
"2 duration3 -0.638612 0.528025\n",
"15 altitude1 -0.692721 0.500213\n",
"3 duration4 -0.823567 0.438863\n",
"8 boxes4 -0.826607 0.437531\n",
"13 speed4 -0.974281 0.377463\n",
"14 speed5 -1.025784 0.358515\n",
"19 altitude5 -1.157871 0.314154\n",
"23 steer4 -1.477424 0.228225\n",
"9 boxes5 -1.547959 0.212682"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"feature_names = X.columns\n",
"coefficients = clf.coef_[0]\n",
"\n",
"pd.DataFrame({\n",
" 'feature': feature_names,\n",
" 'coefficient (log odds ratio)': coefficients,\n",
" 'odds ratio': np.exp(coefficients)\n",
"}).sort_values(by='odds ratio', ascending=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we don't care about the odds ratio, using the `eli5` package can shrink our code by a lot (and give us color!)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <style>\n",
" table.eli5-weights tr:hover {\n",
" filter: brightness(85%);\n",
" }\n",
"</style>\n",
"\n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
" \n",
"\n",
" \n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" <p style=\"margin-bottom: 0.5em; margin-top: 0em\">\n",
" <b>\n",
" \n",
" y=1.0\n",
" \n",
"</b>\n",
"\n",
"top features\n",
" </p>\n",
" \n",
" <table class=\"eli5-weights\"\n",
" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto; margin-bottom: 2em;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" \n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\" title=\"Feature weights. Note that weights do not account for feature value scales, so if feature values have different scales, features with highest weights might not be the most important.\">\n",
" Weight<sup>?</sup>\n",
" </th>\n",
" \n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" \n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 91.24%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.622\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" speed1\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.40%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.508\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer2\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.53%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" +0.403\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes1\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 93.53%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 8 more positive …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
"\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.67%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 5 more negative …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.67%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.391\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes3\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.59%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.398\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer3\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 93.55%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.402\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" altitude2\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 92.90%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.461\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer5\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 92.23%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.524\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer6\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.92%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.555\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" speed3\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.89%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.557\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" duration5\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 91.08%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.639\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" duration3\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 90.56%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.693\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" altitude1\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 89.34%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.824\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" duration4\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 89.31%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.827\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes4\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 88.01%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -0.974\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" speed4\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 87.57%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.026\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" speed5\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 86.47%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.158\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" altitude5\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 83.95%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.477\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer4\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 83.42%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -1.548\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes5\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 80.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" -2.023\n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" <BIAS>\n",
" </td>\n",
" \n",
"</tr>\n",
" \n",
"\n",
" </tbody>\n",
" </table>\n",
"\n",
" \n",
" \n",
"\n",
" \n",
"\n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"\n",
"\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import eli5\n",
"\n",
"feature_names = list(X.columns)\n",
"\n",
"# Use this line instead for wonderful warnings about the results\n",
"# eli5.show_weights(clf, feature_names=feature_names, show=eli5.formatters.fields.ALL)\n",
"eli5.show_weights(clf, feature_names=feature_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How well does our classifier perform?\n",
"\n",
"Let's take a look at the confusion matrix to see how well this classifier finds surveillance planes. Make sure you're using `y_test` and `X_test`, not the full dataset."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Predicted not surveil</th>\n",
" <th>Predicted surveil</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Is not surveil</th>\n",
" <td>120</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Is surveil</th>\n",
" <td>12</td>\n",
" <td>14</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Predicted not surveil Predicted surveil\n",
"Is not surveil 120 4\n",
"Is surveil 12 14"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"\n",
"y_true = y_test\n",
"y_pred = clf.predict(X_test)\n",
"matrix = confusion_matrix(y_true, y_pred)\n",
"\n",
"label_names = pd.Series(['not surveil', 'surveil'])\n",
"pd.DataFrame(matrix,\n",
" columns='Predicted ' + label_names,\n",
" index='Is ' + label_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Classify using a decision tree\n",
"\n",
"Now we'll use a decision tree. This is how you make one:\n",
"\n",
"```python\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"clf = DecisionTreeClassifier()\n",
"```\n",
"\n",
"But it's up to you to teach it what spy planes look like using your training data.\n",
"\n",
"If we use `max_depth=` to limit the depth of the tree, it will help us visualize it. For example, `max_depth=5` will only allow the tree to make five decisions.\n",
"\n",
"Make a decision tree and fit it to your data. Use a `max_depth=` of something between 2 to 5."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,\n",
" max_features=None, max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, presort=False,\n",
" random_state=None, splitter='best')"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"clf = DecisionTreeClassifier(max_depth=5)\n",
"clf.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What are the important features?\n",
"\n",
"We'll use slighyl different code for a decision tree, as it likes to draw big pictures if we don't stop it. The code looks like this:\n",
"\n",
"```python\n",
"import eli5\n",
"\n",
"feature_names=list(X.columns)\n",
"eli5.show_weights(clf, feature_names=feature_names, show=['description', 'feature_importances'])\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <style>\n",
" table.eli5-weights tr:hover {\n",
" filter: brightness(85%);\n",
" }\n",
"</style>\n",
"\n",
"\n",
"\n",
" \n",
"\n",
" \n",
" \n",
" <pre>\n",
"Decision tree feature importances; values are numbers 0 <= x <= 1;\n",
"all values sum to 1.\n",
"</pre>\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
" <table class=\"eli5-weights eli5-feature-importances\" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">Weight</th>\n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 80.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.6489\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer2\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 92.94%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.1465\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" squawk_1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 96.50%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0538\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" altitude1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 96.86%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0461\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" duration4\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 98.11%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0223\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" speed2\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 98.19%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0210\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes3\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 98.21%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0206\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" duration1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 98.70%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0131\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" altitude2\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 98.78%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0120\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 98.80%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0117\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes4\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 99.43%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0040\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" altitude5\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 100.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes2\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 100.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" flights\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 100.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer5\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 100.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" speed1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 100.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" duration5\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 100.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 100.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes5\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 100.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" duration3\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 100.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" observations\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" \n",
" <tr style=\"background-color: hsl(0, 100.00%, 100.00%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 12 more …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" </tbody>\n",
"</table>\n",
" \n",
"\n",
" \n",
"\n",
"\n",
"\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import eli5\n",
"\n",
"feature_names=list(X.columns)\n",
"eli5.show_weights(clf, feature_names=feature_names, show=['description', 'feature_importances'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Understanding the output\n",
"\n",
"**Why is the feature importance difference than for logistic regression?**\n",
"\n",
"Also, if you don't specify a `max_depth`, that's a LOT of zeroes! It doesn't even use most of the features! **Why not?**"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"# Because it's a different algorithm\n",
"# Because the features aren't important"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How well does the tree perform?\n",
"\n",
"Display another confusion matrix with your new classifier."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Predicted not surveil</th>\n",
" <th>Predicted surveil</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Is not surveil</th>\n",
" <td>120</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Is surveil</th>\n",
" <td>2</td>\n",
" <td>24</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Predicted not surveil Predicted surveil\n",
"Is not surveil 120 4\n",
"Is surveil 2 24"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"\n",
"y_true = y_test\n",
"y_pred = clf.predict(X_test)\n",
"matrix = confusion_matrix(y_true, y_pred)\n",
"\n",
"label_names = pd.Series(['not surveil', 'surveil'])\n",
"pd.DataFrame(matrix,\n",
" columns='Predicted ' + label_names,\n",
" index='Is ' + label_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize the tree\n",
"\n",
"You can use `eli5` to visualize the decision tree itself! It usually takes up too much space, but since it's a special occasion we'll let it go."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <style>\n",
" table.eli5-weights tr:hover {\n",
" filter: brightness(85%);\n",
" }\n",
"</style>\n",
"\n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
" \n",
" <br>\n",
" <pre><svg width=\"1436pt\" height=\"642pt\"\n",
" viewBox=\"0.00 0.00 1436.13 642.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 638)\">\n",
"<title>Tree</title>\n",
"<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-638 1432.1338,-638 1432.1338,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>0</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"897.2007,-634 749.933,-634 749.933,-556 897.2007,-556 897.2007,-634\"/>\n",
"<text text-anchor=\"middle\" x=\"823.5669\" y=\"-618.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">steer2 <= 0.111</text>\n",
"<text text-anchor=\"middle\" x=\"823.5669\" y=\"-604.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.267</text>\n",
"<text text-anchor=\"middle\" x=\"823.5669\" y=\"-590.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 100.0%</text>\n",
"<text text-anchor=\"middle\" x=\"823.5669\" y=\"-576.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.841, 0.159]</text>\n",
"<text text-anchor=\"middle\" x=\"823.5669\" y=\"-562.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>1</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"738.2007,-520 590.933,-520 590.933,-442 738.2007,-442 738.2007,-520\"/>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-504.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">squawk_1 <= 4380.5</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-490.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.093</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-476.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 87.2%</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-462.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.951, 0.049]</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-448.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 0->1 -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>0->1</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M768.8481,-555.7677C755.5322,-546.2204 741.1857,-535.9342 727.5215,-526.1373\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"729.2835,-523.0939 719.1171,-520.1115 725.2046,-528.7828 729.2835,-523.0939\"/>\n",
"<text text-anchor=\"middle\" x=\"723.1831\" y=\"-540.5786\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">True</text>\n",
"</g>\n",
"<!-- 20 -->\n",
"<g id=\"node21\" class=\"node\">\n",
"<title>20</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1045.6048,-520 905.529,-520 905.529,-442 1045.6048,-442 1045.6048,-520\"/>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-504.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">duration4 <= 0.207</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-490.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.16</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-476.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 12.8%</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-462.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.088, 0.912]</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-448.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 0->20 -->\n",
"<g id=\"edge20\" class=\"edge\">\n",
"<title>0->20</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M875.8767,-555.7677C888.6063,-546.2204 902.3213,-535.9342 915.3839,-526.1373\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"917.5183,-528.9115 923.4183,-520.1115 913.3183,-523.3115 917.5183,-528.9115\"/>\n",
"<text text-anchor=\"middle\" x=\"919.8828\" y=\"-540.6602\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">False</text>\n",
"</g>\n",
"<!-- 2 -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>2</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"568.2007,-406 420.933,-406 420.933,-328 568.2007,-328 568.2007,-406\"/>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-390.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">duration1 <= 0.371</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-376.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.045</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-362.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 77.4%</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-348.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.977, 0.023]</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-334.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 1->2 -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>1->2</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M606.0625,-441.7677C591.6911,-432.1303 576.1968,-421.7401 561.4636,-411.8601\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"563.1458,-408.7741 552.891,-406.1115 559.2471,-414.5879 563.1458,-408.7741\"/>\n",
"</g>\n",
"<!-- 13 -->\n",
"<g id=\"node14\" class=\"node\">\n",
"<title>13</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"738.2007,-406 590.933,-406 590.933,-328 738.2007,-328 738.2007,-406\"/>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-390.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">squawk_1 <= 4465.5</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-376.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.375</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-362.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 9.8%</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-348.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.75, 0.25]</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-334.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 1->13 -->\n",
"<g id=\"edge13\" class=\"edge\">\n",
"<title>1->13</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M664.5669,-441.7677C664.5669,-433.6172 664.5669,-424.9283 664.5669,-416.4649\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"668.067,-416.3046 664.5669,-406.3046 661.067,-416.3047 668.067,-416.3046\"/>\n",
"</g>\n",
"<!-- 3 -->\n",
"<g id=\"node4\" class=\"node\">\n",
"<title>3</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"389.2007,-292 241.933,-292 241.933,-214 389.2007,-214 389.2007,-292\"/>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-276.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">speed2 <= 0.003</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-262.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.006</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-248.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 69.4%</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-234.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.997, 0.003]</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-220.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 2->3 -->\n",
"<g id=\"edge3\" class=\"edge\">\n",
"<title>2->3</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M433.2157,-327.9272C417.8452,-318.1381 401.2415,-307.5637 385.4887,-297.5312\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"387.1528,-294.4415 376.8379,-292.0218 383.3925,-300.3458 387.1528,-294.4415\"/>\n",
"</g>\n",
"<!-- 8 -->\n",
"<g id=\"node9\" class=\"node\">\n",
"<title>8</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"568.2007,-292 420.933,-292 420.933,-214 568.2007,-214 568.2007,-292\"/>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-276.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">altitude1 <= 0.122</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-262.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.313</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-248.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 8.1%</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-234.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.806, 0.194]</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-220.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 2->8 -->\n",
"<g id=\"edge8\" class=\"edge\">\n",
"<title>2->8</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M494.5669,-327.7677C494.5669,-319.6172 494.5669,-310.9283 494.5669,-302.4649\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"498.067,-302.3046 494.5669,-292.3046 491.067,-302.3047 498.067,-302.3046\"/>\n",
"</g>\n",
"<!-- 4 -->\n",
"<g id=\"node5\" class=\"node\">\n",
"<title>4</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"224.2007,-178 76.933,-178 76.933,-100 224.2007,-100 224.2007,-178\"/>\n",
"<text text-anchor=\"middle\" x=\"150.5669\" y=\"-162.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">boxes4 <= 0.379</text>\n",
"<text text-anchor=\"middle\" x=\"150.5669\" y=\"-148.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.444</text>\n",
"<text text-anchor=\"middle\" x=\"150.5669\" y=\"-134.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 0.7%</text>\n",
"<text text-anchor=\"middle\" x=\"150.5669\" y=\"-120.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.667, 0.333]</text>\n",
"<text text-anchor=\"middle\" x=\"150.5669\" y=\"-106.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 3->4 -->\n",
"<g id=\"edge4\" class=\"edge\">\n",
"<title>3->4</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M258.7833,-213.7677C244.8345,-204.1303 229.796,-193.7401 215.496,-183.8601\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"217.3924,-180.9163 207.1756,-178.1115 213.4134,-186.6754 217.3924,-180.9163\"/>\n",
"</g>\n",
"<!-- 7 -->\n",
"<g id=\"node8\" class=\"node\">\n",
"<title>7</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"389.2007,-171 241.933,-171 241.933,-107 389.2007,-107 389.2007,-171\"/>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-155.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-141.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 68.7%</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-127.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1.0, 0.0]</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-113.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 3->7 -->\n",
"<g id=\"edge7\" class=\"edge\">\n",
"<title>3->7</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M315.5669,-213.7677C315.5669,-203.3338 315.5669,-192.0174 315.5669,-181.4215\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"319.067,-181.1252 315.5669,-171.1252 312.067,-181.1252 319.067,-181.1252\"/>\n",
"</g>\n",
"<!-- 5 -->\n",
"<g id=\"node6\" class=\"node\">\n",
"<title>5</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"147.2007,-64 -.067,-64 -.067,0 147.2007,0 147.2007,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"73.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"73.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 0.4%</text>\n",
"<text text-anchor=\"middle\" x=\"73.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1.0, 0.0]</text>\n",
"<text text-anchor=\"middle\" x=\"73.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 4->5 -->\n",
"<g id=\"edge5\" class=\"edge\">\n",
"<title>4->5</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M122.3322,-99.7647C115.957,-90.9057 109.1767,-81.4838 102.7629,-72.571\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"105.433,-70.2893 96.751,-64.2169 99.7512,-74.3781 105.433,-70.2893\"/>\n",
"</g>\n",
"<!-- 6 -->\n",
"<g id=\"node7\" class=\"node\">\n",
"<title>6</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"290.3113,-64 164.8225,-64 164.8225,0 290.3113,0 290.3113,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"227.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"227.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 0.2%</text>\n",
"<text text-anchor=\"middle\" x=\"227.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.0, 1.0]</text>\n",
"<text text-anchor=\"middle\" x=\"227.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 4->6 -->\n",
"<g id=\"edge6\" class=\"edge\">\n",
"<title>4->6</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M178.8016,-99.7647C185.1768,-90.9057 191.9571,-81.4838 198.3709,-72.571\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"201.3826,-74.3781 204.3828,-64.2169 195.7008,-70.2893 201.3826,-74.3781\"/>\n",
"</g>\n",
"<!-- 9 -->\n",
"<g id=\"node10\" class=\"node\">\n",
"<title>9</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"554.2007,-178 406.933,-178 406.933,-100 554.2007,-100 554.2007,-178\"/>\n",
"<text text-anchor=\"middle\" x=\"480.5669\" y=\"-162.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">altitude1 <= 0.046</text>\n",
"<text text-anchor=\"middle\" x=\"480.5669\" y=\"-148.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.444</text>\n",
"<text text-anchor=\"middle\" x=\"480.5669\" y=\"-134.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 4.7%</text>\n",
"<text text-anchor=\"middle\" x=\"480.5669\" y=\"-120.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.667, 0.333]</text>\n",
"<text text-anchor=\"middle\" x=\"480.5669\" y=\"-106.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 8->9 -->\n",
"<g id=\"edge9\" class=\"edge\">\n",
"<title>8->9</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M489.7489,-213.7677C488.748,-205.6172 487.6809,-196.9283 486.6415,-188.4649\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"490.0867,-187.8034 485.3938,-178.3046 483.1389,-188.6567 490.0867,-187.8034\"/>\n",
"</g>\n",
"<!-- 12 -->\n",
"<g id=\"node13\" class=\"node\">\n",
"<title>12</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"719.2007,-171 571.933,-171 571.933,-107 719.2007,-107 719.2007,-171\"/>\n",
"<text text-anchor=\"middle\" x=\"645.5669\" y=\"-155.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"645.5669\" y=\"-141.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 3.4%</text>\n",
"<text text-anchor=\"middle\" x=\"645.5669\" y=\"-127.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1.0, 0.0]</text>\n",
"<text text-anchor=\"middle\" x=\"645.5669\" y=\"-113.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 8->12 -->\n",
"<g id=\"edge12\" class=\"edge\">\n",
"<title>8->12</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M546.5325,-213.7677C562.2243,-201.9209 579.4231,-188.9364 595.0209,-177.1606\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"597.143,-179.9439 603.0151,-171.1252 592.9253,-174.3572 597.143,-179.9439\"/>\n",
"</g>\n",
"<!-- 10 -->\n",
"<g id=\"node11\" class=\"node\">\n",
"<title>10</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"505.2007,-64 357.933,-64 357.933,0 505.2007,0 505.2007,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"431.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.231</text>\n",
"<text text-anchor=\"middle\" x=\"431.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 3.4%</text>\n",
"<text text-anchor=\"middle\" x=\"431.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.867, 0.133]</text>\n",
"<text text-anchor=\"middle\" x=\"431.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 9->10 -->\n",
"<g id=\"edge10\" class=\"edge\">\n",
"<title>9->10</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M462.5993,-99.7647C458.6679,-91.1797 454.4943,-82.066 450.5255,-73.3994\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"453.6663,-71.8516 446.3204,-64.2169 447.3019,-74.7662 453.6663,-71.8516\"/>\n",
"</g>\n",
"<!-- 11 -->\n",
"<g id=\"node12\" class=\"node\">\n",
"<title>11</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"663.6048,-64 523.529,-64 523.529,0 663.6048,0 663.6048,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"593.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.278</text>\n",
"<text text-anchor=\"middle\" x=\"593.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 1.3%</text>\n",
"<text text-anchor=\"middle\" x=\"593.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.167, 0.833]</text>\n",
"<text text-anchor=\"middle\" x=\"593.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 9->11 -->\n",
"<g id=\"edge11\" class=\"edge\">\n",
"<title>9->11</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M522.0023,-99.7647C531.8403,-90.4491 542.3357,-80.5109 552.1719,-71.197\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"554.6887,-73.634 559.5435,-64.2169 549.8757,-68.5511 554.6887,-73.634\"/>\n",
"</g>\n",
"<!-- 14 -->\n",
"<g id=\"node15\" class=\"node\">\n",
"<title>14</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"719.3113,-285 593.8225,-285 593.8225,-221 719.3113,-221 719.3113,-285\"/>\n",
"<text text-anchor=\"middle\" x=\"656.5669\" y=\"-269.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"656.5669\" y=\"-255.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 2.0%</text>\n",
"<text text-anchor=\"middle\" x=\"656.5669\" y=\"-241.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.0, 1.0]</text>\n",
"<text text-anchor=\"middle\" x=\"656.5669\" y=\"-227.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 13->14 -->\n",
"<g id=\"edge14\" class=\"edge\">\n",
"<title>13->14</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M661.8137,-327.7677C661.0815,-317.3338 660.2874,-306.0174 659.5438,-295.4215\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"663.0128,-294.8556 658.8213,-285.1252 656.03,-295.3457 663.0128,-294.8556\"/>\n",
"</g>\n",
"<!-- 15 -->\n",
"<g id=\"node16\" class=\"node\">\n",
"<title>15</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"884.2007,-292 736.933,-292 736.933,-214 884.2007,-214 884.2007,-292\"/>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-276.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">steer1 <= 0.027</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-262.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.108</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-248.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 7.8%</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-234.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.943, 0.057]</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-220.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 13->15 -->\n",
"<g id=\"edge15\" class=\"edge\">\n",
"<title>13->15</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M714.8118,-327.7677C726.856,-318.3633 739.8184,-308.242 752.1957,-298.5775\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"754.5015,-301.2177 760.2294,-292.3046 750.1934,-295.7004 754.5015,-301.2177\"/>\n",
"</g>\n",
"<!-- 16 -->\n",
"<g id=\"node17\" class=\"node\">\n",
"<title>16</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"884.2007,-171 736.933,-171 736.933,-107 884.2007,-107 884.2007,-171\"/>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-155.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-141.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 6.7%</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-127.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1.0, 0.0]</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-113.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 15->16 -->\n",
"<g id=\"edge16\" class=\"edge\">\n",
"<title>15->16</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M810.5669,-213.7677C810.5669,-203.3338 810.5669,-192.0174 810.5669,-181.4215\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"814.067,-181.1252 810.5669,-171.1252 807.067,-181.1252 814.067,-181.1252\"/>\n",
"</g>\n",
"<!-- 17 -->\n",
"<g id=\"node18\" class=\"node\">\n",
"<title>17</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1049.2007,-178 901.933,-178 901.933,-100 1049.2007,-100 1049.2007,-178\"/>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-162.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">boxes3 <= 0.113</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-148.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.48</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-134.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 1.1%</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-120.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.6, 0.4]</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-106.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 15->17 -->\n",
"<g id=\"edge17\" class=\"edge\">\n",
"<title>15->17</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M867.3505,-213.7677C881.2993,-204.1303 896.3378,-193.7401 910.6378,-183.8601\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"912.7204,-186.6754 918.9582,-178.1115 908.7414,-180.9163 912.7204,-186.6754\"/>\n",
"</g>\n",
"<!-- 18 -->\n",
"<g id=\"node19\" class=\"node\">\n",
"<title>18</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"937.2007,-64 789.933,-64 789.933,0 937.2007,0 937.2007,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"863.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"863.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 0.7%</text>\n",
"<text text-anchor=\"middle\" x=\"863.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1.0, 0.0]</text>\n",
"<text text-anchor=\"middle\" x=\"863.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 17->18 -->\n",
"<g id=\"edge18\" class=\"edge\">\n",
"<title>17->18</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M934.4982,-99.7647C924.7472,-90.4491 914.3447,-80.5109 904.5955,-71.197\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"906.9376,-68.594 897.2892,-64.2169 902.1021,-73.6555 906.9376,-68.594\"/>\n",
"</g>\n",
"<!-- 19 -->\n",
"<g id=\"node20\" class=\"node\">\n",
"<title>19</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1080.3113,-64 954.8225,-64 954.8225,0 1080.3113,0 1080.3113,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"1017.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"1017.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 0.4%</text>\n",
"<text text-anchor=\"middle\" x=\"1017.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.0, 1.0]</text>\n",
"<text text-anchor=\"middle\" x=\"1017.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 17->19 -->\n",
"<g id=\"edge19\" class=\"edge\">\n",
"<title>17->19</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M990.9677,-99.7647C994.3016,-91.271 997.8387,-82.2599 1001.208,-73.6762\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1004.5251,-74.8043 1004.921,-64.2169 998.0091,-72.2466 1004.5251,-74.8043\"/>\n",
"</g>\n",
"<!-- 21 -->\n",
"<g id=\"node22\" class=\"node\">\n",
"<title>21</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1045.6048,-406 905.529,-406 905.529,-328 1045.6048,-328 1045.6048,-406\"/>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-390.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">speed2 <= 0.013</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-376.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.071</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-362.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 12.1%</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-348.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.037, 0.963]</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-334.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 20->21 -->\n",
"<g id=\"edge21\" class=\"edge\">\n",
"<title>20->21</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M975.5669,-441.7677C975.5669,-433.6172 975.5669,-424.9283 975.5669,-416.4649\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"979.067,-416.3046 975.5669,-406.3046 972.067,-416.3047 979.067,-416.3046\"/>\n",
"</g>\n",
"<!-- 28 -->\n",
"<g id=\"node29\" class=\"node\">\n",
"<title>28</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1211.2007,-399 1063.933,-399 1063.933,-335 1211.2007,-335 1211.2007,-399\"/>\n",
"<text text-anchor=\"middle\" x=\"1137.5669\" y=\"-383.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"1137.5669\" y=\"-369.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 0.7%</text>\n",
"<text text-anchor=\"middle\" x=\"1137.5669\" y=\"-355.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1.0, 0.0]</text>\n",
"<text text-anchor=\"middle\" x=\"1137.5669\" y=\"-341.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 20->28 -->\n",
"<g id=\"edge28\" class=\"edge\">\n",
"<title>20->28</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1031.3181,-441.7677C1048.153,-429.9209 1066.6047,-416.9364 1083.3387,-405.1606\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1085.7515,-407.7425 1091.9153,-399.1252 1081.723,-402.0178 1085.7515,-407.7425\"/>\n",
"</g>\n",
"<!-- 22 -->\n",
"<g id=\"node23\" class=\"node\">\n",
"<title>22</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1049.2007,-285 901.933,-285 901.933,-221 1049.2007,-221 1049.2007,-285\"/>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-269.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-255.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 0.2%</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-241.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1.0, 0.0]</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-227.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 21->22 -->\n",
"<g id=\"edge22\" class=\"edge\">\n",
"<title>21->22</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M975.5669,-327.7677C975.5669,-317.3338 975.5669,-306.0174 975.5669,-295.4215\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"979.067,-295.1252 975.5669,-285.1252 972.067,-295.1252 979.067,-295.1252\"/>\n",
"</g>\n",
"<!-- 23 -->\n",
"<g id=\"node24\" class=\"node\">\n",
"<title>23</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1207.6048,-292 1067.529,-292 1067.529,-214 1207.6048,-214 1207.6048,-292\"/>\n",
"<text text-anchor=\"middle\" x=\"1137.5669\" y=\"-276.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">altitude5 <= 0.261</text>\n",
"<text text-anchor=\"middle\" x=\"1137.5669\" y=\"-262.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.037</text>\n",
"<text text-anchor=\"middle\" x=\"1137.5669\" y=\"-248.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 11.9%</text>\n",
"<text text-anchor=\"middle\" x=\"1137.5669\" y=\"-234.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.019, 0.981]</text>\n",
"<text text-anchor=\"middle\" x=\"1137.5669\" y=\"-220.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 21->23 -->\n",
"<g id=\"edge23\" class=\"edge\">\n",
"<title>21->23</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1031.3181,-327.7677C1044.8853,-318.2204 1059.5025,-307.9342 1073.4245,-298.1373\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1075.8236,-300.7288 1081.9875,-292.1115 1071.7951,-295.0041 1075.8236,-300.7288\"/>\n",
"</g>\n",
"<!-- 24 -->\n",
"<g id=\"node25\" class=\"node\">\n",
"<title>24</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1196.3113,-171 1070.8225,-171 1070.8225,-107 1196.3113,-107 1196.3113,-171\"/>\n",
"<text text-anchor=\"middle\" x=\"1133.5669\" y=\"-155.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"1133.5669\" y=\"-141.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 11.0%</text>\n",
"<text text-anchor=\"middle\" x=\"1133.5669\" y=\"-127.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.0, 1.0]</text>\n",
"<text text-anchor=\"middle\" x=\"1133.5669\" y=\"-113.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 23->24 -->\n",
"<g id=\"edge24\" class=\"edge\">\n",
"<title>23->24</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1136.1903,-213.7677C1135.8242,-203.3338 1135.4272,-192.0174 1135.0554,-181.4215\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1138.5427,-180.9963 1134.6941,-171.1252 1131.547,-181.2418 1138.5427,-180.9963\"/>\n",
"</g>\n",
"<!-- 25 -->\n",
"<g id=\"node26\" class=\"node\">\n",
"<title>25</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1340.6048,-178 1214.5289,-178 1214.5289,-100 1340.6048,-100 1340.6048,-178\"/>\n",
"<text text-anchor=\"middle\" x=\"1277.5669\" y=\"-162.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">altitude2 <= 0.01</text>\n",
"<text text-anchor=\"middle\" x=\"1277.5669\" y=\"-148.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.375</text>\n",
"<text text-anchor=\"middle\" x=\"1277.5669\" y=\"-134.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 0.9%</text>\n",
"<text text-anchor=\"middle\" x=\"1277.5669\" y=\"-120.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.25, 0.75]</text>\n",
"<text text-anchor=\"middle\" x=\"1277.5669\" y=\"-106.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 23->25 -->\n",
"<g id=\"edge25\" class=\"edge\">\n",
"<title>23->25</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1185.747,-213.7677C1197.1862,-204.4529 1209.4892,-194.4347 1221.2553,-184.8538\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1223.7537,-187.333 1229.298,-178.3046 1219.3336,-181.9049 1223.7537,-187.333\"/>\n",
"</g>\n",
"<!-- 26 -->\n",
"<g id=\"node27\" class=\"node\">\n",
"<title>26</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1263.3113,-64 1137.8225,-64 1137.8225,0 1263.3113,0 1263.3113,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"1200.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"1200.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 0.7%</text>\n",
"<text text-anchor=\"middle\" x=\"1200.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0.0, 1.0]</text>\n",
"<text text-anchor=\"middle\" x=\"1200.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 25->26 -->\n",
"<g id=\"edge26\" class=\"edge\">\n",
"<title>25->26</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1249.3322,-99.7647C1242.957,-90.9057 1236.1767,-81.4838 1229.7629,-72.571\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1232.433,-70.2893 1223.751,-64.2169 1226.7512,-74.3781 1232.433,-70.2893\"/>\n",
"</g>\n",
"<!-- 27 -->\n",
"<g id=\"node28\" class=\"node\">\n",
"<title>27</title>\n",
"<polygon fill=\"none\" stroke=\"#000000\" points=\"1428.2007,-64 1280.933,-64 1280.933,0 1428.2007,0 1428.2007,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"1354.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"1354.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 0.2%</text>\n",
"<text text-anchor=\"middle\" x=\"1354.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1.0, 0.0]</text>\n",
"<text text-anchor=\"middle\" x=\"1354.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 25->27 -->\n",
"<g id=\"edge27\" class=\"edge\">\n",
"<title>25->27</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1305.8016,-99.7647C1312.1768,-90.9057 1318.9571,-81.4838 1325.3709,-72.571\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1328.3826,-74.3781 1331.3828,-64.2169 1322.7008,-70.2893 1328.3826,-74.3781\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n",
"</pre>\n",
" \n",
"\n",
"\n",
"\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_names=list(X.columns)\n",
"label_names = ['not surveillance', 'surveillance']\n",
"eli5.show_weights(clf, feature_names=feature_names, target_names=label_names, show=['decision_tree'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you'd like your graph to have colors colors, or to not use eli5, you can do it the old-fashioned way. You might need to `brew install graphviz` and `pip install graphviz`.\n",
"\n",
"```python\n",
"from sklearn import tree\n",
"import graphviz\n",
"\n",
"label_names = ['not surveillance', 'surveillance']\n",
"feature_names = X.columns\n",
"\n",
"dot_data = tree.export_graphviz(clf,\n",
" feature_names=feature_names, \n",
" filled=True,\n",
" class_names=label_names) \n",
"graph = graphviz.Source(dot_data) \n",
"graph\n",
"```\n",
"\n",
"* **Tip:** You'll probably need to scroll sideways a bit"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.40.1 (20161225.0304)\n",
" -->\n",
"<!-- Title: Tree Pages: 1 -->\n",
"<svg width=\"1432pt\" height=\"642pt\"\n",
" viewBox=\"0.00 0.00 1432.13 642.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 638)\">\n",
"<title>Tree</title>\n",
"<polygon fill=\"#ffffff\" stroke=\"transparent\" points=\"-4,4 -4,-638 1428.1338,-638 1428.1338,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\">\n",
"<title>0</title>\n",
"<polygon fill=\"#ea995e\" stroke=\"#000000\" points=\"897.2007,-634 749.933,-634 749.933,-556 897.2007,-556 897.2007,-634\"/>\n",
"<text text-anchor=\"middle\" x=\"823.5669\" y=\"-618.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">steer2 <= 0.111</text>\n",
"<text text-anchor=\"middle\" x=\"823.5669\" y=\"-604.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.267</text>\n",
"<text text-anchor=\"middle\" x=\"823.5669\" y=\"-590.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 447</text>\n",
"<text text-anchor=\"middle\" x=\"823.5669\" y=\"-576.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [376, 71]</text>\n",
"<text text-anchor=\"middle\" x=\"823.5669\" y=\"-562.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\">\n",
"<title>1</title>\n",
"<polygon fill=\"#e68743\" stroke=\"#000000\" points=\"738.2007,-520 590.933,-520 590.933,-442 738.2007,-442 738.2007,-520\"/>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-504.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">squawk_1 <= 4380.5</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-490.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.093</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-476.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 390</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-462.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [371, 19]</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-448.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 0->1 -->\n",
"<g id=\"edge1\" class=\"edge\">\n",
"<title>0->1</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M768.8481,-555.7677C755.5322,-546.2204 741.1857,-535.9342 727.5215,-526.1373\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"729.2835,-523.0939 719.1171,-520.1115 725.2046,-528.7828 729.2835,-523.0939\"/>\n",
"<text text-anchor=\"middle\" x=\"723.1831\" y=\"-540.5786\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">True</text>\n",
"</g>\n",
"<!-- 20 -->\n",
"<g id=\"node21\" class=\"node\">\n",
"<title>20</title>\n",
"<polygon fill=\"#4ca6e8\" stroke=\"#000000\" points=\"1038.3113,-520 912.8225,-520 912.8225,-442 1038.3113,-442 1038.3113,-520\"/>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-504.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">duration4 <= 0.207</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-490.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.16</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-476.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 57</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-462.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [5, 52]</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-448.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 0->20 -->\n",
"<g id=\"edge20\" class=\"edge\">\n",
"<title>0->20</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M875.8767,-555.7677C888.6063,-546.2204 902.3213,-535.9342 915.3839,-526.1373\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"917.5183,-528.9115 923.4183,-520.1115 913.3183,-523.3115 917.5183,-528.9115\"/>\n",
"<text text-anchor=\"middle\" x=\"919.8828\" y=\"-540.6602\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">False</text>\n",
"</g>\n",
"<!-- 2 -->\n",
"<g id=\"node3\" class=\"node\">\n",
"<title>2</title>\n",
"<polygon fill=\"#e6843e\" stroke=\"#000000\" points=\"568.2007,-406 420.933,-406 420.933,-328 568.2007,-328 568.2007,-406\"/>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-390.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">duration1 <= 0.371</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-376.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.045</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-362.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 346</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-348.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [338, 8]</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-334.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 1->2 -->\n",
"<g id=\"edge2\" class=\"edge\">\n",
"<title>1->2</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M606.0625,-441.7677C591.6911,-432.1303 576.1968,-421.7401 561.4636,-411.8601\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"563.1458,-408.7741 552.891,-406.1115 559.2471,-414.5879 563.1458,-408.7741\"/>\n",
"</g>\n",
"<!-- 13 -->\n",
"<g id=\"node14\" class=\"node\">\n",
"<title>13</title>\n",
"<polygon fill=\"#eeab7b\" stroke=\"#000000\" points=\"738.2007,-406 590.933,-406 590.933,-328 738.2007,-328 738.2007,-406\"/>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-390.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">squawk_1 <= 4465.5</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-376.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.375</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-362.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 44</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-348.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [33, 11]</text>\n",
"<text text-anchor=\"middle\" x=\"664.5669\" y=\"-334.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 1->13 -->\n",
"<g id=\"edge13\" class=\"edge\">\n",
"<title>1->13</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M664.5669,-441.7677C664.5669,-433.6172 664.5669,-424.9283 664.5669,-416.4649\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"668.067,-416.3046 664.5669,-406.3046 661.067,-416.3047 668.067,-416.3046\"/>\n",
"</g>\n",
"<!-- 3 -->\n",
"<g id=\"node4\" class=\"node\">\n",
"<title>3</title>\n",
"<polygon fill=\"#e5813a\" stroke=\"#000000\" points=\"389.2007,-292 241.933,-292 241.933,-214 389.2007,-214 389.2007,-292\"/>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-276.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">speed2 <= 0.003</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-262.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.006</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-248.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 310</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-234.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [309, 1]</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-220.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 2->3 -->\n",
"<g id=\"edge3\" class=\"edge\">\n",
"<title>2->3</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M433.2157,-327.9272C417.8452,-318.1381 401.2415,-307.5637 385.4887,-297.5312\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"387.1528,-294.4415 376.8379,-292.0218 383.3925,-300.3458 387.1528,-294.4415\"/>\n",
"</g>\n",
"<!-- 8 -->\n",
"<g id=\"node9\" class=\"node\">\n",
"<title>8</title>\n",
"<polygon fill=\"#eb9f69\" stroke=\"#000000\" points=\"568.2007,-292 420.933,-292 420.933,-214 568.2007,-214 568.2007,-292\"/>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-276.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">altitude1 <= 0.122</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-262.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.313</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-248.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 36</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-234.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [29, 7]</text>\n",
"<text text-anchor=\"middle\" x=\"494.5669\" y=\"-220.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 2->8 -->\n",
"<g id=\"edge8\" class=\"edge\">\n",
"<title>2->8</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M494.5669,-327.7677C494.5669,-319.6172 494.5669,-310.9283 494.5669,-302.4649\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"498.067,-302.3046 494.5669,-292.3046 491.067,-302.3047 498.067,-302.3046\"/>\n",
"</g>\n",
"<!-- 4 -->\n",
"<g id=\"node5\" class=\"node\">\n",
"<title>4</title>\n",
"<polygon fill=\"#f2c09c\" stroke=\"#000000\" points=\"224.2007,-178 76.933,-178 76.933,-100 224.2007,-100 224.2007,-178\"/>\n",
"<text text-anchor=\"middle\" x=\"150.5669\" y=\"-162.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">boxes4 <= 0.379</text>\n",
"<text text-anchor=\"middle\" x=\"150.5669\" y=\"-148.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.444</text>\n",
"<text text-anchor=\"middle\" x=\"150.5669\" y=\"-134.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 3</text>\n",
"<text text-anchor=\"middle\" x=\"150.5669\" y=\"-120.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [2, 1]</text>\n",
"<text text-anchor=\"middle\" x=\"150.5669\" y=\"-106.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 3->4 -->\n",
"<g id=\"edge4\" class=\"edge\">\n",
"<title>3->4</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M258.7833,-213.7677C244.8345,-204.1303 229.796,-193.7401 215.496,-183.8601\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"217.3924,-180.9163 207.1756,-178.1115 213.4134,-186.6754 217.3924,-180.9163\"/>\n",
"</g>\n",
"<!-- 7 -->\n",
"<g id=\"node8\" class=\"node\">\n",
"<title>7</title>\n",
"<polygon fill=\"#e58139\" stroke=\"#000000\" points=\"389.2007,-171 241.933,-171 241.933,-107 389.2007,-107 389.2007,-171\"/>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-155.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-141.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 307</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-127.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [307, 0]</text>\n",
"<text text-anchor=\"middle\" x=\"315.5669\" y=\"-113.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 3->7 -->\n",
"<g id=\"edge7\" class=\"edge\">\n",
"<title>3->7</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M315.5669,-213.7677C315.5669,-203.3338 315.5669,-192.0174 315.5669,-181.4215\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"319.067,-181.1252 315.5669,-171.1252 312.067,-181.1252 319.067,-181.1252\"/>\n",
"</g>\n",
"<!-- 5 -->\n",
"<g id=\"node6\" class=\"node\">\n",
"<title>5</title>\n",
"<polygon fill=\"#e58139\" stroke=\"#000000\" points=\"147.2007,-64 -.067,-64 -.067,0 147.2007,0 147.2007,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"73.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"73.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 2</text>\n",
"<text text-anchor=\"middle\" x=\"73.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [2, 0]</text>\n",
"<text text-anchor=\"middle\" x=\"73.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 4->5 -->\n",
"<g id=\"edge5\" class=\"edge\">\n",
"<title>4->5</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M122.3322,-99.7647C115.957,-90.9057 109.1767,-81.4838 102.7629,-72.571\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"105.433,-70.2893 96.751,-64.2169 99.7512,-74.3781 105.433,-70.2893\"/>\n",
"</g>\n",
"<!-- 6 -->\n",
"<g id=\"node7\" class=\"node\">\n",
"<title>6</title>\n",
"<polygon fill=\"#399de5\" stroke=\"#000000\" points=\"290.3113,-64 164.8225,-64 164.8225,0 290.3113,0 290.3113,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"227.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"227.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 1</text>\n",
"<text text-anchor=\"middle\" x=\"227.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0, 1]</text>\n",
"<text text-anchor=\"middle\" x=\"227.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 4->6 -->\n",
"<g id=\"edge6\" class=\"edge\">\n",
"<title>4->6</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M178.8016,-99.7647C185.1768,-90.9057 191.9571,-81.4838 198.3709,-72.571\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"201.3826,-74.3781 204.3828,-64.2169 195.7008,-70.2893 201.3826,-74.3781\"/>\n",
"</g>\n",
"<!-- 9 -->\n",
"<g id=\"node10\" class=\"node\">\n",
"<title>9</title>\n",
"<polygon fill=\"#f2c09c\" stroke=\"#000000\" points=\"554.2007,-178 406.933,-178 406.933,-100 554.2007,-100 554.2007,-178\"/>\n",
"<text text-anchor=\"middle\" x=\"480.5669\" y=\"-162.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">altitude1 <= 0.046</text>\n",
"<text text-anchor=\"middle\" x=\"480.5669\" y=\"-148.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.444</text>\n",
"<text text-anchor=\"middle\" x=\"480.5669\" y=\"-134.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 21</text>\n",
"<text text-anchor=\"middle\" x=\"480.5669\" y=\"-120.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [14, 7]</text>\n",
"<text text-anchor=\"middle\" x=\"480.5669\" y=\"-106.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 8->9 -->\n",
"<g id=\"edge9\" class=\"edge\">\n",
"<title>8->9</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M489.7489,-213.7677C488.748,-205.6172 487.6809,-196.9283 486.6415,-188.4649\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"490.0867,-187.8034 485.3938,-178.3046 483.1389,-188.6567 490.0867,-187.8034\"/>\n",
"</g>\n",
"<!-- 12 -->\n",
"<g id=\"node13\" class=\"node\">\n",
"<title>12</title>\n",
"<polygon fill=\"#e58139\" stroke=\"#000000\" points=\"719.2007,-171 571.933,-171 571.933,-107 719.2007,-107 719.2007,-171\"/>\n",
"<text text-anchor=\"middle\" x=\"645.5669\" y=\"-155.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"645.5669\" y=\"-141.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 15</text>\n",
"<text text-anchor=\"middle\" x=\"645.5669\" y=\"-127.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [15, 0]</text>\n",
"<text text-anchor=\"middle\" x=\"645.5669\" y=\"-113.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 8->12 -->\n",
"<g id=\"edge12\" class=\"edge\">\n",
"<title>8->12</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M546.5325,-213.7677C562.2243,-201.9209 579.4231,-188.9364 595.0209,-177.1606\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"597.143,-179.9439 603.0151,-171.1252 592.9253,-174.3572 597.143,-179.9439\"/>\n",
"</g>\n",
"<!-- 10 -->\n",
"<g id=\"node11\" class=\"node\">\n",
"<title>10</title>\n",
"<polygon fill=\"#e99457\" stroke=\"#000000\" points=\"505.2007,-64 357.933,-64 357.933,0 505.2007,0 505.2007,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"431.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.231</text>\n",
"<text text-anchor=\"middle\" x=\"431.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 15</text>\n",
"<text text-anchor=\"middle\" x=\"431.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [13, 2]</text>\n",
"<text text-anchor=\"middle\" x=\"431.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 9->10 -->\n",
"<g id=\"edge10\" class=\"edge\">\n",
"<title>9->10</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M462.5993,-99.7647C458.6679,-91.1797 454.4943,-82.066 450.5255,-73.3994\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"453.6663,-71.8516 446.3204,-64.2169 447.3019,-74.7662 453.6663,-71.8516\"/>\n",
"</g>\n",
"<!-- 11 -->\n",
"<g id=\"node12\" class=\"node\">\n",
"<title>11</title>\n",
"<polygon fill=\"#61b1ea\" stroke=\"#000000\" points=\"648.3113,-64 522.8225,-64 522.8225,0 648.3113,0 648.3113,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"585.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.278</text>\n",
"<text text-anchor=\"middle\" x=\"585.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 6</text>\n",
"<text text-anchor=\"middle\" x=\"585.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1, 5]</text>\n",
"<text text-anchor=\"middle\" x=\"585.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 9->11 -->\n",
"<g id=\"edge11\" class=\"edge\">\n",
"<title>9->11</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M519.0688,-99.7647C528.1207,-90.5404 537.7715,-80.7057 546.8336,-71.4711\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"549.4462,-73.8058 553.9522,-64.2169 544.45,-68.9029 549.4462,-73.8058\"/>\n",
"</g>\n",
"<!-- 14 -->\n",
"<g id=\"node15\" class=\"node\">\n",
"<title>14</title>\n",
"<polygon fill=\"#399de5\" stroke=\"#000000\" points=\"719.3113,-285 593.8225,-285 593.8225,-221 719.3113,-221 719.3113,-285\"/>\n",
"<text text-anchor=\"middle\" x=\"656.5669\" y=\"-269.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"656.5669\" y=\"-255.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 9</text>\n",
"<text text-anchor=\"middle\" x=\"656.5669\" y=\"-241.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0, 9]</text>\n",
"<text text-anchor=\"middle\" x=\"656.5669\" y=\"-227.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 13->14 -->\n",
"<g id=\"edge14\" class=\"edge\">\n",
"<title>13->14</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M661.8137,-327.7677C661.0815,-317.3338 660.2874,-306.0174 659.5438,-295.4215\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"663.0128,-294.8556 658.8213,-285.1252 656.03,-295.3457 663.0128,-294.8556\"/>\n",
"</g>\n",
"<!-- 15 -->\n",
"<g id=\"node16\" class=\"node\">\n",
"<title>15</title>\n",
"<polygon fill=\"#e78945\" stroke=\"#000000\" points=\"884.2007,-292 736.933,-292 736.933,-214 884.2007,-214 884.2007,-292\"/>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-276.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">steer1 <= 0.027</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-262.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.108</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-248.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 35</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-234.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [33, 2]</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-220.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 13->15 -->\n",
"<g id=\"edge15\" class=\"edge\">\n",
"<title>13->15</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M714.8118,-327.7677C726.856,-318.3633 739.8184,-308.242 752.1957,-298.5775\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"754.5015,-301.2177 760.2294,-292.3046 750.1934,-295.7004 754.5015,-301.2177\"/>\n",
"</g>\n",
"<!-- 16 -->\n",
"<g id=\"node17\" class=\"node\">\n",
"<title>16</title>\n",
"<polygon fill=\"#e58139\" stroke=\"#000000\" points=\"884.2007,-171 736.933,-171 736.933,-107 884.2007,-107 884.2007,-171\"/>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-155.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-141.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 30</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-127.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [30, 0]</text>\n",
"<text text-anchor=\"middle\" x=\"810.5669\" y=\"-113.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 15->16 -->\n",
"<g id=\"edge16\" class=\"edge\">\n",
"<title>15->16</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M810.5669,-213.7677C810.5669,-203.3338 810.5669,-192.0174 810.5669,-181.4215\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"814.067,-181.1252 810.5669,-171.1252 807.067,-181.1252 814.067,-181.1252\"/>\n",
"</g>\n",
"<!-- 17 -->\n",
"<g id=\"node18\" class=\"node\">\n",
"<title>17</title>\n",
"<polygon fill=\"#f6d5bd\" stroke=\"#000000\" points=\"1049.2007,-178 901.933,-178 901.933,-100 1049.2007,-100 1049.2007,-178\"/>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-162.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">boxes3 <= 0.113</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-148.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.48</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-134.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 5</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-120.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [3, 2]</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-106.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 15->17 -->\n",
"<g id=\"edge17\" class=\"edge\">\n",
"<title>15->17</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M867.3505,-213.7677C881.2993,-204.1303 896.3378,-193.7401 910.6378,-183.8601\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"912.7204,-186.6754 918.9582,-178.1115 908.7414,-180.9163 912.7204,-186.6754\"/>\n",
"</g>\n",
"<!-- 18 -->\n",
"<g id=\"node19\" class=\"node\">\n",
"<title>18</title>\n",
"<polygon fill=\"#e58139\" stroke=\"#000000\" points=\"933.2007,-64 785.933,-64 785.933,0 933.2007,0 933.2007,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"859.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"859.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 3</text>\n",
"<text text-anchor=\"middle\" x=\"859.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [3, 0]</text>\n",
"<text text-anchor=\"middle\" x=\"859.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 17->18 -->\n",
"<g id=\"edge18\" class=\"edge\">\n",
"<title>17->18</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M933.0315,-99.7647C922.9322,-90.4491 912.1582,-80.5109 902.0608,-71.197\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"904.2172,-68.4244 894.4936,-64.2169 899.471,-73.5697 904.2172,-68.4244\"/>\n",
"</g>\n",
"<!-- 19 -->\n",
"<g id=\"node20\" class=\"node\">\n",
"<title>19</title>\n",
"<polygon fill=\"#399de5\" stroke=\"#000000\" points=\"1076.3113,-64 950.8225,-64 950.8225,0 1076.3113,0 1076.3113,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"1013.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"1013.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 2</text>\n",
"<text text-anchor=\"middle\" x=\"1013.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0, 2]</text>\n",
"<text text-anchor=\"middle\" x=\"1013.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 17->19 -->\n",
"<g id=\"edge19\" class=\"edge\">\n",
"<title>17->19</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M989.5009,-99.7647C992.5174,-91.271 995.7176,-82.2599 998.766,-73.6762\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1002.0769,-74.8116 1002.1254,-64.2169 995.4805,-72.4689 1002.0769,-74.8116\"/>\n",
"</g>\n",
"<!-- 21 -->\n",
"<g id=\"node22\" class=\"node\">\n",
"<title>21</title>\n",
"<polygon fill=\"#41a1e6\" stroke=\"#000000\" points=\"1038.3113,-406 912.8225,-406 912.8225,-328 1038.3113,-328 1038.3113,-406\"/>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-390.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">speed2 <= 0.013</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-376.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.071</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-362.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 54</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-348.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [2, 52]</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-334.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 20->21 -->\n",
"<g id=\"edge21\" class=\"edge\">\n",
"<title>20->21</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M975.5669,-441.7677C975.5669,-433.6172 975.5669,-424.9283 975.5669,-416.4649\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"979.067,-416.3046 975.5669,-406.3046 972.067,-416.3047 979.067,-416.3046\"/>\n",
"</g>\n",
"<!-- 28 -->\n",
"<g id=\"node29\" class=\"node\">\n",
"<title>28</title>\n",
"<polygon fill=\"#e58139\" stroke=\"#000000\" points=\"1203.2007,-399 1055.933,-399 1055.933,-335 1203.2007,-335 1203.2007,-399\"/>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-383.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-369.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 3</text>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-355.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [3, 0]</text>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-341.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 20->28 -->\n",
"<g id=\"edge28\" class=\"edge\">\n",
"<title>20->28</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1028.565,-441.7677C1044.5685,-429.9209 1062.109,-416.9364 1078.0166,-405.1606\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1080.2147,-407.8881 1086.1697,-399.1252 1076.0498,-402.2619 1080.2147,-407.8881\"/>\n",
"</g>\n",
"<!-- 22 -->\n",
"<g id=\"node23\" class=\"node\">\n",
"<title>22</title>\n",
"<polygon fill=\"#e58139\" stroke=\"#000000\" points=\"1049.2007,-285 901.933,-285 901.933,-221 1049.2007,-221 1049.2007,-285\"/>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-269.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-255.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 1</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-241.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1, 0]</text>\n",
"<text text-anchor=\"middle\" x=\"975.5669\" y=\"-227.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 21->22 -->\n",
"<g id=\"edge22\" class=\"edge\">\n",
"<title>21->22</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M975.5669,-327.7677C975.5669,-317.3338 975.5669,-306.0174 975.5669,-295.4215\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"979.067,-295.1252 975.5669,-285.1252 972.067,-295.1252 979.067,-295.1252\"/>\n",
"</g>\n",
"<!-- 23 -->\n",
"<g id=\"node24\" class=\"node\">\n",
"<title>23</title>\n",
"<polygon fill=\"#3d9fe6\" stroke=\"#000000\" points=\"1192.3113,-292 1066.8225,-292 1066.8225,-214 1192.3113,-214 1192.3113,-292\"/>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-276.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">altitude5 <= 0.261</text>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-262.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.037</text>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-248.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 53</text>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-234.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1, 52]</text>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-220.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 21->23 -->\n",
"<g id=\"edge23\" class=\"edge\">\n",
"<title>21->23</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1028.565,-327.7677C1041.4621,-318.2204 1055.3575,-307.9342 1068.592,-298.1373\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1070.7771,-300.8744 1076.7321,-292.1115 1066.6122,-295.2482 1070.7771,-300.8744\"/>\n",
"</g>\n",
"<!-- 24 -->\n",
"<g id=\"node25\" class=\"node\">\n",
"<title>24</title>\n",
"<polygon fill=\"#399de5\" stroke=\"#000000\" points=\"1192.3113,-171 1066.8225,-171 1066.8225,-107 1192.3113,-107 1192.3113,-171\"/>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-155.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-141.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 49</text>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-127.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0, 49]</text>\n",
"<text text-anchor=\"middle\" x=\"1129.5669\" y=\"-113.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 23->24 -->\n",
"<g id=\"edge24\" class=\"edge\">\n",
"<title>23->24</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1129.5669,-213.7677C1129.5669,-203.3338 1129.5669,-192.0174 1129.5669,-181.4215\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1133.067,-181.1252 1129.5669,-171.1252 1126.067,-181.1252 1133.067,-181.1252\"/>\n",
"</g>\n",
"<!-- 25 -->\n",
"<g id=\"node26\" class=\"node\">\n",
"<title>25</title>\n",
"<polygon fill=\"#7bbeee\" stroke=\"#000000\" points=\"1336.3113,-178 1210.8225,-178 1210.8225,-100 1336.3113,-100 1336.3113,-178\"/>\n",
"<text text-anchor=\"middle\" x=\"1273.5669\" y=\"-162.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">altitude2 <= 0.01</text>\n",
"<text text-anchor=\"middle\" x=\"1273.5669\" y=\"-148.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.375</text>\n",
"<text text-anchor=\"middle\" x=\"1273.5669\" y=\"-134.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 4</text>\n",
"<text text-anchor=\"middle\" x=\"1273.5669\" y=\"-120.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1, 3]</text>\n",
"<text text-anchor=\"middle\" x=\"1273.5669\" y=\"-106.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 23->25 -->\n",
"<g id=\"edge25\" class=\"edge\">\n",
"<title>23->25</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1179.1235,-213.7677C1191.0027,-204.3633 1203.7875,-194.242 1215.9953,-184.5775\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1218.2509,-187.2559 1223.9189,-178.3046 1213.906,-181.7675 1218.2509,-187.2559\"/>\n",
"</g>\n",
"<!-- 26 -->\n",
"<g id=\"node27\" class=\"node\">\n",
"<title>26</title>\n",
"<polygon fill=\"#399de5\" stroke=\"#000000\" points=\"1259.3113,-64 1133.8225,-64 1133.8225,0 1259.3113,0 1259.3113,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"1196.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"1196.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 3</text>\n",
"<text text-anchor=\"middle\" x=\"1196.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [0, 3]</text>\n",
"<text text-anchor=\"middle\" x=\"1196.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = surveillance</text>\n",
"</g>\n",
"<!-- 25->26 -->\n",
"<g id=\"edge26\" class=\"edge\">\n",
"<title>25->26</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1245.3322,-99.7647C1238.957,-90.9057 1232.1767,-81.4838 1225.7629,-72.571\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1228.433,-70.2893 1219.751,-64.2169 1222.7512,-74.3781 1228.433,-70.2893\"/>\n",
"</g>\n",
"<!-- 27 -->\n",
"<g id=\"node28\" class=\"node\">\n",
"<title>27</title>\n",
"<polygon fill=\"#e58139\" stroke=\"#000000\" points=\"1424.2007,-64 1276.933,-64 1276.933,0 1424.2007,0 1424.2007,-64\"/>\n",
"<text text-anchor=\"middle\" x=\"1350.5669\" y=\"-48.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">gini = 0.0</text>\n",
"<text text-anchor=\"middle\" x=\"1350.5669\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">samples = 1</text>\n",
"<text text-anchor=\"middle\" x=\"1350.5669\" y=\"-20.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">value = [1, 0]</text>\n",
"<text text-anchor=\"middle\" x=\"1350.5669\" y=\"-6.8\" font-family=\"Times,serif\" font-size=\"14.00\" fill=\"#000000\">class = not surveillance</text>\n",
"</g>\n",
"<!-- 25->27 -->\n",
"<g id=\"edge27\" class=\"edge\">\n",
"<title>25->27</title>\n",
"<path fill=\"none\" stroke=\"#000000\" d=\"M1301.8016,-99.7647C1308.1768,-90.9057 1314.9571,-81.4838 1321.3709,-72.571\"/>\n",
"<polygon fill=\"#000000\" stroke=\"#000000\" points=\"1324.3826,-74.3781 1327.3828,-64.2169 1318.7008,-70.2893 1324.3826,-74.3781\"/>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<graphviz.files.Source at 0x10b3f86a0>"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn import tree\n",
"import graphviz\n",
"\n",
"label_names = ['not surveillance', 'surveillance']\n",
"feature_names = X.columns\n",
"\n",
"dot_data = tree.export_graphviz(clf,\n",
" feature_names=feature_names, \n",
" filled=True,\n",
" class_names=label_names) \n",
"graph = graphviz.Source(dot_data) \n",
"graph"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# One more classifier: Random forest\n",
"\n",
"## Build and train your classifier\n",
"\n",
"We can build a random forest classifier like this:\n",
"\n",
"```python\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"clf = RandomForestClassifier()\n",
"```\n",
"\n",
"But you're in charge of fitting it to your training data!\n",
"\n",
"* **Tip:** You can also set `max_depth` here, but you won't be able to visualize the result.\n",
"* **Tip:** Increase `n_estimators` to 100 to make a better classifier."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
" max_depth=5, max_features='auto', max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=100,\n",
" n_jobs=None, oob_score=False, random_state=None,\n",
" verbose=0, warm_start=False)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"clf = RandomForestClassifier(n_estimators=100, max_depth=5)\n",
"clf.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What are the important features?"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" <style>\n",
" table.eli5-weights tr:hover {\n",
" filter: brightness(85%);\n",
" }\n",
"</style>\n",
"\n",
"\n",
"\n",
" \n",
"\n",
" \n",
" \n",
" <pre>\n",
"Random forest feature importances; values are numbers 0 <= x <= 1;\n",
"all values sum to 1.\n",
"</pre>\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
"\n",
" \n",
" <table class=\"eli5-weights eli5-feature-importances\" style=\"border-collapse: collapse; border: none; margin-top: 0em; table-layout: auto;\">\n",
" <thead>\n",
" <tr style=\"border: none;\">\n",
" <th style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">Weight</th>\n",
" <th style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">Feature</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 80.00%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.1812\n",
" \n",
" ± 0.5010\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer2\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 81.95%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.1566\n",
" \n",
" ± 0.4234\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 84.35%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.1277\n",
" \n",
" ± 0.3478\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" squawk_1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 87.42%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0935\n",
" \n",
" ± 0.3463\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer5\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.29%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0303\n",
" \n",
" ± 0.1279\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer6\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.42%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0293\n",
" \n",
" ± 0.1362\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" speed1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.66%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0274\n",
" \n",
" ± 0.1078\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer4\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.77%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0266\n",
" \n",
" ± 0.1001\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" altitude1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 94.81%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0264\n",
" \n",
" ± 0.1079\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" altitude3\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 95.32%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0228\n",
" \n",
" ± 0.0785\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 95.44%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0219\n",
" \n",
" ± 0.0915\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" duration5\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 95.74%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0199\n",
" \n",
" ± 0.0866\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" speed4\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 95.76%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0198\n",
" \n",
" ± 0.0819\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" duration4\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 96.16%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0172\n",
" \n",
" ± 0.0924\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes5\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 96.24%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0167\n",
" \n",
" ± 0.0502\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" duration1\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 96.37%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0158\n",
" \n",
" ± 0.0725\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" boxes2\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 96.55%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0147\n",
" \n",
" ± 0.0670\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" type_code\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 96.63%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0143\n",
" \n",
" ± 0.0502\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" speed2\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 96.67%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0140\n",
" \n",
" ± 0.0632\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" observations\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 96.77%); border: none;\">\n",
" <td style=\"padding: 0 1em 0 0.5em; text-align: right; border: none;\">\n",
" 0.0134\n",
" \n",
" ± 0.0595\n",
" \n",
" </td>\n",
" <td style=\"padding: 0 0.5em 0 0.5em; text-align: left; border: none;\">\n",
" steer8\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" \n",
" <tr style=\"background-color: hsl(120, 100.00%, 96.77%); border: none;\">\n",
" <td colspan=\"2\" style=\"padding: 0 0.5em 0 0.5em; text-align: center; border: none; white-space: nowrap;\">\n",
" <i>… 12 more …</i>\n",
" </td>\n",
" </tr>\n",
" \n",
" \n",
" </tbody>\n",
"</table>\n",
" \n",
"\n",
" \n",
"\n",
"\n",
"\n"
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_names = list(X.columns)\n",
"eli5.show_weights(clf, feature_names=feature_names, show=['description', 'feature_importances'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Understanding the output\n",
"\n",
"What is a random forest, and **why is the feature importance difference than for the decision tree?** Isn't a random forest just like a decision tree or something?"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# It's a lot of decision trees that all work together, so it'll even try to use less useful features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How well does it perform?"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Predicted not surveil</th>\n",
" <th>Predicted surveil</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Is not surveil</th>\n",
" <td>124</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Is surveil</th>\n",
" <td>5</td>\n",
" <td>21</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Predicted not surveil Predicted surveil\n",
"Is not surveil 124 0\n",
"Is surveil 5 21"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"\n",
"y_true = y_test\n",
"y_pred = clf.predict(X_test)\n",
"matrix = confusion_matrix(y_true, y_pred)\n",
"\n",
"label_names = pd.Series(['not surveil', 'surveil'])\n",
"pd.DataFrame(matrix,\n",
" columns='Predicted ' + label_names,\n",
" index='Is ' + label_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How confident do you feel in the model?"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"# Very confident"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Actually finding spy planes\n",
"\n",
"Now let's try ot actually find our spy planes\n",
"\n",
"## Retrain our model\n",
"\n",
"When we did test/train split, we trained our model with only a subset of our data, so we could test with the rest. Now that we're working in the \"real world\" we want to re-train it using not just `_train` and `_test` data, but instead **everything we have labels for.**"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
" max_depth=5, max_features='auto', max_leaf_nodes=None,\n",
" min_impurity_decrease=0.0, min_impurity_split=None,\n",
" min_samples_leaf=1, min_samples_split=2,\n",
" min_weight_fraction_leaf=0.0, n_estimators=100,\n",
" n_jobs=None, oob_score=False, random_state=None,\n",
" verbose=0, warm_start=False)"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filter for planes we want to predict\n",
"\n",
"We have a dataframe of features that includes three types of planes:\n",
"\n",
"* Those that are labeled as surveillance planes\n",
"* Those that are labeled as not surveillance\n",
"* Those that aren't labeled\n",
"\n",
"Which do we want to predictions for? **Filter a new dataframe that's just those.**\n",
"\n",
"* **Tip:** Scroll up to see where you created your `train_df`, it's the opposite!"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"real_df = df[df.label.isna()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many planes do you have in that list? **Confirm it's about 19,200.**"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(19202, 35)"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"real_df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Predicting \n",
"\n",
"Build your `X` - remember you need to drop a few columns - and use that to make a prediction for each plane.\n",
"\n",
"**Assign the prediction into the `predicted` column**.\n",
"\n",
"* **Tip:** Scroll up to see where you created your features for training, it's similar\n",
"* **Tip:** pandas will yell at us about setting values on copies of a slice but it's fine"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"X = real_df.drop(columns=['label', 'adshex', 'type'])\n",
"real_df['predicted'] = clf.predict(X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How many planes did it predict to be surveillance planes?\n",
"\n",
"It should be roughly around 70-80 planes."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(70, 36)"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"real_df[real_df.predicted == 1].shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## But.. what about those other ones? The ones that are just below the threshold?\n",
"\n",
"The cutoff for a prediction of `1` is 50%, but since we have a lot of time we're interested in investigating the top 150. To get the probability for each row, you will use `clf.predict_proba` instead of `clf.predict`. Also, to get the predicted probability for the `1` category, you'll need to add `[:,1]` to the end of the\n",
"\n",
"```python\n",
"clf.predict_proba(***your features***)[:,1]\n",
"```\n",
"\n",
"**Create a new column called `predicted_prob` that is the chance that the plane is a surveillance plane.**\n",
"\n",
"* **Tip:** You dropped three columns when using `clf.predict`, but if you drop the same three you'll get an error now. There's now an extra column that you'll need to drop! What is it?"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>adshex</th>\n",
" <th>label</th>\n",
" <th>duration1</th>\n",
" <th>duration2</th>\n",
" <th>duration3</th>\n",
" <th>duration4</th>\n",
" <th>duration5</th>\n",
" <th>boxes1</th>\n",
" <th>boxes2</th>\n",
" <th>boxes3</th>\n",
" <th>boxes4</th>\n",
" <th>boxes5</th>\n",
" <th>speed1</th>\n",
" <th>speed2</th>\n",
" <th>speed3</th>\n",
" <th>speed4</th>\n",
" <th>speed5</th>\n",
" <th>altitude1</th>\n",
" <th>altitude2</th>\n",
" <th>altitude3</th>\n",
" <th>altitude4</th>\n",
" <th>altitude5</th>\n",
" <th>steer1</th>\n",
" <th>steer2</th>\n",
" <th>steer3</th>\n",
" <th>steer4</th>\n",
" <th>steer5</th>\n",
" <th>steer6</th>\n",
" <th>steer7</th>\n",
" <th>steer8</th>\n",
" <th>flights</th>\n",
" <th>squawk_1</th>\n",
" <th>observations</th>\n",
" <th>type</th>\n",
" <th>type_code</th>\n",
" <th>predicted</th>\n",
" <th>predicted_prob</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>597</th>\n",
" <td>A</td>\n",
" <td>NaN</td>\n",
" <td>0.120253</td>\n",
" <td>0.075949</td>\n",
" <td>0.183544</td>\n",
" <td>0.335443</td>\n",
" <td>0.284810</td>\n",
" <td>0.088608</td>\n",
" <td>0.044304</td>\n",
" <td>0.069620</td>\n",
" <td>0.120253</td>\n",
" <td>0.677215</td>\n",
" <td>0.021824</td>\n",
" <td>0.020550</td>\n",
" <td>0.062330</td>\n",
" <td>0.100713</td>\n",
" <td>0.794582</td>\n",
" <td>0.042374</td>\n",
" <td>0.060971</td>\n",
" <td>0.066831</td>\n",
" <td>0.106403</td>\n",
" <td>0.723421</td>\n",
" <td>0.020211</td>\n",
" <td>0.048913</td>\n",
" <td>0.270550</td>\n",
" <td>0.344090</td>\n",
" <td>0.097317</td>\n",
" <td>0.186651</td>\n",
" <td>0.011379</td>\n",
" <td>0.009426</td>\n",
" <td>158</td>\n",
" <td>0</td>\n",
" <td>11776</td>\n",
" <td>GRND</td>\n",
" <td>248</td>\n",
" <td>0.0</td>\n",
" <td>0.003261</td>\n",
" </tr>\n",
" <tr>\n",
" <th>598</th>\n",
" <td>A00000</td>\n",
" <td>NaN</td>\n",
" <td>0.211735</td>\n",
" <td>0.155612</td>\n",
" <td>0.181122</td>\n",
" <td>0.198980</td>\n",
" <td>0.252551</td>\n",
" <td>0.204082</td>\n",
" <td>0.183673</td>\n",
" <td>0.168367</td>\n",
" <td>0.173469</td>\n",
" <td>0.267857</td>\n",
" <td>0.107348</td>\n",
" <td>0.143410</td>\n",
" <td>0.208139</td>\n",
" <td>0.177013</td>\n",
" <td>0.364090</td>\n",
" <td>0.177318</td>\n",
" <td>0.114457</td>\n",
" <td>0.129648</td>\n",
" <td>0.197694</td>\n",
" <td>0.380882</td>\n",
" <td>0.034976</td>\n",
" <td>0.048127</td>\n",
" <td>0.240732</td>\n",
" <td>0.356314</td>\n",
" <td>0.116116</td>\n",
" <td>0.159325</td>\n",
" <td>0.012828</td>\n",
" <td>0.013628</td>\n",
" <td>392</td>\n",
" <td>0</td>\n",
" <td>52465</td>\n",
" <td>TBM7</td>\n",
" <td>431</td>\n",
" <td>0.0</td>\n",
" <td>0.011371</td>\n",
" </tr>\n",
" <tr>\n",
" <th>599</th>\n",
" <td>A00008</td>\n",
" <td>NaN</td>\n",
" <td>0.125000</td>\n",
" <td>0.041667</td>\n",
" <td>0.208333</td>\n",
" <td>0.166667</td>\n",
" <td>0.458333</td>\n",
" <td>0.125000</td>\n",
" <td>0.083333</td>\n",
" <td>0.125000</td>\n",
" <td>0.166667</td>\n",
" <td>0.500000</td>\n",
" <td>0.187960</td>\n",
" <td>0.278952</td>\n",
" <td>0.221048</td>\n",
" <td>0.190257</td>\n",
" <td>0.121783</td>\n",
" <td>0.014706</td>\n",
" <td>0.053309</td>\n",
" <td>0.149816</td>\n",
" <td>0.279871</td>\n",
" <td>0.502298</td>\n",
" <td>0.029871</td>\n",
" <td>0.044118</td>\n",
" <td>0.202665</td>\n",
" <td>0.380515</td>\n",
" <td>0.094669</td>\n",
" <td>0.182904</td>\n",
" <td>0.014706</td>\n",
" <td>0.020221</td>\n",
" <td>24</td>\n",
" <td>0</td>\n",
" <td>2176</td>\n",
" <td>PA46</td>\n",
" <td>350</td>\n",
" <td>0.0</td>\n",
" <td>0.008143</td>\n",
" </tr>\n",
" <tr>\n",
" <th>600</th>\n",
" <td>A0001E</td>\n",
" <td>NaN</td>\n",
" <td>0.100000</td>\n",
" <td>0.200000</td>\n",
" <td>0.200000</td>\n",
" <td>0.400000</td>\n",
" <td>0.100000</td>\n",
" <td>0.100000</td>\n",
" <td>0.000000</td>\n",
" <td>0.100000</td>\n",
" <td>0.400000</td>\n",
" <td>0.400000</td>\n",
" <td>0.007937</td>\n",
" <td>0.026984</td>\n",
" <td>0.084127</td>\n",
" <td>0.179365</td>\n",
" <td>0.701587</td>\n",
" <td>0.041270</td>\n",
" <td>0.085714</td>\n",
" <td>0.039683</td>\n",
" <td>0.111111</td>\n",
" <td>0.722222</td>\n",
" <td>0.019048</td>\n",
" <td>0.049206</td>\n",
" <td>0.249206</td>\n",
" <td>0.326984</td>\n",
" <td>0.112698</td>\n",
" <td>0.206349</td>\n",
" <td>0.012698</td>\n",
" <td>0.011111</td>\n",
" <td>10</td>\n",
" <td>1135</td>\n",
" <td>630</td>\n",
" <td>C56X</td>\n",
" <td>126</td>\n",
" <td>0.0</td>\n",
" <td>0.010685</td>\n",
" </tr>\n",
" <tr>\n",
" <th>601</th>\n",
" <td>A0002B</td>\n",
" <td>NaN</td>\n",
" <td>0.166667</td>\n",
" <td>0.166667</td>\n",
" <td>0.000000</td>\n",
" <td>0.666667</td>\n",
" <td>0.000000</td>\n",
" <td>0.333333</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.666667</td>\n",
" <td>0.000000</td>\n",
" <td>0.767405</td>\n",
" <td>0.191456</td>\n",
" <td>0.023734</td>\n",
" <td>0.017405</td>\n",
" <td>0.000000</td>\n",
" <td>0.150316</td>\n",
" <td>0.113924</td>\n",
" <td>0.178797</td>\n",
" <td>0.534810</td>\n",
" <td>0.022152</td>\n",
" <td>0.001582</td>\n",
" <td>0.009494</td>\n",
" <td>0.281646</td>\n",
" <td>0.416139</td>\n",
" <td>0.112342</td>\n",
" <td>0.169304</td>\n",
" <td>0.001582</td>\n",
" <td>0.001582</td>\n",
" <td>6</td>\n",
" <td>2356</td>\n",
" <td>632</td>\n",
" <td>C82S</td>\n",
" <td>133</td>\n",
" <td>0.0</td>\n",
" <td>0.049944</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" adshex label duration1 duration2 duration3 duration4 duration5 \\\n",
"597 A NaN 0.120253 0.075949 0.183544 0.335443 0.284810 \n",
"598 A00000 NaN 0.211735 0.155612 0.181122 0.198980 0.252551 \n",
"599 A00008 NaN 0.125000 0.041667 0.208333 0.166667 0.458333 \n",
"600 A0001E NaN 0.100000 0.200000 0.200000 0.400000 0.100000 \n",
"601 A0002B NaN 0.166667 0.166667 0.000000 0.666667 0.000000 \n",
"\n",
" boxes1 boxes2 boxes3 boxes4 boxes5 speed1 speed2 \\\n",
"597 0.088608 0.044304 0.069620 0.120253 0.677215 0.021824 0.020550 \n",
"598 0.204082 0.183673 0.168367 0.173469 0.267857 0.107348 0.143410 \n",
"599 0.125000 0.083333 0.125000 0.166667 0.500000 0.187960 0.278952 \n",
"600 0.100000 0.000000 0.100000 0.400000 0.400000 0.007937 0.026984 \n",
"601 0.333333 0.000000 0.000000 0.666667 0.000000 0.767405 0.191456 \n",
"\n",
" speed3 speed4 speed5 altitude1 altitude2 altitude3 altitude4 \\\n",
"597 0.062330 0.100713 0.794582 0.042374 0.060971 0.066831 0.106403 \n",
"598 0.208139 0.177013 0.364090 0.177318 0.114457 0.129648 0.197694 \n",
"599 0.221048 0.190257 0.121783 0.014706 0.053309 0.149816 0.279871 \n",
"600 0.084127 0.179365 0.701587 0.041270 0.085714 0.039683 0.111111 \n",
"601 0.023734 0.017405 0.000000 0.150316 0.113924 0.178797 0.534810 \n",
"\n",
" altitude5 steer1 steer2 steer3 steer4 steer5 steer6 \\\n",
"597 0.723421 0.020211 0.048913 0.270550 0.344090 0.097317 0.186651 \n",
"598 0.380882 0.034976 0.048127 0.240732 0.356314 0.116116 0.159325 \n",
"599 0.502298 0.029871 0.044118 0.202665 0.380515 0.094669 0.182904 \n",
"600 0.722222 0.019048 0.049206 0.249206 0.326984 0.112698 0.206349 \n",
"601 0.022152 0.001582 0.009494 0.281646 0.416139 0.112342 0.169304 \n",
"\n",
" steer7 steer8 flights squawk_1 observations type type_code \\\n",
"597 0.011379 0.009426 158 0 11776 GRND 248 \n",
"598 0.012828 0.013628 392 0 52465 TBM7 431 \n",
"599 0.014706 0.020221 24 0 2176 PA46 350 \n",
"600 0.012698 0.011111 10 1135 630 C56X 126 \n",
"601 0.001582 0.001582 6 2356 632 C82S 133 \n",
"\n",
" predicted predicted_prob \n",
"597 0.0 0.003261 \n",
"598 0.0 0.011371 \n",
"599 0.0 0.008143 \n",
"600 0.0 0.010685 \n",
"601 0.0 0.049944 "
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Predict the probability it's in the class represented by '1'\n",
"real_df['predicted_prob'] = clf.predict_proba(real_df.drop(columns=['label', 'adshex', 'type', 'predicted']))[:,1]\n",
"real_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get the top 200 predictions\n",
"\n",
"Take a look at what the probabilities look like, showing the top 200 planes that are **most likely to be surveillance planes.**\n",
"\n",
"Then save them to a file for later research."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>adshex</th>\n",
" <th>label</th>\n",
" <th>duration1</th>\n",
" <th>duration2</th>\n",
" <th>duration3</th>\n",
" <th>duration4</th>\n",
" <th>duration5</th>\n",
" <th>boxes1</th>\n",
" <th>boxes2</th>\n",
" <th>boxes3</th>\n",
" <th>boxes4</th>\n",
" <th>boxes5</th>\n",
" <th>speed1</th>\n",
" <th>speed2</th>\n",
" <th>speed3</th>\n",
" <th>speed4</th>\n",
" <th>speed5</th>\n",
" <th>altitude1</th>\n",
" <th>altitude2</th>\n",
" <th>altitude3</th>\n",
" <th>altitude4</th>\n",
" <th>altitude5</th>\n",
" <th>steer1</th>\n",
" <th>steer2</th>\n",
" <th>steer3</th>\n",
" <th>steer4</th>\n",
" <th>steer5</th>\n",
" <th>steer6</th>\n",
" <th>steer7</th>\n",
" <th>steer8</th>\n",
" <th>flights</th>\n",
" <th>squawk_1</th>\n",
" <th>observations</th>\n",
" <th>type</th>\n",
" <th>type_code</th>\n",
" <th>predicted</th>\n",
" <th>predicted_prob</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>12275</th>\n",
" <td>A7D925</td>\n",
" <td>NaN</td>\n",
" <td>0.121212</td>\n",
" <td>0.141414</td>\n",
" <td>0.070707</td>\n",
" <td>0.070707</td>\n",
" <td>0.595960</td>\n",
" <td>0.212121</td>\n",
" <td>0.515152</td>\n",
" <td>0.242424</td>\n",
" <td>0.030303</td>\n",
" <td>0.000000</td>\n",
" <td>0.271168</td>\n",
" <td>0.494554</td>\n",
" <td>0.212671</td>\n",
" <td>0.016859</td>\n",
" <td>0.004747</td>\n",
" <td>0.018678</td>\n",
" <td>0.065840</td>\n",
" <td>0.345793</td>\n",
" <td>0.568557</td>\n",
" <td>0.001131</td>\n",
" <td>0.166840</td>\n",
" <td>0.315047</td>\n",
" <td>0.301537</td>\n",
" <td>0.096653</td>\n",
" <td>0.015661</td>\n",
" <td>0.047095</td>\n",
" <td>0.004015</td>\n",
" <td>0.009250</td>\n",
" <td>99</td>\n",
" <td>230</td>\n",
" <td>45079</td>\n",
" <td>T206</td>\n",
" <td>417</td>\n",
" <td>1.0</td>\n",
" <td>0.919753</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2828</th>\n",
" <td>A144AF</td>\n",
" <td>NaN</td>\n",
" <td>0.328358</td>\n",
" <td>0.134328</td>\n",
" <td>0.074627</td>\n",
" <td>0.029851</td>\n",
" <td>0.432836</td>\n",
" <td>0.492537</td>\n",
" <td>0.328358</td>\n",
" <td>0.164179</td>\n",
" <td>0.000000</td>\n",
" <td>0.014925</td>\n",
" <td>0.134059</td>\n",
" <td>0.274446</td>\n",
" <td>0.197484</td>\n",
" <td>0.148554</td>\n",
" <td>0.245457</td>\n",
" <td>0.001251</td>\n",
" <td>0.005371</td>\n",
" <td>0.008167</td>\n",
" <td>0.053271</td>\n",
" <td>0.931940</td>\n",
" <td>0.152969</td>\n",
" <td>0.248841</td>\n",
" <td>0.266132</td>\n",
" <td>0.175116</td>\n",
" <td>0.010448</td>\n",
" <td>0.064013</td>\n",
" <td>0.014495</td>\n",
" <td>0.018247</td>\n",
" <td>67</td>\n",
" <td>5103</td>\n",
" <td>13591</td>\n",
" <td>unknown</td>\n",
" <td>454</td>\n",
" <td>1.0</td>\n",
" <td>0.896639</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2720</th>\n",
" <td>A13098</td>\n",
" <td>NaN</td>\n",
" <td>0.166667</td>\n",
" <td>0.166667</td>\n",
" <td>0.166667</td>\n",
" <td>0.083333</td>\n",
" <td>0.416667</td>\n",
" <td>0.250000</td>\n",
" <td>0.583333</td>\n",
" <td>0.166667</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.866572</td>\n",
" <td>0.071664</td>\n",
" <td>0.035361</td>\n",
" <td>0.020745</td>\n",
" <td>0.005658</td>\n",
" <td>0.053748</td>\n",
" <td>0.123055</td>\n",
" <td>0.665724</td>\n",
" <td>0.157473</td>\n",
" <td>0.000000</td>\n",
" <td>0.151344</td>\n",
" <td>0.176803</td>\n",
" <td>0.181047</td>\n",
" <td>0.300802</td>\n",
" <td>0.019331</td>\n",
" <td>0.085809</td>\n",
" <td>0.010372</td>\n",
" <td>0.028289</td>\n",
" <td>12</td>\n",
" <td>4415</td>\n",
" <td>2121</td>\n",
" <td>unknown</td>\n",
" <td>454</td>\n",
" <td>1.0</td>\n",
" <td>0.896194</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8466</th>\n",
" <td>A4FB3C</td>\n",
" <td>NaN</td>\n",
" <td>0.416667</td>\n",
" <td>0.125000</td>\n",
" <td>0.083333</td>\n",
" <td>0.041667</td>\n",
" <td>0.333333</td>\n",
" <td>0.458333</td>\n",
" <td>0.458333</td>\n",
" <td>0.083333</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.562937</td>\n",
" <td>0.226224</td>\n",
" <td>0.138811</td>\n",
" <td>0.056294</td>\n",
" <td>0.015734</td>\n",
" <td>0.000000</td>\n",
" <td>0.009091</td>\n",
" <td>0.039860</td>\n",
" <td>0.866434</td>\n",
" <td>0.084615</td>\n",
" <td>0.144406</td>\n",
" <td>0.226923</td>\n",
" <td>0.222378</td>\n",
" <td>0.268182</td>\n",
" <td>0.013986</td>\n",
" <td>0.062587</td>\n",
" <td>0.004196</td>\n",
" <td>0.017483</td>\n",
" <td>24</td>\n",
" <td>5310</td>\n",
" <td>2860</td>\n",
" <td>P210</td>\n",
" <td>322</td>\n",
" <td>1.0</td>\n",
" <td>0.890482</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9204</th>\n",
" <td>A565E6</td>\n",
" <td>NaN</td>\n",
" <td>0.333333</td>\n",
" <td>0.200000</td>\n",
" <td>0.066667</td>\n",
" <td>0.000000</td>\n",
" <td>0.400000</td>\n",
" <td>0.600000</td>\n",
" <td>0.266667</td>\n",
" <td>0.133333</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.058968</td>\n",
" <td>0.120803</td>\n",
" <td>0.190418</td>\n",
" <td>0.151515</td>\n",
" <td>0.478296</td>\n",
" <td>0.000819</td>\n",
" <td>0.001229</td>\n",
" <td>0.017199</td>\n",
" <td>0.014742</td>\n",
" <td>0.966011</td>\n",
" <td>0.106880</td>\n",
" <td>0.240377</td>\n",
" <td>0.303440</td>\n",
" <td>0.207617</td>\n",
" <td>0.008190</td>\n",
" <td>0.064701</td>\n",
" <td>0.014333</td>\n",
" <td>0.017199</td>\n",
" <td>15</td>\n",
" <td>5106</td>\n",
" <td>2442</td>\n",
" <td>unknown</td>\n",
" <td>454</td>\n",
" <td>1.0</td>\n",
" <td>0.889338</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15256</th>\n",
" <td>AA3DAF</td>\n",
" <td>NaN</td>\n",
" <td>0.087719</td>\n",
" <td>0.192982</td>\n",
" <td>0.280702</td>\n",
" <td>0.175439</td>\n",
" <td>0.263158</td>\n",
" <td>0.087719</td>\n",
" <td>0.526316</td>\n",
" <td>0.245614</td>\n",
" <td>0.140351</td>\n",
" <td>0.000000</td>\n",
" <td>0.256432</td>\n",
" <td>0.545244</td>\n",
" <td>0.195710</td>\n",
" <td>0.001842</td>\n",
" <td>0.000772</td>\n",
" <td>0.005823</td>\n",
" <td>0.000000</td>\n",
" <td>0.114491</td>\n",
" <td>0.879033</td>\n",
" <td>0.000654</td>\n",
" <td>0.135702</td>\n",
" <td>0.091854</td>\n",
" <td>0.285426</td>\n",
" <td>0.106530</td>\n",
" <td>0.086448</td>\n",
" <td>0.211514</td>\n",
" <td>0.020914</td>\n",
" <td>0.011467</td>\n",
" <td>57</td>\n",
" <td>362</td>\n",
" <td>16831</td>\n",
" <td>C182</td>\n",
" <td>91</td>\n",
" <td>0.0</td>\n",
" <td>0.323519</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14100</th>\n",
" <td>A95959</td>\n",
" <td>NaN</td>\n",
" <td>0.279817</td>\n",
" <td>0.619266</td>\n",
" <td>0.077982</td>\n",
" <td>0.018349</td>\n",
" <td>0.004587</td>\n",
" <td>0.160550</td>\n",
" <td>0.532110</td>\n",
" <td>0.293578</td>\n",
" <td>0.013761</td>\n",
" <td>0.000000</td>\n",
" <td>0.185068</td>\n",
" <td>0.140729</td>\n",
" <td>0.144585</td>\n",
" <td>0.149317</td>\n",
" <td>0.380301</td>\n",
" <td>0.000175</td>\n",
" <td>0.075535</td>\n",
" <td>0.594988</td>\n",
" <td>0.328952</td>\n",
" <td>0.000351</td>\n",
" <td>0.124956</td>\n",
" <td>0.090256</td>\n",
" <td>0.117946</td>\n",
" <td>0.293200</td>\n",
" <td>0.011917</td>\n",
" <td>0.170172</td>\n",
" <td>0.060813</td>\n",
" <td>0.075359</td>\n",
" <td>218</td>\n",
" <td>1200</td>\n",
" <td>5706</td>\n",
" <td>C208</td>\n",
" <td>97</td>\n",
" <td>0.0</td>\n",
" <td>0.323506</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19765</th>\n",
" <td>ADFF65</td>\n",
" <td>NaN</td>\n",
" <td>0.200000</td>\n",
" <td>0.000000</td>\n",
" <td>0.300000</td>\n",
" <td>0.500000</td>\n",
" <td>0.000000</td>\n",
" <td>0.100000</td>\n",
" <td>0.100000</td>\n",
" <td>0.500000</td>\n",
" <td>0.100000</td>\n",
" <td>0.200000</td>\n",
" <td>0.016369</td>\n",
" <td>0.000000</td>\n",
" <td>0.002976</td>\n",
" <td>0.025298</td>\n",
" <td>0.955357</td>\n",
" <td>0.000000</td>\n",
" <td>0.034226</td>\n",
" <td>0.014881</td>\n",
" <td>0.074405</td>\n",
" <td>0.876488</td>\n",
" <td>0.098214</td>\n",
" <td>0.092262</td>\n",
" <td>0.159226</td>\n",
" <td>0.245536</td>\n",
" <td>0.032738</td>\n",
" <td>0.218750</td>\n",
" <td>0.043155</td>\n",
" <td>0.059524</td>\n",
" <td>10</td>\n",
" <td>4552</td>\n",
" <td>672</td>\n",
" <td>unknown</td>\n",
" <td>454</td>\n",
" <td>0.0</td>\n",
" <td>0.321905</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2629</th>\n",
" <td>A11FB5</td>\n",
" <td>NaN</td>\n",
" <td>0.090909</td>\n",
" <td>0.272727</td>\n",
" <td>0.090909</td>\n",
" <td>0.363636</td>\n",
" <td>0.181818</td>\n",
" <td>0.181818</td>\n",
" <td>0.454545</td>\n",
" <td>0.181818</td>\n",
" <td>0.000000</td>\n",
" <td>0.181818</td>\n",
" <td>0.071307</td>\n",
" <td>0.691002</td>\n",
" <td>0.230900</td>\n",
" <td>0.006791</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.044143</td>\n",
" <td>0.723260</td>\n",
" <td>0.232598</td>\n",
" <td>0.112054</td>\n",
" <td>0.101868</td>\n",
" <td>0.190153</td>\n",
" <td>0.134126</td>\n",
" <td>0.028862</td>\n",
" <td>0.317487</td>\n",
" <td>0.039049</td>\n",
" <td>0.028862</td>\n",
" <td>11</td>\n",
" <td>0</td>\n",
" <td>589</td>\n",
" <td>C82R</td>\n",
" <td>132</td>\n",
" <td>0.0</td>\n",
" <td>0.321645</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1355</th>\n",
" <td>A0519F</td>\n",
" <td>NaN</td>\n",
" <td>0.187500</td>\n",
" <td>0.156250</td>\n",
" <td>0.125000</td>\n",
" <td>0.500000</td>\n",
" <td>0.031250</td>\n",
" <td>0.093750</td>\n",
" <td>0.156250</td>\n",
" <td>0.062500</td>\n",
" <td>0.125000</td>\n",
" <td>0.562500</td>\n",
" <td>0.049751</td>\n",
" <td>0.072554</td>\n",
" <td>0.153814</td>\n",
" <td>0.171642</td>\n",
" <td>0.552239</td>\n",
" <td>0.046434</td>\n",
" <td>0.071310</td>\n",
" <td>0.115257</td>\n",
" <td>0.198176</td>\n",
" <td>0.568823</td>\n",
" <td>0.041874</td>\n",
" <td>0.080431</td>\n",
" <td>0.250415</td>\n",
" <td>0.250829</td>\n",
" <td>0.055970</td>\n",
" <td>0.252488</td>\n",
" <td>0.024046</td>\n",
" <td>0.016998</td>\n",
" <td>32</td>\n",
" <td>4610</td>\n",
" <td>2412</td>\n",
" <td>C501</td>\n",
" <td>119</td>\n",
" <td>0.0</td>\n",
" <td>0.321207</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>200 rows \u00d7 37 columns</p>\n",
"</div>"
],
"text/plain": [
" adshex label duration1 duration2 duration3 duration4 duration5 \\\n",
"12275 A7D925 NaN 0.121212 0.141414 0.070707 0.070707 0.595960 \n",
"2828 A144AF NaN 0.328358 0.134328 0.074627 0.029851 0.432836 \n",
"2720 A13098 NaN 0.166667 0.166667 0.166667 0.083333 0.416667 \n",
"8466 A4FB3C NaN 0.416667 0.125000 0.083333 0.041667 0.333333 \n",
"9204 A565E6 NaN 0.333333 0.200000 0.066667 0.000000 0.400000 \n",
"... ... ... ... ... ... ... ... \n",
"15256 AA3DAF NaN 0.087719 0.192982 0.280702 0.175439 0.263158 \n",
"14100 A95959 NaN 0.279817 0.619266 0.077982 0.018349 0.004587 \n",
"19765 ADFF65 NaN 0.200000 0.000000 0.300000 0.500000 0.000000 \n",
"2629 A11FB5 NaN 0.090909 0.272727 0.090909 0.363636 0.181818 \n",
"1355 A0519F NaN 0.187500 0.156250 0.125000 0.500000 0.031250 \n",
"\n",
" boxes1 boxes2 boxes3 boxes4 boxes5 speed1 speed2 \\\n",
"12275 0.212121 0.515152 0.242424 0.030303 0.000000 0.271168 0.494554 \n",
"2828 0.492537 0.328358 0.164179 0.000000 0.014925 0.134059 0.274446 \n",
"2720 0.250000 0.583333 0.166667 0.000000 0.000000 0.866572 0.071664 \n",
"8466 0.458333 0.458333 0.083333 0.000000 0.000000 0.562937 0.226224 \n",
"9204 0.600000 0.266667 0.133333 0.000000 0.000000 0.058968 0.120803 \n",
"... ... ... ... ... ... ... ... \n",
"15256 0.087719 0.526316 0.245614 0.140351 0.000000 0.256432 0.545244 \n",
"14100 0.160550 0.532110 0.293578 0.013761 0.000000 0.185068 0.140729 \n",
"19765 0.100000 0.100000 0.500000 0.100000 0.200000 0.016369 0.000000 \n",
"2629 0.181818 0.454545 0.181818 0.000000 0.181818 0.071307 0.691002 \n",
"1355 0.093750 0.156250 0.062500 0.125000 0.562500 0.049751 0.072554 \n",
"\n",
" speed3 speed4 speed5 altitude1 altitude2 altitude3 \\\n",
"12275 0.212671 0.016859 0.004747 0.018678 0.065840 0.345793 \n",
"2828 0.197484 0.148554 0.245457 0.001251 0.005371 0.008167 \n",
"2720 0.035361 0.020745 0.005658 0.053748 0.123055 0.665724 \n",
"8466 0.138811 0.056294 0.015734 0.000000 0.009091 0.039860 \n",
"9204 0.190418 0.151515 0.478296 0.000819 0.001229 0.017199 \n",
"... ... ... ... ... ... ... \n",
"15256 0.195710 0.001842 0.000772 0.005823 0.000000 0.114491 \n",
"14100 0.144585 0.149317 0.380301 0.000175 0.075535 0.594988 \n",
"19765 0.002976 0.025298 0.955357 0.000000 0.034226 0.014881 \n",
"2629 0.230900 0.006791 0.000000 0.000000 0.000000 0.044143 \n",
"1355 0.153814 0.171642 0.552239 0.046434 0.071310 0.115257 \n",
"\n",
" altitude4 altitude5 steer1 steer2 steer3 steer4 steer5 \\\n",
"12275 0.568557 0.001131 0.166840 0.315047 0.301537 0.096653 0.015661 \n",
"2828 0.053271 0.931940 0.152969 0.248841 0.266132 0.175116 0.010448 \n",
"2720 0.157473 0.000000 0.151344 0.176803 0.181047 0.300802 0.019331 \n",
"8466 0.866434 0.084615 0.144406 0.226923 0.222378 0.268182 0.013986 \n",
"9204 0.014742 0.966011 0.106880 0.240377 0.303440 0.207617 0.008190 \n",
"... ... ... ... ... ... ... ... \n",
"15256 0.879033 0.000654 0.135702 0.091854 0.285426 0.106530 0.086448 \n",
"14100 0.328952 0.000351 0.124956 0.090256 0.117946 0.293200 0.011917 \n",
"19765 0.074405 0.876488 0.098214 0.092262 0.159226 0.245536 0.032738 \n",
"2629 0.723260 0.232598 0.112054 0.101868 0.190153 0.134126 0.028862 \n",
"1355 0.198176 0.568823 0.041874 0.080431 0.250415 0.250829 0.055970 \n",
"\n",
" steer6 steer7 steer8 flights squawk_1 observations type \\\n",
"12275 0.047095 0.004015 0.009250 99 230 45079 T206 \n",
"2828 0.064013 0.014495 0.018247 67 5103 13591 unknown \n",
"2720 0.085809 0.010372 0.028289 12 4415 2121 unknown \n",
"8466 0.062587 0.004196 0.017483 24 5310 2860 P210 \n",
"9204 0.064701 0.014333 0.017199 15 5106 2442 unknown \n",
"... ... ... ... ... ... ... ... \n",
"15256 0.211514 0.020914 0.011467 57 362 16831 C182 \n",
"14100 0.170172 0.060813 0.075359 218 1200 5706 C208 \n",
"19765 0.218750 0.043155 0.059524 10 4552 672 unknown \n",
"2629 0.317487 0.039049 0.028862 11 0 589 C82R \n",
"1355 0.252488 0.024046 0.016998 32 4610 2412 C501 \n",
"\n",
" type_code predicted predicted_prob \n",
"12275 417 1.0 0.919753 \n",
"2828 454 1.0 0.896639 \n",
"2720 454 1.0 0.896194 \n",
"8466 322 1.0 0.890482 \n",
"9204 454 1.0 0.889338 \n",
"... ... ... ... \n",
"15256 91 0.0 0.323519 \n",
"14100 97 0.0 0.323506 \n",
"19765 454 0.0 0.321905 \n",
"2629 132 0.0 0.321645 \n",
"1355 119 0.0 0.321207 \n",
"\n",
"[200 rows x 37 columns]"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top_predictions = real_df.sort_values(by='predicted_prob', ascending=False).head(200)\n",
"top_predictions"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"top_predictions.to_csv(\"planes-to-research.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Questions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 1\n",
"\n",
"What kind of machine learning are we doing here, and why are we doing it?"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"# Classification (or supervised learning) because we have labels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 2\n",
"\n",
"What are a few different ways you can deal with categorical data? Think about how we dealt with race in the reveal regression compared to how we dealt with type in this dataset."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"# You can one-hot encode them if you have few\n",
"# You can just make them numbers if you have a lot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 3\n",
"\n",
"Every time we ran a machine learning algorithm on our dataset, we looked at feature importance.\n",
"\n",
"* When might it be important to explain what our model found important?\n",
"* When might it not be important?"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"# If we're trying to understand what's going wrong or why it is/isn't working well\n",
"# It's more important if we're presenting this to the public"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 4\n",
"\n",
"Using words and not column names, describe what the machine learning algorithm found to be important when identifying surveillance planes."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"# Slow speed, constant turning vs going straight"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 5\n",
"\n",
"Why did we use test/train split when it would have been more effective to give our model all of the data from the start?"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"# Shouldn't test on things that it's already seen"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 6\n",
"\n",
"Why did we use a random forest instead of a decision tree or logistic regression? Was there something about the data?"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"# Because it did a better job!!!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 7\n",
"\n",
"Why did we use probability instead of just looking for planes with a predicted value of 1? It seems like we should have just trusted the algorithm, right?"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"# The 0/1 is an arbitrary cutoff of 50%, we're fine going lower because it gives us more to research"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 8\n",
"\n",
"What if our random forest or input dataset were flawed? What would be the repercussions?"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"# We'd be investigating a bunch of planes that didn't need to be investigated"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 9\n",
"\n",
"The government could claim that we're threatening national security by publishing this paper as well as publishing this code - now anyone could look for planes that are surveilling them. What do you think?"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [],
"source": [
"# Up to you!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 10\n",
"\n",
"We're using data from the past, but you can get real-time flight data from many services. Can you think of any uses for this algorithm using real-time instead of historical data?"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"# Finding out when something crazy is going on police-wise, maybe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 11\n",
"\n",
"This isn't a question, but if you look at `candidates.csv` and `candidates-annotates.csv` you can see how Buzzfeed did their research after finding a list of suspicious planes."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"# k"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}