{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Finding faulty airbag complaints using a very simple keyword search with logistic regression\n", "\n", "**The story:**\n", "- https://www.nytimes.com/2014/09/12/business/air-bag-flaw-long-known-led-to-recalls.html\n", "- https://www.nytimes.com/2014/11/07/business/airbag-maker-takata-is-said-to-have-conducted-secret-tests.html\n", "- https://www.nytimes.com/interactive/2015/06/22/business/international/takata-airbag-recall-list.html\n", "- https://www.nytimes.com/2016/08/27/business/takata-airbag-recall-crisis.html\n", "\n", "This story, done by The New York Times, investigates the content in complaints made to National Highway Traffic Safety Administration (NHTSA) by customers who had bad experiences with Takata airbags in their cars. Eventually, car companies had to recall airbags made by the airbag supplier that promised a cheaper alternative. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<p class=\"reading-options\">\n <a class=\"btn\" href=\"/nyt-takata-airbags/airbag-classifier-search-binary\">\n <i class=\"fa fa-sm fa-book\"></i>\n Read online\n </a>\n <a class=\"btn\" href=\"/nyt-takata-airbags/notebooks/Airbag classifier search (Binary).ipynb\">\n <i class=\"fa fa-sm fa-download\"></i>\n Download notebook\n </a>\n <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/nyt-takata-airbags/notebooks/Airbag classifier search (Binary).ipynb\" target=\"_new\">\n <i class=\"fa fa-sm fa-laptop\"></i>\n Interactive version\n </a>\n</p>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prep work: Downloading necessary files\n", "Before we get started, we need to download all of the data we'll be using.\n", "* **CMPL.txt:** complaint codebook - Codebook for the complaints datast\n", "* **FLAT_CMPL.txt:** vehicle-related complaints - 1995-current from the National Highway Traffic Safety Administration\n", "* **sampled-unlabeled.csv:** unlabeled complaints - a sample of vehicle complaints, not labeled\n", "* **sampled-labeled.csv:** labeled complaints - a sample of vehicle complaints, labeled with being suspicious or not\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Make data directory if it doesn't exist\n", "!mkdir -p data\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/nyt-takata-airbags/data/CMPL.txt -P data\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/nyt-takata-airbags/data/FLAT_CMPL.txt.zip -P data\n", "!unzip -n -d data data/FLAT_CMPL.txt.zip\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/nyt-takata-airbags/data/sampled-unlabeled.csv -P data\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/nyt-takata-airbags/data/sampled-labeled.csv -P data" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Author:** Daeil Kim did a more complex version of this particular analysis - [presentation here](https://www.slideshare.net/mortardata/daeil-kim-at-the-nyc-data-science-meetup)\n", "\n", "**Topics:** Logistic Classifier\n", "\n", "**Datasets**\n", "\n", "* **FLAT_CMPL.txt:** Vehicle-related complaints from 1995-current from the [National Highway Traffic Safety Administration](https://www-odi.nhtsa.dot.gov/downloads/)\n", "* **CMPL.txt:** data dictionary for the above\n", "* **sampled-unlabeled.csv:** a sample of vehicle complaints, not labeled\n", "* **sampled-labeled.csv:** a sample of vehicle complaints, labeled with being suspicious or not\n", "\n", "## What's the goal?\n", "\n", "It's too much work to read twenty years of vehicle comments to find the ones related to dangerous airbags! Because we're lazy, we want the computer to do this for us. We're going to read a subset, mark each one as \"suspicious\" or \"not suspicious,\" then use that information to train the computer to read the rest and recognize which comments are suspicious and which are not suspicious.\n", "\n", "This is a **classification** problem, because we want the computer to recognize which ones are suspicious and which are not." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Our code\n", "\n", "## Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Allow us to display 100 columns at a time, and 100 characters in each column (instead of ...)\n", "pd.set_option(\"display.max_columns\", 100)\n", "pd.set_option(\"display.max_colwidth\", 100)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Read in our data\n", "\n", "The dataset in `FLAT_CMPL.txt` doesn't have column headers, so we're going to use this long long list of headers that we stole from `CMPL.txt` to read it in.\n", "\n", "It's kind of a complicated dataset with a few errors here or there, so we're passing in a *lot* of options to `pd.read_csv`. In the end it's just a big big dataframe, though." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>CMPLID</th>\n", " <th>ODINO</th>\n", " <th>MFR_NAME</th>\n", " <th>MAKETXT</th>\n", " <th>MODELTXT</th>\n", " <th>YEARTXT</th>\n", " <th>CRASH</th>\n", " <th>FAILDATE</th>\n", " <th>FIRE</th>\n", " <th>INJURED</th>\n", " <th>DEATHS</th>\n", " <th>COMPDESC</th>\n", " <th>CITY</th>\n", " <th>STATE</th>\n", " <th>VIN</th>\n", " <th>DATEA</th>\n", " <th>LDATE</th>\n", " <th>MILES</th>\n", " <th>OCCURENCES</th>\n", " <th>CDESCR</th>\n", " <th>CMPL_TYPE</th>\n", " <th>POLICE_RPT_YN</th>\n", " <th>PURCH_DT</th>\n", " <th>ORIG_OWNER_YN</th>\n", " <th>ANTI_BRAKES_YN</th>\n", " <th>CRUISE_CONT_YN</th>\n", " <th>NUM_CYLS</th>\n", " <th>DRIVE_TRAIN</th>\n", " <th>FUEL_SYS</th>\n", " <th>FUEL_TYPE</th>\n", " <th>TRANS_TYPE</th>\n", " <th>VEH_SPEED</th>\n", " <th>DOT</th>\n", " <th>TIRE_SIZE</th>\n", " <th>LOC_OF_TIRE</th>\n", " <th>TIRE_FAIL_TYPE</th>\n", " <th>ORIG_EQUIP_YN</th>\n", " <th>MANUF_DT</th>\n", " <th>SEAT_TYPE</th>\n", " <th>RESTRAINT_TYPE</th>\n", " <th>DEALER_NAME</th>\n", " <th>DEALER_TEL</th>\n", " <th>DEALER_CITY</th>\n", " <th>DEALER_STATE</th>\n", " <th>DEALER_ZIP</th>\n", " <th>PROD_TYPE</th>\n", " <th>REPAIRED_YN</th>\n", " <th>MEDICAL_ATTN</th>\n", " <th>VEHICLES_TOWED_YN</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>958173</td>\n", " <td>Ford Motor Company</td>\n", " <td>LINCOLN</td>\n", " <td>TOWN CAR</td>\n", " <td>1994</td>\n", " <td>Y</td>\n", " <td>19941222</td>\n", " <td>N</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>SERVICE BRAKES, HYDRAULIC:PEDALS AND LINKAGES</td>\n", " <td>HIGH LAND PA</td>\n", " <td>MI</td>\n", " <td>1LNLM82W8RY</td>\n", " <td>19950103</td>\n", " <td>19950103</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>BRAKE PEDAL PUSH ROD RETAINER WAS NOT PROPERLY INSTALLED, CAUSING BRAKES TO FAIL, RESULTING IN A...</td>\n", " <td>EVOQ</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>V</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>958146</td>\n", " <td>General Motors LLC</td>\n", " <td>GMC</td>\n", " <td>SONOMA</td>\n", " <td>1995</td>\n", " <td>NaN</td>\n", " <td>19941215</td>\n", " <td>N</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS</td>\n", " <td>MOBILE</td>\n", " <td>AL</td>\n", " <td>1GTCS19W3S8</td>\n", " <td>19950103</td>\n", " <td>19950103</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>VEHICLE STALLS AT HIGH SPEED, RESULTING IN LOSS OF STEERING AND BRAKING ABILITY. TT</td>\n", " <td>EVOQ</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>V</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>958127</td>\n", " <td>Ford Motor Company</td>\n", " <td>FORD</td>\n", " <td>RANGER</td>\n", " <td>1994</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>N</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>ENGINE AND ENGINE COOLING:EXHAUST SYSTEM</td>\n", " <td>N. LAUDERDAL</td>\n", " <td>FL</td>\n", " <td>NaN</td>\n", " <td>19950103</td>\n", " <td>19950103</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>EXHAUST SYSTEM FAILS; PLEASE DESCRIBE DETAILS. TT</td>\n", " <td>EVOQ</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>V</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>958170</td>\n", " <td>Ford Motor Company</td>\n", " <td>MERCURY</td>\n", " <td>COUGAR</td>\n", " <td>1995</td>\n", " <td>NaN</td>\n", " <td>19950101</td>\n", " <td>N</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS</td>\n", " <td>CORRAL SPRIN</td>\n", " <td>FL</td>\n", " <td>1MELM62W5SH</td>\n", " <td>19950103</td>\n", " <td>19950103</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>BRAKING SYSTEM FAILURE WITHOUT ABS BRAKES. TT</td>\n", " <td>EVOQ</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>V</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>5</td>\n", " <td>958149</td>\n", " <td>Nissan North America, Inc.</td>\n", " <td>NISSAN</td>\n", " <td>MAXIMA</td>\n", " <td>1987</td>\n", " <td>NaN</td>\n", " <td>19941223</td>\n", " <td>N</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>VISIBILITY:SUN ROOF ASSEMBLY</td>\n", " <td>COLUMBUS</td>\n", " <td>OH</td>\n", " <td>JN1HU11P3HX</td>\n", " <td>19950103</td>\n", " <td>19950103</td>\n", " <td>NaN</td>\n", " <td>1</td>\n", " <td>VEHICLES SUN ROOF GLASS FLEW OFF WHILE DRIVING. TT</td>\n", " <td>EVOQ</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>V</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " CMPLID ODINO MFR_NAME MAKETXT MODELTXT YEARTXT CRASH \\\n", "0 1 958173 Ford Motor Company LINCOLN TOWN CAR 1994 Y \n", "1 2 958146 General Motors LLC GMC SONOMA 1995 NaN \n", "2 3 958127 Ford Motor Company FORD RANGER 1994 NaN \n", "3 4 958170 Ford Motor Company MERCURY COUGAR 1995 NaN \n", "4 5 958149 Nissan North America, Inc. NISSAN MAXIMA 1987 NaN \n", "\n", " FAILDATE FIRE INJURED DEATHS \\\n", "0 19941222 N 0 0 \n", "1 19941215 N 0 0 \n", "2 NaN N 0 0 \n", "3 19950101 N 0 0 \n", "4 19941223 N 0 0 \n", "\n", " COMPDESC CITY STATE \\\n", "0 SERVICE BRAKES, HYDRAULIC:PEDALS AND LINKAGES HIGH LAND PA MI \n", "1 SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS MOBILE AL \n", "2 ENGINE AND ENGINE COOLING:EXHAUST SYSTEM N. LAUDERDAL FL \n", "3 SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS CORRAL SPRIN FL \n", "4 VISIBILITY:SUN ROOF ASSEMBLY COLUMBUS OH \n", "\n", " VIN DATEA LDATE MILES OCCURENCES \\\n", "0 1LNLM82W8RY 19950103 19950103 NaN 1 \n", "1 1GTCS19W3S8 19950103 19950103 NaN NaN \n", "2 NaN 19950103 19950103 NaN NaN \n", "3 1MELM62W5SH 19950103 19950103 NaN 1 \n", "4 JN1HU11P3HX 19950103 19950103 NaN 1 \n", "\n", " CDESCR \\\n", "0 BRAKE PEDAL PUSH ROD RETAINER WAS NOT PROPERLY INSTALLED, CAUSING BRAKES TO FAIL, RESULTING IN A... \n", "1 VEHICLE STALLS AT HIGH SPEED, RESULTING IN LOSS OF STEERING AND BRAKING ABILITY. TT \n", "2 EXHAUST SYSTEM FAILS; PLEASE DESCRIBE DETAILS. TT \n", "3 BRAKING SYSTEM FAILURE WITHOUT ABS BRAKES. TT \n", "4 VEHICLES SUN ROOF GLASS FLEW OFF WHILE DRIVING. TT \n", "\n", " CMPL_TYPE POLICE_RPT_YN PURCH_DT ORIG_OWNER_YN ANTI_BRAKES_YN \\\n", "0 EVOQ NaN NaN NaN NaN \n", "1 EVOQ NaN NaN NaN NaN \n", "2 EVOQ NaN NaN NaN NaN \n", "3 EVOQ NaN NaN NaN NaN \n", "4 EVOQ NaN NaN NaN NaN \n", "\n", " CRUISE_CONT_YN NUM_CYLS DRIVE_TRAIN FUEL_SYS FUEL_TYPE TRANS_TYPE VEH_SPEED \\\n", "0 NaN NaN NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN NaN NaN NaN \n", "\n", " DOT TIRE_SIZE LOC_OF_TIRE TIRE_FAIL_TYPE ORIG_EQUIP_YN MANUF_DT SEAT_TYPE \\\n", "0 NaN NaN NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN NaN NaN NaN \n", "\n", " RESTRAINT_TYPE DEALER_NAME DEALER_TEL DEALER_CITY DEALER_STATE DEALER_ZIP \\\n", "0 NaN NaN NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN NaN NaN \n", "\n", " PROD_TYPE REPAIRED_YN MEDICAL_ATTN VEHICLES_TOWED_YN \n", "0 V NaN NaN NaN \n", "1 V NaN NaN NaN \n", "2 V NaN NaN NaN \n", "3 V NaN NaN NaN \n", "4 V NaN NaN NaN " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "column_names = ['CMPLID', 'ODINO', 'MFR_NAME', 'MAKETXT', 'MODELTXT', \n", " 'YEARTXT', 'CRASH', 'FAILDATE', 'FIRE', 'INJURED', \n", " 'DEATHS', 'COMPDESC', 'CITY', 'STATE', 'VIN', 'DATEA', \n", " 'LDATE', 'MILES', 'OCCURENCES', 'CDESCR', 'CMPL_TYPE', \n", " 'POLICE_RPT_YN', 'PURCH_DT', 'ORIG_OWNER_YN', 'ANTI_BRAKES_YN', \n", " 'CRUISE_CONT_YN', 'NUM_CYLS', 'DRIVE_TRAIN', 'FUEL_SYS', 'FUEL_TYPE', \n", " 'TRANS_TYPE', 'VEH_SPEED', 'DOT', 'TIRE_SIZE', 'LOC_OF_TIRE', \n", " 'TIRE_FAIL_TYPE', 'ORIG_EQUIP_YN', 'MANUF_DT', 'SEAT_TYPE', \n", " 'RESTRAINT_TYPE', 'DEALER_NAME', 'DEALER_TEL', 'DEALER_CITY', \n", " 'DEALER_STATE', 'DEALER_ZIP', 'PROD_TYPE', 'REPAIRED_YN', \n", " 'MEDICAL_ATTN', 'VEHICLES_TOWED_YN']\n", "\n", "df = pd.read_csv(\"data/FLAT_CMPL.txt\",\n", " sep='\\t',\n", " dtype='str',\n", " header=None,\n", " error_bad_lines=False,\n", " encoding='latin-1',\n", " names=column_names)\n", "\n", "# We're only interested in pre-2015\n", "df = df[df.DATEA < '2015']\n", "\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many rows and columns are in this dataset?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1144207, 49)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## But wait, we don't even need that yet\n", "\n", "Oof, that's a lot of columns!\n", "\n", "When you're dealing with machine learning, one of the first things you'll need to think about is what columns are important to you. An important thing about this dataset is **it doesn't include whether the complaint is about faulty airbags or not.**\n", "\n", "We can't teach our classifier what a suspicious comment looks like if we don't have a list of suspicious complaints, right? Luckily, we have another dataset of labeled complaints!\n", "\n", "**Read in `sampled-labeled.csv`**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>is_suspicious</th>\n", " <th>CDESCR</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.0</td>\n", " <td>ALTHOUGH I LOVED THE CAR OVERALL AT THE TIME I DECIDED TO OWN, , MY DREAM CAR CADILLAC CTS HAS T...</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.0</td>\n", " <td>CONSUMER SHUT SLIDING DOOR WHEN ALL POWER LOCKS ON ALL DOORS LOCKED BY ITSELF, TRAPPING INFANT I...</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.0</td>\n", " <td>DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.0</td>\n", " <td>TL* THE CONTACT OWNS A 2009 NISSAN ALTIMA. THE CONTACT STATED THAT THE START BUTTON FOR THE IGNI...</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.0</td>\n", " <td>THE FRONT MIDDLE SEAT DOESN'T LOCK IN PLACE. *AK</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " is_suspicious \\\n", "0 0.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 0.0 \n", "\n", " CDESCR \n", "0 ALTHOUGH I LOVED THE CAR OVERALL AT THE TIME I DECIDED TO OWN, , MY DREAM CAR CADILLAC CTS HAS T... \n", "1 CONSUMER SHUT SLIDING DOOR WHEN ALL POWER LOCKS ON ALL DOORS LOCKED BY ITSELF, TRAPPING INFANT I... \n", "2 DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT \n", "3 TL* THE CONTACT OWNS A 2009 NISSAN ALTIMA. THE CONTACT STATED THAT THE START BUTTON FOR THE IGNI... \n", "4 THE FRONT MIDDLE SEAT DOESN'T LOCK IN PLACE. *AK " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled = pd.read_csv(\"data/sampled-labeled.csv\")\n", "labeled.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to use this dataset to **train our classifier about what a suspicious complaint looks like.** Once our classifier is trained we'll be able to use it to predict whether each complaint in that original (big big big) dataset is suspicious or not.\n", "\n", "We made this dataset through hard work, reading comments, and marking them as `0` (not suspicious) or `1` (suspicious). For example, this complaint isn\u2019t suspicious because it\u2019s about an air bag _not_ deploying:\n", "\n", "```\n", "DURING AN ACCIDENT AIR BAG'S DID NOT DEPLOY. DEALER HAS BEEN CONTACTED. *AK \n", "```\n", "\n", "This next one isn\u2019t suspicious either, because it isn\u2019t even about airbags!\n", "\n", "```\n", "DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT\n", "```\n", "\n", "But if something involves explosions or shrapnel happens, it\u2019s probably worth marking as suspicious:\n", "\n", "```I WAS DRIVEN IN A SCHOOL ZONE STREET AND THE LIGHTS OF AIRBAG ON AND APROX. 2 MINUTES THE AIR BAGS EXPLODED IN MY FACE, THE DRIVE AND PASSENGERS SIDE, THEN I STOPPED THE JEEP, IT SMELL LIKE SOMETHING IS BURNING AND HOT, I DID NOT SEE FIRE. *TR\n", "```\n", "\n", "So we went down the file in Excel, one by one, reading comments, marking them as 0 or 1.\n", "\n", "**How many are in each category?**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0 150\n", "1.0 15\n", "Name: is_suspicious, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled.is_suspicious.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "150 non-suspicious and 15 suspicious is a pretty terrible ratio, but we're remarkably lazy and not very many of the comments are actually suspicious.\n", "\n", "Now that we've read a few, let's train our classifier\n", "\n", "## Creating features\n", "\n", "When you're working on machine learning, you need to feed the algorithm a bunch of inputs so it can make its decision. These are called **features**.\n", "\n", "There's a problem: computers only like features to be numbers, but every complaint is **just a bunch of text**, a.k.a. \"unstructured data.\" How can we turn all of this unstructured data into something a computer can understand?\n", "\n", "While there are fancier (and more effective!) ways to do what we're about to do, the simple start below is going to provide a foundation for later work.\n", "\n", "To teach our computer how to find suspicious complaints, we first need to think about how we find those complaints as human beings. By reading, right? So let's teach the computer how to read, and what to look for.\n", "\n", "### Designing our features\n", "\n", "Let's take a look at what the airbag issue is, according [Consumer Reports](https://www.consumerreports.org/car-recalls-defects/takata-airbag-recall-everything-you-need-to-know/):\n", "\n", "> Vehicles made by 19 different automakers have been recalled to replace frontal airbags on the driver\u2019s side or passenger\u2019s side, or both in what NHTSA has called \"the largest and most complex safety recall in U.S. history.\" The airbags, made by major parts supplier Takata, were mostly installed in cars from model year 2002 through 2015. Some of those airbags could deploy explosively, injuring or even killing car occupants. \n", "> \n", "> At the heart of the problem is the airbag\u2019s inflator, a metal cartridge loaded with propellant wafers, which in some cases has ignited with explosive force. If the inflator housing ruptures in a crash, metal shards from the airbag can be sprayed throughout the passenger cabin\u2014a potentially disastrous outcome from a supposedly life-saving device.\n", "\n", "If we're going through a list of vehicle complaints, it isn't too hard for us to figure out which complaints we might want to investigate further. If the complaint's about seatbelts or rear-view mirrors, we probably don't care about it. If the word \"airbag\" shows up in the description, though, we're going to start paying attention.\n", "\n", "We aren't interested in all complaints with the word \"airbag,\" though. Since we're worried about exploding airbags, something like \"the airbag did not deploy\" would get our attention because of the word \"airbag,\" but then we could ignore it once we saw the airbag just didn't work.\n", "\n", "### Selecting our features\n", "\n", "Since we just read a long long list of airbag complaints, we can probably brainstorm some words or phrases that might make a comment interesting or not interesting. A quick start might be these few:\n", "\n", "* airbag\n", "* air bag\n", "* failed\n", "* did not deploy\n", "* violent\n", "* explode\n", "* shrapnel\n", "\n", "These **features** are the things that the machine learning algorithm is going to look for when it's reading. There are lots of words in each complaint, but these are the only ones we'll tell the classifier to pay attention to!\n", "\n", "### Building our features dataframe\n", "\n", "Now we're going to convert each sentence into a list of numbers. It will be a new dataframe, where there's a `1` if the word is in the complaint and a `0` if it isn't.\n", "\n", "To determine if a word is in `CDESCR`, we can use `.str.contains`.\n", "\n", "**See if each row has the word `AIRBAG` in it.**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 False\n", "12 False\n", "13 False\n", "14 False\n", "15 False\n", "16 False\n", "17 False\n", "18 False\n", "19 True\n", "20 False\n", "21 False\n", "22 False\n", "23 False\n", "24 False\n", "25 False\n", "26 False\n", "27 False\n", "28 False\n", "29 False\n", " ... \n", "320 False\n", "321 False\n", "322 True\n", "323 False\n", "324 False\n", "325 False\n", "326 False\n", "327 False\n", "328 False\n", "329 False\n", "330 True\n", "331 True\n", "332 False\n", "333 True\n", "334 True\n", "335 True\n", "336 False\n", "337 True\n", "338 False\n", "339 True\n", "340 False\n", "341 True\n", "342 False\n", "343 True\n", "344 True\n", "345 False\n", "346 False\n", "347 False\n", "348 False\n", "349 False\n", "Name: CDESCR, Length: 350, dtype: bool" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled.CDESCR.str.contains(\"AIRBAG\", na=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Computers can't use `True` and `False`, though, we need numbers. We'll need to use `.astype(int)` to turn them into intgers, with `0` for `False` and `1` for `True`.\n", "\n", "**Give me a `1` for every row that contains \"AIRBAG\" and a `0` fo every row that does not.**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", "5 0\n", "6 0\n", "7 0\n", "8 0\n", "9 0\n", "10 0\n", "11 0\n", "12 0\n", "13 0\n", "14 0\n", "15 0\n", "16 0\n", "17 0\n", "18 0\n", "19 1\n", "20 0\n", "21 0\n", "22 0\n", "23 0\n", "24 0\n", "25 0\n", "26 0\n", "27 0\n", "28 0\n", "29 0\n", " ..\n", "320 0\n", "321 0\n", "322 1\n", "323 0\n", "324 0\n", "325 0\n", "326 0\n", "327 0\n", "328 0\n", "329 0\n", "330 1\n", "331 1\n", "332 0\n", "333 1\n", "334 1\n", "335 1\n", "336 0\n", "337 1\n", "338 0\n", "339 1\n", "340 0\n", "341 1\n", "342 0\n", "343 1\n", "344 1\n", "345 0\n", "346 0\n", "347 0\n", "348 0\n", "349 0\n", "Name: CDESCR, Length: 350, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled.CDESCR.str.contains(\"AIRBAG\", na=False).astype(int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How many `0` values and how many `1` values do we have?**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 205\n", "1 145\n", "Name: CDESCR, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled.CDESCR.str.contains(\"AIRBAG\", na=False).astype(int).value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, so about 200 don't have `AIRBAG` mentioned and about 150 do. That's a decent balance, I guess!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to make a new dataframe with a row for each complaint. Each word will have a column, and we'll have `0` or `1` as to whether the word is in there or not.\n", "\n", "* airbag\n", "* air bag\n", "* failed\n", "* did not deploy\n", "* violent\n", "* explode\n", "* shrapnel\n", "\n", "Along with the words, we'll **also save the `is_suspicious` label** to keep everything in the same place.\n", "\n", "I've started the dataset with the label and the word **airbag**, you'll need to add in the rest of them." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>is_suspicious</th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " is_suspicious airbag air bag failed did not deploy violent explode \\\n", "0 0.0 0 0 0 0 0 0 \n", "1 0.0 0 0 0 0 0 0 \n", "2 0.0 0 0 0 0 0 0 \n", "3 0.0 0 0 0 0 0 0 \n", "4 0.0 0 0 0 0 0 0 \n", "\n", " shrapnel \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df = pd.DataFrame({\n", " 'is_suspicious': labeled.is_suspicious,\n", " 'airbag': labeled.CDESCR.str.contains(\"AIRBAG\", na=False).astype(int),\n", " 'air bag': labeled.CDESCR.str.contains(\"AIR BAG\", na=False).astype(int),\n", " 'failed': labeled.CDESCR.str.contains(\"FAILED\", na=False).astype(int),\n", " 'did not deploy': labeled.CDESCR.str.contains(\"DID NOT DEPLOY\", na=False).astype(int),\n", " 'violent': labeled.CDESCR.str.contains(\"VIOLENT\", na=False).astype(int),\n", " 'explode': labeled.CDESCR.str.contains(\"EXPLODE\", na=False).astype(int),\n", " 'shrapnel': labeled.CDESCR.str.contains(\"SHRAPNEL\", na=False).astype(int),\n", "})\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check how many rows and columns your dataframe has. You'll want to make sure it has **8 columns**, and they should all be numbers." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(350, 8)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification\n", "\n", "The kind of problem we're dealing with here is called a **classification problem**. That's because we have two different classes of complaints:\n", "\n", "* Complaints that are suspicious\n", "* Complaints that are not suspicious\n", "\n", "And the machine's job is to classify new complaints in one of those two categories. Before we put it on the job, though, we need to **train it**.\n", "\n", "Before we start with that, though, let's see how many suspicious and non-suspicious comments are in our training set." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0 150\n", "1.0 15\n", "Name: is_suspicious, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.is_suspicious.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wait a second, I thought we had 350 rows? Where are the rest?\n", "\n", "* **Tip:** Try adding `dropna=False` to your `.value_counts()`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NaN 185\n", "0.0 150\n", "1.0 15\n", "Name: is_suspicious, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.is_suspicious.value_counts(dropna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yup, it looks like we're missing a LOT of labels. Classifiers hate missing data - both missing labels _and_ missing features - so we might as well remove any row that's missing any data.\n", "\n", "* **Tip:** If you use `.dropna()`, it will drop any rows that have `NaN` in them." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "train_df = train_df.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After dropping the missing rows, double-check that your dataframe is the size you expect." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(165, 8)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating our classifier\n", "\n", "Just like with linear regression, we call our classifier a **model**. It **models** the relationship between the inputs and the outputs.\n", "\n", "The classifier we're using is a special one that uses **logistic regression** under the hood, but that doesn't matter very much right now. Just know that it's a classifier!\n", "\n", "### Separating our features and labels\n", "\n", "We need to feed our classifier two things\n", "\n", "1. The features\n", "2. The labels\n", "\n", "Take a look at the first five rows of `train_df`." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>is_suspicious</th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " is_suspicious airbag air bag failed did not deploy violent explode \\\n", "0 0.0 0 0 0 0 0 0 \n", "1 0.0 0 0 0 0 0 0 \n", "2 0.0 0 0 0 0 0 0 \n", "3 0.0 0 0 0 0 0 0 \n", "4 0.0 0 0 0 0 0 0 \n", "\n", " shrapnel \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`is_suspicious` is our label, and all of the othe columns are our features. We'll call the label `y` and the features `X`, because that's what everyone else does.\n", "\n", "The typical way of doing it is below (many people might use `axis=1` instead of `columns=`, but I like how explicit `columns=` is!)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Note that .drop doesn't drop the column permanently, it only drops the column to save it into `X`\n", "X = train_df.drop(columns=['is_suspicious'])\n", "y = train_df.is_suspicious" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a look at `X` and `y` to make sure they look like a list of features and a list of labels. You can use `.head()` on both of them, no problem." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " airbag air bag failed did not deploy violent explode shrapnel\n", "0 0 0 0 0 0 0 0\n", "1 0 0 0 0 0 0 0\n", "2 0 0 0 0 0 0 0\n", "3 0 0 0 0 0 0 0\n", "4 0 0 0 0 0 0 0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.head()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 0.0\n", "1 0.0\n", "2 0.0\n", "3 0.0\n", "4 0.0\n", "Name: is_suspicious, dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Building our classifier\n", "\n", "One we have our features and our labels, we can create a classifier.\n", "\n", "I'm actually going to move the `X=` and `y=` down into this section because it's nice to keep it all in one cell." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1000000000.0, class_weight=None, dual=False,\n", " fit_intercept=True, intercept_scaling=1, l1_ratio=None,\n", " max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',\n", " random_state=None, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "# Every column EXCEPT whether it's suspicious\n", "X = train_df.drop(columns='is_suspicious')\n", "# label is suspicious 0/1\n", "y = train_df.is_suspicious\n", "\n", "# Build a new classifier\n", "# C=1e9 is a magic secret I don't want to talk about\n", "# If we don't say solver='lbfgs' it complains that it's the new default\n", "clf = LogisticRegression(C=1e9, solver='lbfgs')\n", "\n", "# Teach the classifier about the complaints we read\n", "clf.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, that... seems to have done nothing.\n", "\n", "When we do linear regression, it prints out a bunch of stuff for us. It's nice! When we train a classifier, **it's up to us to use the classifier.**\n", "\n", "## Interpreting our classifier\n", "\n", "### Feature importance\n", "\n", "So the classifier did some reading. Hooray! We gave it all sorts of columns (each was a different word)... which columns did it think were important?" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>feature</th>\n", " <th>coefficient</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>4</th>\n", " <td>violent</td>\n", " <td>32.318434</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>explode</td>\n", " <td>1.819453</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>airbag</td>\n", " <td>1.404580</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>air bag</td>\n", " <td>0.812616</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>shrapnel</td>\n", " <td>-12.096964</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>failed</td>\n", " <td>-16.743779</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>did not deploy</td>\n", " <td>-21.108236</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " feature coefficient\n", "4 violent 32.318434\n", "5 explode 1.819453\n", "0 airbag 1.404580\n", "1 air bag 0.812616\n", "6 shrapnel -12.096964\n", "2 failed -16.743779\n", "3 did not deploy -21.108236" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The words we were looking for,\n", "# X were our features, X.columns is the column names\n", "feature_names = X.columns\n", "\n", "# Coefficients! Remember this from linear regression?\n", "coefficients = clf.coef_[0]\n", "\n", "pd.DataFrame({\n", " 'feature': feature_names,\n", " 'coefficient': coefficients\n", "}).sort_values(by='coefficient', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A higher number for a coefficient means \"this word makes me think it's suspicious, a.k.a. `1`\" and a lower number means \"this word makes me think it was not suspicious, a.k.a. `0`.\"\n", "\n", "Is there anything you found surprising about these results? Why do you think that might have happened?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Predicting with our classifier\n", "\n", "The point of a classifier is to classify documents it hasn't seen before, to read them and put them into the appropriate category. Before we can do this, we need to **extract features from our original dataframe**, the one that doesn't have labels.\n", "\n", "We'll do this the **same way** we did with our set of labeled data. Build a new dataframe that asks whether each complaint has the appropriate word:\n", "\n", "* airbag\n", "* air bag\n", "* failed\n", "* did not deploy\n", "* violent\n", "* explode\n", "* shrapnel\n", "\n", "I've started you off with one check for the word **airbag**." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " airbag air bag failed did not deploy violent explode shrapnel\n", "0 0 0 0 0 0 0 0\n", "1 0 0 0 0 0 0 0\n", "2 0 0 0 0 0 0 0\n", "3 0 0 0 0 0 0 0\n", "4 0 0 0 0 0 0 0" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features = pd.DataFrame({\n", " 'airbag': df.CDESCR.str.contains(\"AIRBAG\", na=False).astype(int),\n", " 'airbag': df.CDESCR.str.contains(\"AIRBAG\", na=False).astype(int),\n", " 'air bag': df.CDESCR.str.contains(\"AIR BAG\", na=False).astype(int),\n", " 'failed': df.CDESCR.str.contains(\"FAILED\", na=False).astype(int),\n", " 'did not deploy': df.CDESCR.str.contains(\"DID NOT DEPLOY\", na=False).astype(int),\n", " 'violent': df.CDESCR.str.contains(\"VIOLENT\", na=False).astype(int),\n", " 'explode': df.CDESCR.str.contains(\"EXPLODE\", na=False).astype(int),\n", " 'shrapnel': df.CDESCR.str.contains(\"SHRAPNEL\", na=False).astype(int),\n", "})\n", "features.head()" ] }, { "cell_type": "code", "execution_count": 188, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "airbag 35613\n", "air bag 56358\n", "failed 129117\n", "did not deploy 16685\n", "violent 9994\n", "explode 6638\n", "shrapnel 160\n", "dtype: int64" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features.sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataframe should have 7 columns, **none of which are `is_suspicious`**. It's unlabeled, remember? We aren't sure whether they're suspicious complaints or not.\n", "\n", "Confirm that real quick." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can add a new column, the classifier's guess about whether it's suspicious or not. To make the classifier guess, we use `.predict`. We just feed our features to the classifier and there we go!" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0., 0., 0., ..., 0., 0., 0.])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.predict(features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a copy of `features` and give it a new column called `predicted`. That way if we need to use features again we won't have messed it up by adding new columns." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "features_with_prediction = features.copy()\n", "features_with_prediction['predicted'] = clf.predict(features)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first five." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features_with_prediction.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pretty boring, right? No words in there, all predicted as `0`, not fun at all. Let's try filtering to see **the first ten where the prediction was `1`**." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " <th>predicted</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>56</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>1217</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>1868</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>2035</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>2936</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>2960</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>3949</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>3952</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>4129</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>5362</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " airbag air bag failed did not deploy violent explode shrapnel \\\n", "56 0 0 0 0 1 0 0 \n", "1217 1 0 0 0 1 0 0 \n", "1868 0 0 0 0 1 0 0 \n", "2035 0 0 0 0 1 0 0 \n", "2936 0 0 1 0 1 0 0 \n", "2960 0 0 0 0 1 0 0 \n", "3949 0 0 1 0 1 0 0 \n", "3952 0 0 1 0 1 0 0 \n", "4129 0 0 0 0 1 0 0 \n", "5362 0 0 0 0 1 0 0 \n", "\n", " predicted \n", "56 1.0 \n", "1217 1.0 \n", "1868 1.0 \n", "2035 1.0 \n", "2936 1.0 \n", "2960 1.0 \n", "3949 1.0 \n", "3952 1.0 \n", "4129 1.0 \n", "5362 1.0 " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features_with_prediction[features_with_prediction.predicted == 1].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see most of the ones marked as suspicious include the words \"airbag\" and \"violent,\" and none of them include \"failed\" or \"did not deploy.\" That all makes sense, but what about all of the ones that include the word \"violent\" but not \"airbag\" or \"air bag?\" None of those should be good!\n", "\n", "While we could just filter it to only include ones with the word \"airabg\" in it, we probably need a way to **test the quality of our classifier**.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Testing our classifier\n", "\n", "When we look at the results of our classifier, we know some of them are wrong - complaints shouldn't be suspicious if they don't have airbags in them! But it would be nice to have an **automated process** to give us an idea of how well our classifier does.\n", "\n", "The problem is **we can't test our classifier on this unlabeled data**, because it doesn't know what's right and what's wrong. Instead, we have to test on the **labeled data** we trained our classifier on.\n", "\n", "One technique would be having our classifier compare the actual labels on our training data to what it would predict those labels to be." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9212121212121213" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Look at our training data, predict the labels,\n", "# then compare the labels to the actual labels\n", "clf.score(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Incredible, over 90% accuracy! ...that's good, right? **Well, not really.** There are two major reason why this isn't impressive!\n", "\n", "### Test-train split\n", "\n", "One big problem with our classifier is that we're testing it on **data it's already seen**. While it's cool to have a study sheet for a test, it doesn't quite seem fair if the **study sheet is exactly the same as the test**.\n", "\n", "Instead, we should try to reproduce what the real world is like - trainig it on one set of data, and testing it on *similar* data... but similar data we already know the labels for!\n", "\n", "To make this happen we use something called **train/test split**, where instead of using the _entire_ dataset for training, we only use _most_ of it - the default is 80% for training and 20% for testing. The code on the line below automatically splits the dataset into two groups, one for training and a smaller one for testing." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To try to understand what's going on, take a look at `X_train`, `X_test`, `y_train` and `y_test`, along with their sizes." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>12</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>321</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>296</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>348</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>246</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " airbag air bag failed did not deploy violent explode shrapnel\n", "12 0 0 0 0 0 0 0\n", "321 0 1 0 1 0 0 0\n", "296 1 1 0 0 0 0 0\n", "348 0 1 0 0 0 0 0\n", "246 1 0 0 0 0 0 0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.head()" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>84</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>240</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>127</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>23</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " airbag air bag failed did not deploy violent explode shrapnel\n", "84 1 1 0 0 0 0 0\n", "240 1 0 0 1 0 0 0\n", "127 0 1 0 1 0 0 0\n", "0 0 0 0 0 0 0 0\n", "23 0 0 0 0 0 0 0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_test.head()" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(123, 7)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(42, 7)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_test.shape" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "12 0.0\n", "321 0.0\n", "296 0.0\n", "348 0.0\n", "246 0.0\n", "Name: is_suspicious, dtype: float64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.head()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "84 0.0\n", "240 0.0\n", "127 0.0\n", "0 0.0\n", "23 0.0\n", "Name: is_suspicious, dtype: float64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test.head()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(123,)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.shape" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(42,)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both the `X_` and the `y_` variables look just about exactly the same, the only difference is that `_train` contains a lot more than `_test`, and there are no repeats between the two.\n", "\n", "Now when we give the model a test, it hasn't seen the answers already!\n", "\n", "* Use `clf.fit` to train on the training sample\n", "* Use `clf.score` to score on the testing sample" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8809523809523809" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(X_train, y_train)\n", "clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This part is fun, because there's a chance *it will get even better!* Weird, right? We'll talk about why that might have happened a little later.\n", "\n", "There are other ways to improve this further, but for now we have a larger problem to tackle." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The confusion matrix\n", "\n", "Our accuracy is looking great, hovering somewhere in the 90's. Feeling good, right? **Unfortunately, things aren't actually that rosy.**\n", "\n", "Let's take a look at how many suspicious and how many non-suspicious ones we have in our labeled dataset (for the millionth time, yes)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0 150\n", "1.0 15\n", "Name: is_suspicious, dtype: int64" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled.is_suspicious.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a lot more non-suspicious ones as compared to suspicious, right? Let's say we were classifying, and we *always* guessed \"not suspicious\". Since there are so few suspicious ones, we wouldn't get very many wrong, and our accuracy would be really high!\n", "\n", "> If we have 99 non-suspicious and 1 suspicious, if we always guess \"non-suspicious\" we'd have 99% accuracy.\n", "\n", "Even though our accuracy would look great, the result would be super boring. Since zero of our complaints would have been marked as suspicious, we wouldn't have anything to read or research. **It'd be much nicer if we could identify the difference between getting one category right compared to the other.**\n", "\n", "And hey, that's easy! We use this thing called a **confusion matrix**. It looks like this:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[150, 0],\n", " [ 14, 1]])" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "y_true = y\n", "y_pred = clf.predict(X)\n", "\n", "confusion_matrix(y_true, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...which is pretty terrible-looking, right? It's hard as heck to understand! Let's try to spice it up a little bit and make it a little nicer to read:\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Predicted not suspicious</th>\n", " <th>Predicted suspicious</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Is not suspicious</th>\n", " <td>150</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>Is suspicious</th>\n", " <td>14</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Predicted not suspicious Predicted suspicious\n", "Is not suspicious 150 0\n", "Is suspicious 14 1" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "\n", "# Save the true label, but also save the predicted label\n", "y_true = y\n", "y_pred = clf.predict(X)\n", "# We could also use just the test dataset\n", "# y_true = y_test\n", "# y_pred = clf.predict(X_test)\n", "\n", "matrix = confusion_matrix(y_true, y_pred)\n", "\n", "# But then make it look nice\n", "label_names = pd.Series(['not suspicious', 'suspicious'])\n", "pd.DataFrame(matrix,\n", " columns='Predicted ' + label_names,\n", " index='Is ' + label_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now we can see what's going on a little bit better. According to the confusion matrix, when using our original dataset (your numbers might be a little different):\n", "\n", "* We correctly predicted 149 of 150 not-suspicious\n", "* We only correctly predicted 2 of 15 suspicious ones.\n", "\n", "Even though that gives us a really high score, **it's pretty useless**.\n", "\n", "## Thinking about what your outputs mean\n", "\n", "While we could spend a lot of time working on the math behind all of this and the technical ins and outs, I think a more useful thing for journalists to do - when both analyzing their own algorithms as well as other people's algorithms - is to think about **what incorrect outputs mean**.\n", "\n", "In this case, we're trying to predict whether we should investigate a given complaint. That basically means, \"the computer takes a look and says 'hey human being, you should go look at it'.\n", "\n", "As a result, every complain that _shouldn't_ have been flagged is more work for a computer, but every complaint that is _incorrectly_ flagged means we'll never think to look at that complaint.\n", "\n", "**Do you think it's better to incorrectly flag non-suspicious complaints as suspicious, or to incorrectly flag suspicious complaints as non-suspicious**\n", "\n", "What are the upsides/downsides of each, and which side is more important to you?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classifier Probability\n", "\n", "When we use `clf.predict`, we only get a `0` or a `1`. That's kind of a fakeout, though, as under the hood there is actually something a (little) more complicated going on. Since we only have two categories, each row is given a score between 0-100% as to whether it should belong to a category. If it's over 50% it goes into that category!\n", "\n", "We can see this with `clf.predict_proba`." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "X_with_predictions = X.copy()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " <th>predicted</th>\n", " <th>probability</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.055643</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.055643</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.055643</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.055643</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.055643</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " airbag air bag failed did not deploy violent explode shrapnel \\\n", "0 0 0 0 0 0 0 0 \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "\n", " predicted probability \n", "0 0.0 0.055643 \n", "1 0.0 0.055643 \n", "2 0.0 0.055643 \n", "3 0.0 0.055643 \n", "4 0.0 0.055643 " ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_with_predictions['predicted'] = clf.predict(X)\n", "# [:,1] is the probability it belongs in the '1' category\n", "X_with_predictions['probability'] = clf.predict_proba(X)[:,1]\n", "X_with_predictions.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can be a little more discriminating - instead of just looking as whether it scored above or below 50% by seeing the final classification we can see exactly what the classifier was thinking when it assigned it to one category or another. Try sorting by probability and showing the top 20, putting the higher probability at the top." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " <th>predicted</th>\n", " <th>probability</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>303</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " <td>0.641107</td>\n", " </tr>\n", " <tr>\n", " <th>334</th>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.358830</td>\n", " </tr>\n", " <tr>\n", " <th>84</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>59</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>254</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>290</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>252</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>296</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>81</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>339</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>337</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>55</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>316</th>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.321545</td>\n", " </tr>\n", " <tr>\n", " <th>224</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.158299</td>\n", " </tr>\n", " <tr>\n", " <th>223</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.158299</td>\n", " </tr>\n", " <tr>\n", " <th>57</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.158299</td>\n", " </tr>\n", " <tr>\n", " <th>349</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.158299</td>\n", " </tr>\n", " <tr>\n", " <th>342</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.158299</td>\n", " </tr>\n", " <tr>\n", " <th>323</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.158299</td>\n", " </tr>\n", " <tr>\n", " <th>348</th>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.158299</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " airbag air bag failed did not deploy violent explode shrapnel \\\n", "303 1 1 0 0 0 1 0 \n", "334 1 0 0 0 0 1 0 \n", "84 1 1 0 0 0 0 0 \n", "59 1 1 0 0 0 0 0 \n", "254 1 1 0 0 0 0 0 \n", "290 1 1 0 0 0 0 0 \n", "252 1 1 0 0 0 0 0 \n", "296 1 1 0 0 0 0 0 \n", "81 1 1 0 0 0 0 0 \n", "339 1 1 0 0 0 0 0 \n", "337 1 1 0 0 0 0 0 \n", "55 1 1 0 0 0 0 0 \n", "316 1 1 0 0 0 0 0 \n", "224 0 1 0 0 0 0 0 \n", "223 0 1 0 0 0 0 0 \n", "57 0 1 0 0 0 0 0 \n", "349 0 1 0 0 0 0 0 \n", "342 0 1 0 0 0 0 0 \n", "323 0 1 0 0 0 0 0 \n", "348 0 1 0 0 0 0 0 \n", "\n", " predicted probability \n", "303 1.0 0.641107 \n", "334 0.0 0.358830 \n", "84 0.0 0.321545 \n", "59 0.0 0.321545 \n", "254 0.0 0.321545 \n", "290 0.0 0.321545 \n", "252 0.0 0.321545 \n", "296 0.0 0.321545 \n", "81 0.0 0.321545 \n", "339 0.0 0.321545 \n", "337 0.0 0.321545 \n", "55 0.0 0.321545 \n", "316 0.0 0.321545 \n", "224 0.0 0.158299 \n", "223 0.0 0.158299 \n", "57 0.0 0.158299 \n", "349 0.0 0.158299 \n", "342 0.0 0.158299 \n", "323 0.0 0.158299 \n", "348 0.0 0.158299 " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_with_predictions.sort_values(by='probability', ascending=False).head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's improve our model\n", "\n", "Right now our model isn't very good. It doesn't seem to require the word \"airbag\" to be in it (maybe because we count \"airbag\" and \"air bag\" as separate words?) and doesn't include that many features. Can you think of ways to improve our model, and maybe try a few out?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Imports\n", "\n", "We'll just do this all over again." ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import confusion_matrix\n", "pd.set_option(\"display.max_colwidth\", 500)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Read in our labeled data\n", "\n", "Right now we're only dropping ones have missing labels. Why do we have so many missing labels? Are there other options for ones we could include/not include?" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(165, 2)" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read in our data, drop those that are missing labels\n", "labeled = pd.read_csv(\"data/sampled-labeled.csv\")\n", "labeled = labeled.dropna()\n", "labeled.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create our X and y\n", "\n", "Are there other words you might look for? Any words you might remove?" ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>is_suspicious</th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " is_suspicious airbag air bag failed did not deploy violent explode \\\n", "0 0.0 0 0 0 0 0 0 \n", "1 0.0 0 0 0 0 0 0 \n", "2 0.0 0 0 0 0 0 0 \n", "3 0.0 0 0 0 0 0 0 \n", "4 0.0 0 0 0 0 0 0 \n", "\n", " shrapnel \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df = pd.DataFrame({\n", " 'is_suspicious': labeled.is_suspicious,\n", " 'airbag': labeled.CDESCR.str.contains(\"AIRBAG\", na=False).astype(int),\n", " 'air bag': labeled.CDESCR.str.contains(\"AIR BAG\", na=False).astype(int),\n", " 'failed': labeled.CDESCR.str.contains(\"FAILED\", na=False).astype(int),\n", " 'did not deploy': labeled.CDESCR.str.contains(\"DID NOT DEPLOY\", na=False).astype(int),\n", " 'violent': labeled.CDESCR.str.contains(\"VIOLENT\", na=False).astype(int),\n", " 'explode': labeled.CDESCR.str.contains(\"EXPLODE\", na=False).astype(int),\n", " 'shrapnel': labeled.CDESCR.str.contains(\"SHRAPNEL\", na=False).astype(int),\n", "})\n", "train_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split into train and test\n", "\n", "Does giving the model more (or less) to train with change anything?" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [], "source": [ "X = train_df.drop(columns='is_suspicious')\n", "y = train_df.is_suspicious\n", "\n", "# With test_size=0.3, we'll train on 70% and test on 30%\n", "# random_state=42 means it isn't actually random, it will always give you the same split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create and train our classifier\n", "\n", "You... don't know any other classifiers. But hey, you could always look some up, I guess!" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1000000000.0, class_weight=None, dual=False,\n", " fit_intercept=True, intercept_scaling=1, l1_ratio=None,\n", " max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',\n", " random_state=None, solver='lbfgs', tol=0.0001, verbose=0,\n", " warm_start=False)" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf = LogisticRegression(C=1e9, solver='lbfgs')\n", "\n", "clf.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check the important words\n", "\n", "Are the selected words pushing your results in the direction you think they should?" ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>feature</th>\n", " <th>coefficient</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>4</th>\n", " <td>violent</td>\n", " <td>45.690384</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>explode</td>\n", " <td>1.600866</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>air bag</td>\n", " <td>1.283316</td>\n", " </tr>\n", " <tr>\n", " <th>0</th>\n", " <td>airbag</td>\n", " <td>0.689808</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>shrapnel</td>\n", " <td>-11.551542</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>failed</td>\n", " <td>-23.881028</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>did not deploy</td>\n", " <td>-34.018183</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " feature coefficient\n", "4 violent 45.690384\n", "5 explode 1.600866\n", "1 air bag 1.283316\n", "0 airbag 0.689808\n", "6 shrapnel -11.551542\n", "2 failed -23.881028\n", "3 did not deploy -34.018183" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_names = X_train.columns\n", "# Coefficients! Remember this from linear regression?\n", "coefficients = clf.coef_[0]\n", "\n", "pd.DataFrame({\n", " 'feature': feature_names,\n", " 'coefficient': coefficients\n", "}).sort_values(by='coefficient', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Test our classifier\n", "\n", "We'll do a simple `.score` (which we know isn't very useful) along with a confusion matrix (which is harder to understand, but less useful). How do we feel about the results according to both?\n", "\n", "**Normally I'd only use the confusion matrix on `X_test`/`y_test`, but we do such a bad job that I feel like we should look at it all.**" ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8484848484848485" ] }, "execution_count": 181, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Predicted not suspicious</th>\n", " <th>Predicted suspicious</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>Is not suspicious</th>\n", " <td>150</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>Is suspicious</th>\n", " <td>13</td>\n", " <td>2</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Predicted not suspicious Predicted suspicious\n", "Is not suspicious 150 0\n", "Is suspicious 13 2" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_true = y\n", "y_pred = clf.predict(X)\n", "# y_true = y_test\n", "# y_pred = clf.predict(X_test)\n", "\n", "matrix = confusion_matrix(y_true, y_pred)\n", "\n", "label_names = pd.Series(['not suspicious', 'suspicious'])\n", "pd.DataFrame(matrix,\n", " columns='Predicted ' + label_names,\n", " index='Is ' + label_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**If you keep running this and running this, it's going to be different each time.** " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examining the results" ] }, { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [], "source": [ "train_df_with_predictions = train_df.copy()\n", "train_df_with_predictions['predicted'] = clf.predict(train_df.drop(columns='is_suspicious'))\n", "train_df_with_predictions['predicted_prob'] = clf.predict_proba(train_df.drop(columns='is_suspicious'))[:,1]\n", "train_df_with_predictions['sentence'] = labeled.CDESCR" ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>is_suspicious</th>\n", " <th>airbag</th>\n", " <th>air bag</th>\n", " <th>failed</th>\n", " <th>did not deploy</th>\n", " <th>violent</th>\n", " <th>explode</th>\n", " <th>shrapnel</th>\n", " <th>predicted</th>\n", " <th>predicted_prob</th>\n", " <th>sentence</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>294</th>\n", " <td>1.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " <td>1.000000</td>\n", " <td>DROVE THE CAR ABOUT 20 YARDS, THEN PLACED IT IN PARK TO ALLOW THE REAR VAN DOORS TO OPEN FOR OUR CHILDREN. WHEN THE KIDS GOT IN, I PLACED THE SHIFT LEVER IN DRIVE. IMMEDIATELY, BOTH THE DRIVER AND PASSENGER AIRBAGS DEPLOYED VIOLENTLY. I WAS NOT IN MOTION, WAS NOT STRUCK BY ANY OTHER VEHICLE OR OBJECT, AND MY FOOT WAS ON THE BRAKE. AN OFF-DUTY POLICE OFFICER WAS PARKED RIGHT BEHIND ME SAW THIS AND CAME TO HELP. HE NOTED THAT THE AIRBAG FIRING MECHANISMS CONTINUED TO FIRE. \"I'VE SEEN A LO...</td>\n", " </tr>\n", " <tr>\n", " <th>303</th>\n", " <td>1.0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>1.0</td>\n", " <td>0.655119</td>\n", " <td>I WAS DRIVEN IN A SCHOOL ZONE STREET AND THE LIGHTS OF AIRBAG ON AND APROX. 2 MINUTES THE AIR BAGS EXPLODED IN MY FACE, THE DRIVE AND PASSENGERS SIDE, THEN I STOPPED THE JEEP, IT SMELL LIKE SOMETHING IS BURNING AND HOT, I DID NOT SEE FIRE. *TR</td>\n", " </tr>\n", " <tr>\n", " <th>334</th>\n", " <td>0.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.344863</td>\n", " <td>SINGLE-CAR ACCIDENT; ROLLOVER, 2008 KIA RONDO DECLARED A TOTAL LOSS BY INSURANCE CO.; SAFETY FEATURES INCLUDED ELECTRONIC STABILITY CONTROL; 6 AIRBAGS INCLUDING SIDE/HEAD CURTAIN AND NOT ONE DEPLOYED - I SUFFER FROM APPROX 6\" LESION WITH PARTIAL SKULL SHOWING (16 STAPLES) TO LEFT SIDE OF HEAD FROM THE SIDEROOF SLAMMED INTO ME; I WAS WEARING A SEATBELT AND IT SAVED MY LIFE - GLASS EXPLODED EVERYWHERE, I HAVE SEVERE WHIPLASH AND CONCUSSION. *TR</td>\n", " </tr>\n", " <tr>\n", " <th>84</th>\n", " <td>0.0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.277029</td>\n", " <td>THE SEBRING HIT THE CAR IN FRONT OF IT. LEFT FRONT OF SEBRING HITTING THE RIGHT REAR BUMPER OF CAR IN FRONT OF IT. IMMEDIATELY STOPPING THE SEBRING AND THEN HAVING IT ROLL BACKWARDS INTO A DITCH. TOTALLY THE SEBRING. NO AIR BAGS DEPLOYED. INJURIES SUSTAINED BECAUSE OF AIRBAGS NOT DEPLOYING. WEATHER OUTSIDE WAS BELOW ZERO. *TR</td>\n", " </tr>\n", " <tr>\n", " <th>55</th>\n", " <td>0.0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.277029</td>\n", " <td>2005 NISSAN MURANO AIR BAG SENSOR LIGHT CONTINUED TO BLINK. CONSUMER WANTS TO KNOW IF THIS IS A SAFETY ISSUE. *NJ A DIAGNOSTICS DETERMINED THAT THE CONTACT SPIRAL IN THE STEERING COLUMN HAD AN OPEN CIRCUIT AND NEEDED TO BE REPLACED OR THE DRIVER'S SIDE AIRBAG WOULD NOT DEPLOY. *JB</td>\n", " </tr>\n", " <tr>\n", " <th>337</th>\n", " <td>0.0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.277029</td>\n", " <td>2008 JEEP WRANGLER RHD POSTAL VEHICLES USED BY RURAL CARRIERS FOR USPS. THOSE OF US WHO PURCHASED THE VEHICLES ARE HAVING ISSUES WITH AIRBAG MALFUNCTION INDICATOR LIGHT AND NO HORN USE OR INTERMITTENT USAGE. ONE ITEM OF INTEREST POSSIBLE CAUSE IS THE SPRINGCLOCK IN THE STEERING COLUMN AS COMPONENT THAT OPERATES BOTH HORN AND AIR BAG CIRCUITS. THIS ITEM HAS BEEN A PROBLEM IN YEARS PAST WITH JEEP. CONCERNED WITH A DEPLOYMENT OF FAULTY AIR BAG ISSUE MOSTLY, TRYING TO GET MY VEHICLE IN FOR A...</td>\n", " </tr>\n", " <tr>\n", " <th>339</th>\n", " <td>0.0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.277029</td>\n", " <td>SERVICE AIR BAG CHECK ENGINE LIGHT ON WHEN STARTING 2006 SILVERADO; GM DEALERS DIAGNOSED AIRBAG SENSOR MALFUNCTIONING REQUIRES REPLACEMENT. THIS SAFETY ISSUE HAS BEEN REPORTED IN 2009 AT EDMUNDS.COM BY OTHERS. CALLED, EMAILED AND TWITTED GM SINCE 14 MAY 2014. AFTER 13 PHONE TAGS WITH 4 GM CUSTOMER CARE SPECIALISTS AND 2 DEALER SERVICE REPS (INCLUDE MANAGER) AND 5 DAYS, A VERBAL WORD FROM GM \"YOUR VEHICLE IS WAY BELONG GM'S RESPONSIBILITY\"...WE ARE 'TICKED\" AROUND BY GM AND THE DEALER WITH ...</td>\n", " </tr>\n", " <tr>\n", " <th>59</th>\n", " <td>0.0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.277029</td>\n", " <td>COMMON DEFECT: AIR BAG WARNING LIGHT IS ON. SUSPECTED PASSENGER SEAT SENSOR FAILURE. I'VE HAD 2 BMW'S FROM THIS SERIES - BOTH WITH THE SAME ISSUE - YET NO RECALL FROM BMW - WHY? ISN'T THIS A SAFETY SYSTEM? I SEE AIRBAG RECALLS ON CARS DATING BACK TO THIS ERA. WHY NONE FOR BMW? ALSO - ODOMETER RIBBON CABLE FAILURE CAUSING PIXELS TO FAIL. NO RECALL - PROBLEM EXISTS ACROSS THE SERIES FROM 1996-2001 AND BEYOND. NO FIX - NO RECALL - NO REPAIRS OFFERED. WHY? *TR</td>\n", " </tr>\n", " <tr>\n", " <th>296</th>\n", " <td>0.0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.277029</td>\n", " <td>WE BOUGHT THIS VEHICLE IN JUNE 2013. WE HAD IT ALMOST A MONTH WHEN THE STEERING LOCKED UP ON ME. THE POWER STEERING PRESSURE LINE BLEW OUT. WE TOOK IT TO THE DEALER THAT WE BOUGHT IT FROM AND THEY FIXED THE LINE. THREE DAYS LATER THE STEERING WENT AGAIN WE GOT IT TO THE DEALERSHIP AND THEY WERE GOING TO CHARGE US $300.00. I GOT UPSET BUT THE MOST THEY WOULD DO FOR ME WAS SPLIT THE COST.THEN A FEW MONTHS LATER I STARTED TO SMELL GAS. I DIDN'T SEE ANY THING ON THE GROUND SO I LEFT IT GO. THEN...</td>\n", " </tr>\n", " <tr>\n", " <th>81</th>\n", " <td>0.0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0.0</td>\n", " <td>0.277029</td>\n", " <td>2007 HYUNDAI SONATA. CONSUMER WRITES IN REGARDS TO VEHICLE AIRBAG ISSUES. *SMD THE CONSUMER STATED THE AIR BAG LIGHT ILLUMINATED. THE CONSUMER HAD AN ISSUE WITH THE AIR BAG LIGHT ILLUMINATING IN OCTOBER 2012, WHERE THE DEALER REPLACED THE SEAT BELT BUCKLE ASSEMBLY. *JB</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " is_suspicious airbag air bag failed did not deploy violent explode \\\n", "294 1.0 1 0 0 0 1 0 \n", "303 1.0 1 1 0 0 0 1 \n", "334 0.0 1 0 0 0 0 1 \n", "84 0.0 1 1 0 0 0 0 \n", "55 0.0 1 1 0 0 0 0 \n", "337 0.0 1 1 0 0 0 0 \n", "339 0.0 1 1 0 0 0 0 \n", "59 0.0 1 1 0 0 0 0 \n", "296 0.0 1 1 0 0 0 0 \n", "81 0.0 1 1 0 0 0 0 \n", "\n", " shrapnel predicted predicted_prob \\\n", "294 0 1.0 1.000000 \n", "303 0 1.0 0.655119 \n", "334 0 0.0 0.344863 \n", "84 0 0.0 0.277029 \n", "55 0 0.0 0.277029 \n", "337 0 0.0 0.277029 \n", "339 0 0.0 0.277029 \n", "59 0 0.0 0.277029 \n", "296 0 0.0 0.277029 \n", "81 0 0.0 0.277029 \n", "\n", " sentence \n", "294 DROVE THE CAR ABOUT 20 YARDS, THEN PLACED IT IN PARK TO ALLOW THE REAR VAN DOORS TO OPEN FOR OUR CHILDREN. WHEN THE KIDS GOT IN, I PLACED THE SHIFT LEVER IN DRIVE. IMMEDIATELY, BOTH THE DRIVER AND PASSENGER AIRBAGS DEPLOYED VIOLENTLY. I WAS NOT IN MOTION, WAS NOT STRUCK BY ANY OTHER VEHICLE OR OBJECT, AND MY FOOT WAS ON THE BRAKE. AN OFF-DUTY POLICE OFFICER WAS PARKED RIGHT BEHIND ME SAW THIS AND CAME TO HELP. HE NOTED THAT THE AIRBAG FIRING MECHANISMS CONTINUED TO FIRE. \"I'VE SEEN A LO... \n", "303 I WAS DRIVEN IN A SCHOOL ZONE STREET AND THE LIGHTS OF AIRBAG ON AND APROX. 2 MINUTES THE AIR BAGS EXPLODED IN MY FACE, THE DRIVE AND PASSENGERS SIDE, THEN I STOPPED THE JEEP, IT SMELL LIKE SOMETHING IS BURNING AND HOT, I DID NOT SEE FIRE. *TR \n", "334 SINGLE-CAR ACCIDENT; ROLLOVER, 2008 KIA RONDO DECLARED A TOTAL LOSS BY INSURANCE CO.; SAFETY FEATURES INCLUDED ELECTRONIC STABILITY CONTROL; 6 AIRBAGS INCLUDING SIDE/HEAD CURTAIN AND NOT ONE DEPLOYED - I SUFFER FROM APPROX 6\" LESION WITH PARTIAL SKULL SHOWING (16 STAPLES) TO LEFT SIDE OF HEAD FROM THE SIDEROOF SLAMMED INTO ME; I WAS WEARING A SEATBELT AND IT SAVED MY LIFE - GLASS EXPLODED EVERYWHERE, I HAVE SEVERE WHIPLASH AND CONCUSSION. *TR \n", "84 THE SEBRING HIT THE CAR IN FRONT OF IT. LEFT FRONT OF SEBRING HITTING THE RIGHT REAR BUMPER OF CAR IN FRONT OF IT. IMMEDIATELY STOPPING THE SEBRING AND THEN HAVING IT ROLL BACKWARDS INTO A DITCH. TOTALLY THE SEBRING. NO AIR BAGS DEPLOYED. INJURIES SUSTAINED BECAUSE OF AIRBAGS NOT DEPLOYING. WEATHER OUTSIDE WAS BELOW ZERO. *TR \n", "55 2005 NISSAN MURANO AIR BAG SENSOR LIGHT CONTINUED TO BLINK. CONSUMER WANTS TO KNOW IF THIS IS A SAFETY ISSUE. *NJ A DIAGNOSTICS DETERMINED THAT THE CONTACT SPIRAL IN THE STEERING COLUMN HAD AN OPEN CIRCUIT AND NEEDED TO BE REPLACED OR THE DRIVER'S SIDE AIRBAG WOULD NOT DEPLOY. *JB \n", "337 2008 JEEP WRANGLER RHD POSTAL VEHICLES USED BY RURAL CARRIERS FOR USPS. THOSE OF US WHO PURCHASED THE VEHICLES ARE HAVING ISSUES WITH AIRBAG MALFUNCTION INDICATOR LIGHT AND NO HORN USE OR INTERMITTENT USAGE. ONE ITEM OF INTEREST POSSIBLE CAUSE IS THE SPRINGCLOCK IN THE STEERING COLUMN AS COMPONENT THAT OPERATES BOTH HORN AND AIR BAG CIRCUITS. THIS ITEM HAS BEEN A PROBLEM IN YEARS PAST WITH JEEP. CONCERNED WITH A DEPLOYMENT OF FAULTY AIR BAG ISSUE MOSTLY, TRYING TO GET MY VEHICLE IN FOR A... \n", "339 SERVICE AIR BAG CHECK ENGINE LIGHT ON WHEN STARTING 2006 SILVERADO; GM DEALERS DIAGNOSED AIRBAG SENSOR MALFUNCTIONING REQUIRES REPLACEMENT. THIS SAFETY ISSUE HAS BEEN REPORTED IN 2009 AT EDMUNDS.COM BY OTHERS. CALLED, EMAILED AND TWITTED GM SINCE 14 MAY 2014. AFTER 13 PHONE TAGS WITH 4 GM CUSTOMER CARE SPECIALISTS AND 2 DEALER SERVICE REPS (INCLUDE MANAGER) AND 5 DAYS, A VERBAL WORD FROM GM \"YOUR VEHICLE IS WAY BELONG GM'S RESPONSIBILITY\"...WE ARE 'TICKED\" AROUND BY GM AND THE DEALER WITH ... \n", "59 COMMON DEFECT: AIR BAG WARNING LIGHT IS ON. SUSPECTED PASSENGER SEAT SENSOR FAILURE. I'VE HAD 2 BMW'S FROM THIS SERIES - BOTH WITH THE SAME ISSUE - YET NO RECALL FROM BMW - WHY? ISN'T THIS A SAFETY SYSTEM? I SEE AIRBAG RECALLS ON CARS DATING BACK TO THIS ERA. WHY NONE FOR BMW? ALSO - ODOMETER RIBBON CABLE FAILURE CAUSING PIXELS TO FAIL. NO RECALL - PROBLEM EXISTS ACROSS THE SERIES FROM 1996-2001 AND BEYOND. NO FIX - NO RECALL - NO REPAIRS OFFERED. WHY? *TR \n", "296 WE BOUGHT THIS VEHICLE IN JUNE 2013. WE HAD IT ALMOST A MONTH WHEN THE STEERING LOCKED UP ON ME. THE POWER STEERING PRESSURE LINE BLEW OUT. WE TOOK IT TO THE DEALER THAT WE BOUGHT IT FROM AND THEY FIXED THE LINE. THREE DAYS LATER THE STEERING WENT AGAIN WE GOT IT TO THE DEALERSHIP AND THEY WERE GOING TO CHARGE US $300.00. I GOT UPSET BUT THE MOST THEY WOULD DO FOR ME WAS SPLIT THE COST.THEN A FEW MONTHS LATER I STARTED TO SMELL GAS. I DIDN'T SEE ANY THING ON THE GROUND SO I LEFT IT GO. THEN... \n", "81 2007 HYUNDAI SONATA. CONSUMER WRITES IN REGARDS TO VEHICLE AIRBAG ISSUES. *SMD THE CONSUMER STATED THE AIR BAG LIGHT ILLUMINATED. THE CONSUMER HAD AN ISSUE WITH THE AIR BAG LIGHT ILLUMINATING IN OCTOBER 2012, WHERE THE DEALER REPLACED THE SEAT BELT BUCKLE ASSEMBLY. *JB " ] }, "execution_count": 187, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df_with_predictions.sort_values(by='predicted_prob', ascending=False).head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How are we going to fix this?\n", "\n", "Even if you can't successfully make your classifier perform any better, try to think about what you feel like could make it better." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review\n", "\n", "We have far too many complaints from the National Highway Traffic Safety Administration to read ourselves, so we're hoping to convince a computer to mark the ones we'll be interested in. **We hand-labeled a random sample as suspicious or not** and used this smaller dataset as a source of training material for our machine learning algorithm.\n", "\n", "We picked a few words we thought might be indicative of malfunctioning air bags, and added new columns to our dataset as to whether each complain has the word or not. We then **train** our algorithm, where it learns how these **features** are related to the **label** of suspicious or not. In this case we used a **logistic regression classifier**.\n", "\n", "It did not do a very good job, and we thought about reasons why." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discussion topics\n", "\n", "We have a few options: try to flag more examples, try a different machine learning algorithm, just give up and have someone read all of the complaints manually.\n", "\n", "Do you think our algorithm might perform better if we had a better split between suspicious and non-suspicious complaints?\n", "\n", "What do you think takes longer: learning to use machine learning, or searching for \"airbag\" and manually marking complaints as suspicious or not?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }