Finding faulty airbag complaints using a very simple keyword search with logistic regression#

The story:

This story, done by The New York Times, investigates the content in complaints made to National Highway Traffic Safety Administration (NHTSA) by customers who had bad experiences with Takata airbags in their cars. Eventually, car companies had to recall airbags made by the airbag supplier that promised a cheaper alternative.

Author: Daeil Kim did a more complex version of this particular analysis - presentation here

Topics: Logistic Classifier

Datasets

  • FLAT_CMPL.txt: Vehicle-related complaints from 1995-current from the National Highway Traffic Safety Administration
  • CMPL.txt: data dictionary for the above
  • sampled-unlabeled.csv: a sample of vehicle complaints, not labeled
  • sampled-labeled.csv: a sample of vehicle complaints, labeled with being suspicious or not

What's the goal?#

It's too much work to read twenty years of vehicle comments to find the ones related to dangerous airbags! Because we're lazy, we want the computer to do this for us. We're going to read a subset, mark each one as "suspicious" or "not suspicious," then use that information to train the computer to read the rest and recognize which comments are suspicious and which are not suspicious.

This is a classification problem, because we want the computer to recognize which ones are suspicious and which are not.

Our code#

Setup#

import pandas as pd

# Allow us to display 100 columns at a time, and 100 characters in each column (instead of ...)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)

Read in our data#

The dataset in FLAT_CMPL.txt doesn't have column headers, so we're going to use this long long list of headers that we stole from CMPL.txt to read it in.

It's kind of a complicated dataset with a few errors here or there, so we're passing in a lot of options to pd.read_csv. In the end it's just a big big dataframe, though.

column_names = ['CMPLID', 'ODINO', 'MFR_NAME', 'MAKETXT', 'MODELTXT', 
                'YEARTXT', 'CRASH', 'FAILDATE', 'FIRE', 'INJURED', 
                'DEATHS', 'COMPDESC', 'CITY', 'STATE', 'VIN', 'DATEA', 
                'LDATE', 'MILES', 'OCCURENCES', 'CDESCR', 'CMPL_TYPE', 
                'POLICE_RPT_YN', 'PURCH_DT', 'ORIG_OWNER_YN', 'ANTI_BRAKES_YN', 
                'CRUISE_CONT_YN', 'NUM_CYLS', 'DRIVE_TRAIN', 'FUEL_SYS', 'FUEL_TYPE', 
                'TRANS_TYPE', 'VEH_SPEED', 'DOT', 'TIRE_SIZE', 'LOC_OF_TIRE', 
                'TIRE_FAIL_TYPE', 'ORIG_EQUIP_YN', 'MANUF_DT', 'SEAT_TYPE', 
                'RESTRAINT_TYPE', 'DEALER_NAME', 'DEALER_TEL', 'DEALER_CITY', 
                'DEALER_STATE', 'DEALER_ZIP', 'PROD_TYPE', 'REPAIRED_YN', 
                'MEDICAL_ATTN', 'VEHICLES_TOWED_YN']

df = pd.read_csv("data/FLAT_CMPL.txt",
                 sep='\t',
                 dtype='str',
                 header=None,
                 error_bad_lines=False,
                 encoding='latin-1',
                 names=column_names)

# We're only interested in pre-2015
df = df[df.DATEA < '2015']

df.head()
CMPLID ODINO MFR_NAME MAKETXT MODELTXT YEARTXT CRASH FAILDATE FIRE INJURED DEATHS COMPDESC CITY STATE VIN DATEA LDATE MILES OCCURENCES CDESCR CMPL_TYPE POLICE_RPT_YN PURCH_DT ORIG_OWNER_YN ANTI_BRAKES_YN CRUISE_CONT_YN NUM_CYLS DRIVE_TRAIN FUEL_SYS FUEL_TYPE TRANS_TYPE VEH_SPEED DOT TIRE_SIZE LOC_OF_TIRE TIRE_FAIL_TYPE ORIG_EQUIP_YN MANUF_DT SEAT_TYPE RESTRAINT_TYPE DEALER_NAME DEALER_TEL DEALER_CITY DEALER_STATE DEALER_ZIP PROD_TYPE REPAIRED_YN MEDICAL_ATTN VEHICLES_TOWED_YN
0 1 958173 Ford Motor Company LINCOLN TOWN CAR 1994 Y 19941222 N 0 0 SERVICE BRAKES, HYDRAULIC:PEDALS AND LINKAGES HIGH LAND PA MI 1LNLM82W8RY 19950103 19950103 NaN 1 BRAKE PEDAL PUSH ROD RETAINER WAS NOT PROPERLY INSTALLED, CAUSING BRAKES TO FAIL, RESULTING IN A... EVOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN V NaN NaN NaN
1 2 958146 General Motors LLC GMC SONOMA 1995 NaN 19941215 N 0 0 SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS MOBILE AL 1GTCS19W3S8 19950103 19950103 NaN NaN VEHICLE STALLS AT HIGH SPEED, RESULTING IN LOSS OF STEERING AND BRAKING ABILITY. TT EVOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN V NaN NaN NaN
2 3 958127 Ford Motor Company FORD RANGER 1994 NaN NaN N 0 0 ENGINE AND ENGINE COOLING:EXHAUST SYSTEM N. LAUDERDAL FL NaN 19950103 19950103 NaN NaN EXHAUST SYSTEM FAILS; PLEASE DESCRIBE DETAILS. TT EVOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN V NaN NaN NaN
3 4 958170 Ford Motor Company MERCURY COUGAR 1995 NaN 19950101 N 0 0 SERVICE BRAKES, HYDRAULIC:FOUNDATION COMPONENTS CORRAL SPRIN FL 1MELM62W5SH 19950103 19950103 NaN 1 BRAKING SYSTEM FAILURE WITHOUT ABS BRAKES. TT EVOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN V NaN NaN NaN
4 5 958149 Nissan North America, Inc. NISSAN MAXIMA 1987 NaN 19941223 N 0 0 VISIBILITY:SUN ROOF ASSEMBLY COLUMBUS OH JN1HU11P3HX 19950103 19950103 NaN 1 VEHICLES SUN ROOF GLASS FLEW OFF WHILE DRIVING. TT EVOQ NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN V NaN NaN NaN

How many rows and columns are in this dataset?

df.shape
(1144207, 49)

But wait, we don't even need that yet#

Oof, that's a lot of columns!

When you're dealing with machine learning, one of the first things you'll need to think about is what columns are important to you. An important thing about this dataset is it doesn't include whether the complaint is about faulty airbags or not.

We can't teach our classifier what a suspicious comment looks like if we don't have a list of suspicious complaints, right? Luckily, we have another dataset of labeled complaints!

Read in sampled-labeled.csv

labeled = pd.read_csv("data/sampled-labeled.csv")
labeled.head()
is_suspicious CDESCR
0 0.0 ALTHOUGH I LOVED THE CAR OVERALL AT THE TIME I DECIDED TO OWN, , MY DREAM CAR CADILLAC CTS HAS T...
1 0.0 CONSUMER SHUT SLIDING DOOR WHEN ALL POWER LOCKS ON ALL DOORS LOCKED BY ITSELF, TRAPPING INFANT I...
2 0.0 DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT
3 0.0 TL* THE CONTACT OWNS A 2009 NISSAN ALTIMA. THE CONTACT STATED THAT THE START BUTTON FOR THE IGNI...
4 0.0 THE FRONT MIDDLE SEAT DOESN'T LOCK IN PLACE. *AK

We're going to use this dataset to train our classifier about what a suspicious complaint looks like. Once our classifier is trained we'll be able to use it to predict whether each complaint in that original (big big big) dataset is suspicious or not.

We made this dataset through hard work, reading comments, and marking them as 0 (not suspicious) or 1 (suspicious). For example, this complaint isn’t suspicious because it’s about an air bag not deploying:

DURING AN  ACCIDENT  AIR BAG'S DID NOT DEPLOY.  DEALER HAS BEEN CONTACTED.  *AK

This next one isn’t suspicious either, because it isn’t even about airbags!

DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS.  TT

But if something involves explosions or shrapnel happens, it’s probably worth marking as suspicious:

I WAS DRIVEN IN A SCHOOL ZONE STREET AND THE LIGHTS OF AIRBAG ON AND APROX. 2 MINUTES THE AIR BAGS EXPLODED IN MY FACE, THE DRIVE AND PASSENGERS SIDE, THEN I STOPPED THE JEEP, IT SMELL LIKE SOMETHING IS BURNING AND HOT, I DID NOT SEE FIRE. *TR

So we went down the file in Excel, one by one, reading comments, marking them as 0 or 1.

How many are in each category?

labeled.is_suspicious.value_counts()
0.0    150
1.0     15
Name: is_suspicious, dtype: int64

150 non-suspicious and 15 suspicious is a pretty terrible ratio, but we're remarkably lazy and not very many of the comments are actually suspicious.

Now that we've read a few, let's train our classifier

Creating features#

When you're working on machine learning, you need to feed the algorithm a bunch of inputs so it can make its decision. These are called features.

There's a problem: computers only like features to be numbers, but every complaint is just a bunch of text, a.k.a. "unstructured data." How can we turn all of this unstructured data into something a computer can understand?

While there are fancier (and more effective!) ways to do what we're about to do, the simple start below is going to provide a foundation for later work.

To teach our computer how to find suspicious complaints, we first need to think about how we find those complaints as human beings. By reading, right? So let's teach the computer how to read, and what to look for.

Designing our features#

Let's take a look at what the airbag issue is, according Consumer Reports:

Vehicles made by 19 different automakers have been recalled to replace frontal airbags on the driver’s side or passenger’s side, or both in what NHTSA has called "the largest and most complex safety recall in U.S. history." The airbags, made by major parts supplier Takata, were mostly installed in cars from model year 2002 through 2015. Some of those airbags could deploy explosively, injuring or even killing car occupants.

At the heart of the problem is the airbag’s inflator, a metal cartridge loaded with propellant wafers, which in some cases has ignited with explosive force. If the inflator housing ruptures in a crash, metal shards from the airbag can be sprayed throughout the passenger cabin—a potentially disastrous outcome from a supposedly life-saving device.

If we're going through a list of vehicle complaints, it isn't too hard for us to figure out which complaints we might want to investigate further. If the complaint's about seatbelts or rear-view mirrors, we probably don't care about it. If the word "airbag" shows up in the description, though, we're going to start paying attention.

We aren't interested in all complaints with the word "airbag," though. Since we're worried about exploding airbags, something like "the airbag did not deploy" would get our attention because of the word "airbag," but then we could ignore it once we saw the airbag just didn't work.

Selecting our features#

Since we just read a long long list of airbag complaints, we can probably brainstorm some words or phrases that might make a comment interesting or not interesting. A quick start might be these few:

  • airbag
  • air bag
  • failed
  • did not deploy
  • violent
  • explode
  • shrapnel

These features are the things that the machine learning algorithm is going to look for when it's reading. There are lots of words in each complaint, but these are the only ones we'll tell the classifier to pay attention to!

Building our features dataframe#

Now we're going to convert each sentence into a list of numbers. It will be a new dataframe, where there's a 1 if the word is in the complaint and a 0 if it isn't.

To determine if a word is in CDESCR, we can use .str.contains.

See if each row has the word AIRBAG in it.

labeled.CDESCR.str.contains("AIRBAG", na=False)
0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19      True
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
320    False
321    False
322     True
323    False
324    False
325    False
326    False
327    False
328    False
329    False
330     True
331     True
332    False
333     True
334     True
335     True
336    False
337     True
338    False
339     True
340    False
341     True
342    False
343     True
344     True
345    False
346    False
347    False
348    False
349    False
Name: CDESCR, Length: 350, dtype: bool

Computers can't use True and False, though, we need numbers. We'll need to use .astype(int) to turn them into integers, with 0 for False and 1 for True.

Give me a 1 for every row that contains "AIRBAG" and a 0 for every row that does not.

labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int)
0      0
1      0
2      0
3      0
4      0
5      0
6      0
7      0
8      0
9      0
10     0
11     0
12     0
13     0
14     0
15     0
16     0
17     0
18     0
19     1
20     0
21     0
22     0
23     0
24     0
25     0
26     0
27     0
28     0
29     0
      ..
320    0
321    0
322    1
323    0
324    0
325    0
326    0
327    0
328    0
329    0
330    1
331    1
332    0
333    1
334    1
335    1
336    0
337    1
338    0
339    1
340    0
341    1
342    0
343    1
344    1
345    0
346    0
347    0
348    0
349    0
Name: CDESCR, Length: 350, dtype: int64

How many 0 values and how many 1 values do we have?

labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int).value_counts()
0    205
1    145
Name: CDESCR, dtype: int64

Okay, so about 200 don't have AIRBAG mentioned and about 150 do. That's a decent balance, I guess!

Now we need to make a new dataframe with a row for each complaint. Each word will have a column, and we'll have 0 or 1 as to whether the word is in there or not.

  • airbag
  • air bag
  • failed
  • did not deploy
  • violent
  • explode
  • shrapnel

Along with the words, we'll also save the is_suspicious label to keep everything in the same place.

I've started the dataset with the label and the word airbag, you'll need to add in the rest of them.

train_df = pd.DataFrame({
    'is_suspicious': labeled.is_suspicious,
    'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
    'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
    'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
    'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
    'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
    'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()
is_suspicious airbag air bag failed did not deploy violent explode shrapnel
0 0.0 0 0 0 0 0 0 0
1 0.0 0 0 0 0 0 0 0
2 0.0 0 0 0 0 0 0 0
3 0.0 0 0 0 0 0 0 0
4 0.0 0 0 0 0 0 0 0

Check how many rows and columns your dataframe has. You'll want to make sure it has 8 columns, and they should all be numbers.

train_df.shape
(350, 8)

Classification#

The kind of problem we're dealing with here is called a classification problem. That's because we have two different classes of complaints:

  • Complaints that are suspicious
  • Complaints that are not suspicious

And the machine's job is to classify new complaints in one of those two categories. Before we put it on the job, though, we need to train it.

Before we start with that, though, let's see how many suspicious and non-suspicious comments are in our training set.

train_df.is_suspicious.value_counts()
0.0    150
1.0     15
Name: is_suspicious, dtype: int64

Wait a second, I thought we had 350 rows? Where are the rest?

  • Tip: Try adding dropna=False to your .value_counts().
train_df.is_suspicious.value_counts(dropna=False)
NaN    185
0.0    150
1.0     15
Name: is_suspicious, dtype: int64

Up, it looks like we're missing a LOT of labels. Classifiers hate missing data - both missing labels and missing features - so we might as well remove any row that's missing any data.

  • Tip: If you use .dropna(), it will drop any rows that have NaN in them.
train_df = train_df.dropna()

After dropping the missing rows, double-check that your dataframe is the size you expect.

train_df.shape
(165, 8)

Creating our classifier#

Just like with linear regression, we call our classifier a model. It models the relationship between the inputs and the outputs.

The classifier we're using is a special one that uses logistic regression under the hood, but that doesn't matter very much right now. Just know that it's a classifier!

Separating our features and labels#

We need to feed our classifier two things

  1. The features
  2. The labels

Take a look at the first five rows of train_df.

train_df.head()
is_suspicious airbag air bag failed did not deploy violent explode shrapnel
0 0.0 0 0 0 0 0 0 0
1 0.0 0 0 0 0 0 0 0
2 0.0 0 0 0 0 0 0 0
3 0.0 0 0 0 0 0 0 0
4 0.0 0 0 0 0 0 0 0

is_suspicious is our label, and all of the other columns are our features. We'll call the label y and the features X, because that's what everyone else does.

The typical way of doing it is below (many people might use axis=1 instead of columns=, but I like how explicit columns= is!)

# Note that .drop doesn't drop the column permanently, it only drops the column to save it into `X`
X = train_df.drop(columns=['is_suspicious'])
y = train_df.is_suspicious

Take a look at X and y to make sure they look like a list of features and a list of labels. You can use .head() on both of them, no problem.

X.head()
airbag air bag failed did not deploy violent explode shrapnel
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
y.head()
0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: is_suspicious, dtype: float64

Building our classifier#

One we have our features and our labels, we can create a classifier.

I'm actually going to move the X= and y= down into this section because it's nice to keep it all in one cell.

from sklearn.linear_model import LogisticRegression

# Every column EXCEPT whether it's suspicious
X = train_df.drop(columns='is_suspicious')
# label is suspicious 0/1
y = train_df.is_suspicious

# Build a new classifier
# C=1e9 is a magic secret I don't want to talk about
# If we don't say solver='lbfgs' it complains that it's the new default
clf = LogisticRegression(C=1e9, solver='lbfgs')

# Teach the classifier about the complaints we read
clf.fit(X, y)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Okay, that... seems to have done nothing.

When we do linear regression, it prints out a bunch of stuff for us. It's nice! When we train a classifier, it's up to us to use the classifier.

Interpreting our classifier#

Feature importance#

So the classifier did some reading. Hooray! We gave it all sorts of columns (each was a different word)... which columns did it think were important?

# The words we were looking for,
# X were our features, X.columns is the column names
feature_names = X.columns

# Coefficients! Remember this from linear regression?
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
}).sort_values(by='coefficient', ascending=False)
feature coefficient
4 violent 32.318434
5 explode 1.819453
0 airbag 1.404580
1 air bag 0.812616
6 shrapnel -12.096964
2 failed -16.743779
3 did not deploy -21.108236

A higher number for a coefficient means "this word makes me think it's suspicious, a.k.a. 1" and a lower number means "this word makes me think it was not suspicious, a.k.a. 0."

Is there anything you found surprising about these results? Why do you think that might have happened?

 

Predicting with our classifier#

The point of a classifier is to classify documents it hasn't seen before, to read them and put them into the appropriate category. Before we can do this, we need to extract features from our original dataframe, the one that doesn't have labels.

We'll do this the same way we did with our set of labeled data. Build a new dataframe that asks whether each complaint has the appropriate word:

  • airbag
  • air bag
  • failed
  • did not deploy
  • violent
  • explode
  • shrapnel

I've started you off with one check for the word airbag.

features = pd.DataFrame({
    'airbag': df.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'airbag': df.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'air bag': df.CDESCR.str.contains("AIR BAG", na=False).astype(int),
    'failed': df.CDESCR.str.contains("FAILED", na=False).astype(int),
    'did not deploy': df.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
    'violent': df.CDESCR.str.contains("VIOLENT", na=False).astype(int),
    'explode': df.CDESCR.str.contains("EXPLODE", na=False).astype(int),
    'shrapnel': df.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
features.head()
airbag air bag failed did not deploy violent explode shrapnel
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
features.sum()
airbag             35613
air bag            56358
failed            129117
did not deploy     16685
violent             9994
explode             6638
shrapnel             160
dtype: int64

This dataframe should have 7 columns, none of which are is_suspicious. It's unlabeled, remember? We aren't sure whether they're suspicious complaints or not.

Confirm that real quick.

 

Now we can add a new column, the classifier's guess about whether it's suspicious or not. To make the classifier guess, we use .predict. We just feed our features to the classifier and there we go!

clf.predict(features)
array([0., 0., 0., ..., 0., 0., 0.])

Let's make a copy of features and give it a new column called predicted. That way if we need to use features again we won't have messed it up by adding new columns.

features_with_prediction = features.copy()
features_with_prediction['predicted'] = clf.predict(features)
 

Let's look at the first five.

features_with_prediction.head()

Pretty boring, right? No words in there, all predicted as 0, not fun at all. Let's try filtering to see the first ten where the prediction was 1.

features_with_prediction[features_with_prediction.predicted == 1].head(10)
airbag air bag failed did not deploy violent explode shrapnel predicted
56 0 0 0 0 1 0 0 1.0
1217 1 0 0 0 1 0 0 1.0
1868 0 0 0 0 1 0 0 1.0
2035 0 0 0 0 1 0 0 1.0
2936 0 0 1 0 1 0 0 1.0
2960 0 0 0 0 1 0 0 1.0
3949 0 0 1 0 1 0 0 1.0
3952 0 0 1 0 1 0 0 1.0
4129 0 0 0 0 1 0 0 1.0
5362 0 0 0 0 1 0 0 1.0

We can see most of the ones marked as suspicious include the words "airbag" and "violent," and none of them include "failed" or "did not deploy." That all makes sense, but what about all of the ones that include the word "violent" but not "airbag" or "air bag?" None of those should be good!

While we could just filter it to only include ones with the word "airbag" in it, we probably need a way to test the quality of our classifier.

Testing our classifier#

When we look at the results of our classifier, we know some of them are wrong - complaints shouldn't be suspicious if they don't have airbags in them! But it would be nice to have an automated process to give us an idea of how well our classifier does.

The problem is we can't test our classifier on this unlabeled data, because it doesn't know what's right and what's wrong. Instead, we have to test on the labeled data we trained our classifier on.

One technique would be having our classifier compare the actual labels on our training data to what it would predict those labels to be.

# Look at our training data, predict the labels,
# then compare the labels to the actual labels
clf.score(X, y)
0.9212121212121213

Incredible, over 90% accuracy! ...that's good, right? Well, not really. There are two major reason why this isn't impressive!

Test-train split#

One big problem with our classifier is that we're testing it on data it's already seen. While it's cool to have a study sheet for a test, it doesn't quite seem fair if the study sheet is exactly the same as the test.

Instead, we should try to reproduce what the real world is like - training it on one set of data, and testing it on similar data... but similar data we already know the labels for!

To make this happen we use something called train/test split, where instead of using the entire dataset for training, we only use most of it - the default is 80% for training and 20% for testing. The code on the line below automatically splits the dataset into two groups, one for training and a smaller one for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

To try to understand what's going on, take a look at X_train, X_test, y_train and y_test, along with their sizes.

X_train.head()
airbag air bag failed did not deploy violent explode shrapnel
12 0 0 0 0 0 0 0
321 0 1 0 1 0 0 0
296 1 1 0 0 0 0 0
348 0 1 0 0 0 0 0
246 1 0 0 0 0 0 0
X_test.head()
airbag air bag failed did not deploy violent explode shrapnel
84 1 1 0 0 0 0 0
240 1 0 0 1 0 0 0
127 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 0
X_train.shape
(123, 7)
X_test.shape
(42, 7)
y_train.head()
12     0.0
321    0.0
296    0.0
348    0.0
246    0.0
Name: is_suspicious, dtype: float64
y_test.head()
84     0.0
240    0.0
127    0.0
0      0.0
23     0.0
Name: is_suspicious, dtype: float64
y_train.shape
(123,)
y_test.shape
(42,)

Both the X_ and the y_ variables look just about exactly the same, the only difference is that _train contains a lot more than _test, and there are no repeats between the two.

Now when we give the model a test, it hasn't seen the answers already!

  • Use clf.fit to train on the training sample
  • Use clf.score to score on the testing sample
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
0.8809523809523809

This part is fun, because there's a chance it will get even better! Weird, right? We'll talk about why that might have happened a little later.

There are other ways to improve this further, but for now we have a larger problem to tackle.

The confusion matrix#

Our accuracy is looking great, hovering somewhere in the 90's. Feeling good, right? Unfortunately, things aren't actually that rosy.

Let's take a look at how many suspicious and how many non-suspicious ones we have in our labeled dataset (for the millionth time, yes)

labeled.is_suspicious.value_counts()
0.0    150
1.0     15
Name: is_suspicious, dtype: int64

We have a lot more non-suspicious ones as compared to suspicious, right? Let's say we were classifying, and we always guessed "not suspicious". Since there are so few suspicious ones, we wouldn't get very many wrong, and our accuracy would be really high!

If we have 99 non-suspicious and 1 suspicious, if we always guess "non-suspicious" we'd have 99% accuracy.

Even though our accuracy would look great, the result would be super boring. Since zero of our complaints would have been marked as suspicious, we wouldn't have anything to read or research. It'd be much nicer if we could identify the difference between getting one category right compared to the other.

And hey, that's easy! We use this thing called a confusion matrix. It looks like this:

from sklearn.metrics import confusion_matrix

y_true = y
y_pred = clf.predict(X)

confusion_matrix(y_true, y_pred)
array([[150,   0],
       [ 14,   1]])

...which is pretty terrible-looking, right? It's hard as heck to understand! Let's try to spice it up a little bit and make it a little nicer to read:

from sklearn.metrics import confusion_matrix

# Save the true label, but also save the predicted label
y_true = y
y_pred = clf.predict(X)
# We could also use just the test dataset
# y_true = y_test
# y_pred = clf.predict(X_test)

matrix = confusion_matrix(y_true, y_pred)

# But then make it look nice
label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted not suspicious Predicted suspicious
Is not suspicious 150 0
Is suspicious 14 1

So now we can see what's going on a little bit better. According to the confusion matrix, when using our original dataset (your numbers might be a little different):

  • We correctly predicted 149 of 150 not-suspicious
  • We only correctly predicted 2 of 15 suspicious ones.

Even though that gives us a really high score, it's pretty useless.

Thinking about what your outputs mean#

While we could spend a lot of time working on the math behind all of this and the technical ins and outs, I think a more useful thing for journalists to do - when both analyzing their own algorithms as well as other people's algorithms - is to think about what incorrect outputs mean.

In this case, we're trying to predict whether we should investigate a given complaint. That basically means, "the computer takes a look and says 'hey human being, you should go look at it'.

As a result, every complain that shouldn't have been flagged is more work for a computer, but every complaint that is incorrectly flagged means we'll never think to look at that complaint.

Do you think it's better to incorrectly flag non-suspicious complaints as suspicious, or to incorrectly flag suspicious complaints as non-suspicious

What are the upsides/downsides of each, and which side is more important to you?

 

Classifier Probability#

When we use clf.predict, we only get a 0 or a 1. That's kind of a fakeout, though, as under the hood there is actually something a (little) more complicated going on. Since we only have two categories, each row is given a score between 0-100% as to whether it should belong to a category. If it's over 50% it goes into that category!

We can see this with clf.predict_proba.

X_with_predictions = X.copy()
X_with_predictions['predicted'] = clf.predict(X)
# [:,1] is the probability it belongs in the '1' category
X_with_predictions['probability'] = clf.predict_proba(X)[:,1]
X_with_predictions.head()
airbag air bag failed did not deploy violent explode shrapnel predicted probability
0 0 0 0 0 0 0 0 0.0 0.055643
1 0 0 0 0 0 0 0 0.0 0.055643
2 0 0 0 0 0 0 0 0.0 0.055643
3 0 0 0 0 0 0 0 0.0 0.055643
4 0 0 0 0 0 0 0 0.0 0.055643

Now we can be a little more discriminating - instead of just looking as whether it scored above or below 50% by seeing the final classification we can see exactly what the classifier was thinking when it assigned it to one category or another. Try sorting by probability and showing the top 20, putting the higher probability at the top.

X_with_predictions.sort_values(by='probability', ascending=False).head(20)
airbag air bag failed did not deploy violent explode shrapnel predicted probability
303 1 1 0 0 0 1 0 1.0 0.641107
334 1 0 0 0 0 1 0 0.0 0.358830
84 1 1 0 0 0 0 0 0.0 0.321545
59 1 1 0 0 0 0 0 0.0 0.321545
254 1 1 0 0 0 0 0 0.0 0.321545
290 1 1 0 0 0 0 0 0.0 0.321545
252 1 1 0 0 0 0 0 0.0 0.321545
296 1 1 0 0 0 0 0 0.0 0.321545
81 1 1 0 0 0 0 0 0.0 0.321545
339 1 1 0 0 0 0 0 0.0 0.321545
337 1 1 0 0 0 0 0 0.0 0.321545
55 1 1 0 0 0 0 0 0.0 0.321545
316 1 1 0 0 0 0 0 0.0 0.321545
224 0 1 0 0 0 0 0 0.0 0.158299
223 0 1 0 0 0 0 0 0.0 0.158299
57 0 1 0 0 0 0 0 0.0 0.158299
349 0 1 0 0 0 0 0 0.0 0.158299
342 0 1 0 0 0 0 0 0.0 0.158299
323 0 1 0 0 0 0 0 0.0 0.158299
348 0 1 0 0 0 0 0 0.0 0.158299

Let's improve our model#

Right now our model isn't very good. It doesn't seem to require the word "airbag" to be in it (maybe because we count "airbag" and "air bag" as separate words?) and doesn't include that many features. Can you think of ways to improve our model, and maybe try a few out?

Imports#

We'll just do this all over again.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
pd.set_option("display.max_colwidth", 500)

Read in our labeled data#

Right now we're only dropping ones have missing labels. Why do we have so many missing labels? Are there other options for ones we could include/not include?

# Read in our data, drop those that are missing labels
labeled = pd.read_csv("data/sampled-labeled.csv")
labeled = labeled.dropna()
labeled.shape
(165, 2)

Create our X and y#

Are there other words you might look for? Any words you might remove?

train_df = pd.DataFrame({
    'is_suspicious': labeled.is_suspicious,
    'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
    'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
    'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
    'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
    'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
    'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()
is_suspicious airbag air bag failed did not deploy violent explode shrapnel
0 0.0 0 0 0 0 0 0 0
1 0.0 0 0 0 0 0 0 0
2 0.0 0 0 0 0 0 0 0
3 0.0 0 0 0 0 0 0 0
4 0.0 0 0 0 0 0 0 0

Split into train and test#

Does giving the model more (or less) to train with change anything?

X = train_df.drop(columns='is_suspicious')
y = train_df.is_suspicious

# With test_size=0.3, we'll train on 70% and test on 30%
# random_state=42 means it isn't actually random, it will always give you the same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Create and train our classifier#

You... don't know any other classifiers. But hey, you could always look some up, I guess!

clf = LogisticRegression(C=1e9, solver='lbfgs')

clf.fit(X_train, y_train)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Check the important words#

Are the selected words pushing your results in the direction you think they should?

feature_names = X_train.columns
# Coefficients! Remember this from linear regression?
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
}).sort_values(by='coefficient', ascending=False)
feature coefficient
4 violent 45.690384
5 explode 1.600866
1 air bag 1.283316
0 airbag 0.689808
6 shrapnel -11.551542
2 failed -23.881028
3 did not deploy -34.018183

Test our classifier#

We'll do a simple .score (which we know isn't very useful) along with a confusion matrix (which is harder to understand, but less useful). How do we feel about the results according to both?

Normally I'd only use the confusion matrix on X_test/y_test, but we do such a bad job that I feel like we should look at it all.

clf.score(X_test, y_test)
0.8484848484848485
y_true = y
y_pred = clf.predict(X)
# y_true = y_test
# y_pred = clf.predict(X_test)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted not suspicious Predicted suspicious
Is not suspicious 150 0
Is suspicious 13 2

If you keep running this and running this, it's going to be different each time.

Examining the results#

train_df_with_predictions = train_df.copy()
train_df_with_predictions['predicted'] = clf.predict(train_df.drop(columns='is_suspicious'))
train_df_with_predictions['predicted_prob'] = clf.predict_proba(train_df.drop(columns='is_suspicious'))[:,1]
train_df_with_predictions['sentence'] = labeled.CDESCR
train_df_with_predictions.sort_values(by='predicted_prob', ascending=False).head(10)
is_suspicious airbag air bag failed did not deploy violent explode shrapnel predicted predicted_prob sentence
294 1.0 1 0 0 0 1 0 0 1.0 1.000000 DROVE THE CAR ABOUT 20 YARDS, THEN PLACED IT IN PARK TO ALLOW THE REAR VAN DOORS TO OPEN FOR OUR CHILDREN. WHEN THE KIDS GOT IN, I PLACED THE SHIFT LEVER IN DRIVE. IMMEDIATELY, BOTH THE DRIVER AND PASSENGER AIRBAGS DEPLOYED VIOLENTLY. I WAS NOT IN MOTION, WAS NOT STRUCK BY ANY OTHER VEHICLE OR OBJECT, AND MY FOOT WAS ON THE BRAKE. AN OFF-DUTY POLICE OFFICER WAS PARKED RIGHT BEHIND ME SAW THIS AND CAME TO HELP. HE NOTED THAT THE AIRBAG FIRING MECHANISMS CONTINUED TO FIRE. "I'VE SEEN A LO...
303 1.0 1 1 0 0 0 1 0 1.0 0.655119 I WAS DRIVEN IN A SCHOOL ZONE STREET AND THE LIGHTS OF AIRBAG ON AND APROX. 2 MINUTES THE AIR BAGS EXPLODED IN MY FACE, THE DRIVE AND PASSENGERS SIDE, THEN I STOPPED THE JEEP, IT SMELL LIKE SOMETHING IS BURNING AND HOT, I DID NOT SEE FIRE. *TR
334 0.0 1 0 0 0 0 1 0 0.0 0.344863 SINGLE-CAR ACCIDENT; ROLLOVER, 2008 KIA RONDO DECLARED A TOTAL LOSS BY INSURANCE CO.; SAFETY FEATURES INCLUDED ELECTRONIC STABILITY CONTROL; 6 AIRBAGS INCLUDING SIDE/HEAD CURTAIN AND NOT ONE DEPLOYED - I SUFFER FROM APPROX 6" LESION WITH PARTIAL SKULL SHOWING (16 STAPLES) TO LEFT SIDE OF HEAD FROM THE SIDEROOF SLAMMED INTO ME; I WAS WEARING A SEATBELT AND IT SAVED MY LIFE - GLASS EXPLODED EVERYWHERE, I HAVE SEVERE WHIPLASH AND CONCUSSION. *TR
84 0.0 1 1 0 0 0 0 0 0.0 0.277029 THE SEBRING HIT THE CAR IN FRONT OF IT. LEFT FRONT OF SEBRING HITTING THE RIGHT REAR BUMPER OF CAR IN FRONT OF IT. IMMEDIATELY STOPPING THE SEBRING AND THEN HAVING IT ROLL BACKWARDS INTO A DITCH. TOTALLY THE SEBRING. NO AIR BAGS DEPLOYED. INJURIES SUSTAINED BECAUSE OF AIRBAGS NOT DEPLOYING. WEATHER OUTSIDE WAS BELOW ZERO. *TR
55 0.0 1 1 0 0 0 0 0 0.0 0.277029 2005 NISSAN MURANO AIR BAG SENSOR LIGHT CONTINUED TO BLINK. CONSUMER WANTS TO KNOW IF THIS IS A SAFETY ISSUE. *NJ A DIAGNOSTICS DETERMINED THAT THE CONTACT SPIRAL IN THE STEERING COLUMN HAD AN OPEN CIRCUIT AND NEEDED TO BE REPLACED OR THE DRIVER'S SIDE AIRBAG WOULD NOT DEPLOY. *JB
337 0.0 1 1 0 0 0 0 0 0.0 0.277029 2008 JEEP WRANGLER RHD POSTAL VEHICLES USED BY RURAL CARRIERS FOR USPS. THOSE OF US WHO PURCHASED THE VEHICLES ARE HAVING ISSUES WITH AIRBAG MALFUNCTION INDICATOR LIGHT AND NO HORN USE OR INTERMITTENT USAGE. ONE ITEM OF INTEREST POSSIBLE CAUSE IS THE SPRINGCLOCK IN THE STEERING COLUMN AS COMPONENT THAT OPERATES BOTH HORN AND AIR BAG CIRCUITS. THIS ITEM HAS BEEN A PROBLEM IN YEARS PAST WITH JEEP. CONCERNED WITH A DEPLOYMENT OF FAULTY AIR BAG ISSUE MOSTLY, TRYING TO GET MY VEHICLE IN FOR A...
339 0.0 1 1 0 0 0 0 0 0.0 0.277029 SERVICE AIR BAG CHECK ENGINE LIGHT ON WHEN STARTING 2006 SILVERADO; GM DEALERS DIAGNOSED AIRBAG SENSOR MALFUNCTIONING REQUIRES REPLACEMENT. THIS SAFETY ISSUE HAS BEEN REPORTED IN 2009 AT EDMUNDS.COM BY OTHERS. CALLED, EMAILED AND TWITTED GM SINCE 14 MAY 2014. AFTER 13 PHONE TAGS WITH 4 GM CUSTOMER CARE SPECIALISTS AND 2 DEALER SERVICE REPS (INCLUDE MANAGER) AND 5 DAYS, A VERBAL WORD FROM GM "YOUR VEHICLE IS WAY BELONG GM'S RESPONSIBILITY"...WE ARE 'TICKED" AROUND BY GM AND THE DEALER WITH ...
59 0.0 1 1 0 0 0 0 0 0.0 0.277029 COMMON DEFECT: AIR BAG WARNING LIGHT IS ON. SUSPECTED PASSENGER SEAT SENSOR FAILURE. I'VE HAD 2 BMW'S FROM THIS SERIES - BOTH WITH THE SAME ISSUE - YET NO RECALL FROM BMW - WHY? ISN'T THIS A SAFETY SYSTEM? I SEE AIRBAG RECALLS ON CARS DATING BACK TO THIS ERA. WHY NONE FOR BMW? ALSO - ODOMETER RIBBON CABLE FAILURE CAUSING PIXELS TO FAIL. NO RECALL - PROBLEM EXISTS ACROSS THE SERIES FROM 1996-2001 AND BEYOND. NO FIX - NO RECALL - NO REPAIRS OFFERED. WHY? *TR
296 0.0 1 1 0 0 0 0 0 0.0 0.277029 WE BOUGHT THIS VEHICLE IN JUNE 2013. WE HAD IT ALMOST A MONTH WHEN THE STEERING LOCKED UP ON ME. THE POWER STEERING PRESSURE LINE BLEW OUT. WE TOOK IT TO THE DEALER THAT WE BOUGHT IT FROM AND THEY FIXED THE LINE. THREE DAYS LATER THE STEERING WENT AGAIN WE GOT IT TO THE DEALERSHIP AND THEY WERE GOING TO CHARGE US $300.00. I GOT UPSET BUT THE MOST THEY WOULD DO FOR ME WAS SPLIT THE COST.THEN A FEW MONTHS LATER I STARTED TO SMELL GAS. I DIDN'T SEE ANY THING ON THE GROUND SO I LEFT IT GO. THEN...
81 0.0 1 1 0 0 0 0 0 0.0 0.277029 2007 HYUNDAI SONATA. CONSUMER WRITES IN REGARDS TO VEHICLE AIRBAG ISSUES. *SMD THE CONSUMER STATED THE AIR BAG LIGHT ILLUMINATED. THE CONSUMER HAD AN ISSUE WITH THE AIR BAG LIGHT ILLUMINATING IN OCTOBER 2012, WHERE THE DEALER REPLACED THE SEAT BELT BUCKLE ASSEMBLY. *JB
 

How are we going to fix this?#

Even if you can't successfully make your classifier perform any better, try to think about what you feel like could make it better.

 

Review#

We have far too many complaints from the National Highway Traffic Safety Administration to read ourselves, so we're hoping to convince a computer to mark the ones we'll be interested in. We hand-labeled a random sample as suspicious or not and used this smaller dataset as a source of training material for our machine learning algorithm.

We picked a few words we thought might be indicative of malfunctioning air bags, and added new columns to our dataset as to whether each complain has the word or not. We then train our algorithm, where it learns how these features are related to the label of suspicious or not. In this case we used a logistic regression classifier.

It did not do a very good job, and we thought about reasons why.

Discussion topics#

We have a few options: try to flag more examples, try a different machine learning algorithm, just give up and have someone read all of the complaints manually.

Do you think our algorithm might perform better if we had a better split between suspicious and non-suspicious complaints?

What do you think takes longer: learning to use machine learning, or searching for "airbag" and manually marking complaints as suspicious or not?