Finding faulty airbags in a sea of consumer complaints with a decision tree#

The story:

This story, done by The New York Times, investigates the content in complaints made to National Highway Traffic Safety Administration (NHTSA) by customers who had bad experiences with Takata airbags in their cars. Eventually, car companies had to recall airbags made by the airbag supplier that promised a cheaper alternative.

Author: Daeil Kim did a more complex version of this particular analysis - presentation here

Topics: Decision Trees, Random Forests

Datasets

  • sampled-labeled.csv: a sample of vehicle complaints, labeled with being suspicious or not

What's the goal?#

It was too much work to read twenty years of vehicle comments to find the ones related to dangerous airbags! Because we're lazy, we wanted the computer to do this for us. We did this before with a classifier that used logistic regression, now we're going to try a different one.

Our code#

Setup#

import pandas as pd

# Allow us to display 100 columns at a time, and 100 characters in each column (instead of ...)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)

Read in our labeled data#

We aren't going to be using the unlabeled dataset this time, we're only going to look at how our classifier works. We'll start by reading in our complaints that have labeled attached to them.

Read in sampled-labeled.csv and check how many suspicious/not suspicious complaints we have.

labeled = pd.read_csv("data/sampled-labeled.csv")
labeled.head()
is_suspicious CDESCR
0 0.0 ALTHOUGH I LOVED THE CAR OVERALL AT THE TIME I DECIDED TO OWN, , MY DREAM CAR CADILLAC CTS HAS T...
1 0.0 CONSUMER SHUT SLIDING DOOR WHEN ALL POWER LOCKS ON ALL DOORS LOCKED BY ITSELF, TRAPPING INFANT I...
2 0.0 DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT
3 0.0 TL* THE CONTACT OWNS A 2009 NISSAN ALTIMA. THE CONTACT STATED THAT THE START BUTTON FOR THE IGNI...
4 0.0 THE FRONT MIDDLE SEAT DOESN'T LOCK IN PLACE. *AK
labeled.is_suspicious.value_counts()
0.0    150
1.0     15
Name: is_suspicious, dtype: int64

150 non-suspicious and 15 suspicious is a pretty terrible ratio, but we're remarkably lazy and not very many of the comments are actually suspicious.

Now that we've read a few, let's train our classifier

Creating features#

Selecting our features and building a features dataframe#

Last time, we can thought of some words or phrases that might make a comment interesting or not interesting. We came up with this list:

  • airbag
  • air bag
  • failed
  • did not deploy
  • violent
  • explode
  • shrapnel

These features are the things that the machine learning algorithm is going to look for when it's reading. There are lots of words in each complaint, but these are the only ones we'll tell the classifier to pay attention to!

To determine if a word is in CDESCR, we can use .str.contains. Because computers only like numbers, though, we need to use .astype(int) to change it from True/False to 1/0.

train_df = pd.DataFrame({
    'is_suspicious': labeled.is_suspicious,
    'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
    'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
    'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
    'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
    'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
    'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()
is_suspicious airbag air bag failed did not deploy violent explode shrapnel
0 0.0 0 0 0 0 0 0 0
1 0.0 0 0 0 0 0 0 0
2 0.0 0 0 0 0 0 0 0
3 0.0 0 0 0 0 0 0 0
4 0.0 0 0 0 0 0 0 0

Let's see how big our dataset is, and then remove any rows that are missing data (not all of them are labeled).

train_df.shape
(350, 8)
train_df = train_df.dropna()
train_df.shape
(165, 8)

Creating our classifier#

Any time you're building a classifier, doing regression, or most anything with machine learning, you're using a model. It models the relationship between the inputs and the outputs.

Classification with Decision Trees#

Last time we used a classifier based on Logistic Regression. First we split into X (our features) and y (our labels), and trained the classifier on them.

from sklearn.linear_model import LogisticRegression

X = train_df.drop(columns='is_suspicious')
y = train_df.is_suspicious

clf = LogisticRegression(C=1e9, solver='lbfgs')

clf.fit(X, y)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

After we built our classifier, we tested it and found it didn't work very well.

from sklearn.metrics import confusion_matrix

y_true = y
y_pred = clf.predict(X)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted not suspicious Predicted suspicious
Is not suspicious 150 0
Is suspicious 13 2

To understand a logistic regression classifier, we looked at the coefficients and the odds ratios.

import numpy as np

feature_names = X.columns
coefficients = clf.coef_[0]

pd.DataFrame({
    'feature': feature_names,
    'coefficient (log odds ratio)': coefficients,
    'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)
feature coefficient (log odds ratio) odds ratio
4 violent 41.423096 9.768364e+17
5 explode 1.269048 3.557500e+00
1 air bag 1.268123 3.554200e+00
0 airbag 0.945612 2.574400e+00
2 failed -27.175214 0.000000e+00
3 did not deploy -37.906428 0.000000e+00
6 shrapnel -13.204894 0.000000e+00

Classification with Decision Trees#

We can also use a classifier called a decision tree. All you need to do is have one new import and change the line where you create your classifier.

#from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

X = train_df.drop(columns='is_suspicious')
y = train_df.is_suspicious

#clf = LogisticRegression(C=1e9, solver='lbfgs')
clf = DecisionTreeClassifier()

clf.fit(X, y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Confusion matrix code looks exactly the same.

from sklearn.metrics import confusion_matrix

y_true = y
y_pred = clf.predict(X)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted not suspicious Predicted suspicious
Is not suspicious 150 0
Is suspicious 13 2

When using a decision tree, using the classifier is the same, but the code to understand the classifier is a bit different. Instead of coefficients, we're going to look at feature importance.

import eli5

label_names = ['not suspicious', 'suspicious']
feature_names = list(X.columns)

eli5.show_weights(clf,
                  feature_names=feature_names,
                  target_names=label_names,
                  show=['feature_importances', 'description'])
Weight Feature
0.3440 airbag
0.2026 violent
0.1529 air bag
0.1445 explode
0.1205 did not deploy
0.0266 failed
0.0091 shrapnel
Decision tree feature importances; values are numbers 0 <= x <= 1;
all values sum to 1.

The most fun part of using a decision tree is visualizing it.

from sklearn import tree
import graphviz

label_names = ['not suspicious', 'suspicious']
feature_names = X.columns

dot_data = tree.export_graphviz(clf,
                    feature_names=feature_names,  
                    filled=True,
                    class_names=label_names)  
graph = graphviz.Source(dot_data)  
graph
Tree 0 violent <= 0.5 gini = 0.165 samples = 165 value = [150, 15] class = not suspicious 1 did not deploy <= 0.5 gini = 0.156 samples = 164 value = [150, 14] class = not suspicious 0->1 True 20 gini = 0.0 samples = 1 value = [0, 1] class = suspicious 0->20 False 2 air bag <= 0.5 gini = 0.212 samples = 116 value = [102, 14] class = not suspicious 1->2 19 gini = 0.0 samples = 48 value = [48, 0] class = not suspicious 1->19 3 airbag <= 0.5 gini = 0.156 samples = 94 value = [86, 8] class = not suspicious 2->3 12 explode <= 0.5 gini = 0.397 samples = 22 value = [16, 6] class = not suspicious 2->12 4 gini = 0.0 samples = 49 value = [49, 0] class = not suspicious 3->4 5 failed <= 0.5 gini = 0.292 samples = 45 value = [37, 8] class = not suspicious 3->5 6 shrapnel <= 0.5 gini = 0.308 samples = 42 value = [34, 8] class = not suspicious 5->6 11 gini = 0.0 samples = 3 value = [3, 0] class = not suspicious 5->11 7 explode <= 0.5 gini = 0.314 samples = 41 value = [33, 8] class = not suspicious 6->7 10 gini = 0.0 samples = 1 value = [1, 0] class = not suspicious 6->10 8 gini = 0.32 samples = 40 value = [32, 8] class = not suspicious 7->8 9 gini = 0.0 samples = 1 value = [1, 0] class = not suspicious 7->9 13 airbag <= 0.5 gini = 0.363 samples = 21 value = [16, 5] class = not suspicious 12->13 18 gini = 0.0 samples = 1 value = [0, 1] class = suspicious 12->18 14 gini = 0.494 samples = 9 value = [5, 4] class = not suspicious 13->14 15 failed <= 0.5 gini = 0.153 samples = 12 value = [11, 1] class = not suspicious 13->15 16 gini = 0.165 samples = 11 value = [10, 1] class = not suspicious 15->16 17 gini = 0.0 samples = 1 value = [1, 0] class = not suspicious 15->17

You can also also see the tree with eli5, I just suppressed it because I thought we could use a little color.

feature_names = list(X.columns)

eli5.show_weights(clf,
                  feature_names=feature_names,
                  target_names=label_names)
Weight Feature
0.3440 airbag
0.2026 violent
0.1529 air bag
0.1445 explode
0.1205 did not deploy
0.0266 failed
0.0091 shrapnel



Tree



0

violent <= 0.5
gini = 0.165
samples = 100.0%
value = [0.909, 0.091]
class = not suspicious



1

did not deploy <= 0.5
gini = 0.156
samples = 99.4%
value = [0.915, 0.085]
class = not suspicious



0->1


True



20

gini = 0.0
samples = 0.6%
value = [0.0, 1.0]
class = suspicious



0->20


False



2

air bag <= 0.5
gini = 0.212
samples = 70.3%
value = [0.879, 0.121]
class = not suspicious



1->2





19

gini = 0.0
samples = 29.1%
value = [1.0, 0.0]
class = not suspicious



1->19





3

airbag <= 0.5
gini = 0.156
samples = 57.0%
value = [0.915, 0.085]
class = not suspicious



2->3





12

explode <= 0.5
gini = 0.397
samples = 13.3%
value = [0.727, 0.273]
class = not suspicious



2->12





4

gini = 0.0
samples = 29.7%
value = [1.0, 0.0]
class = not suspicious



3->4





5

failed <= 0.5
gini = 0.292
samples = 27.3%
value = [0.822, 0.178]
class = not suspicious



3->5





6

shrapnel <= 0.5
gini = 0.308
samples = 25.5%
value = [0.81, 0.19]
class = not suspicious



5->6





11

gini = 0.0
samples = 1.8%
value = [1.0, 0.0]
class = not suspicious



5->11





7

explode <= 0.5
gini = 0.314
samples = 24.8%
value = [0.805, 0.195]
class = not suspicious



6->7





10

gini = 0.0
samples = 0.6%
value = [1.0, 0.0]
class = not suspicious



6->10





8

gini = 0.32
samples = 24.2%
value = [0.8, 0.2]
class = not suspicious



7->8





9

gini = 0.0
samples = 0.6%
value = [1.0, 0.0]
class = not suspicious



7->9





13

airbag <= 0.5
gini = 0.363
samples = 12.7%
value = [0.762, 0.238]
class = not suspicious



12->13





18

gini = 0.0
samples = 0.6%
value = [0.0, 1.0]
class = suspicious



12->18





14

gini = 0.494
samples = 5.5%
value = [0.556, 0.444]
class = not suspicious



13->14





15

failed <= 0.5
gini = 0.153
samples = 7.3%
value = [0.917, 0.083]
class = not suspicious



13->15





16

gini = 0.165
samples = 6.7%
value = [0.909, 0.091]
class = not suspicious



15->16





17

gini = 0.0
samples = 0.6%
value = [1.0, 0.0]
class = not suspicious



15->17





And the best part is: almost everything you can do with a logistic regression classifier you can do with a decision tree. Most of the time you can just change your classifier to see if it does better.

Decision trees also have a lot of simple options.

from sklearn.tree import DecisionTreeClassifier

X = train_df.drop(columns='is_suspicious')
y = train_df.is_suspicious

clf = DecisionTreeClassifier(max_depth=2)

clf.fit(X, y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
from sklearn import tree
import graphviz

label_names = ['not suspicious', 'suspicious']
feature_names = X.columns

dot_data = tree.export_graphviz(clf,
                    feature_names=feature_names,  
                    filled=True,
                    class_names=label_names)  
graph = graphviz.Source(dot_data)  
graph
Tree 0 violent <= 0.5 gini = 0.165 samples = 165 value = [150, 15] class = not suspicious 1 did not deploy <= 0.5 gini = 0.156 samples = 164 value = [150, 14] class = not suspicious 0->1 True 4 gini = 0.0 samples = 1 value = [0, 1] class = suspicious 0->4 False 2 gini = 0.212 samples = 116 value = [102, 14] class = not suspicious 1->2 3 gini = 0.0 samples = 48 value = [48, 0] class = not suspicious 1->3

A random forest is usually even better#

Although in this case our inputs are terrible so it's still not very good. Garbage in, garbage out.

We'll change our classifier to be

clf = RandomForestClassifier(n_estimators=100)

and it will use 100 decision trees to make a forest.

from sklearn.ensemble import RandomForestClassifier

X = train_df.drop(columns='is_suspicious')
y = train_df.is_suspicious

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
from sklearn.metrics import confusion_matrix

y_true = y
y_pred = clf.predict(X)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted not suspicious Predicted suspicious
Is not suspicious 150 0
Is suspicious 13 2
feature_names = list(X.columns)

eli5.show_weights(clf, feature_names=feature_names, show=eli5.formatters.fields.ALL)

Explained as: feature importances

Random forest feature importances; values are numbers 0 <= x <= 1;
all values sum to 1.
Weight Feature
0.2394 ± 0.3216 did not deploy
0.2223 ± 0.3465 airbag
0.2101 ± 0.4087 air bag
0.1543 ± 0.2972 violent
0.1314 ± 0.2854 explode
0.0369 ± 0.0527 failed
0.0057 ± 0.0190 shrapnel

Review#

In our previous two attempts to tackle Takata airbag investigation, we used a logistic regression classifier. This time we're trying a new type called a random forest, which performed slightly better (although it could have just been chance).

Despite this slight improvement, its predictions were still very off.

Discussion topics#

What's wrong here? Why does nothing work for us, even though we keep throwing more machine learning tools at it?