Finding faulty airbags in a sea of consumer complaints with a decision tree#
The story:
- https://www.nytimes.com/2014/09/12/business/air-bag-flaw-long-known-led-to-recalls.html
- https://www.nytimes.com/2014/11/07/business/airbag-maker-takata-is-said-to-have-conducted-secret-tests.html
- https://www.nytimes.com/interactive/2015/06/22/business/international/takata-airbag-recall-list.html
- https://www.nytimes.com/2016/08/27/business/takata-airbag-recall-crisis.html
This story, done by The New York Times, investigates the content in complaints made to National Highway Traffic Safety Administration (NHTSA) by customers who had bad experiences with Takata airbags in their cars. Eventually, car companies had to recall airbags made by the airbag supplier that promised a cheaper alternative.
Author: Daeil Kim did a more complex version of this particular analysis - presentation here
Topics: Decision Trees, Random Forests
Datasets
- sampled-labeled.csv: a sample of vehicle complaints, labeled with being suspicious or not
What's the goal?#
It was too much work to read twenty years of vehicle comments to find the ones related to dangerous airbags! Because we're lazy, we wanted the computer to do this for us. We did this before with a classifier that used logistic regression, now we're going to try a different one.
import pandas as pd
# Allow us to display 100 columns at a time, and 100 characters in each column (instead of ...)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)
Read in our labeled data#
We aren't going to be using the unlabeled dataset this time, we're only going to look at how our classifier works. We'll start by reading in our complaints that have labeled attached to them.
Read in sampled-labeled.csv
and check how many suspicious/not suspicious complaints we have.
labeled = pd.read_csv("data/sampled-labeled.csv")
labeled.head()
labeled.is_suspicious.value_counts()
150 non-suspicious and 15 suspicious is a pretty terrible ratio, but we're remarkably lazy and not very many of the comments are actually suspicious.
Now that we've read a few, let's train our classifier
Creating features#
Selecting our features and building a features dataframe#
Last time, we can thought of some words or phrases that might make a comment interesting or not interesting. We came up with this list:
- airbag
- air bag
- failed
- did not deploy
- violent
- explode
- shrapnel
These features are the things that the machine learning algorithm is going to look for when it's reading. There are lots of words in each complaint, but these are the only ones we'll tell the classifier to pay attention to!
To determine if a word is in CDESCR
, we can use .str.contains
. Because computers only like numbers, though, we need to use .astype(int)
to change it from True
/False
to 1
/0
.
train_df = pd.DataFrame({
'is_suspicious': labeled.is_suspicious,
'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()
Let's see how big our dataset is, and then remove any rows that are missing data (not all of them are labeled).
train_df.shape
train_df = train_df.dropna()
train_df.shape
Creating our classifier#
Any time you're building a classifier, doing regression, or most anything with machine learning, you're using a model. It models the relationship between the inputs and the outputs.
Classification with Decision Trees#
Last time we used a classifier based on Logistic Regression. First we split into X
(our features) and y
(our labels), and trained the classifier on them.
from sklearn.linear_model import LogisticRegression
X = train_df.drop(columns='is_suspicious')
y = train_df.is_suspicious
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X, y)
After we built our classifier, we tested it and found it didn't work very well.
from sklearn.metrics import confusion_matrix
y_true = y
y_pred = clf.predict(X)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
To understand a logistic regression classifier, we looked at the coefficients and the odds ratios.
import numpy as np
feature_names = X.columns
coefficients = clf.coef_[0]
pd.DataFrame({
'feature': feature_names,
'coefficient (log odds ratio)': coefficients,
'odds ratio': np.exp(coefficients).round(4)
}).sort_values(by='odds ratio', ascending=False)
Classification with Decision Trees#
We can also use a classifier called a decision tree. All you need to do is have one new import and change the line where you create your classifier.
#from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
X = train_df.drop(columns='is_suspicious')
y = train_df.is_suspicious
#clf = LogisticRegression(C=1e9, solver='lbfgs')
clf = DecisionTreeClassifier()
clf.fit(X, y)
Confusion matrix code looks exactly the same.
from sklearn.metrics import confusion_matrix
y_true = y
y_pred = clf.predict(X)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
When using a decision tree, using the classifier is the same, but the code to understand the classifier is a bit different. Instead of coefficients, we're going to look at feature importance.
import eli5
label_names = ['not suspicious', 'suspicious']
feature_names = list(X.columns)
eli5.show_weights(clf,
feature_names=feature_names,
target_names=label_names,
show=['feature_importances', 'description'])
The most fun part of using a decision tree is visualizing it.
from sklearn import tree
import graphviz
label_names = ['not suspicious', 'suspicious']
feature_names = X.columns
dot_data = tree.export_graphviz(clf,
feature_names=feature_names,
filled=True,
class_names=label_names)
graph = graphviz.Source(dot_data)
graph
You can also also see the tree with eli5
, I just suppressed it because I thought we could use a little color.
feature_names = list(X.columns)
eli5.show_weights(clf,
feature_names=feature_names,
target_names=label_names)
And the best part is: almost everything you can do with a logistic regression classifier you can do with a decision tree. Most of the time you can just change your classifier to see if it does better.
Decision trees also have a lot of simple options.
from sklearn.tree import DecisionTreeClassifier
X = train_df.drop(columns='is_suspicious')
y = train_df.is_suspicious
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(X, y)
from sklearn import tree
import graphviz
label_names = ['not suspicious', 'suspicious']
feature_names = X.columns
dot_data = tree.export_graphviz(clf,
feature_names=feature_names,
filled=True,
class_names=label_names)
graph = graphviz.Source(dot_data)
graph
A random forest is usually even better#
Although in this case our inputs are terrible so it's still not very good. Garbage in, garbage out.
We'll change our classifier to be
clf = RandomForestClassifier(n_estimators=100)
and it will use 100 decision trees to make a forest.
from sklearn.ensemble import RandomForestClassifier
X = train_df.drop(columns='is_suspicious')
y = train_df.is_suspicious
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)
from sklearn.metrics import confusion_matrix
y_true = y
y_pred = clf.predict(X)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
feature_names = list(X.columns)
eli5.show_weights(clf, feature_names=feature_names, show=eli5.formatters.fields.ALL)
Review#
In our previous two attempts to tackle Takata airbag investigation, we used a logistic regression classifier. This time we're trying a new type called a random forest, which performed slightly better (although it could have just been chance).
Despite this slight improvement, its predictions were still very off.
Discussion topics#
What's wrong here? Why does nothing work for us, even though we keep throwing more machine learning tools at it?