Finding faulty airbag complaints using a very simple keyword search with logistic regression#
The story:
- https://www.nytimes.com/2014/09/12/business/air-bag-flaw-long-known-led-to-recalls.html
- https://www.nytimes.com/2014/11/07/business/airbag-maker-takata-is-said-to-have-conducted-secret-tests.html
- https://www.nytimes.com/interactive/2015/06/22/business/international/takata-airbag-recall-list.html
- https://www.nytimes.com/2016/08/27/business/takata-airbag-recall-crisis.html
This story, done by The New York Times, investigates the content in complaints made to National Highway Traffic Safety Administration (NHTSA) by customers who had bad experiences with Takata airbags in their cars. Eventually, car companies had to recall airbags made by the airbag supplier that promised a cheaper alternative.
Author: Daeil Kim did a more complex version of this particular analysis - presentation here
Topics: Logistic Classifier
Datasets
- FLAT_CMPL.txt: Vehicle-related complaints from 1995-current from the National Highway Traffic Safety Administration
- CMPL.txt: data dictionary for the above
- sampled-unlabeled.csv: a sample of vehicle complaints, not labeled
- sampled-labeled.csv: a sample of vehicle complaints, labeled with being suspicious or not
What's the goal?#
It's too much work to read twenty years of vehicle comments to find the ones related to dangerous airbags! Because we're lazy, we want the computer to do this for us. We're going to read a subset, mark each one as "suspicious" or "not suspicious," then use that information to train the computer to read the rest and recognize which comments are suspicious and which are not suspicious.
This is a classification problem, because we want the computer to recognize which ones are suspicious and which are not.
import pandas as pd
# Allow us to display 100 columns at a time, and 100 characters in each column (instead of ...)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)
Read in our data#
The dataset in FLAT_CMPL.txt
doesn't have column headers, so we're going to use this long long list of headers that we stole from CMPL.txt
to read it in.
It's kind of a complicated dataset with a few errors here or there, so we're passing in a lot of options to pd.read_csv
. In the end it's just a big big dataframe, though.
column_names = ['CMPLID', 'ODINO', 'MFR_NAME', 'MAKETXT', 'MODELTXT',
'YEARTXT', 'CRASH', 'FAILDATE', 'FIRE', 'INJURED',
'DEATHS', 'COMPDESC', 'CITY', 'STATE', 'VIN', 'DATEA',
'LDATE', 'MILES', 'OCCURENCES', 'CDESCR', 'CMPL_TYPE',
'POLICE_RPT_YN', 'PURCH_DT', 'ORIG_OWNER_YN', 'ANTI_BRAKES_YN',
'CRUISE_CONT_YN', 'NUM_CYLS', 'DRIVE_TRAIN', 'FUEL_SYS', 'FUEL_TYPE',
'TRANS_TYPE', 'VEH_SPEED', 'DOT', 'TIRE_SIZE', 'LOC_OF_TIRE',
'TIRE_FAIL_TYPE', 'ORIG_EQUIP_YN', 'MANUF_DT', 'SEAT_TYPE',
'RESTRAINT_TYPE', 'DEALER_NAME', 'DEALER_TEL', 'DEALER_CITY',
'DEALER_STATE', 'DEALER_ZIP', 'PROD_TYPE', 'REPAIRED_YN',
'MEDICAL_ATTN', 'VEHICLES_TOWED_YN']
df = pd.read_csv("data/FLAT_CMPL.txt",
sep='\t',
dtype='str',
header=None,
error_bad_lines=False,
encoding='latin-1',
names=column_names)
# We're only interested in pre-2015
df = df[df.DATEA < '2015']
df.head()
How many rows and columns are in this dataset?
df.shape
But wait, we don't even need that yet#
Oof, that's a lot of columns!
When you're dealing with machine learning, one of the first things you'll need to think about is what columns are important to you. An important thing about this dataset is it doesn't include whether the complaint is about faulty airbags or not.
We can't teach our classifier what a suspicious comment looks like if we don't have a list of suspicious complaints, right? Luckily, we have another dataset of labeled complaints!
Read in sampled-labeled.csv
labeled = pd.read_csv("data/sampled-labeled.csv")
labeled.head()
We're going to use this dataset to train our classifier about what a suspicious complaint looks like. Once our classifier is trained we'll be able to use it to predict whether each complaint in that original (big big big) dataset is suspicious or not.
We made this dataset through hard work, reading comments, and marking them as 0
(not suspicious) or 1
(suspicious). For example, this complaint isn’t suspicious because it’s about an air bag not deploying:
DURING AN ACCIDENT AIR BAG'S DID NOT DEPLOY. DEALER HAS BEEN CONTACTED. *AK
This next one isn’t suspicious either, because it isn’t even about airbags!
DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT
But if something involves explosions or shrapnel happens, it’s probably worth marking as suspicious:
I WAS DRIVEN IN A SCHOOL ZONE STREET AND THE LIGHTS OF AIRBAG ON AND APROX. 2 MINUTES THE AIR BAGS EXPLODED IN MY FACE, THE DRIVE AND PASSENGERS SIDE, THEN I STOPPED THE JEEP, IT SMELL LIKE SOMETHING IS BURNING AND HOT, I DID NOT SEE FIRE. *TR
So we went down the file in Excel, one by one, reading comments, marking them as 0 or 1.
How many are in each category?
labeled.is_suspicious.value_counts()
150 non-suspicious and 15 suspicious is a pretty terrible ratio, but we're remarkably lazy and not very many of the comments are actually suspicious.
Now that we've read a few, let's train our classifier
Creating features#
When you're working on machine learning, you need to feed the algorithm a bunch of inputs so it can make its decision. These are called features.
There's a problem: computers only like features to be numbers, but every complaint is just a bunch of text, a.k.a. "unstructured data." How can we turn all of this unstructured data into something a computer can understand?
While there are fancier (and more effective!) ways to do what we're about to do, the simple start below is going to provide a foundation for later work.
To teach our computer how to find suspicious complaints, we first need to think about how we find those complaints as human beings. By reading, right? So let's teach the computer how to read, and what to look for.
Designing our features#
Let's take a look at what the airbag issue is, according Consumer Reports:
Vehicles made by 19 different automakers have been recalled to replace frontal airbags on the driver’s side or passenger’s side, or both in what NHTSA has called "the largest and most complex safety recall in U.S. history." The airbags, made by major parts supplier Takata, were mostly installed in cars from model year 2002 through 2015. Some of those airbags could deploy explosively, injuring or even killing car occupants.
At the heart of the problem is the airbag’s inflator, a metal cartridge loaded with propellant wafers, which in some cases has ignited with explosive force. If the inflator housing ruptures in a crash, metal shards from the airbag can be sprayed throughout the passenger cabin—a potentially disastrous outcome from a supposedly life-saving device.
If we're going through a list of vehicle complaints, it isn't too hard for us to figure out which complaints we might want to investigate further. If the complaint's about seatbelts or rear-view mirrors, we probably don't care about it. If the word "airbag" shows up in the description, though, we're going to start paying attention.
We aren't interested in all complaints with the word "airbag," though. Since we're worried about exploding airbags, something like "the airbag did not deploy" would get our attention because of the word "airbag," but then we could ignore it once we saw the airbag just didn't work.
Selecting our features#
Since we just read a long long list of airbag complaints, we can probably brainstorm some words or phrases that might make a comment interesting or not interesting. A quick start might be these few:
- airbag
- air bag
- failed
- did not deploy
- violent
- explode
- shrapnel
These features are the things that the machine learning algorithm is going to look for when it's reading. There are lots of words in each complaint, but these are the only ones we'll tell the classifier to pay attention to!
Building our features dataframe#
Now we're going to convert each sentence into a list of numbers. It will be a new dataframe, where there's a 1
if the word is in the complaint and a 0
if it isn't.
To determine if a word is in CDESCR
, we can use .str.contains
.
See if each row has the word AIRBAG
in it.
labeled.CDESCR.str.contains("AIRBAG", na=False)
Computers can't use True
and False
, though, we need numbers. We'll need to use .astype(int)
to turn them into integers, with 0
for False
and 1
for True
.
Give me a 1
for every row that contains "AIRBAG" and a 0
for every row that does not.
labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int)
How many 0
values and how many 1
values do we have?
labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int).value_counts()
Okay, so about 200 don't have AIRBAG
mentioned and about 150 do. That's a decent balance, I guess!
Now we need to make a new dataframe with a row for each complaint. Each word will have a column, and we'll have 0
or 1
as to whether the word is in there or not.
- airbag
- air bag
- failed
- did not deploy
- violent
- explode
- shrapnel
Along with the words, we'll also save the is_suspicious
label to keep everything in the same place.
I've started the dataset with the label and the word airbag, you'll need to add in the rest of them.
train_df = pd.DataFrame({
'is_suspicious': labeled.is_suspicious,
'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()
Check how many rows and columns your dataframe has. You'll want to make sure it has 8 columns, and they should all be numbers.
train_df.shape
Classification#
The kind of problem we're dealing with here is called a classification problem. That's because we have two different classes of complaints:
- Complaints that are suspicious
- Complaints that are not suspicious
And the machine's job is to classify new complaints in one of those two categories. Before we put it on the job, though, we need to train it.
Before we start with that, though, let's see how many suspicious and non-suspicious comments are in our training set.
train_df.is_suspicious.value_counts()
Wait a second, I thought we had 350 rows? Where are the rest?
- Tip: Try adding
dropna=False
to your.value_counts()
.
train_df.is_suspicious.value_counts(dropna=False)
Up, it looks like we're missing a LOT of labels. Classifiers hate missing data - both missing labels and missing features - so we might as well remove any row that's missing any data.
- Tip: If you use
.dropna()
, it will drop any rows that haveNaN
in them.
train_df = train_df.dropna()
After dropping the missing rows, double-check that your dataframe is the size you expect.
train_df.shape
Creating our classifier#
Just like with linear regression, we call our classifier a model. It models the relationship between the inputs and the outputs.
The classifier we're using is a special one that uses logistic regression under the hood, but that doesn't matter very much right now. Just know that it's a classifier!
Separating our features and labels#
We need to feed our classifier two things
- The features
- The labels
Take a look at the first five rows of train_df
.
train_df.head()
is_suspicious
is our label, and all of the other columns are our features. We'll call the label y
and the features X
, because that's what everyone else does.
The typical way of doing it is below (many people might use axis=1
instead of columns=
, but I like how explicit columns=
is!)
# Note that .drop doesn't drop the column permanently, it only drops the column to save it into `X`
X = train_df.drop(columns=['is_suspicious'])
y = train_df.is_suspicious
Take a look at X
and y
to make sure they look like a list of features and a list of labels. You can use .head()
on both of them, no problem.
X.head()
y.head()
Building our classifier#
One we have our features and our labels, we can create a classifier.
I'm actually going to move the X=
and y=
down into this section because it's nice to keep it all in one cell.
from sklearn.linear_model import LogisticRegression
# Every column EXCEPT whether it's suspicious
X = train_df.drop(columns='is_suspicious')
# label is suspicious 0/1
y = train_df.is_suspicious
# Build a new classifier
# C=1e9 is a magic secret I don't want to talk about
# If we don't say solver='lbfgs' it complains that it's the new default
clf = LogisticRegression(C=1e9, solver='lbfgs')
# Teach the classifier about the complaints we read
clf.fit(X, y)
Okay, that... seems to have done nothing.
When we do linear regression, it prints out a bunch of stuff for us. It's nice! When we train a classifier, it's up to us to use the classifier.
Interpreting our classifier#
Feature importance#
So the classifier did some reading. Hooray! We gave it all sorts of columns (each was a different word)... which columns did it think were important?
# The words we were looking for,
# X were our features, X.columns is the column names
feature_names = X.columns
# Coefficients! Remember this from linear regression?
coefficients = clf.coef_[0]
pd.DataFrame({
'feature': feature_names,
'coefficient': coefficients
}).sort_values(by='coefficient', ascending=False)
A higher number for a coefficient means "this word makes me think it's suspicious, a.k.a. 1
" and a lower number means "this word makes me think it was not suspicious, a.k.a. 0
."
Is there anything you found surprising about these results? Why do you think that might have happened?
Predicting with our classifier#
The point of a classifier is to classify documents it hasn't seen before, to read them and put them into the appropriate category. Before we can do this, we need to extract features from our original dataframe, the one that doesn't have labels.
We'll do this the same way we did with our set of labeled data. Build a new dataframe that asks whether each complaint has the appropriate word:
- airbag
- air bag
- failed
- did not deploy
- violent
- explode
- shrapnel
I've started you off with one check for the word airbag.
features = pd.DataFrame({
'airbag': df.CDESCR.str.contains("AIRBAG", na=False).astype(int),
'airbag': df.CDESCR.str.contains("AIRBAG", na=False).astype(int),
'air bag': df.CDESCR.str.contains("AIR BAG", na=False).astype(int),
'failed': df.CDESCR.str.contains("FAILED", na=False).astype(int),
'did not deploy': df.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
'violent': df.CDESCR.str.contains("VIOLENT", na=False).astype(int),
'explode': df.CDESCR.str.contains("EXPLODE", na=False).astype(int),
'shrapnel': df.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
features.head()
features.sum()
This dataframe should have 7 columns, none of which are is_suspicious
. It's unlabeled, remember? We aren't sure whether they're suspicious complaints or not.
Confirm that real quick.
Now we can add a new column, the classifier's guess about whether it's suspicious or not. To make the classifier guess, we use .predict
. We just feed our features to the classifier and there we go!
clf.predict(features)
Let's make a copy of features
and give it a new column called predicted
. That way if we need to use features again we won't have messed it up by adding new columns.
features_with_prediction = features.copy()
features_with_prediction['predicted'] = clf.predict(features)
Let's look at the first five.
features_with_prediction.head()
Pretty boring, right? No words in there, all predicted as 0
, not fun at all. Let's try filtering to see the first ten where the prediction was 1
.
features_with_prediction[features_with_prediction.predicted == 1].head(10)
We can see most of the ones marked as suspicious include the words "airbag" and "violent," and none of them include "failed" or "did not deploy." That all makes sense, but what about all of the ones that include the word "violent" but not "airbag" or "air bag?" None of those should be good!
While we could just filter it to only include ones with the word "airbag" in it, we probably need a way to test the quality of our classifier.
Testing our classifier#
When we look at the results of our classifier, we know some of them are wrong - complaints shouldn't be suspicious if they don't have airbags in them! But it would be nice to have an automated process to give us an idea of how well our classifier does.
The problem is we can't test our classifier on this unlabeled data, because it doesn't know what's right and what's wrong. Instead, we have to test on the labeled data we trained our classifier on.
One technique would be having our classifier compare the actual labels on our training data to what it would predict those labels to be.
# Look at our training data, predict the labels,
# then compare the labels to the actual labels
clf.score(X, y)
Incredible, over 90% accuracy! ...that's good, right? Well, not really. There are two major reason why this isn't impressive!
Test-train split#
One big problem with our classifier is that we're testing it on data it's already seen. While it's cool to have a study sheet for a test, it doesn't quite seem fair if the study sheet is exactly the same as the test.
Instead, we should try to reproduce what the real world is like - training it on one set of data, and testing it on similar data... but similar data we already know the labels for!
To make this happen we use something called train/test split, where instead of using the entire dataset for training, we only use most of it - the default is 80% for training and 20% for testing. The code on the line below automatically splits the dataset into two groups, one for training and a smaller one for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
To try to understand what's going on, take a look at X_train
, X_test
, y_train
and y_test
, along with their sizes.
X_train.head()
X_test.head()
X_train.shape
X_test.shape
y_train.head()
y_test.head()
y_train.shape
y_test.shape
Both the X_
and the y_
variables look just about exactly the same, the only difference is that _train
contains a lot more than _test
, and there are no repeats between the two.
Now when we give the model a test, it hasn't seen the answers already!
- Use
clf.fit
to train on the training sample - Use
clf.score
to score on the testing sample
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
This part is fun, because there's a chance it will get even better! Weird, right? We'll talk about why that might have happened a little later.
There are other ways to improve this further, but for now we have a larger problem to tackle.
The confusion matrix#
Our accuracy is looking great, hovering somewhere in the 90's. Feeling good, right? Unfortunately, things aren't actually that rosy.
Let's take a look at how many suspicious and how many non-suspicious ones we have in our labeled dataset (for the millionth time, yes)
labeled.is_suspicious.value_counts()
We have a lot more non-suspicious ones as compared to suspicious, right? Let's say we were classifying, and we always guessed "not suspicious". Since there are so few suspicious ones, we wouldn't get very many wrong, and our accuracy would be really high!
If we have 99 non-suspicious and 1 suspicious, if we always guess "non-suspicious" we'd have 99% accuracy.
Even though our accuracy would look great, the result would be super boring. Since zero of our complaints would have been marked as suspicious, we wouldn't have anything to read or research. It'd be much nicer if we could identify the difference between getting one category right compared to the other.
And hey, that's easy! We use this thing called a confusion matrix. It looks like this:
from sklearn.metrics import confusion_matrix
y_true = y
y_pred = clf.predict(X)
confusion_matrix(y_true, y_pred)
...which is pretty terrible-looking, right? It's hard as heck to understand! Let's try to spice it up a little bit and make it a little nicer to read:
from sklearn.metrics import confusion_matrix
# Save the true label, but also save the predicted label
y_true = y
y_pred = clf.predict(X)
# We could also use just the test dataset
# y_true = y_test
# y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
# But then make it look nice
label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
So now we can see what's going on a little bit better. According to the confusion matrix, when using our original dataset (your numbers might be a little different):
- We correctly predicted 149 of 150 not-suspicious
- We only correctly predicted 2 of 15 suspicious ones.
Even though that gives us a really high score, it's pretty useless.
Thinking about what your outputs mean#
While we could spend a lot of time working on the math behind all of this and the technical ins and outs, I think a more useful thing for journalists to do - when both analyzing their own algorithms as well as other people's algorithms - is to think about what incorrect outputs mean.
In this case, we're trying to predict whether we should investigate a given complaint. That basically means, "the computer takes a look and says 'hey human being, you should go look at it'.
As a result, every complain that shouldn't have been flagged is more work for a computer, but every complaint that is incorrectly flagged means we'll never think to look at that complaint.
Do you think it's better to incorrectly flag non-suspicious complaints as suspicious, or to incorrectly flag suspicious complaints as non-suspicious
What are the upsides/downsides of each, and which side is more important to you?
Classifier Probability#
When we use clf.predict
, we only get a 0
or a 1
. That's kind of a fakeout, though, as under the hood there is actually something a (little) more complicated going on. Since we only have two categories, each row is given a score between 0-100% as to whether it should belong to a category. If it's over 50% it goes into that category!
We can see this with clf.predict_proba
.
X_with_predictions = X.copy()
X_with_predictions['predicted'] = clf.predict(X)
# [:,1] is the probability it belongs in the '1' category
X_with_predictions['probability'] = clf.predict_proba(X)[:,1]
X_with_predictions.head()
Now we can be a little more discriminating - instead of just looking as whether it scored above or below 50% by seeing the final classification we can see exactly what the classifier was thinking when it assigned it to one category or another. Try sorting by probability and showing the top 20, putting the higher probability at the top.
X_with_predictions.sort_values(by='probability', ascending=False).head(20)
Let's improve our model#
Right now our model isn't very good. It doesn't seem to require the word "airbag" to be in it (maybe because we count "airbag" and "air bag" as separate words?) and doesn't include that many features. Can you think of ways to improve our model, and maybe try a few out?
Imports#
We'll just do this all over again.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
pd.set_option("display.max_colwidth", 500)
Read in our labeled data#
Right now we're only dropping ones have missing labels. Why do we have so many missing labels? Are there other options for ones we could include/not include?
# Read in our data, drop those that are missing labels
labeled = pd.read_csv("data/sampled-labeled.csv")
labeled = labeled.dropna()
labeled.shape
Create our X and y#
Are there other words you might look for? Any words you might remove?
train_df = pd.DataFrame({
'is_suspicious': labeled.is_suspicious,
'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()
Split into train and test#
Does giving the model more (or less) to train with change anything?
X = train_df.drop(columns='is_suspicious')
y = train_df.is_suspicious
# With test_size=0.3, we'll train on 70% and test on 30%
# random_state=42 means it isn't actually random, it will always give you the same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Create and train our classifier#
You... don't know any other classifiers. But hey, you could always look some up, I guess!
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X_train, y_train)
Check the important words#
Are the selected words pushing your results in the direction you think they should?
feature_names = X_train.columns
# Coefficients! Remember this from linear regression?
coefficients = clf.coef_[0]
pd.DataFrame({
'feature': feature_names,
'coefficient': coefficients
}).sort_values(by='coefficient', ascending=False)
Test our classifier#
We'll do a simple .score
(which we know isn't very useful) along with a confusion matrix (which is harder to understand, but less useful). How do we feel about the results according to both?
Normally I'd only use the confusion matrix on X_test
/y_test
, but we do such a bad job that I feel like we should look at it all.
clf.score(X_test, y_test)
y_true = y
y_pred = clf.predict(X)
# y_true = y_test
# y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
If you keep running this and running this, it's going to be different each time.
Examining the results#
train_df_with_predictions = train_df.copy()
train_df_with_predictions['predicted'] = clf.predict(train_df.drop(columns='is_suspicious'))
train_df_with_predictions['predicted_prob'] = clf.predict_proba(train_df.drop(columns='is_suspicious'))[:,1]
train_df_with_predictions['sentence'] = labeled.CDESCR
train_df_with_predictions.sort_values(by='predicted_prob', ascending=False).head(10)
How are we going to fix this?#
Even if you can't successfully make your classifier perform any better, try to think about what you feel like could make it better.
Review#
We have far too many complaints from the National Highway Traffic Safety Administration to read ourselves, so we're hoping to convince a computer to mark the ones we'll be interested in. We hand-labeled a random sample as suspicious or not and used this smaller dataset as a source of training material for our machine learning algorithm.
We picked a few words we thought might be indicative of malfunctioning air bags, and added new columns to our dataset as to whether each complain has the word or not. We then train our algorithm, where it learns how these features are related to the label of suspicious or not. In this case we used a logistic regression classifier.
It did not do a very good job, and we thought about reasons why.
Discussion topics#
We have a few options: try to flag more examples, try a different machine learning algorithm, just give up and have someone read all of the complaints manually.
Do you think our algorithm might perform better if we had a better split between suspicious and non-suspicious complaints?
What do you think takes longer: learning to use machine learning, or searching for "airbag" and manually marking complaints as suspicious or not?