5.4 Testing our classifier

When we look at the results of our classifier, we know some of them are wrong - complaints shouldn’t be suspicious if they don’t have airbags in them! But it would be nice to have an automated process to give us an idea of how well our classifier does.

We test a classifier just like our teachers test us in class: we’ll show our classifier rows we you know the answer to, and see if it gets them right. The problem is we can’t test it on our unlabeled data, because we doesn’t know what’s right and what’s wrong. Instead, we have to test on the labeled data we know the answer to.

One technique would be having our classifier compare the actual labels on our training data (suspicious, not suspicious) to what it would predict those labels to be.

# Look at our training data, predict the labels,
# then compare the labels to the actual labels
clf.score(X, y)

## 0.9212121212121213

Incredible, 92% accuracy! It got 92% correct! …that’s good, right? Well, not really. There are two major reason why this isn’t impressive:

We’re testing it on data it’s already seen
The vast majority of our samples are not suspicious

5.4.1 Test-train split

The biggest problem with our classifier is that we’re testing it on data it’s already seen. While it’s cool to have a study sheet for a quiz, it doesn’t quite seem fair if the study sheet is exactly the same as the test.

Instead, we should try to reproduce what the real world is like - training it on one set of data, and testing it on similar data… but similar data we already know the labels for! It’s like how a teacher gives sample quizzes that are similar - but not the same - as the the real one.

To make this happen we use something called train/test split, where instead of using the entire dataset for training, we only use most of it - maybe 75% or so. The code on the line below automatically splits the dataset into two groups, one for training and a smaller one for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

That way when we give the model a test, it hasn’t seen the answers already!

clf.score(X_test, y_test)

## 0.9047619047619048

Not bad, not bad. There are other ways to improve this further, but for now we have a larger problem to tackle.

5.4.2 The confusion matrix

Our accuracy is looking great, hovering somewhere in the 90’s. Feeling good, right? Unfortunately, things aren’t actually that rosy.

Let’s take a look at how many suspicious and how many non-suspicious complaints we have:

labeled_df.is_suspicious.value_counts()

## 0.0    150
## 1.0     15
## Name: is_suspicious, dtype: int64

We have a lot more non-suspicious ones as compared to suspicious, right? Let’s say we were classifying, and we always guessed “not suspicious”. Since there are so few suspicious ones, we wouldn’t get very many wrong, and our accuracy would be really high!

If we have 99 non-suspicious and 1 suspicious, if we always guess “non-suspicious” we’d have 99% accuracy.

Even though our accuracy would look great, the result would be super boring. Since zero of our complaints would have been marked as suspicious, we wouldn’t have anything to read or research. It’d be much nicer if we could identify the difference between getting one category right compared to the other.

And hey, that’s easy! We use this thing called a confusion matrix. It looks like this:

from sklearn.metrics import confusion_matrix

y_true = y
y_pred = clf.predict(X)

confusion_matrix(y_true, y_pred)

## array([[150,   0],
##        [ 13,   2]])

…which is pretty terrible-looking, right? It’s hard as heck to understand! Let’s try to spice it up a little bit and make it a little nicer to read:

from sklearn.metrics import confusion_matrix

# Save the true label, but also save the predicted label
y_true = y
y_pred = clf.predict(X)
# We could also use just the test dataset
# y_true = y_test
# y_pred = clf.predict(X_test)

# Build the confusion matrix
matrix = confusion_matrix(y_true, y_pred)

# But then make it look nice
label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

	Predicted not suspicious	Predicted suspicious
Is not suspicious	150	0
Is suspicious	13	2

So now we can see what’s going on a little bit better. According to the confusion matrix, when using our test set:

We correctly predicted 38 of 38 not-suspicious
We only correctly predicted 2 of 4 suspicious ones.

Not nearly as good as we’d hoped.