4.4 Evaluation metrics

When you’ve put together a machine learning algorithm, you need some way to test its performance. This is called an evaluation metric. It seems like it would be as easy as your teacher scoring a test, but it gets more complicated pretty quickly.

4.4.1 Accuracy

The most basic way we can judge its performance is to ask: what percent did you predict correctly? We’ll check by comparing the right answers to what the classifier predicted.

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)
## 0.7401055408970977

Around 75% percent, not so bad!

Unfortunately, there’s a little issue with accuracy that makes it almost useless as a metric.

Important question: How often is a request denied, and how often is it accepted?

df.successful.value_counts()
## 0    6345
## 1    2749
## Name: successful, dtype: int64

6,345 denied requests, 2,749 requests fulfilled. If we take one more step convert that to percentages, our crisis might be a little more clear.

df.successful.value_counts(normalize=True)
## 0    0.697713
## 1    0.302287
## Name: successful, dtype: float64

Yes, we have around 70% denied. So what? We didn’t think it was a bad split before - 30% success rate, not so tough.

Here’s the problem: if our classifier just guessed “it’s gonna get rejected” every single time, we’d be 70% accurate!

Even though we’d be throwing out every single successful request, it wouldn’t matter. If we’re accuracy as our evaluation metric, it doesn’t matter what we got right and what we got wrong, it only counts the number of correct predictions.

4.4.2 Dummy classifier

You can also do this with code, too, using a hilarious classifier called a DummyClassifier. It isn’t a real classifier, but you can tell it to just guess the most popular thing! The code works just like a ‘normal’ classifier, which I find incredibly amusing.

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy='most_frequent', random_state=42)
dummy.fit(X_train, y_train)
## DummyClassifier(constant=None, random_state=42, strategy='most_frequent')

So let’s say we use the dummy classifier. How does it do just guessing the most popular thing, no machine learning in sight?

accuracy_score(y_test, dummy.predict(X_test))
## 0.689094107299912

Just like we predicted, right around 70%. Not feeling so good about our 75%fish performance now, are we?

4.4.3 Confusion matrix

It’s turned out that what we’re interested in isn’t just “did you get it right?” What we’re interested in is somehow looking at acceptances and denials and making sure we didn’t just throw everything into one bucket or the other.

To look at how we performed on different aspects of our classification ask, we can use a confusion matrix. Let’s see how our k-nearest neighbors classifier performed.

from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = knn.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted unsuccessful Predicted successful
Is unsuccessful 1509 58
Is successful 533 174

A confusion matrix can put into context where we’re making our mistakes.

And here’s the confusion matrix for the dummy classifier.

y_true = y_test
y_pred = dummy.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted unsuccessful Predicted successful
Is unsuccessful 1567 0
Is successful 707 0

Unlike the accuracy score, we have a clear distinction between the two! It’s easy to see that the dummy classifier doesn’t predict anything as successful, while the k-nearest neighbors classifier is a bit more mixed.

A confusion matrix is a great way to see how your classifier performs across your classes separately. Successful, unsuccessful, all broken out. It’s really really easy to see how different they are when viewed this way.

Each one of those boxes has a name.

term meaning where
True Positive is successful, predicted successful top left
False Negative is successful, predicted unsuccessful top right
False Positive is unsuccessful, predicted successful bottom left
True Negative is unsuccessful, predicted unsuccessful bottom right

If we’re being frank, though: I really dislike using those names. I can barely remember which is which, and I think using sterile names like that makes you forget what you’re actually working on.

For example, “minimize false negatives.” In normal words, you might say something like “be really sure someone’s reject is going to be rejected before you tell them that, because they might not submit it.” I think keeping the language related to the topic helps us understand what we’re really doing, and what impact we’re really having.

But either way, our K-Nearest Neighbors seems to be working pretty well. Or at least, it goes a bit better than always guessing ‘denied.’

4.4.4 Explainability

Hand-in-hand with performance is the idea of explainability. Why did our algorithm give the result it did? Imagine how uncomfortable it would be dealing with a person who could never explain their reasoning behind their decisions!

One problem of the K-Nearest Neighbors algorithm is that it’s kind of difficult to explain what’s going on, or why we received a certain result. In theory it’s easy - “we’re taking the 20 most similar FOIA requests and picking whether they were fulfilled or not” - but it’s difficult to point out exactly which columns are the important ones, or give feedback on how we might need to improve your FOIA request.

Maybe it’s time to look at some alternatives? As we mentioned before, there are more classification algorithms than just K-Nearest Neighbors. A few examples are logistic regression classifiers, decision trees, and random forests.

While there is a lot of potential math and tests for suitability between your dataset and what kind of algorithm should work best, at the end of the day the only thing that really matters is which one does work best. To figure it out… you just try all of the different algorithms and compare the results!