Trying out different classifiers#
When the Los Angeles Times was using machine learning to detect serious assaults that the LAPD had downgraded into simple assaults, they used a combination of two different classification algorithms to find suspicious reports. In the spirit of completeness, let's take a look at how several classifiers perform in the task.
Imports and setup#
First we'll set some options up to make everything display correctly. It's mostly because these assault descriptions can be quite long, and the default is to truncate text after a few words.
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 300)
%matplotlib inline
Repeat our analysis#
First we'll repeat the majority of our processing and analysis from the first notebook, then we'll get into the critique.
Read in our data#
Our dataset is going to be a database of crimes committed between 2008 and 2012. It will start off with two columns:
CCDESC
, what criminal code was violatedDO_NARRATIVE
, a short text description of what happened
We're going to use this description to see if we can separate serious cases of assault compared to non-serious cases of assault.
We won't be covering the process of vectorizing our dataset and creating our classifier in this notebook. Instead, we're going to focus on analyzing the possible shortcomings of our analysis, both conceptually and technically.
# Read in our dataset
df = pd.read_csv("data/2008-2012.csv")
# Only use reports classified as types of assault
df = df[df.CCDESC.str.contains("ASSAULT")].copy()
# Classify as serious or non-serious
df['serious'] = df.CCDESC.str.contains("AGGRAVATED") | df.CCDESC.str.contains("DEADLY")
df['serious'] = df['serious'].astype(int)
# Downgrade 15% from aggravated to simple assault
serious_subset = df[df.serious == 1].sample(frac=0.15)
df['downgraded'] = 0
df.loc[serious_subset.index, 'downgraded'] = 1
df.loc[serious_subset.index, 'serious'] = 0
# Take a sample of 50,000
df = df.sample(n=50000)
# Examine the first few
df.head()
Vectorize#
%%time
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
stemmer = SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(StemmedTfidfVectorizer,self).build_analyzer()
return lambda doc:(stemmer.stem(word) for word in analyzer(doc))
vectorizer = StemmedTfidfVectorizer(min_df=15, max_df=0.5, max_features=1000)
X = vectorizer.fit_transform(df.DO_NARRATIVE)
words_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
words_df.head(5)
words_df.shape
Classify#
Previously we stuck to using a LinearSVC classifier. How do other classifiers compare? We'll train each one, then look at their confusion matrix.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from nltk.classify import MaxentClassifier
X = words_df
y = df.serious
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train.shape
Create and train a logistic regression classifier#
Logistic regression classifiers take a while to train.
%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)
Create and train a random forest classifier#
Random forests train pretty slowly, too. If you make them perform better by increasing n_estimators
it takes even longer!
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X_train, y_train)
Create and train a linear support vector classifier (LinearSVC)#
This one will be nice and quick!
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X_train, y_train)
Create and train a multinomial naive bayes classifier (MultinomialNB)#
This one will also train quickly.
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X_train, y_train)
Checking each classifier's performance#
We'll use the accuracy score and confusion matrix to see how well each algorithm performs. While we're mostly interested in the confusion matrix, seeing the accuracy score is a good reminder that accuracy is a terrible evaluation metric.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
Logistic Regression#
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Random Forest#
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
LinearSVC#
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Naive Bayes#
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Since a good number of our serious crimes have been downgraded to not-serious, a better performance on predicting serious crimes could actually be a problem. What should we try to measure for our evaluation metric?
We'll cover it (and measure it) later, but you should think reaaaally hard about it right now.
Making predictions to find downgraded crimes#
To see if our algorithm can find downgraded reports, we'll first ask it to make predictions on each of the descriptions we have. If a report is listed as not serious, but the algorithm thinks it should be serious, we should examine the report further.
# Feed the classifier the word counts (X) to have it make the prediction
df['logreg_pred'] = logreg.predict(X)
df['forest_pred'] = forest.predict(X)
df['svc_pred'] = svc.predict(X)
df['bayes_pred'] = bayes.predict(X)
df.head()
Crimes with a 1
in serious are serious, and ones with a 1
in downgraded were downgraded. If either of those columns is 1
, then the prediction should have been 1
.
df[df.downgraded == 1].head(5)
Combining classifiers#
When the LA Times did their analysis, they used two different classifiers. We can actually do the same thing!
Each predictor gave us a 0
or a 1
, where 1
means it thinks the report should be classified as serious. What if we just said hey, did any of you think this report should be serious?
df[['logreg_pred', 'forest_pred', 'svc_pred', 'bayes_pred']].head()
df['combined_pred'] = df[['logreg_pred',
'forest_pred',
'svc_pred',
'bayes_pred']].any(axis=1).astype(int)
df.head()
y_true = df.serious
y_pred = df.combined_pred
matrix = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names) / matrix.sum(axis=1)
AMAZING!!! Incredible!!!
But again: the accuracy score is a terrible metric and the confusion matrix might not even be that good, as what we're actually interested in is whether we can detect downgrades.
Let's try to do that now: out of all of the ones marked not serious - candidates for being secretly downgraded - how many actually were downgraded?
# Out all of all the ones actually downgraded
not_serious = df[df.serious == 0]
# How many did we think were serious?
y_true = not_serious.downgraded
y_pred = not_serious.combined_pred
matrix = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)
label_names = pd.Series(['not serious', 'serious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Okay, not nearly as good as what the confusion matrix said. Using this new metric, how does it compare for LinearSVC, one of our standard classifiers?
# Out all of all the ones actually downgraded
not_serious = df[df.serious == 0]
# How many did we think were serious?
y_true = not_serious.downgraded
y_pred = not_serious.svc_pred
matrix = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)
label_names = pd.Series(['not serious', 'serious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
The combined one is definitely better, finding about 150 additional cases for our inspection to go through.
Review#
We're working on reproducing a Los Angeles Times piece where they uncovered serious assaults that had been downgraded by the LAPD to simple assault. They used multiple machine learning classifiers in their investigation, so we tried to see whether we could do the same.
We used four different kinds of classifiers - a linear support vector classifier, a random forest, a logistic regression, and a naive bayes classifier - and compared their performance. While the classifiers performed roughly similar to one another, some did outperform others. In the end, though, we combined the predictions of all of our classifiers to cast the widest net for misclassified crime reports.
It was also important to determine what our evaluation metric should be. While we trained our dataset on everything - both serious assaults, non-serious assaults, and downgraded assaults - we eventually realized all we should measure was how accurate we were in discriminating non-serious assaults from downgraded assaults.
Discussion topics#
- Our algorithm had 88% accuracy overall, but only 65% in detecting downgraded crimes. What's the difference here? How important is one score compared to the other?
- We only hit around 65% accuracy in finding downgraded crimes. Is this a useful score? How does it compare to random guessing, or going one-by-one through the crimes marked as non-serious?
- What techniques could we have used to find downgraded crimes if we didn't use machine learning?
- Is there a difference between looking at the prediction - the 0 or 1 - and looking at the output of
decision_function
? - What happens if our algorithm errs on the side of calling non-serious crimes serious crimes? What if it errs on the side of calling serious crimes non-serious crimes?
- If we want to find more downgraded cases (but do more work), we'll want to err on the side of examining more potentially-serious cases. Is there a better method than picking random cases?
- One of our first steps was to eliminate all crimes that weren't assaults. How do you think this helped or hindered our analysis?
- Why did we use LinearSVC instead of another classifier such as LogisticRegression, RandomForest or Naive Bayes (MultinomialNB)? Why might we try or not try those?
- You don't work for the LAPD, so you can only be so sure what should and shouldn't be a serious crime. What can you do to help feel confident that a case should be one or the other, or that our algorithm is working as promised?
- In this case, we randomly picked serious crimes to downgrade. Would it be easier or more difficult if the LAPD was systematically downgrading certain types of serious crimes? Can you think of a way to around that sort of trickery?
- Many people say you need to release your data and analysis in order to have people trust what you've done. With something like this dataset, however, you're dealing with real things that happened to real people, many of whom would probably prefer to keep these things private. Is that a reasonable expectation? If it is, what can be done to bridge the gap between releasing all of the original data and keeping our process secret?