Trying out different classifiers#

When the Los Angeles Times was using machine learning to detect serious assaults that the LAPD had downgraded into simple assaults, they used a combination of two different classification algorithms to find suspicious reports. In the spirit of completeness, let's take a look at how several classifiers perform in the task.

Read online Download notebook Interactive version

Imports and setup#

First we'll set some options up to make everything display correctly. It's mostly because these assault descriptions can be quite long, and the default is to truncate text after a few words.

import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 300)

%matplotlib inline

Repeat our analysis#

First we'll repeat the majority of our processing and analysis from the first notebook, then we'll get into the critique.

Read in our data#

Our dataset is going to be a database of crimes committed between 2008 and 2012. It will start off with two columns:

CCDESC, what criminal code was violated
DO_NARRATIVE, a short text description of what happened

We're going to use this description to see if we can separate serious cases of assault compared to non-serious cases of assault.

We won't be covering the process of vectorizing our dataset and creating our classifier in this notebook. Instead, we're going to focus on analyzing the possible shortcomings of our analysis, both conceptually and technically.

# Read in our dataset
df = pd.read_csv("data/2008-2012.csv")

# Only use reports classified as types of assault
df = df[df.CCDESC.str.contains("ASSAULT")].copy()

# Classify as serious or non-serious
df['serious'] = df.CCDESC.str.contains("AGGRAVATED") | df.CCDESC.str.contains("DEADLY")
df['serious'] = df['serious'].astype(int)

# Downgrade 15% from aggravated to simple assault
serious_subset = df[df.serious == 1].sample(frac=0.15)
df['downgraded'] = 0
df.loc[serious_subset.index, 'downgraded'] = 1
df.loc[serious_subset.index, 'serious'] = 0

# Take a sample of 50,000
df = df.sample(n=50000)
# Examine the first few
df.head()

	CCDESC	DO_NARRATIVE	serious
405689	BATTERY - SIMPLE ASSAULT	DO-SUSP PUSHED VICT	0
258993	INTIMATE PARTNER - SIMPLE ASSAULT	DO-S AND V LIVE TOGETHER HAVE 1 CIC S PUNCHED V IN THE FACE	0
531609	BATTERY - SIMPLE ASSAULT	DO-DURING AN ARGUMENT SUSP SLAPPED BOTH VICTS	0
365367	INTIMATE PARTNER - SIMPLE ASSAULT	DO-SUSP PUSHED VICT DURING AN ARGUMENT SUPS THEN THREW A SMALL TABLE AT VICTS LEGS CASING INJURY	0
700699	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO- UNK SUSP HIT VICTS VEH W SUSP VEH WITH INTENTION OF MAKING VICT STOP UNK SUSP FLED WB ON VALERIO ST TOWARDS VAN NUYS BL	1

Vectorize#

%%time

from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

stemmer = SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer,self).build_analyzer()
        return lambda doc:(stemmer.stem(word) for word in analyzer(doc))

vectorizer = StemmedTfidfVectorizer(min_df=15, max_df=0.5, max_features=1000)

X = vectorizer.fit_transform(df.DO_NARRATIVE)
words_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
words_df.head(5)

CPU times: user 30 s, sys: 967 ms, total: 31 s
Wall time: 34.4 s

	an	...	wb	with
0	0.000000	...	0.00000	0.000000
1	0.000000	...	0.00000	0.000000
2	0.405713	...	0.00000	0.000000
3	0.217172	...	0.00000	0.000000
4	0.000000	...	0.27343	0.111141

5 rows × 1000 columns

words_df.shape

(50000, 1000)

Classify#

Previously we stuck to using a LinearSVC classifier. How do other classifiers compare? We'll train each one, then look at their confusion matrix.

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from nltk.classify import MaxentClassifier

X = words_df
y = df.serious

X_train, X_test, y_train, y_test = train_test_split(X, y)

X_train.shape

(37500, 1000)

Create and train a logistic regression classifier#

Logistic regression classifiers take a while to train.

%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)

CPU times: user 42.2 s, sys: 1.07 s, total: 43.3 s
Wall time: 37 s

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=1000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Create and train a random forest classifier#

Random forests train pretty slowly, too. If you make them perform better by increasing n_estimators it takes even longer!

%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X_train, y_train)

CPU times: user 49.2 s, sys: 821 ms, total: 50 s
Wall time: 59.4 s

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Create and train a linear support vector classifier (LinearSVC)#

This one will be nice and quick!

%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X_train, y_train)

CPU times: user 753 ms, sys: 28.4 ms, total: 781 ms
Wall time: 1.01 s

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

Create and train a multinomial naive bayes classifier (MultinomialNB)#

This one will also train quickly.

%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X_train, y_train)

CPU times: user 251 ms, sys: 32.6 ms, total: 284 ms
Wall time: 312 ms

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Checking each classifier's performance#

We'll use the accuracy score and confusion matrix to see how well each algorithm performs. While we're mostly interested in the confusion matrix, seeing the accuracy score is a good reminder that accuracy is a terrible evaluation metric.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

Logistic Regression#

y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Accuracy score 0.86368

	Predicted negative	Predicted positive
Is negative	0.934727	0.199805
Is positive	0.115581	0.646199

Random Forest#

y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Accuracy score 0.86648

	Predicted negative	Predicted positive
Is negative	0.944704	0.169266
Is positive	0.121842	0.627031

LinearSVC#

y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Accuracy score 0.86544

	Predicted negative	Predicted positive
Is negative	0.937487	0.191358
Is positive	0.116005	0.644899

Naive Bayes#

y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Accuracy score 0.85016

	Predicted negative	Predicted positive
Is negative	0.944385	0.170240
Is positive	0.143176	0.561728

Since a good number of our serious crimes have been downgraded to not-serious, a better performance on predicting serious crimes could actually be a problem. What should we try to measure for our evaluation metric?

We'll cover it (and measure it) later, but you should think reaaaally hard about it right now.

Making predictions to find downgraded crimes#

To see if our algorithm can find downgraded reports, we'll first ask it to make predictions on each of the descriptions we have. If a report is listed as not serious, but the algorithm thinks it should be serious, we should examine the report further.

# Feed the classifier the word counts (X) to have it make the prediction
df['logreg_pred'] = logreg.predict(X)
df['forest_pred'] = forest.predict(X)
df['svc_pred'] = svc.predict(X)
df['bayes_pred'] = bayes.predict(X)

df.head()

	CCDESC	DO_NARRATIVE	serious	logreg_pred	forest_pred	svc_pred	bayes_pred
405689	BATTERY - SIMPLE ASSAULT	DO-SUSP PUSHED VICT	0	0	0	0	0
258993	INTIMATE PARTNER - SIMPLE ASSAULT	DO-S AND V LIVE TOGETHER HAVE 1 CIC S PUNCHED V IN THE FACE	0	0	0	0	0
531609	BATTERY - SIMPLE ASSAULT	DO-DURING AN ARGUMENT SUSP SLAPPED BOTH VICTS	0	0	0	0	0
365367	INTIMATE PARTNER - SIMPLE ASSAULT	DO-SUSP PUSHED VICT DURING AN ARGUMENT SUPS THEN THREW A SMALL TABLE AT VICTS LEGS CASING INJURY	0	0	0	0	0
700699	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO- UNK SUSP HIT VICTS VEH W SUSP VEH WITH INTENTION OF MAKING VICT STOP UNK SUSP FLED WB ON VALERIO ST TOWARDS VAN NUYS BL	1	1	1	1	1

Crimes with a 1 in serious are serious, and ones with a 1 in downgraded were downgraded. If either of those columns is 1, then the prediction should have been 1.

df[df.downgraded == 1].head(5)

	CCDESC	DO_NARRATIVE	downgraded	logreg_pred	forest_pred	svc_pred	bayes_pred
575961	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO-V WAS APP BY S S PUNCHD V IN FACE S1 PROD KNIFE AND SWUNG KNIFE AT V CAUSG CUT ABOVE HER LFT EYE	1	1	1	1	1
137899	ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER	DO-SUSP THREW GLASS BOTTLE AT VICT FROM THIRD FLOOR APT BALCONY ALMOST HITTING VICT SUSP FLED INTO APT	1	1	0	1	0
405522	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO-V WAS STRUCK IN THE HEAD TWICE WITH BASEBALL BAT BY UNK SUSP V WAS TRANS TO HOSP BY PRIVATE PARTY AND DROPPED DUMPED OFF	1	1	0	1	0
750347	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO-S ENGAGED IN A VERBAL ALTERCATION WITH V S STRUCK V ON HAND WITH A STICK S CHASED V AND STRUCK V A SECOND TIME ON VICTS BACK WITH STICK ARREST	1	1	0	1	0
173459	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO-SUSPECT PICKED UP A 50 GALLON METAL TRASH CAN AND THREW IT AT VICTIM HITTING HER ON HER LEFT LEG	1	0	0	0	0

Combining classifiers#

When the LA Times did their analysis, they used two different classifiers. We can actually do the same thing!

Each predictor gave us a 0 or a 1, where 1 means it thinks the report should be classified as serious. What if we just said hey, did any of you think this report should be serious?

df[['logreg_pred', 'forest_pred', 'svc_pred', 'bayes_pred']].head()

	logreg_pred	forest_pred	svc_pred	bayes_pred
405689	0	0	0	0
258993	0	0	0	0
531609	0	0	0	0
365367	0	0	0	0
700699	1	1	1	1

df['combined_pred'] = df[['logreg_pred', 
                          'forest_pred', 
                          'svc_pred', 
                          'bayes_pred']].any(axis=1).astype(int)
df.head()

	CCDESC	DO_NARRATIVE	serious	logreg_pred	forest_pred	svc_pred	bayes_pred	combined_pred
405689	BATTERY - SIMPLE ASSAULT	DO-SUSP PUSHED VICT	0	0	0	0	0	0
258993	INTIMATE PARTNER - SIMPLE ASSAULT	DO-S AND V LIVE TOGETHER HAVE 1 CIC S PUNCHED V IN THE FACE	0	0	0	0	0	0
531609	BATTERY - SIMPLE ASSAULT	DO-DURING AN ARGUMENT SUSP SLAPPED BOTH VICTS	0	0	0	0	0	0
365367	INTIMATE PARTNER - SIMPLE ASSAULT	DO-SUSP PUSHED VICT DURING AN ARGUMENT SUPS THEN THREW A SMALL TABLE AT VICTS LEGS CASING INJURY	0	0	0	0	0	0
700699	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO- UNK SUSP HIT VICTS VEH W SUSP VEH WITH INTENTION OF MAKING VICT STOP UNK SUSP FLED WB ON VALERIO ST TOWARDS VAN NUYS BL	1	1	1	1	1	1

y_true = df.serious
y_pred = df.combined_pred
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names) / matrix.sum(axis=1)

Accuracy score 0.9241

	Predicted negative	Predicted positive
Is negative	0.921468	0.246544
Is positive	0.021545	0.932362

AMAZING!!! Incredible!!!

But again: the accuracy score is a terrible metric and the confusion matrix might not even be that good, as what we're actually interested in is whether we can detect downgrades.

Let's try to do that now: out of all of the ones marked not serious - candidates for being secretly downgraded - how many actually were downgraded?

# Out all of all the ones actually downgraded
not_serious = df[df.serious == 0]

# How many did we think were serious?
y_true = not_serious.downgraded
y_pred = not_serious.combined_pred
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['not serious', 'serious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Accuracy score 0.9444371192742808

	Predicted not serious	Predicted serious
Is not serious	0.957708	0.042292
Is serious	0.287665	0.712335

Okay, not nearly as good as what the confusion matrix said. Using this new metric, how does it compare for LinearSVC, one of our standard classifiers?

# Out all of all the ones actually downgraded
not_serious = df[df.serious == 0]

# How many did we think were serious?
y_true = not_serious.downgraded
y_pred = not_serious.svc_pred
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['not serious', 'serious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Accuracy score 0.957543313731178

	Predicted not serious	Predicted serious
Is not serious	0.975802	0.024198
Is serious	0.361775	0.638225

The combined one is definitely better, finding about 150 additional cases for our inspection to go through.

Review#

We're working on reproducing a Los Angeles Times piece where they uncovered serious assaults that had been downgraded by the LAPD to simple assault. They used multiple machine learning classifiers in their investigation, so we tried to see whether we could do the same.

We used four different kinds of classifiers - a linear support vector classifier, a random forest, a logistic regression, and a naive bayes classifier - and compared their performance. While the classifiers performed roughly similar to one another, some did outperform others. In the end, though, we combined the predictions of all of our classifiers to cast the widest net for misclassified crime reports.

It was also important to determine what our evaluation metric should be. While we trained our dataset on everything - both serious assaults, non-serious assaults, and downgraded assaults - we eventually realized all we should measure was how accurate we were in discriminating non-serious assaults from downgraded assaults.

Discussion topics#

Our algorithm had 88% accuracy overall, but only 65% in detecting downgraded crimes. What's the difference here? How important is one score compared to the other?
We only hit around 65% accuracy in finding downgraded crimes. Is this a useful score? How does it compare to random guessing, or going one-by-one through the crimes marked as non-serious?
What techniques could we have used to find downgraded crimes if we didn't use machine learning?
Is there a difference between looking at the prediction - the 0 or 1 - and looking at the output of decision_function?
What happens if our algorithm errs on the side of calling non-serious crimes serious crimes? What if it errs on the side of calling serious crimes non-serious crimes?
If we want to find more downgraded cases (but do more work), we'll want to err on the side of examining more potentially-serious cases. Is there a better method than picking random cases?
One of our first steps was to eliminate all crimes that weren't assaults. How do you think this helped or hindered our analysis?
Why did we use LinearSVC instead of another classifier such as LogisticRegression, RandomForest or Naive Bayes (MultinomialNB)? Why might we try or not try those?
You don't work for the LAPD, so you can only be so sure what should and shouldn't be a serious crime. What can you do to help feel confident that a case should be one or the other, or that our algorithm is working as promised?
In this case, we randomly picked serious crimes to downgrade. Would it be easier or more difficult if the LAPD was systematically downgrading certain types of serious crimes? Can you think of a way to around that sort of trickery?
Many people say you need to release your data and analysis in order to have people trust what you've done. With something like this dataset, however, you're dealing with real things that happened to real people, many of whom would probably prefer to keep these things private. Is that a reasonable expectation? If it is, what can be done to bridge the gap between releasing all of the original data and keeping our process secret?