Trying out different classifiers#

When the Los Angeles Times was using machine learning to detect serious assaults that the LAPD had downgraded into simple assaults, they used a combination of two different classification algorithms to find suspicious reports. In the spirit of completeness, let's take a look at how several classifiers perform in the task.

Imports and setup#

First we'll set some options up to make everything display correctly. It's mostly because these assault descriptions can be quite long, and the default is to truncate text after a few words.

```import pandas as pd
import numpy as np

pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 300)

%matplotlib inline
```

Repeat our analysis#

First we'll repeat the majority of our processing and analysis from the first notebook, then we'll get into the critique.

Our dataset is going to be a database of crimes committed between 2008 and 2012. It will start off with two columns:

• `CCDESC`, what criminal code was violated
• `DO_NARRATIVE`, a short text description of what happened

We're going to use this description to see if we can separate serious cases of assault compared to non-serious cases of assault.

We won't be covering the process of vectorizing our dataset and creating our classifier in this notebook. Instead, we're going to focus on analyzing the possible shortcomings of our analysis, both conceptually and technically.

```# Read in our dataset

# Only use reports classified as types of assault
df = df[df.CCDESC.str.contains("ASSAULT")].copy()

# Classify as serious or non-serious
df['serious'] = df['serious'].astype(int)

# Downgrade 15% from aggravated to simple assault
serious_subset = df[df.serious == 1].sample(frac=0.15)
df.loc[serious_subset.index, 'serious'] = 0

# Take a sample of 50,000
df = df.sample(n=50000)
# Examine the first few
```
405689 BATTERY - SIMPLE ASSAULT DO-SUSP PUSHED VICT 0 0
258993 INTIMATE PARTNER - SIMPLE ASSAULT DO-S AND V LIVE TOGETHER HAVE 1 CIC S PUNCHED V IN THE FACE 0 0
531609 BATTERY - SIMPLE ASSAULT DO-DURING AN ARGUMENT SUSP SLAPPED BOTH VICTS 0 0
365367 INTIMATE PARTNER - SIMPLE ASSAULT DO-SUSP PUSHED VICT DURING AN ARGUMENT SUPS THEN THREW A SMALL TABLE AT VICTS LEGS CASING INJURY 0 0
700699 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT DO- UNK SUSP HIT VICTS VEH W SUSP VEH WITH INTENTION OF MAKING VICT STOP UNK SUSP FLED WB ON VALERIO ST TOWARDS VAN NUYS BL 1 0

Vectorize#

```%%time

from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

stemmer = SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(StemmedTfidfVectorizer,self).build_analyzer()
return lambda doc:(stemmer.stem(word) for word in analyzer(doc))

vectorizer = StemmedTfidfVectorizer(min_df=15, max_df=0.5, max_features=1000)

X = vectorizer.fit_transform(df.DO_NARRATIVE)
words_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
```
```CPU times: user 30 s, sys: 967 ms, total: 31 s
Wall time: 34.4 s
```
10 11 12 13 15 18th 1x 1yr 20 2x 2yr 390 3x 3yr abdomen abl about abov abras abus abv acceler accus across adn adv advis after again against aggress ago aid air alcohol all alley allow almost along also alt alterc am amount an andpunch anger angri ani ... wall want warn was watch water way wb weapon went were west westbound western what wheelchair when where whi which while white who wife will window windshield wit wit1 with witha without woke wood wooden word work would wound wrap wrestl wrist wth yard year yell you your yr yrs
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.405713 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.217172 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.27343 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.111141 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1000 columns

```words_df.shape
```
`(50000, 1000)`

Classify#

Previously we stuck to using a LinearSVC classifier. How do other classifiers compare? We'll train each one, then look at their confusion matrix.

```from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from nltk.classify import MaxentClassifier
```
```X = words_df
y = df.serious

X_train, X_test, y_train, y_test = train_test_split(X, y)
```
```X_train.shape
```
`(37500, 1000)`

Create and train a logistic regression classifier#

Logistic regression classifiers take a while to train.

```%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)
```
```CPU times: user 42.2 s, sys: 1.07 s, total: 43.3 s
Wall time: 37 s
```
```LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=None,
max_iter=1000, multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)```

Create and train a random forest classifier#

Random forests train pretty slowly, too. If you make them perform better by increasing `n_estimators` it takes even longer!

```%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X_train, y_train)
```
```CPU times: user 49.2 s, sys: 821 ms, total: 50 s
Wall time: 59.4 s
```
```RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=50,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)```

Create and train a linear support vector classifier (LinearSVC)#

This one will be nice and quick!

```%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X_train, y_train)
```
```CPU times: user 753 ms, sys: 28.4 ms, total: 781 ms
Wall time: 1.01 s
```
```LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0)```

Create and train a multinomial naive bayes classifier (MultinomialNB)#

This one will also train quickly.

```%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X_train, y_train)
```
```CPU times: user 251 ms, sys: 32.6 ms, total: 284 ms
Wall time: 312 ms
```
`MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)`

Checking each classifier's performance#

We'll use the accuracy score and confusion matrix to see how well each algorithm performs. While we're mostly interested in the confusion matrix, seeing the accuracy score is a good reminder that accuracy is a terrible evaluation metric.

```from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
```

Logistic Regression#

```y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
```
```Accuracy score 0.86368
```
Predicted negative Predicted positive
Is negative 0.934727 0.199805
Is positive 0.115581 0.646199

Random Forest#

```y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
```
```Accuracy score 0.86648
```
Predicted negative Predicted positive
Is negative 0.944704 0.169266
Is positive 0.121842 0.627031

LinearSVC#

```y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
```
```Accuracy score 0.86544
```
Predicted negative Predicted positive
Is negative 0.937487 0.191358
Is positive 0.116005 0.644899

Naive Bayes#

```y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
```
```Accuracy score 0.85016
```
Predicted negative Predicted positive
Is negative 0.944385 0.170240
Is positive 0.143176 0.561728

Since a good number of our serious crimes have been downgraded to not-serious, a better performance on predicting serious crimes could actually be a problem. What should we try to measure for our evaluation metric?

We'll cover it (and measure it) later, but you should think reaaaally hard about it right now.

Making predictions to find downgraded crimes#

To see if our algorithm can find downgraded reports, we'll first ask it to make predictions on each of the descriptions we have. If a report is listed as not serious, but the algorithm thinks it should be serious, we should examine the report further.

```# Feed the classifier the word counts (X) to have it make the prediction
df['logreg_pred'] = logreg.predict(X)
df['forest_pred'] = forest.predict(X)
df['svc_pred'] = svc.predict(X)
df['bayes_pred'] = bayes.predict(X)

```
CCDESC DO_NARRATIVE serious downgraded logreg_pred forest_pred svc_pred bayes_pred
405689 BATTERY - SIMPLE ASSAULT DO-SUSP PUSHED VICT 0 0 0 0 0 0
258993 INTIMATE PARTNER - SIMPLE ASSAULT DO-S AND V LIVE TOGETHER HAVE 1 CIC S PUNCHED V IN THE FACE 0 0 0 0 0 0
531609 BATTERY - SIMPLE ASSAULT DO-DURING AN ARGUMENT SUSP SLAPPED BOTH VICTS 0 0 0 0 0 0
365367 INTIMATE PARTNER - SIMPLE ASSAULT DO-SUSP PUSHED VICT DURING AN ARGUMENT SUPS THEN THREW A SMALL TABLE AT VICTS LEGS CASING INJURY 0 0 0 0 0 0
700699 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT DO- UNK SUSP HIT VICTS VEH W SUSP VEH WITH INTENTION OF MAKING VICT STOP UNK SUSP FLED WB ON VALERIO ST TOWARDS VAN NUYS BL 1 0 1 1 1 1

Crimes with a `1` in serious are serious, and ones with a `1` in downgraded were downgraded. If either of those columns is `1`, then the prediction should have been `1`.

```df[df.downgraded == 1].head(5)
```
CCDESC DO_NARRATIVE serious downgraded logreg_pred forest_pred svc_pred bayes_pred
575961 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT DO-V WAS APP BY S S PUNCHD V IN FACE S1 PROD KNIFE AND SWUNG KNIFE AT V CAUSG CUT ABOVE HER LFT EYE 0 1 1 1 1 1
137899 ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER DO-SUSP THREW GLASS BOTTLE AT VICT FROM THIRD FLOOR APT BALCONY ALMOST HITTING VICT SUSP FLED INTO APT 0 1 1 0 1 0
405522 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT DO-V WAS STRUCK IN THE HEAD TWICE WITH BASEBALL BAT BY UNK SUSP V WAS TRANS TO HOSP BY PRIVATE PARTY AND DROPPED DUMPED OFF 0 1 1 0 1 0
750347 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT DO-S ENGAGED IN A VERBAL ALTERCATION WITH V S STRUCK V ON HAND WITH A STICK S CHASED V AND STRUCK V A SECOND TIME ON VICTS BACK WITH STICK ARREST 0 1 1 0 1 0
173459 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT DO-SUSPECT PICKED UP A 50 GALLON METAL TRASH CAN AND THREW IT AT VICTIM HITTING HER ON HER LEFT LEG 0 1 0 0 0 0

Combining classifiers#

When the LA Times did their analysis, they used two different classifiers. We can actually do the same thing!

Each predictor gave us a `0` or a `1`, where `1` means it thinks the report should be classified as serious. What if we just said hey, did any of you think this report should be serious?

```df[['logreg_pred', 'forest_pred', 'svc_pred', 'bayes_pred']].head()
```
logreg_pred forest_pred svc_pred bayes_pred
405689 0 0 0 0
258993 0 0 0 0
531609 0 0 0 0
365367 0 0 0 0
700699 1 1 1 1
```df['combined_pred'] = df[['logreg_pred',
'forest_pred',
'svc_pred',
'bayes_pred']].any(axis=1).astype(int)
```
CCDESC DO_NARRATIVE serious downgraded logreg_pred forest_pred svc_pred bayes_pred combined_pred
405689 BATTERY - SIMPLE ASSAULT DO-SUSP PUSHED VICT 0 0 0 0 0 0 0
258993 INTIMATE PARTNER - SIMPLE ASSAULT DO-S AND V LIVE TOGETHER HAVE 1 CIC S PUNCHED V IN THE FACE 0 0 0 0 0 0 0
531609 BATTERY - SIMPLE ASSAULT DO-DURING AN ARGUMENT SUSP SLAPPED BOTH VICTS 0 0 0 0 0 0 0
365367 INTIMATE PARTNER - SIMPLE ASSAULT DO-SUSP PUSHED VICT DURING AN ARGUMENT SUPS THEN THREW A SMALL TABLE AT VICTS LEGS CASING INJURY 0 0 0 0 0 0 0
700699 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT DO- UNK SUSP HIT VICTS VEH W SUSP VEH WITH INTENTION OF MAKING VICT STOP UNK SUSP FLED WB ON VALERIO ST TOWARDS VAN NUYS BL 1 0 1 1 1 1 1
```y_true = df.serious
y_pred = df.combined_pred
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names) / matrix.sum(axis=1)
```
```Accuracy score 0.9241
```
Predicted negative Predicted positive
Is negative 0.921468 0.246544
Is positive 0.021545 0.932362

AMAZING!!! Incredible!!!

But again: the accuracy score is a terrible metric and the confusion matrix might not even be that good, as what we're actually interested in is whether we can detect downgrades.

Let's try to do that now: out of all of the ones marked not serious - candidates for being secretly downgraded - how many actually were downgraded?

```# Out all of all the ones actually downgraded
not_serious = df[df.serious == 0]

# How many did we think were serious?
y_pred = not_serious.combined_pred
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['not serious', 'serious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
```
```Accuracy score 0.9444371192742808
```
Predicted not serious Predicted serious
Is not serious 0.957708 0.042292
Is serious 0.287665 0.712335

Okay, not nearly as good as what the confusion matrix said. Using this new metric, how does it compare for LinearSVC, one of our standard classifiers?

```# Out all of all the ones actually downgraded
not_serious = df[df.serious == 0]

# How many did we think were serious?
y_pred = not_serious.svc_pred
matrix = confusion_matrix(y_true, y_pred)

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy score", accuracy)

label_names = pd.Series(['not serious', 'serious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
```
```Accuracy score 0.957543313731178
```
Predicted not serious Predicted serious
Is not serious 0.975802 0.024198
Is serious 0.361775 0.638225

The combined one is definitely better, finding about 150 additional cases for our inspection to go through.

Review#

We're working on reproducing a Los Angeles Times piece where they uncovered serious assaults that had been downgraded by the LAPD to simple assault. They used multiple machine learning classifiers in their investigation, so we tried to see whether we could do the same.

We used four different kinds of classifiers - a linear support vector classifier, a random forest, a logistic regression, and a naive bayes classifier - and compared their performance. While the classifiers performed roughly similar to one another, some did outperform others. In the end, though, we combined the predictions of all of our classifiers to cast the widest net for misclassified crime reports.

It was also important to determine what our evaluation metric should be. While we trained our dataset on everything - both serious assaults, non-serious assaults, and downgraded assaults - we eventually realized all we should measure was how accurate we were in discriminating non-serious assaults from downgraded assaults.

Discussion topics#

• Our algorithm had 88% accuracy overall, but only 65% in detecting downgraded crimes. What's the difference here? How important is one score compared to the other?
• We only hit around 65% accuracy in finding downgraded crimes. Is this a useful score? How does it compare to random guessing, or going one-by-one through the crimes marked as non-serious?
• What techniques could we have used to find downgraded crimes if we didn't use machine learning?
• Is there a difference between looking at the prediction - the 0 or 1 - and looking at the output of `decision_function`?
• What happens if our algorithm errs on the side of calling non-serious crimes serious crimes? What if it errs on the side of calling serious crimes non-serious crimes?
• If we want to find more downgraded cases (but do more work), we'll want to err on the side of examining more potentially-serious cases. Is there a better method than picking random cases?
• One of our first steps was to eliminate all crimes that weren't assaults. How do you think this helped or hindered our analysis?
• Why did we use LinearSVC instead of another classifier such as LogisticRegression, RandomForest or Naive Bayes (MultinomialNB)? Why might we try or not try those?
• You don't work for the LAPD, so you can only be so sure what should and shouldn't be a serious crime. What can you do to help feel confident that a case should be one or the other, or that our algorithm is working as promised?
• In this case, we randomly picked serious crimes to downgrade. Would it be easier or more difficult if the LAPD was systematically downgrading certain types of serious crimes? Can you think of a way to around that sort of trickery?
• Many people say you need to release your data and analysis in order to have people trust what you've done. With something like this dataset, however, you're dealing with real things that happened to real people, many of whom would probably prefer to keep these things private. Is that a reasonable expectation? If it is, what can be done to bridge the gap between releasing all of the original data and keeping our process secret?
```
```