Taking a closer look at our crime classifier's shortcomings#
In our last project, we built a detector to identify serious assaults that were downgraded into simple assaults. In this notebook we'll see how our algorithm operates, and address any issues about what this technique might miss.
Imports and setup#
First we'll set some options up to make everything display correctly. It's mostly because these assault descriptions can be quite long, and the default is to truncate text after a few words.
import pandas as pd
import numpy as np
from sklearn.svm import LinearSVC
pd.set_option('display.max_colwidth', 200)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 300)
%matplotlib inline
Repeat our analysis#
First we'll repeat the majority of our processing and analysis from the first notebook, then we'll get into the critique.
Read in our data#
Our dataset is going to be a database of crime reports between 2008 and 2012. It will start off with two columns:
CCDESC
, what criminal code was violatedDO_NARRATIVE
, a short text description of what happened
We're going to use this description to see if we can separate serious cases of assault compared to non-serious cases of assault.
We won't be covering the process of vectorizing our dataset and creating our classifier in this notebook. Instead, we're going to focus on analyzing the possible shortcomings of our analysis, both conceptually and technically.
# Read in our dataset
df = pd.read_csv("data/2008-2012.csv")
# Only use reports classified as types of assault
df = df[df.CCDESC.str.contains("ASSAULT")].copy()
# Classify as serious or non-serious
df['serious'] = df.CCDESC.str.contains("AGGRAVATED") | df.CCDESC.str.contains("DEADLY")
df['serious'] = df['serious'].astype(int)
# Downgrade 15% from aggravated to simple assault
serious_subset = df[df.serious == 1].sample(frac=0.15)
df['downgraded'] = 0
df.loc[serious_subset.index, 'downgraded'] = 1
df.loc[serious_subset.index, 'serious'] = 0
# Examine the first few
df.head()
Vectorize#
%%time
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
stemmer = SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(StemmedTfidfVectorizer,self).build_analyzer()
return lambda doc:(stemmer.stem(word) for word in analyzer(doc))
vectorizer = StemmedTfidfVectorizer(min_df=20, max_df=0.5)
X = vectorizer.fit_transform(df.DO_NARRATIVE)
words_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
words_df.head(5)
Classify#
%%time
X = words_df
y = df.serious
clf = LinearSVC()
clf.fit(X, y)
import eli5
eli5.show_weights(clf, vec=vectorizer, top=(20,20), horizontal_layout=True)
We can throw it into a graph form, too.
eli5.explain_weights_df(
clf,
vec=vectorizer, top=(20,20)
).plot(
x='feature',
y='weight',
kind='barh',
figsize=(10,10)
)
Lots of interesting stuff in there!
- Does it sound reasonable which terms imply aggravated vs simple assault?
- Which ones are misspellings? Does that worry you?
- Are there any terms in there you don't quite understand?
I personally don't understand ppa
, so I'm going to look it up in the dataset.
df[df.DO_NARRATIVE.str.contains("PPA")].head()
PPA stands for Private Person's Arrest, and you can find read more about it and even see the form that's filled out. PPAs are often performed by private security guards.
plastic
also seems to be popular under non-serious assaults.
df[df.DO_NARRATIVE.str.contains("PLASTIC")].head()
Guess they aren't too dangerous?
Making predictions to find downgraded crimes#
To see if our algorithm can find downgraded reports, we'll first ask it to make predictions on each of the descriptions we have. If a report is listed as not serious, but the algorithm thinks it should be serious, we should examine the report further.
# Feed the classifier the word counts (X) to have it make the prediction
df['prediction'] = clf.predict(X)
# Let's also how certain the classifier is
df['prediction_dist'] = clf.decision_function(X)
df.head()
Crimes with a 1
in serious are serious, and ones with a 1
in downgraded were downgraded. If either of those columns is 1
, then prediction would also be 1
for a correct prediction.
We're also going to add how certain the classifier is about its classification. Different classifiers use different metrics, but for a LinearSVC it's called .decision_function
. The closer to 0 the score is, the less sure the algorithm is.
Let's evaluate our classifier#
When you build a classifier, you'll talk about your evaluation metric, what you use to judge how well your algorithm performed. Typically this is accuracy - how often was your prediction correct?
How often did our prediction match whether a crime was listed as serious?#
(df.prediction == df.serious).value_counts(normalize=True)
88% doesn't seem that bad!
Remember, though, 15% of the serious crimes have been downgraded. We don't actually care whether the prediction matches if the crime has been downgraded. We need to see whether we correctly predicted reports marked as serious or downgraded reports.
How often did we match the true serious/not serious value?#
Since we're interested in uncovering the secretly-serious reports, we want to see whether it's serious or downgraded.
(df.prediction == (df.serious | df.downgraded)).value_counts(normalize=True)
We actually did better when including the secrets! 89%!
While this seems good, it isn't what we're actually after. We're specifically doing research on finding downgraded reports, so what we're interested in is how often we found reports marked as non-serious that were downgraded from serious.
How often did we catch downgrades?#
# Only select downgraded reports
downgraded_df = df[df.downgraded == 1]
# How often did we predict they were serious?
(downgraded_df.prediction == 1).value_counts(normalize=True)
# And again, without the percentage
(downgraded_df.prediction == 1).value_counts()
We were able to find around 4,500 of our 7,000 downgraded offenses. That's about 65% of them.
Whether this is good or bad is up for discussion at the end, but let's turn to examining the cases we got wrong.
df[df.DO_NARRATIVE.str.contains("MACHETE")].head(10)
For those first 10 it looks like we predicted all of these accurately. But what about the machete-related crimes that our algorithm is very certain are non-serious assaults?
df[df.DO_NARRATIVE.str.contains("MACHETE")].sort_values(by='prediction_dist').head(10)
Some of them involve a machete being used as a threat, instead of it actually striking the person. Others involve actual violence with the machete, and seem like they should be classified as serious.
Reading through the descriptions carefully, you'll notice that the last description has a little typo in it - MACHETET
instead of MACHETE
. If we corrected the typo, will the classifier correctly predict it as serious?
sentence = "DO-SUSP GRABBED MACHETET AND SWANG IT AT 3 VICT MISSING VICT S STRUCK V1 ON EYE LEFT WITH CLOSED FIST"
sample_X = vectorizer.transform([sentence])
clf.predict(sample_X)
sentence = "DO-SUSP GRABBED MACHETE AND SWANG IT AT 3 VICT MISSING VICT S STRUCK V1 ON EYE LEFT WITH CLOSED FIST"
sample_X = vectorizer.transform([sentence])
clf.predict(sample_X)
Yes! The vectorizer we used to count word doesn't know that MACHETET
isn't a word, so it counted it as one. It's only because we're aware of spelling errors that we were able to adjust it into MACHETE
.
While this isn't the case for all of themes classified pieces - for example, the first three would still be incorrectly predicted as non-serious - a small typo can definitely derail our classifier.
Examining more general misses#
Let's look at a handful of the reports we predicted incorrectly. We'll sort by using the distance measurement to find the ones that the classifier was very sure were non-serious.
predicted_correctly = (df.prediction == (df.serious | df.downgraded))
df[~predicted_correctly].sort_values(by='prediction_dist').head(15)
While the most important words for the classifier were weapons - baseball bats, knives, guns - causing visible injury through punching can also be classified as aggravated assault. It looks like the classifier doesn't agree.
punch_df = df[df.DO_NARRATIVE.str.contains("PUNCH")]
print("Predicted")
print(punch_df.prediction.value_counts(normalize=True))
print("Actual")
print((punch_df.serious | punch_df.downgraded).value_counts(normalize=True))
While the LAPD classifies assault as aggravated 13% of the time if punching is involved, the algorithm only classifies it as aggravated 5% of the time.
Now we can look at the opposite - cases where our predictor was certain something was a serious assault, but it was filed as a simple assault.
df[~predicted_correctly].sort_values(by='prediction_dist', ascending=False).head(n=20)
Again, mostly reports of domestic violence. This time they often involve stabbing and knives, words that we've already seem are high triggers for a report to be marked as aggravated.
punch_df = df[df.DO_NARRATIVE.str.contains("KNIF")]
print("Predicted")
print(punch_df.prediction.value_counts(normalize=True))
print("Actual")
print((punch_df.serious | punch_df.downgraded).value_counts(normalize=True))
Oddly, though, it looks like the classifier tends to under-report stabbings, not over-report. Although 86% of reports involving the word "STAB" are marked as aggravated, the classifier only marks them as serious 81% of the time.
It appears that domestic abuse cases might be an especially problematic topic in terms of classification, with more going into the situation than just a clear-cut definition of assault categories.
Review#
We reproduced an ersatz version of a Los Angeles Times piece where they uncovered serious assaults that had been downgraded by the LAPD to simple assault. We don't have access to the original classifications, so we used a dataset of assaults between 2008 and 2012 and downgraded a random 15% of the serious assaults.
Using text analysis, we first analyzed the words used in a description of assault - less common words were given more weight, and incredibly common words were left out altogether. Using these results, we then created a classifier, teaching the classifier which words were associated with simple assault compared to aggravated assault.
Finally, we used the classifier to predict whether each assault was aggravated or simple assault. If a crime was predicted as serious but marked as non-serious, it needed to be examined as a possible downgrade. Our algorithm correctly pointed out around 65% of the randomly downgraded crimes.
Discussion topics#
- Our algorithm had 88% accuracy overall, but only 65% in detecting downgraded crimes. What's the difference here? How important is one score compared to the other?
- We only hit around 65% accuracy in finding downgraded crimes. Is this a useful score? How does it compare to random guessing, or going one-by-one through the crimes marked as non-serious?
- What techniques could we have used to find downgraded crimes if we didn't use machine learning?
- Is there a difference between looking at the prediction - the 0 or 1 - and looking at the output of
decision_function
? - What happens if our algorithm errs on the side of calling non-serious crimes serious crimes? What if it errs on the side of calling serious crimes non-serious crimes?
- If we want to find more downgraded cases (but do more work), we'll want to err on the side of examining more potentially-serious cases. Is there a better method than picking random cases?
- One of our first steps was to eliminate all crimes that weren't assaults. How do you think this helped or hindered our analysis?
- Why did we use LinearSVC instead of another classifier such as LogisticRegression, RandomForest or Naive Bayes (MultinomialNB)? Why might we try or not try those?
- You don't work for the LAPD, so you can only be so sure what should and shouldn't be a serious crime. What can you do to help feel confident that a case should be one or the other, or that our algorithm is working as promised?
- In this case, we randomly picked serious crimes to downgrade. Would it be easier or more difficult if the LAPD was systematically downgrading certain types of serious crimes? Can you think of a way to around that sort of trickery?
- Many people say you need to release your data and analysis in order to have people trust what you've done. With something like this dataset, however, you're dealing with real things that happened to real people, many of whom would probably prefer to keep these things private. Is that a reasonable expectation? If it is, what can be done to bridge the gap between releasing all of the original data and keeping our process secret?