Building a crime classification engine
Using machine learning as an investigative tool to cast light on years of underreporting by the Los Angeles Police Department.
evaluation metrics stemming TF-IDF sparse data support vector machines classification natural language processing
Readings and links
The Los Angeles Police Department underreported serious assaults for years, classifying them as lower-level crimes instead. In a series of separate pieces, the Los Angeles Times analyzes crime reports first by hand and then using machine learning. They eventually uncovered more than 14,000 serious assaults that had been classified as lower-level simple assaults.
Working from a selection of assaults labeled as either serious or simple, we'll train a classifier to detect which is which based on words included on police reports. We'll then turn this classifier back on the dataset itself, seeing if it can detect incorrectly categorized assaults.
This chapter builds on the simple text-analysis techniques learned in the previous chapters, moving to more advanced skills like using all of the words instead of a selection, and using "term-frequency inverse-document-frequency" (TF-IDF) to ignore common words.
Notebooks, Assignments, and Walkthroughs
This walkthrough goes through the start-to-finish process of tracking down misclassified crimes. Be warned that dataset contains multiple descriptions of assaults.
Using LAPD summaries of 160,000 cases of assault, we'll build a classification engine to detect when a report has been downgraded from aggravated assault to simple assault (note that we're the ones performing the downgrading).
While our classifier performed well, how did it work? Let's examine how our classifier succeeds and fails, and see how that may impact our investigation.
The LA Times used two different classifiers that they felt complemented each other in the investigation. Let's see how different machine learning algorithms compare, and how to combine multiple classifiers to cast the widest net for possible downgrades.
Basic data-driven story questions
- What editorial decisions did we make when cleaning or data or putting together this analysis?
- If you wanted to criticize this analysis, what approaches might you take?
Different classifiers provided different results - we tried random forests, logistic regression, and a support vector machine. None of these results matched exactly!
- Which classifier did we have the most faith in, and why?
- How comfortable should we/can we be with our results if we don't have an understanding of the math or technology behind the processes?
- Should/does that lack of understanding limit what we can/should do with our results?
- If we don't have much of an understanding of the lower-level processes, how can we protect ourselves from being wrong?
We've been predicting whether a crime has potentially been misclassified as something of a lower severity. This won't be perfect - no matter how good we get, we'll definitely be wrong with our guesses sometimes. For these questions, it's important to think about what we are doing with these results.
- What are the consequences of us misclassifying crimes? How about for the LAPD?
- What are the consequences of predicting "minor" for a crime that should actually have been classified as "major?"
- Is that different from predicting "major" for a crime that should actually have been classified as "minor?"
- Are there processes are in place that might limit the negative repercussions of incorrect assessments? If not, what could be done?
With publishing this analysis, we're releasing a lot of somewhat detailed crime reports.
- Are there any ethical concerns with publishing this dataset?
- If so, what steps could be taken to limit the unintended consequences of publishing this data?