5 Teaching computers to read

Last time we did text analysis, we picked a custom list of words that, if found, might imply sexual assault.

We could do the same thing here, trying to find crimes with especially violent words that were classified as Part II “simple” assault. That’s actually exactly how The LA Times did their original research! I n their published piece they say:

Reporters searched the summaries for terms such as “stab” and “knife” to flag incidents that might meet the FBI criteria for serious offenses. They then read thousands of the summaries, which are typically two or three sentences long. They also reviewed court and police records for dozens of cases.

The problem with this is that we would have to guess useful words - like “stab,” “knife,” “gun,” “shot” - and then read through all of the results that come up. But what about other situations, ones that might be less clear-cut to non-experts, or assaults that involved less nontraditional weapons?

For example, machetes make appearances in plenty of aggravated assaults:

df[df.DO_NARRATIVE.str.contains("MACHETE", na=False)].head(5)

	CCDESC	DO_NARRATIVE	is_part_i	reported
1469	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO-SUSP STRUCK VICTS NOSE W/MACHETE	1	0
1752	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO-SUSP HELD MACHETE HANDLE AT SHOULDER HEIGHT AND CHARGED AT VICTIM CLOSING THE DISTANCE	1	1
2339	INTIMATE PARTNER - SIMPLE ASSAULT	DO-S ENGAGED V IN A VERBAL ARGUMENT S CHOKED THE V UNTIL SHE LOST CONSCIOUSNESS ONCE V REGAINED CONSCIOUSNESS THE S PUT A MACHETE TO THE V NECK	0	0
2416	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO-SUSP CONCEALED VICT AND SWUNG A MACHETE A VICT SUSP CHASED VICT UNTIL VICT FLAGGED DOWN PD SUSP WAS ARRESTED FOR ADW	1	0
2805	ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT	DO-SUSP AND VICT WERE WERE INVOLVED IN ARGUMENT SUSP TOLD VICT HE WAS GOING TO KILL HIM AND STRUCK HIM W A MACHETE	1	1

One thing we could do is talk to experts, and see what words might be useful to search for. We could also read many many many narratives, and eventually learn some of the more obscure weapons and techniques that might make something more serious Part I offense. Once we had our more complete list, we could then search the documents those words and review the cases.

Both of those sound like a lot of work - this is what machine learning was invented for!

Instead of reading thousands of narratives and learning what words are important, we could just have the computer read the narratives for us. The computer can go case-by-case, reading documents, finding the words for each of them, and then figure out which words are more likely to imply a Part I vs. Part II crime.

When we start, the computer won’t know the difference between “stab” and “punch.” After some training, though, it will actually notice that “stab” appears more often with aggravated assaults, while “punch” is typically for simple assaults. Once we’ve told it to read enough cases, we can give the computer a description of a crime it’ll be able to guess which type the crime should be classified as!

5.0.1 Data cleaning

Some of our offenses are missing a description, though. Since we can’t judge the crime classification if there isn’t a description of what happened, we’ll toss those out.

df = df.dropna(subset=['DO_NARRATIVE'])