5.2 Vectorizing
At the same time we tokenize, we’re probably also counting how many times each word appears. The process of turning text into numbers is called vectorization, and it’s an important and necessary step in helping computers understand text.
Every single time you deal with text you’ll need to vectorize it, otherwise the computer won’t know what’s going on!
Before we vectorize, let’s take a look at the first few cases to see what kinds of words might be working with. Lots of arguments and pushing, along with a couple appearances of handguns and knives.
DO_NARRATIVE | |
---|---|
0 | DO-SUSPS PULL UP NEXT TO VICT IN VEH SUSP1 SUSP2 SUSP3 EXIT VEH RUSH VICTSUSP1 PRODUCED FOLDING KNIFE AND STABBED VICT IN STOMACH SUSPS FLEE IN VEH |
1 | DO-VICT AND SUSP HAVE 2 CHILDREN IN COMMON BOTH INV IN A VEBAL ARGUMENT SUSP BECOMES IRATE AND HITS VICT |
2 | DO-SUSP PUSHED THE VICT AND SPANKED VICTIM APPROX THREE TIMES NOT CAUSING VISIBLE INJURY |
3 | DO-S1 V1 HAVE AND ALTERCATION OVER MONEY S1 BECAME ANGRY WITH V1 FOR NOT GIVING HER MONEY S1 THEN GOT ON TOP OF V1 AND ATTEMP TO WRESTLE THE MONEY AWAY |
4 | DO-S WAS VERBALLY CONFRONTED BY V WHO WAS ACCROSS THE STREET AFTER S DOG DEFACATED S APPROACHED V AND HIT PR HAND |
When we vectorize these descriptions, we get a new dataset. It’s formatted in a pretty technical way, but if you massage the results a little, you can get a nice dataframe to show you which words appear in which sentences.
Every column is a single word that the computer found somewhere in the descriptions:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(binary=True)
X = vec.fit_transform(df.DO_NARRATIVE)
word_appearances = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
word_appearances[['knife', 'argument', 'handgun', 'wrestle', 'pushed', 'injuries', 'fired', 'irate', 'umbrella']].head(10)
knife | argument | handgun | wrestle | pushed | injuries | fired | irate | umbrella |
---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
0
means the word wasn’t found in that sentence, while 1
means it was. Compare each row with the sentences up above to see that yes, it worked!
I only picked a few words here because as a lot of the words are pretty much garbage, and might be misspellings or only appear in one or two narratives. Even though I cleaned up the results a little bit, looking at them with our nice human brains is an exercise in getting a headache:
## Index(['09083', '09084', '092312', '10', '100', '1000', '100ft', '100yrs',
## '101', '101st', '102212', '102nd', '103rd', '105', '105th', '106th',
## '107th', '10860', '10861', '108th', '109th', '10feet', '10ft', '10inch',
## '10mos', '10mths', '10th', '10times', '10x', '10xs'],
## dtype='object')
Luckily computers don’t mind sorting the useful words from the useless words, so we don’t need to feel too guilty about the next steps.