- investigate.ai
- Searching for faulty airbags in vehicle complaints
- Combining a text vectorizer and a classifier to track down suspicious complaints
Finding faulty airbags in a sea of consumer complaints by counting words and classifying the results#
Topics: Vectorizing text
Datasets
- sampled-labeled.csv: a sample of vehicle complaints, labeled with being suspicious or not
What's the goal?#
It was too much work to read twenty years of vehicle comments to find the ones related to dangerous airbags! The last two times we tried to pick out important words to dangerous/not dangerous airbags, but it didn't go so well because we weren't sure what the best ones to pick were.
This time we're going to pick everything.
Setup#
import pandas as pd
# Allow us to display 100 columns at a time, and 100 characters in each column (instead of ...)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)
Read in our labeled data#
We'll start by reading in our complaints that have labeled attached to them. Read in sampled-labeled.csv
.
labeled = pd.read_csv("data/sampled-labeled.csv")
labeled.head()
is_suspicious | CDESCR | |
---|---|---|
0 | 0.0 | ALTHOUGH I LOVED THE CAR OVERALL AT THE TIME I DECIDED TO OWN, , MY DREAM CAR CADILLAC CTS HAS T... |
1 | 0.0 | CONSUMER SHUT SLIDING DOOR WHEN ALL POWER LOCKS ON ALL DOORS LOCKED BY ITSELF, TRAPPING INFANT I... |
2 | 0.0 | DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT |
3 | 0.0 | TL* THE CONTACT OWNS A 2009 NISSAN ALTIMA. THE CONTACT STATED THAT THE START BUTTON FOR THE IGNI... |
4 | 0.0 | THE FRONT MIDDLE SEAT DOESN'T LOCK IN PLACE. *AK |
Even though it's called labeled
, not all of them have labels. Drop the ones missing labels.
labeled = labeled.dropna()
See how many suspicious/not suspicious comments we have.
labeled.is_suspicious.value_counts()
0.0 150 1.0 15 Name: is_suspicious, dtype: int64
150 non-suspicious and 15 suspicious is a pretty terrible ratio, but we're remarkably lazy and not very many of the comments are actually suspicious.
Now that we've read a few, let's train our classifier
Creating features#
Selecting our features and building a features dataframe#
Last time, we can thought of some words or phrases that might make a comment interesting or not interesting. We came up with this list:
- airbag
- air bag
- failed
- did not deploy
- violent
- explode
- shrapnel
We then built a dataframe that included those words for each row - 0
if it's in there, 1
if it isn't - along with the is_suspicious
label. That process looked like this:
train_df = pd.DataFrame({
'is_suspicious': labeled.is_suspicious,
'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()
is_suspicious | airbag | air bag | failed | did not deploy | violent | explode | shrapnel | |
---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
But as we found out later, picking which words are important - feature selection - can be a difficult process. There are a lot of words in there, and it isn't like we're going to go through and look at every single word, right?
Well, actually, it's definitely possible to look at every single word, and it takes way less code than what we did up above.
You can count words using the CountVectorizer
from sci-kit learn. Using .fit_transform
below will learn all of the words in a column, then count them.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(labeled.CDESCR)
vectors
<165x2280 sparse matrix of type '<class 'numpy.int64'>' with 9089 stored elements in Compressed Sparse Row format>
But... what's a "sparse matrix"? We can see something that looks more familiar if we tell it to become an array (basically a list).
vectors.toarray()
array([[0, 0, 0, ..., 3, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 1, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]])
It's still a little hard to understand, but a list of lists? Sounds like a great opportunity for a dataframe!
pd.DataFrame(vectors.toarray())
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | ... | 2230 | 2231 | 2232 | 2233 | 2234 | 2235 | 2236 | 2237 | 2238 | 2239 | 2240 | 2241 | 2242 | 2243 | 2244 | 2245 | 2246 | 2247 | 2248 | 2249 | 2250 | 2251 | 2252 | 2253 | 2254 | 2255 | 2256 | 2257 | 2258 | 2259 | 2260 | 2261 | 2262 | 2263 | 2264 | 2265 | 2266 | 2267 | 2268 | 2269 | 2270 | 2271 | 2272 | 2273 | 2274 | 2275 | 2276 | 2277 | 2278 | 2279 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
160 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
161 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
162 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
163 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
164 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
165 rows × 2280 columns
Each row is a sentence, and each column is a word!
- We had 165 sentences, so we know have 165 rows
- There were 2280 words, so we have 2280 columns
If a word appears zero times in a sentence, that column gets a 0
. If it appears one or two or twenty times, that number appears in the column instead.
The whole sparse matrix thing is part of numpy. It's the idea that since the list of lists was mostly empty, Python can be lazy and not keep track of all of the 0
s - instead, it only tracks where there are non-0
numbers. A sparse matrix is much more efficient with space if you have a lot lot lot of 0
's!
We used .toarray()
to turn it into a list of lists (although sometimes if we have a lot lot lot of words and sentences our computer might not be able to do it).
How do we know which column is which word? When we told the vectorizer to count all of the words in each sentence, it also memorized all of the words separately.
print(vectorizer.get_feature_names())
['00', '000', '01', '01v347000', '02', '02v105000', '02v146000', '03', '03v455000', '04', '05', '05v395000', '06', '07', '08', '08v303000', '09', '10', '1000', '10017', '11', '12', '128', '12th', '13', '136', '13v136000', '14', '1420', '15', '150', '15pm', '16', '160lbs', '17', '180', '1996', '1997', '1998', '1999', '1st', '20', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '20k', '20mph', '22', '2300', '24', '25', '2500', '262', '28', '29', '2nd', '30', '300', '30miles', '30mph', '31', '32', '323i', '325xi', '32k', '35', '37', '39', '390', '3k', '3rd', '40', '40mph', '42', '440', '45mph', '48', '49', '4x4', '50', '500', '5000', '50000', '50k', '517', '55', '552', '57', '5th', '60k', '60mph', '65', '65000km', '68', '6th', '70', '71000', '75', '7500', '77', '775', '79', '795', '800', '8004341', '808680', '86', '87', '91', '915', '93k', '94', '98', '981', 'a1', 'aamco', 'able', 'about', 'above', 'abrasion', 'abrasions', 'abs', 'absence', 'absolutely', 'ac', 'accelerate', 'accelerated', 'acceleration', 'access', 'accident', 'accord', 'according', 'accurate', 'acknowledge', 'across', 'act', 'acted', 'action', 'activates', 'activations', 'active', 'actually', 'actuator', 'acura', 'addition', 'additional', 'address', 'addressed', 'adjacent', 'adjusted', 'advance', 'advise', 'advised', 'affairs', 'affecting', 'afford', 'afraid', 'after', 'again', 'against', 'age', 'agency', 'agents', 'ago', 'agony', 'air', 'airbag', 'airbags', 'aircondition', 'ak', 'alive', 'all', 'alley', 'allow', 'almost', 'along', 'already', 'also', 'although', 'altima', 'always', 'am', 'american', 'an', 'and', 'angles', 'another', 'answer', 'antenna', 'anti', 'antifreeze', 'any', 'anymore', 'anyone', 'anything', 'anywhere', 'apart', 'apparent', 'appeal', 'appear', 'appeared', 'appears', 'applied', 'apply', 'applying', 'appointment', 'appreciate', 'appreciated', 'approaching', 'approval', 'approx', 'approximate', 'approximately', 'april', 'aprox', 'are', 'area', 'arise', 'arm', 'arms', 'around', 'as', 'asked', 'assembly', 'assigned', 'assist', 'assistance', 'assume', 'at', 'attachment', 'attachments', 'attempted', 'attempting', 'attempts', 'attention', 'audible', 'aug', 'august', 'augusta', 'authorized', 'auto', 'automatic', 'automobiles', 'avail', 'available', 'avoid', 'aware', 'away', 'awfully', 'awhile', 'axel', 'axle', 'baby', 'back', 'backing', 'backwards', 'bad', 'bag', 'bags', 'bailed', 'ball', 'banged', 'banging', 'bar', 'barely', 'bargained', 'barrier', 'bars', 'battery', 'bc', 'be', 'beam', 'beams', 'became', 'because', 'been', 'beep', 'before', 'beg', 'began', 'behind', 'being', 'believe', 'believed', 'bell', 'belong', 'below', 'belt', 'belts', 'beltway', 'bent', 'benz', 'better', 'between', 'beyond', 'bf', 'bigger', 'binding', 'bit', 'bizarre', 'blew', 'blink', 'blinker', 'blinking', 'block', 'blocks', 'blowing', 'blowout', 'bmw', 'board', 'body', 'bolted', 'bone', 'booster', 'boosters', 'both', 'bother', 'bottom', 'bought', 'bound', 'box', 'bracket', 'brake', 'brakes', 'braking', 'brand', 'break', 'breaking', 'brick', 'bring', 'broadsided', 'broke', 'broken', 'brought', 'bruise', 'bruised', 'bruises', 'bruising', 'buckle', 'buckled', 'bug', 'buick', 'building', 'builds', 'built', 'bump', 'bumper', 'buns', 'buried', 'burned', 'burning', 'burns', 'busy', 'but', 'button', 'buttons', 'buy', 'buying', 'by', 'ca', 'cab', 'cable', 'cadillac', 'cafe', 'caliber', 'caliper', 'calipers', 'call', 'called', 'calls', 'cam', 'came', 'campaign', 'campaigns', 'camping', 'camry', 'can', 'canada', 'canadian', 'cannot', 'cap', 'car', 'care', 'carefully', 'cares', 'carnival', 'caromed', 'carriers', 'carrying', 'cars', 'case', 'catalytic', 'catch', 'caught', 'cause', 'caused', 'causes', 'causing', 'cb', 'center', 'centre', 'ceo', 'certain', 'certainly', 'chain', 'chandra', 'change', 'changed', 'charge', 'charged', 'cheat', 'check', 'checked', 'cherokee', 'chest', 'chevrolet', 'chevy', 'child', 'children', 'chimes', 'chin', 'choose', 'chrysler', 'cinergy', 'circle', 'circuit', 'circuits', 'claim', 'claimed', 'clash', 'classic', 'clear', 'clearly', 'climb', 'clock', 'clockspring', 'close', 'closed', 'closure', 'clue', 'cn', 'co', 'coasting', 'codes', 'coil', 'coils', 'coincidentally', 'cold', 'collapse', 'collapsed', 'collapsing', 'collide', 'collided', 'collision', 'collison', 'column', 'com', 'combi', 'come', 'comes', 'coming', 'comment', 'common', 'company', 'compartment', 'compensation', 'complained', 'complaint', 'complaints', 'complete', 'completely', 'component', 'compressor', 'compromises', 'computer', 'concern', 'concerned', 'concerning', 'concerns', 'concord', 'concrete', 'concussion', 'condition', 'conditioner', 'conditions', 'conducted', 'confirm', 'consider', 'considerably', 'considered', 'console', 'constantly', 'consume', 'consumer', 'consumers', 'contact', 'contacted', 'contemplating', 'continue', 'continued', 'continues', 'contributed', 'control', 'controls', 'converter', 'cool', 'cooler', 'cooling', 'corner', 'corolla', 'corollas', 'corporation', 'correct', 'corrode', 'corrosion', 'cost', 'costs', 'could', 'country', 'county', 'couple', 'course', 'court', 'cover', 'covered', 'crack', 'cracked', 'cracking', 'crash', 'crashed', 'crazy', 'critical', 'cronic', 'cross', 'crossed', 'crossing', 'crossroads', 'cruise', 'cruising', 'crumpled', 'crv', 'cts', 'cupping', 'curb', 'curbing', 'current', 'currently', 'currents', 'curtain', 'customer', 'cut', 'cuts', 'cutting', 'cylinder', 'd4', 'daimler', 'damage', 'damaged', 'danger', 'dangerous', 'dash', 'dashboard', 'date', 'dating', 'daughter', 'day', 'days', 'daytime', 'dazed', 'dead', 'deadly', 'dealer', 'dealers', 'dealership', 'dealerships', 'dear', 'death', 'december', 'decent', 'decided', 'decides', 'decision', 'declared', 'deemed', 'deer', 'defect', 'defective', 'defects', 'defog', 'defogger', 'defrost', 'degree', 'degrees', 'delay', 'demons', 'denied', 'denies', 'dented', 'department', 'deploy', 'deployed', 'deploying', 'deployment', 'describe', 'design', 'despite', 'destroyed', 'destructive', 'detached', 'details', 'detect', 'determine', 'determined', 'developed', 'diagnose', 'diagnosed', 'diagnosis', 'diagnostic', 'diagnostics', 'did', 'didn', 'die', 'died', 'differ', 'different', 'difficult', 'difficulty', 'digital', 'ding', 'direct', 'directed', 'dirt', 'disabled', 'discovered', 'discs', 'discuss', 'dismantled', 'dispite', 'display', 'displayed', 'dissappointed', 'distance', 'distribution', 'ditch', 'do', 'doctor', 'documented', 'doddge', 'does', 'doesn', 'dog', 'dollars', 'don', 'done', 'door', 'doors', 'down', 'dream', 'drivable', 'drive', 'driven', 'driver', 'drivers', 'driveway', 'driving', 'drivng', 'drove', 'dry', 'dt', 'dual', 'due', 'duplicate', 'during', 'dust', 'duty', 'dvd', 'e320', 'ea13003', 'each', 'ear', 'early', 'ears', 'edmunds', 'either', 'elderly', 'electrical', 'electronic', 'electronics', 'elk', 'else', 'emailed', 'embankment', 'emergency', 'emitted', 'emptied', 'empty', 'en', 'encountered', 'end', 'ended', 'engaged', 'engine', 'enough', 'entered', 'entering', 'enters', 'entertainment', 'entiire', 'entire', 'equipment', 'equipped', 'era', 'erratic', 'erratically', 'error', 'especially', 'esserman', 'estimated', 'et', 'etc', 'even', 'event', 'events', 'ever', 'every', 'everyday', 'everyone', 'everything', 'everywhere', 'evidence', 'evident', 'evidently', 'exact', 'examined', 'except', 'exceptional', 'excessive', 'exhaust', 'exists', 'exit', 'expedition', 'expense', 'expensive', 'experience', 'experienced', 'experiences', 'experiencing', 'expired', 'explained', 'explains', 'exploded', 'explorer', 'explosions', 'explosive', 'extended', 'extensive', 'extra', 'extremely', 'eye', 'face', 'facets', 'facility', 'facing', 'fact', 'factory', 'fades', 'fail', 'failed', 'failing', 'fails', 'failure', 'failures', 'faint', 'fairly', 'fallen', 'false', 'family', 'fan', 'far', 'fast', 'fatal', 'fate', 'father', 'fault', 'faults', 'faulty', 'fax', 'feature', 'features', 'february', 'federally', 'fee', 'feeding', 'feel', 'feels', 'feet', 'fell', 'felt', 'fence', 'fender', 'few', 'fiance', 'fifteen', 'fight', 'figure', 'file', 'filed', 'filler', 'filling', 'filter', 'final', 'finally', 'financially', 'find', 'fine', 'fire', 'firestone', 'firing', 'first', 'fit', 'five', 'fix', 'fixed', 'flagship', 'flashing', 'flaw', 'floor', 'flying', 'fm', 'fog', 'foia', 'fold', 'following', 'foot', 'for', 'force', 'forced', 'ford', 'forearm', 'fortunately', 'forum', 'forums', 'forward', 'found', 'four', 'fraction', 'fractured', 'frame', 'free', 'freedom', 'freon', 'from', 'front', 'frontal', 'frustrated', 'fuel', 'full', 'fully', 'function', 'functional', 'funtion', 'further', 'fuse', 'future', 'ga', 'garage', 'gas', 'gasket', 'gasoline', 'gate', 'gauge', 'gauges', 'gb250', 'gear', 'gears', 'get', 'gets', 'getting', 'give', 'given', 'glass', 'glove', 'gm', 'gmc', 'go', 'god', 'goes', 'going', 'golfs', 'gone', 'good', 'goodness', 'got', 'gotten', 'grace', 'grand', 'graph', 'gravel', 'green', 'grill', 'grinding', 'grip', 'grooves', 'ground', 'grove', 'gto', 'guard', 'had', 'hamilton', 'hand', 'handle', 'handles', 'handling', 'hands', 'happen', 'happened', 'happening', 'happens', 'hard', 'hardware', 'harness', 'harnessing', 'has', 'hasn', 'have', 'haven', 'having', 'hazard', 'hazardous', 'hd', 'he', 'head', 'headlight', 'headlights', 'headliner', 'hear', 'heard', 'heated', 'heater', 'heating', 'heavy', 'held', 'help', 'hence', 'her', 'here', 'hesitant', 'hesitated', 'high', 'higher', 'highway', 'him', 'hindered', 'hindering', 'hinge', 'hinges', 'his', 'hit', 'hitting', 'hold', 'holding', 'holds', 'holes', 'hollow', 'home', 'honda', 'honor', 'honored', 'hooked', 'hops', 'horn', 'horror', 'hose', 'hospital', 'hot', 'hour', 'hours', 'house', 'how', 'howe', 'however', 'hows', 'hub', 'huge', 'humid', 'humidity', 'hundreds', 'hurry', 'hurt', 'husband', 'husbands', 'hutchinson', 'hydraulic', 'hyosung', 'hyundai', 'i35s', 'i95', 'iahwan', 'id', 'idea', 'identified', 'ie', 'if', 'ignition', 'ii', 'illuminated', 'illuminating', 'im', 'imbursement', 'immediately', 'impact', 'impacting', 'impacts', 'impala', 'impeding', 'importation', 'in', 'inaccurate', 'inactive', 'inadvertent', 'inch', 'incident', 'include', 'included', 'including', 'indeed', 'independent', 'indianapolis', 'indicate', 'indicated', 'indicating', 'indication', 'indicator', 'indicators', 'infant', 'infiniti', 'inflate', 'inflater', 'information', 'informed', 'initial', 'initially', 'injet', 'injuires', 'injured', 'injuries', 'injuring', 'injury', 'inoperable', 'inserted', 'inside', 'insight', 'insists', 'inspect', 'inspected', 'inspecting', 'inspection', 'inspector', 'installed', 'instead', 'instrument', 'insurance', 'integral', 'intended', 'interest', 'interior', 'intermittent', 'intermittently', 'internal', 'international', 'intersection', 'intersections', 'interstate', 'interval', 'intervals', 'interventions', 'into', 'intrusive', 'investigate', 'investigated', 'investigation', 'invoice', 'involve', 'involved', 'invoved', 'iraq', 'is', 'isn', 'isolated', 'issue', 'issued', 'issues', 'it', 'itbstruck', 'item', 'items', 'its', 'itself', 'jackets', 'january', 'japanese', 'jarred', 'jb', 'jeep', 'jerked', 'jersey', 'jetta', 'job', 'joint', 'js', 'juice', 'jump', 'jumping', 'june', 'just', 'justify', 'k2500', 'kb', 'keep', 'keeps', 'kept', 'key', 'kia', 'kicked', 'kids', 'kill', 'killed', 'killing', 'kind', 'kit', 'kits', 'km', 'kms', 'knee', 'knees', 'knew', 'knock', 'knocked', 'know', 'known', 'knows', 'la', 'labor', 'lacerations', 'lamps', 'lane', 'lanes', 'lap', 'laredo', 'large', 'last', 'lasting', 'latch', 'later', 'launch', 'lawn', 'laying', 'leading', 'leak', 'leakage', 'leaking', 'leaks', 'leaned', 'leased', 'least', 'leather', 'leave', 'leaving', 'left', 'leg', 'legs', 'lehmer', 'lengths', 'lesion', 'let', 'letter', 'level', 'lever', 'lied', 'life', 'lift', 'lifter', 'liftgate', 'light', 'lights', 'like', 'likely', 'lincoln', 'line', 'lines', 'link', 'list', 'listed', 'lit', 'literally', 'little', 'lives', 'lj', 'local', 'located', 'location', 'lock', 'locked', 'locking', 'locks', 'logical', 'long', 'longer', 'looked', 'looks', 'loose', 'lose', 'losing', 'loss', 'lost', 'lot', 'loud', 'loved', 'low', 'lower', 'luckily', 'lunch', 'lurched', 'luxury', 'ma', 'made', 'mail', 'main', 'maintain', 'maintained', 'maintenance', 'major', 'make', 'makes', 'making', 'malfunction', 'malfunctioning', 'malfunctions', 'malfuntioned', 'malibu', 'manager', 'maneuvering', 'manifold', 'manual', 'manually', 'manufacture', 'manufactured', 'manufacturer', 'many', 'march', 'market', 'massive', 'master', 'matrix', 'matter', 'maxima', 'may', 'maybe', 'mbrusman', 'mdx', 'me', 'means', 'mechanic', 'mechanical', 'mechanism', 'mechanisms', 'median', 'mediation', 'mediator', 'mention', 'mentioned', 'mercedes', 'merging', 'message', 'met', 'metal', 'meters', 'middle', 'mileage', 'mileages', 'miles', 'mine', 'minor', 'minutes', 'miraculously', 'mishap', 'missing', 'ml', 'model', 'models', 'moderate', 'module', 'moisture', 'molding', 'moldings', 'moment', 'money', 'month', 'months', 'more', 'morning', 'most', 'mostly', 'mother', 'motion', 'motor', 'motorcycle', 'mountain', 'mounting', 'mouth', 'mph', 'mr', 'much', 'multiple', 'murano', 'must', 'mustang', 'my', 'myself', 'na', 'name', 'nature', 'nc', 'near', 'neck', 'need', 'needed', 'needle', 'needles', 'needs', 'neighborhood', 'neither', 'nerve', 'nerves', 'net', 'never', 'new', 'newer', 'news', 'next', 'nhtsa', 'nice', 'night', 'nightmare', 'nissan', 'nj', 'nm', 'no', 'nobody', 'noise', 'noises', 'non', 'none', 'nor', 'normal', 'north', 'not', 'noted', 'nothing', 'notice', 'noticeable', 'noticed', 'notified', 'notifying', 'november', 'now', 'number', 'numbers', 'numerous', 'oakland', 'object', 'objects', 'obtain', 'obvious', 'obviously', 'occasion', 'occasions', 'occupant', 'occupants', 'occurred', 'occurrence', 'occurring', 'occurs', 'ocs', 'october', 'odi', 'odometer', 'odyssey', 'of', 'off', 'offer', 'offered', 'office', 'officer', 'official', 'often', 'oil', 'ok', 'old', 'on', 'once', 'oncoming', 'one', 'ones', 'oneself', 'ongoing', 'online', 'only', 'onto', 'oo', 'op', 'open', 'opened', 'operate', 'operates', 'operation', 'opinion', 'opposite', 'optional', 'or', 'order', 'ordered', 'organs', 'original', 'orthopedic', 'other', 'others', 'otherwise', 'ounce', 'our', 'ours', 'out', 'outcome', 'outside', 'over', 'overall', 'overheating', 'overnight', 'own', 'owned', 'owner', 'owners', 'owns', 'oxygen', 'p225', 'pads', 'paid', 'pain', 'paint', 'panel', 'panic', 'park', 'parked', 'parking', 'parkway', 'part', 'partial', 'particular', 'parts', 'passanger', 'passat', 'passats', 'passed', 'passenger', 'passengers', 'passing', 'past', 'pavement', 'pay', 'pe', 'pedal', 'pedestrians', 'people', 'per', 'perfect', 'perfectly', 'performed', 'perhaps', 'period', 'periodic', 'permantly', 'persist', 'person', 'ph', 'phone', 'picked', 'pickup', 'pictures', 'pieces', 'pillar', 'pinion', 'pipe', 'pixels', 'place', 'placed', 'places', 'plan', 'plastic', 'play', 'please', 'plenty', 'plymouth', 'pocket', 'pockets', 'point', 'poles', 'police', 'pond', 'pontiac', 'pop', 'popped', 'popping', 'portion', 'position', 'positioned', 'possibilities', 'possible', 'possibly', 'postal', 'posted', 'potential', 'potentially', 'pothole', 'powder', 'power', 'powertrain', 'preoccupied', 'pressed', 'pressing', 'pressure', 'prevented', 'previous', 'prior', 'private', 'probably', 'problem', 'problems', 'process', 'produce', 'product', 'products', 'professional', 'program', 'prominent', 'promise', 'promised', 'prompted', 'proof', 'properly', 'protect', 'protection', 'proveout', 'provide', 'provincially', 'provoke', 'public', 'puddle', 'pull', 'pulled', 'pulley', 'pulling', 'pump', 'purchase', 'purchased', 'pursuant', 'pursuing', 'push', 'pushed', 'pushing', 'put', 'putting', 'quality', 'quart', 'quarter', 'quick', 'quickly', 'quite', 'quits', 'quote', 'r16', 'rack', 'radio', 'rail', 'raining', 'ran', 'random', 'randomly', 'range', 'rate', 'rather', 'rating', 'rattling', 'rav', 'rayed', 're', 'read', 'reading', 'readings', 'reads', 'realized', 'really', 'rear', 'rearfacing', 'reason', 'reasonable', 'reasoning', 'reasons', 'recall', 'recalled', 'recalls', 'receive', 'received', 'recent', 'recently', 'recode', 'recommended', 'record', 'recording', 'records', 'recovered', 'rectifying', 'recurred', 'recurring', 'red', 'redacted', 'redo', 'referred', 'refused', 'refuses', 'refusing', 'regarding', 'regards', 'regional', 'regular', 'regularly', 'reimbursement', 'related', 'relations', 'relay', 'releasing', 'relied', 'rely', 'remain', 'remained', 'remains', 'remedy', 'remember', 'reoccur', 'reoccurring', 'reopened', 'repair', 'repaired', 'repairs', 'repeated', 'repeatedly', 'replace', 'replaced', 'replacement', 'replacing', 'report', 'reported', 'reproduce', 'reps', 'request', 'requested', 'requires', 'requiring', 'research', 'reserve', 'reset', 'resistance', 'resolve', 'respond', 'responded', 'responsibility', 'responsible', 'rest', 'restart', 'restarted', 'restrain', 'restraint', 'result', 'resulted', 'resulting', 'resurface', 'retract', 'retractor', 'return', 'returned', 'returning', 'rewire', 'rhd', 'ribbon', 'ride', 'ridgeline', 'riding', 'right', 'rims', 'ringing', 'rings', 'ripping', 'risk', 'river', 'road', 'roadside', 'rod', 'roll', 'rollover', 'rondo', 'rotate', 'rotors', 'roughly', 'route', 'rt', 'rte', 'rubbing', 'ruined', 'run', 'running', 'runnings', 'rural', 'ruralinfo', 'rust', 'rusted', 'sacramento', 'sadly', 'safe', 'safely', 'safety', 'safey', 'said', 'salesman', 'salesperson', 'same', 'sanctioned', 'saturn', 'save', 'saved', 'saw', 'say', 'says', 'sc', 'scam', 'scared', 'school', 'scn', 'screen', 'screw', 'scrutinized', 'sd', 'seal', 'seat', 'seatbelt', 'seatbelted', 'seatbelts', 'sebring', 'second', 'secondary', 'secure', 'security', 'see', 'seeing', 'seem', 'seemed', 'seems', 'seen', 'selling', 'semi', 'send', 'sensor', 'sensors', 'sent', 'separation', 'sept', 'sequoia', 'series', 'serious', 'seriously', 'serpentine', 'service', 'serviced', 'set', 'several', 'severe', 'shaft', 'shake', 'shared', 'shattering', 'she', 'sheered', 'shield', 'shift', 'shifting', 'shipping', 'shock', 'shocked', 'shocking', 'shop', 'short', 'shortly', 'should', 'shoulder', 'shouldn', 'show', 'showed', 'showing', 'shows', 'shrapnel', 'shrubs', 'shudder', 'shut', 'shutting', 'side', 'sideroof', 'sides', 'sideways', 'sign', 'signal', 'signals', 'significant', 'significantly', 'silent', 'silverado', 'similar', 'simply', 'simultaneously', 'since', 'single', 'sit', 'sited', 'sitting', 'situated', 'situation', 'six', 'skid', 'skidded', 'skull', 'slam', 'slammed', 'slc', 'sliding', 'slip', 'slipping', 'slither', 'slow', 'slowed', 'slowing', 'slumped', 'small', 'smashed', 'smd', 'smell', 'smoke', 'snapped', 'so', 'software', 'solara', 'solstice', 'solutions', 'some', 'someone', 'something', 'sometimes', 'somewhere', 'sonata', 'soon', 'soooo', 'sore', 'soreness', 'sorry', 'sort', 'sound', 'sounds', 'source', 'space', 'spanning', 'specialists', 'specific', 'speed', 'speeding', 'speedometer', 'spin', 'spiral', 'split', 'sport', 'spot', 'spots', 'sprained', 'spring', 'springclock', 'springs', 'spun', 'sputtered', 'srs', 'st', 'stabilitrack', 'stability', 'staff', 'stalled', 'stalls', 'stance', 'standards', 'staples', 'start', 'started', 'starting', 'starts', 'state', 'stated', 'statedon', 'states', 'stations', 'stay', 'stayed', 'stays', 'steel', 'steer', 'steering', 'stem', 'stick', 'still', 'stitches', 'stomach', 'stop', 'stopped', 'stopping', 'store', 'straight', 'stranded', 'strange', 'streeing', 'street', 'strike', 'strongly', 'struck', 'stuck', 'subject', 'submitted', 'subsequent', 'substantial', 'suburban', 'such', 'sudden', 'suddenly', 'suffer', 'suffered', 'summer', 'sun', 'sunroof', 'supply', 'support', 'supposed', 'supposedly', 'sure', 'surged', 'surprise', 'surprised', 'suspected', 'suspension', 'sustain', 'sustained', 'suv', 'sway', 'switch', 'swollen', 'symptoms', 'system', 'systems', 'tags', 'tail', 'tailgage', 'tailpipe', 'takata', 'take', 'taken', 'taking', 'talked', 'tank', 'tap', 'taurus', 'taylor', 'tcs', 'tear', 'technicians', 'tee', 'tell', 'telling', 'temp', 'temperature', 'tends', 'tennessee', 'terrible', 'terribly', 'test', 'tested', 'tgw', 'than', 'thank', 'thankfully', 'thanks', 'that', 'the', 'their', 'them', 'then', 'there', 'thereafter', 'these', 'they', 'thing', 'things', 'think', 'third', 'this', 'thoroughfare', 'those', 'though', 'thought', 'thousands', 'three', 'throttle', 'through', 'thrown', 'ticked', 'ticket', 'tie', 'tightened', 'tightening', 'tilt', 'time', 'times', 'tinted', 'tire', 'tires', 'tl', 'to', 'today', 'together', 'told', 'tomorrow', 'tone', 'too', 'took', 'top', 'total', 'totaled', 'totalled', 'totally', 'tow', 'toward', 'towards', 'towed', 'town', 'toyota', 'tr', 'trac', 'track', 'traction', 'trade', 'traffic', 'trailer', 'transfer', 'transferred', 'transmission', 'transportation', 'trapping', 'trauma', 'traveled', 'traveling', 'tread', 'tree', 'tried', 'trigger', 'triggering', 'trip', 'trips', 'troubleshoot', 'truck', 'trucks', 'trunk', 'trust', 'try', 'trying', 'ts', 'tt', 'turbo', 'turn', 'turned', 'turning', 'turns', 'twice', 'twitted', 'two', 'type', 'tyre', 'umbrellas', 'un', 'unable', 'under', 'underneath', 'understand', 'understanding', 'unexpected', 'unexpectedly', 'unfortunately', 'unique', 'unit', 'unity', 'unknown', 'unless', 'unlock', 'unreadable', 'unreturned', 'unsafe', 'unsure', 'until', 'unwarranted', 'unwilling', 'up', 'update', 'updated', 'updates', 'upload', 'upon', 'upper', 'upright', 'upset', 'us', 'usage', 'use', 'used', 'using', 'usps', 'usually', 'vacation', 'van', 'vancouver', 'vans', 've', 'veer', 'veered', 'veering', 'vehicle', 'vehicles', 'verbal', 'vertebra', 'very', 'vibrate', 'video', 'view', 'vin', 'violated', 'violently', 'visibility', 'visit', 'visiting', 'voice', 'volkswagen', 'volvo', 'voyager', 'vsc', 'vulnerable', 'vw', 'wade', 'wait', 'waited', 'waiting', 'walked', 'wall', 'want', 'wanted', 'wants', 'warm', 'warms', 'warned', 'warning', 'warnings', 'warped', 'warrant', 'warranty', 'was', 'wasn', 'watched', 'water', 'way', 'we', 'weak', 'wear', 'wearing', 'weather', 'website', 'week', 'weeks', 'welding', 'well', 'went', 'were', 'weren', 'westbound', 'wet', 'what', 'wheel', 'when', 'where', 'whether', 'which', 'while', 'whiplash', 'who', 'whole', 'why', 'wider', 'wife', 'wiggling', 'will', 'willing', 'wilson', 'wind', 'window', 'windows', 'windshield', 'wiper', 'wipers', 'wires', 'wiring', 'wished', 'with', 'within', 'without', 'withstand', 'witnesses', 'won', 'wonder', 'woosh', 'word', 'work', 'working', 'works', 'worn', 'worse', 'worsened', 'worst', 'worth', 'would', 'wouldn', 'wrangler', 'wreck', 'wrecks', 'wrist', 'write', 'writes', 'writing', 'written', 'wrong', 'xterra', 'xxx', 'yards', 'yc', 'year', 'years', 'yes', 'yet', 'yield', 'york', 'you', 'your', 'zero', 'zone']
Big secret: The "fit" part of
.fit_transform
means "learn the words." The "transform" part means "count them."
You can take advantage of this list to build a nice-looking dataframe:
pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
00 | 000 | 01 | 01v347000 | 02 | 02v105000 | 02v146000 | 03 | 03v455000 | 04 | 05 | 05v395000 | 06 | 07 | 08 | 08v303000 | 09 | 10 | 1000 | 10017 | 11 | 12 | 128 | 12th | 13 | 136 | 13v136000 | 14 | 1420 | 15 | 150 | 15pm | 16 | 160lbs | 17 | 180 | 1996 | 1997 | 1998 | 1999 | 1st | 20 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | ... | window | windows | windshield | wiper | wipers | wires | wiring | wished | with | within | without | withstand | witnesses | won | wonder | woosh | word | work | working | works | worn | worse | worsened | worst | worth | would | wouldn | wrangler | wreck | wrecks | wrist | write | writes | writing | written | wrong | xterra | xxx | yards | yc | year | years | yes | yet | yield | york | you | your | zero | zone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
160 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
161 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
162 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
163 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
164 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
165 rows × 2280 columns
Only counting with ones and zeros#
It doesn't seem to matter too much whether a word shows up one or two or twenty times in a complaint - the only important thing is whether yes it shows up or no it doesn't show up.
To turn the counting into just 0
s and 1
s, we send an extra option to our CountVectorizer
.
vectorizer = CountVectorizer(binary=True)
vectors = vectorizer.fit_transform(labeled.CDESCR)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()
00 | 000 | 01 | 01v347000 | 02 | 02v105000 | 02v146000 | 03 | 03v455000 | 04 | 05 | 05v395000 | 06 | 07 | 08 | 08v303000 | 09 | 10 | 1000 | 10017 | 11 | 12 | 128 | 12th | 13 | 136 | 13v136000 | 14 | 1420 | 15 | 150 | 15pm | 16 | 160lbs | 17 | 180 | 1996 | 1997 | 1998 | 1999 | 1st | 20 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | ... | window | windows | windshield | wiper | wipers | wires | wiring | wished | with | within | without | withstand | witnesses | won | wonder | woosh | word | work | working | works | worn | worse | worsened | worst | worth | would | wouldn | wrangler | wreck | wrecks | wrist | write | writes | writing | written | wrong | xterra | xxx | yards | yc | year | years | yes | yet | yield | york | you | your | zero | zone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 2280 columns
Using our new dataframe in machine learning#
We really like random forests now, right? They're more or less a fancy decision tree, and they usually give pretty good results.
Let's try one out with our new every-single-word features.
Hot tip: a vector is just a list of numbers (for example, each row). A matrix is a list of vectors.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
Usually we do .drop
to get rid of the label, but when we counted all of our words it didn't carry over the label column (whether it's suspicious or not). Instead, we'll just use the is_suspicious
column from our original dataframe, the one with the actual text.
X = words_df
y = labeled.is_suspicious
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
Confusion matrix#
With all of those incredible features, how did it do?
y_true = y
y_pred = clf.predict(X)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Predicted not suspicious | Predicted suspicious | |
---|---|---|
Is not suspicious | 150 | 0 |
Is suspicious | 0 | 15 |
Amazing!!! 100% accuracy!!! Loving it!!!
What did the random forest think were the important features?
import eli5
feature_names = list(X.columns)
# Use this line instead of warnings about judging these classifier
# eli5.show_weights(clf, feature_names=feature_names, show=eli5.formatters.fields.ALL)
eli5.show_weights(clf, feature_names=feature_names)
Weight | Feature |
---|---|
0.0184 ± 0.1117 | deployed |
0.0163 ± 0.0944 | pulling |
0.0163 ± 0.0904 | burns |
0.0147 ± 0.1022 | burning |
0.0146 ± 0.0926 | degree |
0.0141 ± 0.0701 | sunroof |
0.0129 ± 0.0798 | school |
0.0116 ± 0.0725 | 1st |
0.0110 ± 0.0776 | apart |
0.0107 ± 0.0736 | zone |
0.0102 ± 0.0564 | driver |
0.0086 ± 0.0574 | problem |
0.0085 ± 0.0610 | killing |
0.0083 ± 0.0573 | unexpectedly |
0.0081 ± 0.0537 | further |
0.0076 ± 0.0667 | sputtered |
0.0074 ± 0.0597 | 2nd |
0.0073 ± 0.0698 | street |
0.0071 ± 0.0611 | chin |
0.0068 ± 0.0556 | suffered |
… 2260 more … |
Sure, sure, that all makes sense.
No, wait! let's train-test split#
Oh boy we totally forgot about train-test split, we were testing the classifier on things it had already seen. Let's split them up into test sets and train sets and try again.
from sklearn.model_selection import train_test_split
X = words_df
y = labeled.is_suspicious
X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Predicted not suspicious | Predicted suspicious | |
---|---|---|
Is not suspicious | 39 | 0 |
Is suspicious | 3 | 0 |
Oh no, that's horrible. That's terrible. Let's try looking at our feature importances, just to see if it's making dumb decisions.
eli5.show_weights(clf, feature_names=feature_names)
Weight | Feature |
---|---|
0.0146 ± 0.0872 | face |
0.0145 ± 0.0830 | problem |
0.0142 ± 0.1067 | deployed |
0.0123 ± 0.0874 | passenger |
0.0118 ± 0.0873 | killing |
0.0116 ± 0.0922 | burns |
0.0099 ± 0.0797 | ripping |
0.0097 ± 0.0830 | mouth |
0.0096 ± 0.0842 | malfunction |
0.0094 ± 0.0926 | suffered |
0.0093 ± 0.0692 | his |
0.0092 ± 0.0857 | both |
0.0092 ± 0.0720 | degree |
0.0089 ± 0.0855 | chin |
0.0088 ± 0.0666 | resulting |
0.0087 ± 0.0546 | unexpectedly |
0.0087 ± 0.0585 | 1st |
0.0085 ± 0.0635 | further |
0.0084 ± 0.0695 | 2nd |
0.0078 ± 0.0910 | apart |
… 2260 more … |
I mean, it makes sense, I guess. Even though we added all those new features, why doesn't it work well?
Trying again with a Logistic Classifier#
Well, if there's one thing we know to do, it's try again and again with different classifiers until something works. Let's see if a logistic classifier work any better!*
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1e9, solver='lbfgs')
clf.fit(X_train, y_train)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Predicted not suspicious | Predicted suspicious | |
---|---|---|
Is not suspicious | 39 | 0 |
Is suspicious | 3 | 0 |
Just as bad! Sadly, with this much information there's no good pattern. We can feel good about how explainable it is, though.
eli5.show_weights(clf, feature_names=feature_names, target_names=['not suspicious', 'suspicious'])
y=suspicious top features
Weight? | Feature |
---|---|
+3.187 | deployed |
+2.961 | passenger |
+2.145 | degree |
+2.017 | problem |
+1.869 | 1st |
+1.840 | 2nd |
+1.840 | hands |
+1.818 | face |
+1.772 | burns |
+1.612 | provide |
+1.377 | further |
… 859 more positive … | |
… 1155 more negative … | |
-1.376 | traveling |
-1.390 | light |
-1.465 | brake |
-1.510 | pads |
-1.546 | front |
-2.400 | is |
-2.888 | did |
-2.924 | not |
-6.441 | <BIAS> |
Review#
While last time we just used hand-picked words to have our classifier pay attention to, this time we used a vectorizer to just use all of the words. We figured that more information was better information, and we wouldn't even have to flag more complaints!
Unfortunately our classifier still didn't really find any suspicious complaints.
Discussion topics#
Brainstorm reasons why more information didn't save us.
In classification problems, when might you want to hand-pick words and when might you want to use a vectorizer? Compare this airbag situation, about sentiment analysis of tweets, and separating sci-fi and romance novels.
About the site
Hi, I'm Soma, welcome to Data Science for Journalism a.k.a. investigate.ai!
There's been a lot of buzz about machine learning and "artificial intelligence" being used in stories over the past few years. It's mostly not that complicated - a little stats, a classifier here or there - but it's hard to know where to start without a little help.
If you know a little Python programming, hopefully this site can be that help! Learn more about this project here.
Our newsletter
Links
Thanks to Columbia Journalism School, the Knight Foundation, and many others.