Finding faulty airbags in a sea of consumer complaints by counting words and classifying the results#

Topics: Vectorizing text

Datasets

sampled-labeled.csv: a sample of vehicle complaints, labeled with being suspicious or not

What's the goal?#

It was too much work to read twenty years of vehicle comments to find the ones related to dangerous airbags! The last two times we tried to pick out important words to dangerous/not dangerous airbags, but it didn't go so well because we weren't sure what the best ones to pick were.

This time we're going to pick everything.

Read online Download notebook Interactive version

Setup#

import pandas as pd

# Allow us to display 100 columns at a time, and 100 characters in each column (instead of ...)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)

Read in our labeled data#

We'll start by reading in our complaints that have labeled attached to them. Read in sampled-labeled.csv.

labeled = pd.read_csv("data/sampled-labeled.csv")
labeled.head()

	is_suspicious	CDESCR
0	0.0	ALTHOUGH I LOVED THE CAR OVERALL AT THE TIME I DECIDED TO OWN, , MY DREAM CAR CADILLAC CTS HAS T...
1	0.0	CONSUMER SHUT SLIDING DOOR WHEN ALL POWER LOCKS ON ALL DOORS LOCKED BY ITSELF, TRAPPING INFANT I...
2	0.0	DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT
3	0.0	TL* THE CONTACT OWNS A 2009 NISSAN ALTIMA. THE CONTACT STATED THAT THE START BUTTON FOR THE IGNI...
4	0.0	THE FRONT MIDDLE SEAT DOESN'T LOCK IN PLACE. *AK

Even though it's called labeled, not all of them have labels. Drop the ones missing labels.

labeled = labeled.dropna()

See how many suspicious/not suspicious comments we have.

labeled.is_suspicious.value_counts()

0.0    150
1.0     15
Name: is_suspicious, dtype: int64

150 non-suspicious and 15 suspicious is a pretty terrible ratio, but we're remarkably lazy and not very many of the comments are actually suspicious.

Now that we've read a few, let's train our classifier

Creating features#

Selecting our features and building a features dataframe#

Last time, we can thought of some words or phrases that might make a comment interesting or not interesting. We came up with this list:

airbag
air bag
failed
did not deploy
violent
explode
shrapnel

We then built a dataframe that included those words for each row - 0 if it's in there, 1 if it isn't - along with the is_suspicious label. That process looked like this:

train_df = pd.DataFrame({
    'is_suspicious': labeled.is_suspicious,
    'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
    'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
    'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
    'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
    'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
    'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()

	is_suspicious	airbag	air bag	failed	did not deploy	violent	explode	shrapnel
0	0.0	0	0	0	0	0	0	0
1	0.0	0	0	0	0	0	0	0
2	0.0	0	0	0	0	0	0	0
3	0.0	0	0	0	0	0	0	0
4	0.0	0	0	0	0	0	0	0

But as we found out later, picking which words are important - feature selection - can be a difficult process. There are a lot of words in there, and it isn't like we're going to go through and look at every single word, right?

Well, actually, it's definitely possible to look at every single word, and it takes way less code than what we did up above.

You can count words using the CountVectorizer from sci-kit learn. Using .fit_transform below will learn all of the words in a column, then count them.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

vectors = vectorizer.fit_transform(labeled.CDESCR)
vectors

<165x2280 sparse matrix of type '<class 'numpy.int64'>'
	with 9089 stored elements in Compressed Sparse Row format>

But... what's a "sparse matrix"? We can see something that looks more familiar if we tell it to become an array (basically a list).

vectors.toarray()

array([[0, 0, 0, ..., 3, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

It's still a little hard to understand, but a list of lists? Sounds like a great opportunity for a dataframe!

pd.DataFrame(vectors.toarray())

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41	42	43	44	45	46	47	48	49	...	2230	2231	2232	2233	2234	2235	2236	2237	2238	2239	2240	2241	2242	2243	2244	2245	2246	2247	2248	2249	2250	2251	2252	2253	2254	2255	2256	2257	2258	2259	2260	2261	2262	2263	2264	2265	2266	2267	2268	2269	2270	2271	2272	2273	2274	2275	2276	2277	2278	2279
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	3	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
160	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
161	0	1	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
162	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
163	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
164	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

165 rows × 2280 columns

Each row is a sentence, and each column is a word!

We had 165 sentences, so we know have 165 rows
There were 2280 words, so we have 2280 columns

If a word appears zero times in a sentence, that column gets a 0. If it appears one or two or twenty times, that number appears in the column instead.

The whole sparse matrix thing is part of numpy. It's the idea that since the list of lists was mostly empty, Python can be lazy and not keep track of all of the 0s - instead, it only tracks where there are non-0 numbers. A sparse matrix is much more efficient with space if you have a lot lot lot of 0's!

We used .toarray() to turn it into a list of lists (although sometimes if we have a lot lot lot of words and sentences our computer might not be able to do it).

How do we know which column is which word? When we told the vectorizer to count all of the words in each sentence, it also memorized all of the words separately.

print(vectorizer.get_feature_names())

['00', '000', '01', '01v347000', '02', '02v105000', '02v146000', '03', '03v455000', '04', '05', '05v395000', '06', '07', '08', '08v303000', '09', '10', '1000', '10017', '11', '12', '128', '12th', '13', '136', '13v136000', '14', '1420', '15', '150', '15pm', '16', '160lbs', '17', '180', '1996', '1997', '1998', '1999', '1st', '20', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '20k', '20mph', '22', '2300', '24', '25', '2500', '262', '28', '29', '2nd', '30', '300', '30miles', '30mph', '31', '32', '323i', '325xi', '32k', '35', '37', '39', '390', '3k', '3rd', '40', '40mph', '42', '440', '45mph', '48', '49', '4x4', '50', '500', '5000', '50000', '50k', '517', '55', '552', '57', '5th', '60k', '60mph', '65', '65000km', '68', '6th', '70', '71000', '75', '7500', '77', '775', '79', '795', '800', '8004341', '808680', '86', '87', '91', '915', '93k', '94', '98', '981', 'a1', 'aamco', 'able', 'about', 'above', 'abrasion', 'abrasions', 'abs', 'absence', 'absolutely', 'ac', 'accelerate', 'accelerated', 'acceleration', 'access', 'accident', 'accord', 'according', 'accurate', 'acknowledge', 'across', 'act', 'acted', 'action', 'activates', 'activations', 'active', 'actually', 'actuator', 'acura', 'addition', 'additional', 'address', 'addressed', 'adjacent', 'adjusted', 'advance', 'advise', 'advised', 'affairs', 'affecting', 'afford', 'afraid', 'after', 'again', 'against', 'age', 'agency', 'agents', 'ago', 'agony', 'air', 'airbag', 'airbags', 'aircondition', 'ak', 'alive', 'all', 'alley', 'allow', 'almost', 'along', 'already', 'also', 'although', 'altima', 'always', 'am', 'american', 'an', 'and', 'angles', 'another', 'answer', 'antenna', 'anti', 'antifreeze', 'any', 'anymore', 'anyone', 'anything', 'anywhere', 'apart', 'apparent', 'appeal', 'appear', 'appeared', 'appears', 'applied', 'apply', 'applying', 'appointment', 'appreciate', 'appreciated', 'approaching', 'approval', 'approx', 'approximate', 'approximately', 'april', 'aprox', 'are', 'area', 'arise', 'arm', 'arms', 'around', 'as', 'asked', 'assembly', 'assigned', 'assist', 'assistance', 'assume', 'at', 'attachment', 'attachments', 'attempted', 'attempting', 'attempts', 'attention', 'audible', 'aug', 'august', 'augusta', 'authorized', 'auto', 'automatic', 'automobiles', 'avail', 'available', 'avoid', 'aware', 'away', 'awfully', 'awhile', 'axel', 'axle', 'baby', 'back', 'backing', 'backwards', 'bad', 'bag', 'bags', 'bailed', 'ball', 'banged', 'banging', 'bar', 'barely', 'bargained', 'barrier', 'bars', 'battery', 'bc', 'be', 'beam', 'beams', 'became', 'because', 'been', 'beep', 'before', 'beg', 'began', 'behind', 'being', 'believe', 'believed', 'bell', 'belong', 'below', 'belt', 'belts', 'beltway', 'bent', 'benz', 'better', 'between', 'beyond', 'bf', 'bigger', 'binding', 'bit', 'bizarre', 'blew', 'blink', 'blinker', 'blinking', 'block', 'blocks', 'blowing', 'blowout', 'bmw', 'board', 'body', 'bolted', 'bone', 'booster', 'boosters', 'both', 'bother', 'bottom', 'bought', 'bound', 'box', 'bracket', 'brake', 'brakes', 'braking', 'brand', 'break', 'breaking', 'brick', 'bring', 'broadsided', 'broke', 'broken', 'brought', 'bruise', 'bruised', 'bruises', 'bruising', 'buckle', 'buckled', 'bug', 'buick', 'building', 'builds', 'built', 'bump', 'bumper', 'buns', 'buried', 'burned', 'burning', 'burns', 'busy', 'but', 'button', 'buttons', 'buy', 'buying', 'by', 'ca', 'cab', 'cable', 'cadillac', 'cafe', 'caliber', 'caliper', 'calipers', 'call', 'called', 'calls', 'cam', 'came', 'campaign', 'campaigns', 'camping', 'camry', 'can', 'canada', 'canadian', 'cannot', 'cap', 'car', 'care', 'carefully', 'cares', 'carnival', 'caromed', 'carriers', 'carrying', 'cars', 'case', 'catalytic', 'catch', 'caught', 'cause', 'caused', 'causes', 'causing', 'cb', 'center', 'centre', 'ceo', 'certain', 'certainly', 'chain', 'chandra', 'change', 'changed', 'charge', 'charged', 'cheat', 'check', 'checked', 'cherokee', 'chest', 'chevrolet', 'chevy', 'child', 'children', 'chimes', 'chin', 'choose', 'chrysler', 'cinergy', 'circle', 'circuit', 'circuits', 'claim', 'claimed', 'clash', 'classic', 'clear', 'clearly', 'climb', 'clock', 'clockspring', 'close', 'closed', 'closure', 'clue', 'cn', 'co', 'coasting', 'codes', 'coil', 'coils', 'coincidentally', 'cold', 'collapse', 'collapsed', 'collapsing', 'collide', 'collided', 'collision', 'collison', 'column', 'com', 'combi', 'come', 'comes', 'coming', 'comment', 'common', 'company', 'compartment', 'compensation', 'complained', 'complaint', 'complaints', 'complete', 'completely', 'component', 'compressor', 'compromises', 'computer', 'concern', 'concerned', 'concerning', 'concerns', 'concord', 'concrete', 'concussion', 'condition', 'conditioner', 'conditions', 'conducted', 'confirm', 'consider', 'considerably', 'considered', 'console', 'constantly', 'consume', 'consumer', 'consumers', 'contact', 'contacted', 'contemplating', 'continue', 'continued', 'continues', 'contributed', 'control', 'controls', 'converter', 'cool', 'cooler', 'cooling', 'corner', 'corolla', 'corollas', 'corporation', 'correct', 'corrode', 'corrosion', 'cost', 'costs', 'could', 'country', 'county', 'couple', 'course', 'court', 'cover', 'covered', 'crack', 'cracked', 'cracking', 'crash', 'crashed', 'crazy', 'critical', 'cronic', 'cross', 'crossed', 'crossing', 'crossroads', 'cruise', 'cruising', 'crumpled', 'crv', 'cts', 'cupping', 'curb', 'curbing', 'current', 'currently', 'currents', 'curtain', 'customer', 'cut', 'cuts', 'cutting', 'cylinder', 'd4', 'daimler', 'damage', 'damaged', 'danger', 'dangerous', 'dash', 'dashboard', 'date', 'dating', 'daughter', 'day', 'days', 'daytime', 'dazed', 'dead', 'deadly', 'dealer', 'dealers', 'dealership', 'dealerships', 'dear', 'death', 'december', 'decent', 'decided', 'decides', 'decision', 'declared', 'deemed', 'deer', 'defect', 'defective', 'defects', 'defog', 'defogger', 'defrost', 'degree', 'degrees', 'delay', 'demons', 'denied', 'denies', 'dented', 'department', 'deploy', 'deployed', 'deploying', 'deployment', 'describe', 'design', 'despite', 'destroyed', 'destructive', 'detached', 'details', 'detect', 'determine', 'determined', 'developed', 'diagnose', 'diagnosed', 'diagnosis', 'diagnostic', 'diagnostics', 'did', 'didn', 'die', 'died', 'differ', 'different', 'difficult', 'difficulty', 'digital', 'ding', 'direct', 'directed', 'dirt', 'disabled', 'discovered', 'discs', 'discuss', 'dismantled', 'dispite', 'display', 'displayed', 'dissappointed', 'distance', 'distribution', 'ditch', 'do', 'doctor', 'documented', 'doddge', 'does', 'doesn', 'dog', 'dollars', 'don', 'done', 'door', 'doors', 'down', 'dream', 'drivable', 'drive', 'driven', 'driver', 'drivers', 'driveway', 'driving', 'drivng', 'drove', 'dry', 'dt', 'dual', 'due', 'duplicate', 'during', 'dust', 'duty', 'dvd', 'e320', 'ea13003', 'each', 'ear', 'early', 'ears', 'edmunds', 'either', 'elderly', 'electrical', 'electronic', 'electronics', 'elk', 'else', 'emailed', 'embankment', 'emergency', 'emitted', 'emptied', 'empty', 'en', 'encountered', 'end', 'ended', 'engaged', 'engine', 'enough', 'entered', 'entering', 'enters', 'entertainment', 'entiire', 'entire', 'equipment', 'equipped', 'era', 'erratic', 'erratically', 'error', 'especially', 'esserman', 'estimated', 'et', 'etc', 'even', 'event', 'events', 'ever', 'every', 'everyday', 'everyone', 'everything', 'everywhere', 'evidence', 'evident', 'evidently', 'exact', 'examined', 'except', 'exceptional', 'excessive', 'exhaust', 'exists', 'exit', 'expedition', 'expense', 'expensive', 'experience', 'experienced', 'experiences', 'experiencing', 'expired', 'explained', 'explains', 'exploded', 'explorer', 'explosions', 'explosive', 'extended', 'extensive', 'extra', 'extremely', 'eye', 'face', 'facets', 'facility', 'facing', 'fact', 'factory', 'fades', 'fail', 'failed', 'failing', 'fails', 'failure', 'failures', 'faint', 'fairly', 'fallen', 'false', 'family', 'fan', 'far', 'fast', 'fatal', 'fate', 'father', 'fault', 'faults', 'faulty', 'fax', 'feature', 'features', 'february', 'federally', 'fee', 'feeding', 'feel', 'feels', 'feet', 'fell', 'felt', 'fence', 'fender', 'few', 'fiance', 'fifteen', 'fight', 'figure', 'file', 'filed', 'filler', 'filling', 'filter', 'final', 'finally', 'financially', 'find', 'fine', 'fire', 'firestone', 'firing', 'first', 'fit', 'five', 'fix', 'fixed', 'flagship', 'flashing', 'flaw', 'floor', 'flying', 'fm', 'fog', 'foia', 'fold', 'following', 'foot', 'for', 'force', 'forced', 'ford', 'forearm', 'fortunately', 'forum', 'forums', 'forward', 'found', 'four', 'fraction', 'fractured', 'frame', 'free', 'freedom', 'freon', 'from', 'front', 'frontal', 'frustrated', 'fuel', 'full', 'fully', 'function', 'functional', 'funtion', 'further', 'fuse', 'future', 'ga', 'garage', 'gas', 'gasket', 'gasoline', 'gate', 'gauge', 'gauges', 'gb250', 'gear', 'gears', 'get', 'gets', 'getting', 'give', 'given', 'glass', 'glove', 'gm', 'gmc', 'go', 'god', 'goes', 'going', 'golfs', 'gone', 'good', 'goodness', 'got', 'gotten', 'grace', 'grand', 'graph', 'gravel', 'green', 'grill', 'grinding', 'grip', 'grooves', 'ground', 'grove', 'gto', 'guard', 'had', 'hamilton', 'hand', 'handle', 'handles', 'handling', 'hands', 'happen', 'happened', 'happening', 'happens', 'hard', 'hardware', 'harness', 'harnessing', 'has', 'hasn', 'have', 'haven', 'having', 'hazard', 'hazardous', 'hd', 'he', 'head', 'headlight', 'headlights', 'headliner', 'hear', 'heard', 'heated', 'heater', 'heating', 'heavy', 'held', 'help', 'hence', 'her', 'here', 'hesitant', 'hesitated', 'high', 'higher', 'highway', 'him', 'hindered', 'hindering', 'hinge', 'hinges', 'his', 'hit', 'hitting', 'hold', 'holding', 'holds', 'holes', 'hollow', 'home', 'honda', 'honor', 'honored', 'hooked', 'hops', 'horn', 'horror', 'hose', 'hospital', 'hot', 'hour', 'hours', 'house', 'how', 'howe', 'however', 'hows', 'hub', 'huge', 'humid', 'humidity', 'hundreds', 'hurry', 'hurt', 'husband', 'husbands', 'hutchinson', 'hydraulic', 'hyosung', 'hyundai', 'i35s', 'i95', 'iahwan', 'id', 'idea', 'identified', 'ie', 'if', 'ignition', 'ii', 'illuminated', 'illuminating', 'im', 'imbursement', 'immediately', 'impact', 'impacting', 'impacts', 'impala', 'impeding', 'importation', 'in', 'inaccurate', 'inactive', 'inadvertent', 'inch', 'incident', 'include', 'included', 'including', 'indeed', 'independent', 'indianapolis', 'indicate', 'indicated', 'indicating', 'indication', 'indicator', 'indicators', 'infant', 'infiniti', 'inflate', 'inflater', 'information', 'informed', 'initial', 'initially', 'injet', 'injuires', 'injured', 'injuries', 'injuring', 'injury', 'inoperable', 'inserted', 'inside', 'insight', 'insists', 'inspect', 'inspected', 'inspecting', 'inspection', 'inspector', 'installed', 'instead', 'instrument', 'insurance', 'integral', 'intended', 'interest', 'interior', 'intermittent', 'intermittently', 'internal', 'international', 'intersection', 'intersections', 'interstate', 'interval', 'intervals', 'interventions', 'into', 'intrusive', 'investigate', 'investigated', 'investigation', 'invoice', 'involve', 'involved', 'invoved', 'iraq', 'is', 'isn', 'isolated', 'issue', 'issued', 'issues', 'it', 'itbstruck', 'item', 'items', 'its', 'itself', 'jackets', 'january', 'japanese', 'jarred', 'jb', 'jeep', 'jerked', 'jersey', 'jetta', 'job', 'joint', 'js', 'juice', 'jump', 'jumping', 'june', 'just', 'justify', 'k2500', 'kb', 'keep', 'keeps', 'kept', 'key', 'kia', 'kicked', 'kids', 'kill', 'killed', 'killing', 'kind', 'kit', 'kits', 'km', 'kms', 'knee', 'knees', 'knew', 'knock', 'knocked', 'know', 'known', 'knows', 'la', 'labor', 'lacerations', 'lamps', 'lane', 'lanes', 'lap', 'laredo', 'large', 'last', 'lasting', 'latch', 'later', 'launch', 'lawn', 'laying', 'leading', 'leak', 'leakage', 'leaking', 'leaks', 'leaned', 'leased', 'least', 'leather', 'leave', 'leaving', 'left', 'leg', 'legs', 'lehmer', 'lengths', 'lesion', 'let', 'letter', 'level', 'lever', 'lied', 'life', 'lift', 'lifter', 'liftgate', 'light', 'lights', 'like', 'likely', 'lincoln', 'line', 'lines', 'link', 'list', 'listed', 'lit', 'literally', 'little', 'lives', 'lj', 'local', 'located', 'location', 'lock', 'locked', 'locking', 'locks', 'logical', 'long', 'longer', 'looked', 'looks', 'loose', 'lose', 'losing', 'loss', 'lost', 'lot', 'loud', 'loved', 'low', 'lower', 'luckily', 'lunch', 'lurched', 'luxury', 'ma', 'made', 'mail', 'main', 'maintain', 'maintained', 'maintenance', 'major', 'make', 'makes', 'making', 'malfunction', 'malfunctioning', 'malfunctions', 'malfuntioned', 'malibu', 'manager', 'maneuvering', 'manifold', 'manual', 'manually', 'manufacture', 'manufactured', 'manufacturer', 'many', 'march', 'market', 'massive', 'master', 'matrix', 'matter', 'maxima', 'may', 'maybe', 'mbrusman', 'mdx', 'me', 'means', 'mechanic', 'mechanical', 'mechanism', 'mechanisms', 'median', 'mediation', 'mediator', 'mention', 'mentioned', 'mercedes', 'merging', 'message', 'met', 'metal', 'meters', 'middle', 'mileage', 'mileages', 'miles', 'mine', 'minor', 'minutes', 'miraculously', 'mishap', 'missing', 'ml', 'model', 'models', 'moderate', 'module', 'moisture', 'molding', 'moldings', 'moment', 'money', 'month', 'months', 'more', 'morning', 'most', 'mostly', 'mother', 'motion', 'motor', 'motorcycle', 'mountain', 'mounting', 'mouth', 'mph', 'mr', 'much', 'multiple', 'murano', 'must', 'mustang', 'my', 'myself', 'na', 'name', 'nature', 'nc', 'near', 'neck', 'need', 'needed', 'needle', 'needles', 'needs', 'neighborhood', 'neither', 'nerve', 'nerves', 'net', 'never', 'new', 'newer', 'news', 'next', 'nhtsa', 'nice', 'night', 'nightmare', 'nissan', 'nj', 'nm', 'no', 'nobody', 'noise', 'noises', 'non', 'none', 'nor', 'normal', 'north', 'not', 'noted', 'nothing', 'notice', 'noticeable', 'noticed', 'notified', 'notifying', 'november', 'now', 'number', 'numbers', 'numerous', 'oakland', 'object', 'objects', 'obtain', 'obvious', 'obviously', 'occasion', 'occasions', 'occupant', 'occupants', 'occurred', 'occurrence', 'occurring', 'occurs', 'ocs', 'october', 'odi', 'odometer', 'odyssey', 'of', 'off', 'offer', 'offered', 'office', 'officer', 'official', 'often', 'oil', 'ok', 'old', 'on', 'once', 'oncoming', 'one', 'ones', 'oneself', 'ongoing', 'online', 'only', 'onto', 'oo', 'op', 'open', 'opened', 'operate', 'operates', 'operation', 'opinion', 'opposite', 'optional', 'or', 'order', 'ordered', 'organs', 'original', 'orthopedic', 'other', 'others', 'otherwise', 'ounce', 'our', 'ours', 'out', 'outcome', 'outside', 'over', 'overall', 'overheating', 'overnight', 'own', 'owned', 'owner', 'owners', 'owns', 'oxygen', 'p225', 'pads', 'paid', 'pain', 'paint', 'panel', 'panic', 'park', 'parked', 'parking', 'parkway', 'part', 'partial', 'particular', 'parts', 'passanger', 'passat', 'passats', 'passed', 'passenger', 'passengers', 'passing', 'past', 'pavement', 'pay', 'pe', 'pedal', 'pedestrians', 'people', 'per', 'perfect', 'perfectly', 'performed', 'perhaps', 'period', 'periodic', 'permantly', 'persist', 'person', 'ph', 'phone', 'picked', 'pickup', 'pictures', 'pieces', 'pillar', 'pinion', 'pipe', 'pixels', 'place', 'placed', 'places', 'plan', 'plastic', 'play', 'please', 'plenty', 'plymouth', 'pocket', 'pockets', 'point', 'poles', 'police', 'pond', 'pontiac', 'pop', 'popped', 'popping', 'portion', 'position', 'positioned', 'possibilities', 'possible', 'possibly', 'postal', 'posted', 'potential', 'potentially', 'pothole', 'powder', 'power', 'powertrain', 'preoccupied', 'pressed', 'pressing', 'pressure', 'prevented', 'previous', 'prior', 'private', 'probably', 'problem', 'problems', 'process', 'produce', 'product', 'products', 'professional', 'program', 'prominent', 'promise', 'promised', 'prompted', 'proof', 'properly', 'protect', 'protection', 'proveout', 'provide', 'provincially', 'provoke', 'public', 'puddle', 'pull', 'pulled', 'pulley', 'pulling', 'pump', 'purchase', 'purchased', 'pursuant', 'pursuing', 'push', 'pushed', 'pushing', 'put', 'putting', 'quality', 'quart', 'quarter', 'quick', 'quickly', 'quite', 'quits', 'quote', 'r16', 'rack', 'radio', 'rail', 'raining', 'ran', 'random', 'randomly', 'range', 'rate', 'rather', 'rating', 'rattling', 'rav', 'rayed', 're', 'read', 'reading', 'readings', 'reads', 'realized', 'really', 'rear', 'rearfacing', 'reason', 'reasonable', 'reasoning', 'reasons', 'recall', 'recalled', 'recalls', 'receive', 'received', 'recent', 'recently', 'recode', 'recommended', 'record', 'recording', 'records', 'recovered', 'rectifying', 'recurred', 'recurring', 'red', 'redacted', 'redo', 'referred', 'refused', 'refuses', 'refusing', 'regarding', 'regards', 'regional', 'regular', 'regularly', 'reimbursement', 'related', 'relations', 'relay', 'releasing', 'relied', 'rely', 'remain', 'remained', 'remains', 'remedy', 'remember', 'reoccur', 'reoccurring', 'reopened', 'repair', 'repaired', 'repairs', 'repeated', 'repeatedly', 'replace', 'replaced', 'replacement', 'replacing', 'report', 'reported', 'reproduce', 'reps', 'request', 'requested', 'requires', 'requiring', 'research', 'reserve', 'reset', 'resistance', 'resolve', 'respond', 'responded', 'responsibility', 'responsible', 'rest', 'restart', 'restarted', 'restrain', 'restraint', 'result', 'resulted', 'resulting', 'resurface', 'retract', 'retractor', 'return', 'returned', 'returning', 'rewire', 'rhd', 'ribbon', 'ride', 'ridgeline', 'riding', 'right', 'rims', 'ringing', 'rings', 'ripping', 'risk', 'river', 'road', 'roadside', 'rod', 'roll', 'rollover', 'rondo', 'rotate', 'rotors', 'roughly', 'route', 'rt', 'rte', 'rubbing', 'ruined', 'run', 'running', 'runnings', 'rural', 'ruralinfo', 'rust', 'rusted', 'sacramento', 'sadly', 'safe', 'safely', 'safety', 'safey', 'said', 'salesman', 'salesperson', 'same', 'sanctioned', 'saturn', 'save', 'saved', 'saw', 'say', 'says', 'sc', 'scam', 'scared', 'school', 'scn', 'screen', 'screw', 'scrutinized', 'sd', 'seal', 'seat', 'seatbelt', 'seatbelted', 'seatbelts', 'sebring', 'second', 'secondary', 'secure', 'security', 'see', 'seeing', 'seem', 'seemed', 'seems', 'seen', 'selling', 'semi', 'send', 'sensor', 'sensors', 'sent', 'separation', 'sept', 'sequoia', 'series', 'serious', 'seriously', 'serpentine', 'service', 'serviced', 'set', 'several', 'severe', 'shaft', 'shake', 'shared', 'shattering', 'she', 'sheered', 'shield', 'shift', 'shifting', 'shipping', 'shock', 'shocked', 'shocking', 'shop', 'short', 'shortly', 'should', 'shoulder', 'shouldn', 'show', 'showed', 'showing', 'shows', 'shrapnel', 'shrubs', 'shudder', 'shut', 'shutting', 'side', 'sideroof', 'sides', 'sideways', 'sign', 'signal', 'signals', 'significant', 'significantly', 'silent', 'silverado', 'similar', 'simply', 'simultaneously', 'since', 'single', 'sit', 'sited', 'sitting', 'situated', 'situation', 'six', 'skid', 'skidded', 'skull', 'slam', 'slammed', 'slc', 'sliding', 'slip', 'slipping', 'slither', 'slow', 'slowed', 'slowing', 'slumped', 'small', 'smashed', 'smd', 'smell', 'smoke', 'snapped', 'so', 'software', 'solara', 'solstice', 'solutions', 'some', 'someone', 'something', 'sometimes', 'somewhere', 'sonata', 'soon', 'soooo', 'sore', 'soreness', 'sorry', 'sort', 'sound', 'sounds', 'source', 'space', 'spanning', 'specialists', 'specific', 'speed', 'speeding', 'speedometer', 'spin', 'spiral', 'split', 'sport', 'spot', 'spots', 'sprained', 'spring', 'springclock', 'springs', 'spun', 'sputtered', 'srs', 'st', 'stabilitrack', 'stability', 'staff', 'stalled', 'stalls', 'stance', 'standards', 'staples', 'start', 'started', 'starting', 'starts', 'state', 'stated', 'statedon', 'states', 'stations', 'stay', 'stayed', 'stays', 'steel', 'steer', 'steering', 'stem', 'stick', 'still', 'stitches', 'stomach', 'stop', 'stopped', 'stopping', 'store', 'straight', 'stranded', 'strange', 'streeing', 'street', 'strike', 'strongly', 'struck', 'stuck', 'subject', 'submitted', 'subsequent', 'substantial', 'suburban', 'such', 'sudden', 'suddenly', 'suffer', 'suffered', 'summer', 'sun', 'sunroof', 'supply', 'support', 'supposed', 'supposedly', 'sure', 'surged', 'surprise', 'surprised', 'suspected', 'suspension', 'sustain', 'sustained', 'suv', 'sway', 'switch', 'swollen', 'symptoms', 'system', 'systems', 'tags', 'tail', 'tailgage', 'tailpipe', 'takata', 'take', 'taken', 'taking', 'talked', 'tank', 'tap', 'taurus', 'taylor', 'tcs', 'tear', 'technicians', 'tee', 'tell', 'telling', 'temp', 'temperature', 'tends', 'tennessee', 'terrible', 'terribly', 'test', 'tested', 'tgw', 'than', 'thank', 'thankfully', 'thanks', 'that', 'the', 'their', 'them', 'then', 'there', 'thereafter', 'these', 'they', 'thing', 'things', 'think', 'third', 'this', 'thoroughfare', 'those', 'though', 'thought', 'thousands', 'three', 'throttle', 'through', 'thrown', 'ticked', 'ticket', 'tie', 'tightened', 'tightening', 'tilt', 'time', 'times', 'tinted', 'tire', 'tires', 'tl', 'to', 'today', 'together', 'told', 'tomorrow', 'tone', 'too', 'took', 'top', 'total', 'totaled', 'totalled', 'totally', 'tow', 'toward', 'towards', 'towed', 'town', 'toyota', 'tr', 'trac', 'track', 'traction', 'trade', 'traffic', 'trailer', 'transfer', 'transferred', 'transmission', 'transportation', 'trapping', 'trauma', 'traveled', 'traveling', 'tread', 'tree', 'tried', 'trigger', 'triggering', 'trip', 'trips', 'troubleshoot', 'truck', 'trucks', 'trunk', 'trust', 'try', 'trying', 'ts', 'tt', 'turbo', 'turn', 'turned', 'turning', 'turns', 'twice', 'twitted', 'two', 'type', 'tyre', 'umbrellas', 'un', 'unable', 'under', 'underneath', 'understand', 'understanding', 'unexpected', 'unexpectedly', 'unfortunately', 'unique', 'unit', 'unity', 'unknown', 'unless', 'unlock', 'unreadable', 'unreturned', 'unsafe', 'unsure', 'until', 'unwarranted', 'unwilling', 'up', 'update', 'updated', 'updates', 'upload', 'upon', 'upper', 'upright', 'upset', 'us', 'usage', 'use', 'used', 'using', 'usps', 'usually', 'vacation', 'van', 'vancouver', 'vans', 've', 'veer', 'veered', 'veering', 'vehicle', 'vehicles', 'verbal', 'vertebra', 'very', 'vibrate', 'video', 'view', 'vin', 'violated', 'violently', 'visibility', 'visit', 'visiting', 'voice', 'volkswagen', 'volvo', 'voyager', 'vsc', 'vulnerable', 'vw', 'wade', 'wait', 'waited', 'waiting', 'walked', 'wall', 'want', 'wanted', 'wants', 'warm', 'warms', 'warned', 'warning', 'warnings', 'warped', 'warrant', 'warranty', 'was', 'wasn', 'watched', 'water', 'way', 'we', 'weak', 'wear', 'wearing', 'weather', 'website', 'week', 'weeks', 'welding', 'well', 'went', 'were', 'weren', 'westbound', 'wet', 'what', 'wheel', 'when', 'where', 'whether', 'which', 'while', 'whiplash', 'who', 'whole', 'why', 'wider', 'wife', 'wiggling', 'will', 'willing', 'wilson', 'wind', 'window', 'windows', 'windshield', 'wiper', 'wipers', 'wires', 'wiring', 'wished', 'with', 'within', 'without', 'withstand', 'witnesses', 'won', 'wonder', 'woosh', 'word', 'work', 'working', 'works', 'worn', 'worse', 'worsened', 'worst', 'worth', 'would', 'wouldn', 'wrangler', 'wreck', 'wrecks', 'wrist', 'write', 'writes', 'writing', 'written', 'wrong', 'xterra', 'xxx', 'yards', 'yc', 'year', 'years', 'yes', 'yet', 'yield', 'york', 'you', 'your', 'zero', 'zone']

Big secret: The "fit" part of .fit_transform means "learn the words." The "transform" part means "count them."

You can take advantage of this list to build a nice-looking dataframe:

pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())

	00	000	01	01v347000	02	02v105000	02v146000	03	03v455000	04	05	05v395000	06	07	08	08v303000	09	10	1000	10017	11	12	128	12th	13	136	13v136000	14	1420	15	150	15pm	16	160lbs	17	180	1996	1997	1998	1999	1st	20	2000	2001	2002	2003	2004	2005	2006	2007	...	window	windows	windshield	wiper	wipers	wires	wiring	wished	with	within	without	withstand	witnesses	won	wonder	woosh	word	work	working	works	worn	worse	worsened	worst	worth	would	wouldn	wrangler	wreck	wrecks	wrist	write	writes	writing	written	wrong	xterra	xxx	yards	yc	year	years	yes	yet	yield	york	you	your	zero	zone
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	3	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
160	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
161	0	1	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
162	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
163	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
164	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

165 rows × 2280 columns

Only counting with ones and zeros#

It doesn't seem to matter too much whether a word shows up one or two or twenty times in a complaint - the only important thing is whether yes it shows up or no it doesn't show up.

To turn the counting into just 0s and 1s, we send an extra option to our CountVectorizer.

vectorizer = CountVectorizer(binary=True)

vectors = vectorizer.fit_transform(labeled.CDESCR)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

	05	10	12th	13	...	with	would	you	your
0	0	0	1	0	...	1	0	1	1
1	0	0	0	0	...	0	0	0	0
2	0	0	0	0	...	0	0	0	0
3	1	1	0	1	...	0	1	0	0
4	0	0	0	0	...	0	0	0	0

5 rows × 2280 columns

Using our new dataframe in machine learning#

We really like random forests now, right? They're more or less a fancy decision tree, and they usually give pretty good results.

Let's try one out with our new every-single-word features.

Hot tip: a vector is just a list of numbers (for example, each row). A matrix is a list of vectors.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

Usually we do .drop to get rid of the label, but when we counted all of our words it didn't carry over the label column (whether it's suspicious or not). Instead, we'll just use the is_suspicious column from our original dataframe, the one with the actual text.

X = words_df
y = labeled.is_suspicious

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Confusion matrix#

With all of those incredible features, how did it do?

y_true = y
y_pred = clf.predict(X)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

	Predicted not suspicious	Predicted suspicious
Is not suspicious	150	0
Is suspicious	0	15

Amazing!!! 100% accuracy!!! Loving it!!!

What did the random forest think were the important features?

import eli5
feature_names = list(X.columns)

# Use this line instead of warnings about judging these classifier
# eli5.show_weights(clf, feature_names=feature_names, show=eli5.formatters.fields.ALL)
eli5.show_weights(clf, feature_names=feature_names)

Weight	Feature
0.0184 ± 0.1117	deployed
0.0163 ± 0.0944	pulling
0.0163 ± 0.0904	burns
0.0147 ± 0.1022	burning
0.0146 ± 0.0926	degree
0.0141 ± 0.0701	sunroof
0.0129 ± 0.0798	school
0.0116 ± 0.0725	1st
0.0110 ± 0.0776	apart
0.0107 ± 0.0736	zone
0.0102 ± 0.0564	driver
0.0086 ± 0.0574	problem
0.0085 ± 0.0610	killing
0.0083 ± 0.0573	unexpectedly
0.0081 ± 0.0537	further
0.0076 ± 0.0667	sputtered
0.0074 ± 0.0597	2nd
0.0073 ± 0.0698	street
0.0071 ± 0.0611	chin
0.0068 ± 0.0556	suffered
… 2260 more …

Sure, sure, that all makes sense.

No, wait! let's train-test split#

Oh boy we totally forgot about train-test split, we were testing the classifier on things it had already seen. Let's split them up into test sets and train sets and try again.

from sklearn.model_selection import train_test_split

X = words_df
y = labeled.is_suspicious

X_train, X_test, y_train, y_test = train_test_split(X, y)

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

y_true = y_test
y_pred = clf.predict(X_test)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

	Predicted not suspicious	Predicted suspicious
Is not suspicious	39	0
Is suspicious	3	0

Oh no, that's horrible. That's terrible. Let's try looking at our feature importances, just to see if it's making dumb decisions.

eli5.show_weights(clf, feature_names=feature_names)

Weight	Feature
0.0146 ± 0.0872	face
0.0145 ± 0.0830	problem
0.0142 ± 0.1067	deployed
0.0123 ± 0.0874	passenger
0.0118 ± 0.0873	killing
0.0116 ± 0.0922	burns
0.0099 ± 0.0797	ripping
0.0097 ± 0.0830	mouth
0.0096 ± 0.0842	malfunction
0.0094 ± 0.0926	suffered
0.0093 ± 0.0692	his
0.0092 ± 0.0857	both
0.0092 ± 0.0720	degree
0.0089 ± 0.0855	chin
0.0088 ± 0.0666	resulting
0.0087 ± 0.0546	unexpectedly
0.0087 ± 0.0585	1st
0.0085 ± 0.0635	further
0.0084 ± 0.0695	2nd
0.0078 ± 0.0910	apart
… 2260 more …

I mean, it makes sense, I guess. Even though we added all those new features, why doesn't it work well?

Trying again with a Logistic Classifier#

Well, if there's one thing we know to do, it's try again and again with different classifiers until something works. Let's see if a logistic classifier work any better!*

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e9, solver='lbfgs')

clf.fit(X_train, y_train)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

y_true = y_test
y_pred = clf.predict(X_test)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

	Predicted not suspicious	Predicted suspicious
Is not suspicious	39	0
Is suspicious	3	0

Just as bad! Sadly, with this much information there's no good pattern. We can feel good about how explainable it is, though.

eli5.show_weights(clf, feature_names=feature_names, target_names=['not suspicious', 'suspicious'])

y=suspicious top features

Weight^?	Feature
+3.187	deployed
+2.961	passenger
+2.145	degree
+2.017	problem
+1.869	1st
+1.840	2nd
+1.840	hands
+1.818	face
+1.772	burns
+1.612	provide
+1.377	further
… 859 more positive …
… 1155 more negative …
-1.376	traveling
-1.390	light
-1.465	brake
-1.510	pads
-1.546	front
-2.400	is
-2.888	did
-2.924	not
-6.441	<BIAS>

Review#

While last time we just used hand-picked words to have our classifier pay attention to, this time we used a vectorizer to just use all of the words. We figured that more information was better information, and we wouldn't even have to flag more complaints!

Unfortunately our classifier still didn't really find any suspicious complaints.

Discussion topics#

Brainstorm reasons why more information didn't save us.

In classification problems, when might you want to hand-pick words and when might you want to use a vectorizer? Compare this airbag situation, about sentiment analysis of tweets, and separating sci-fi and romance novels.