Finding faulty airbags in a sea of consumer complaints by counting words and classifying the results#

Topics: Vectorizing text

Datasets

  • sampled-labeled.csv: a sample of vehicle complaints, labeled with being suspicious or not

What's the goal?#

It was too much work to read twenty years of vehicle comments to find the ones related to dangerous airbags! The last two times we tried to pick out important words to dangerous/not dangerous airbags, but it didn't go so well because we weren't sure what the best ones to pick were.

This time we're going to pick everything.

Setup#

import pandas as pd

# Allow us to display 100 columns at a time, and 100 characters in each column (instead of ...)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)

Read in our labeled data#

We'll start by reading in our complaints that have labeled attached to them. Read in sampled-labeled.csv.

labeled = pd.read_csv("data/sampled-labeled.csv")
labeled.head()
is_suspicious CDESCR
0 0.0 ALTHOUGH I LOVED THE CAR OVERALL AT THE TIME I DECIDED TO OWN, , MY DREAM CAR CADILLAC CTS HAS T...
1 0.0 CONSUMER SHUT SLIDING DOOR WHEN ALL POWER LOCKS ON ALL DOORS LOCKED BY ITSELF, TRAPPING INFANT I...
2 0.0 DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT
3 0.0 TL* THE CONTACT OWNS A 2009 NISSAN ALTIMA. THE CONTACT STATED THAT THE START BUTTON FOR THE IGNI...
4 0.0 THE FRONT MIDDLE SEAT DOESN'T LOCK IN PLACE. *AK

Even though it's called labeled, not all of them have labels. Drop the ones missing labels.

labeled = labeled.dropna()

See how many suspicious/not suspicious comments we have.

labeled.is_suspicious.value_counts()
0.0    150
1.0     15
Name: is_suspicious, dtype: int64

150 non-suspicious and 15 suspicious is a pretty terrible ratio, but we're remarkably lazy and not very many of the comments are actually suspicious.

Now that we've read a few, let's train our classifier

Creating features#

Selecting our features and building a features dataframe#

Last time, we can thought of some words or phrases that might make a comment interesting or not interesting. We came up with this list:

  • airbag
  • air bag
  • failed
  • did not deploy
  • violent
  • explode
  • shrapnel

We then built a dataframe that included those words for each row - 0 if it's in there, 1 if it isn't - along with the is_suspicious label. That process looked like this:

train_df = pd.DataFrame({
    'is_suspicious': labeled.is_suspicious,
    'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
    'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
    'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
    'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
    'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
    'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()
is_suspicious airbag air bag failed did not deploy violent explode shrapnel
0 0.0 0 0 0 0 0 0 0
1 0.0 0 0 0 0 0 0 0
2 0.0 0 0 0 0 0 0 0
3 0.0 0 0 0 0 0 0 0
4 0.0 0 0 0 0 0 0 0

But as we found out later, picking which words are important - feature selection - can be a difficult process. There are a lot of words in there, and it isn't like we're going to go through and look at every single word, right?

Well, actually, it's definitely possible to look at every single word, and it takes way less code than what we did up above.

You can count words using the CountVectorizer from sci-kit learn. Using .fit_transform below will learn all of the words in a column, then count them.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

vectors = vectorizer.fit_transform(labeled.CDESCR)
vectors
<165x2280 sparse matrix of type '<class 'numpy.int64'>'
	with 9089 stored elements in Compressed Sparse Row format>

But... what's a "sparse matrix"? We can see something that looks more familiar if we tell it to become an array (basically a list).

vectors.toarray()
array([[0, 0, 0, ..., 3, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

It's still a little hard to understand, but a list of lists? Sounds like a great opportunity for a dataframe!

pd.DataFrame(vectors.toarray())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 ... 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 3 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
160 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
161 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
162 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
163 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
164 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

165 rows × 2280 columns

Each row is a sentence, and each column is a word!

  • We had 165 sentences, so we know have 165 rows
  • There were 2280 words, so we have 2280 columns

If a word appears zero times in a sentence, that column gets a 0. If it appears one or two or twenty times, that number appears in the column instead.

The whole sparse matrix thing is part of numpy. It's the idea that since the list of lists was mostly empty, Python can be lazy and not keep track of all of the 0s - instead, it only tracks where there are non-0 numbers. A sparse matrix is much more efficient with space if you have a lot lot lot of 0's!

We used .toarray() to turn it into a list of lists (although sometimes if we have a lot lot lot of words and sentences our computer might not be able to do it).

How do we know which column is which word? When we told the vectorizer to count all of the words in each sentence, it also memorized all of the words separately.

print(vectorizer.get_feature_names())
['00', '000', '01', '01v347000', '02', '02v105000', '02v146000', '03', '03v455000', '04', '05', '05v395000', '06', '07', '08', '08v303000', '09', '10', '1000', '10017', '11', '12', '128', '12th', '13', '136', '13v136000', '14', '1420', '15', '150', '15pm', '16', '160lbs', '17', '180', '1996', '1997', '1998', '1999', '1st', '20', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '20k', '20mph', '22', '2300', '24', '25', '2500', '262', '28', '29', '2nd', '30', '300', '30miles', '30mph', '31', '32', '323i', '325xi', '32k', '35', '37', '39', '390', '3k', '3rd', '40', '40mph', '42', '440', '45mph', '48', '49', '4x4', '50', '500', '5000', '50000', '50k', '517', '55', '552', '57', '5th', '60k', '60mph', '65', '65000km', '68', '6th', '70', '71000', '75', '7500', '77', '775', '79', '795', '800', '8004341', '808680', '86', '87', '91', '915', '93k', '94', '98', '981', 'a1', 'aamco', 'able', 'about', 'above', 'abrasion', 'abrasions', 'abs', 'absence', 'absolutely', 'ac', 'accelerate', 'accelerated', 'acceleration', 'access', 'accident', 'accord', 'according', 'accurate', 'acknowledge', 'across', 'act', 'acted', 'action', 'activates', 'activations', 'active', 'actually', 'actuator', 'acura', 'addition', 'additional', 'address', 'addressed', 'adjacent', 'adjusted', 'advance', 'advise', 'advised', 'affairs', 'affecting', 'afford', 'afraid', 'after', 'again', 'against', 'age', 'agency', 'agents', 'ago', 'agony', 'air', 'airbag', 'airbags', 'aircondition', 'ak', 'alive', 'all', 'alley', 'allow', 'almost', 'along', 'already', 'also', 'although', 'altima', 'always', 'am', 'american', 'an', 'and', 'angles', 'another', 'answer', 'antenna', 'anti', 'antifreeze', 'any', 'anymore', 'anyone', 'anything', 'anywhere', 'apart', 'apparent', 'appeal', 'appear', 'appeared', 'appears', 'applied', 'apply', 'applying', 'appointment', 'appreciate', 'appreciated', 'approaching', 'approval', 'approx', 'approximate', 'approximately', 'april', 'aprox', 'are', 'area', 'arise', 'arm', 'arms', 'around', 'as', 'asked', 'assembly', 'assigned', 'assist', 'assistance', 'assume', 'at', 'attachment', 'attachments', 'attempted', 'attempting', 'attempts', 'attention', 'audible', 'aug', 'august', 'augusta', 'authorized', 'auto', 'automatic', 'automobiles', 'avail', 'available', 'avoid', 'aware', 'away', 'awfully', 'awhile', 'axel', 'axle', 'baby', 'back', 'backing', 'backwards', 'bad', 'bag', 'bags', 'bailed', 'ball', 'banged', 'banging', 'bar', 'barely', 'bargained', 'barrier', 'bars', 'battery', 'bc', 'be', 'beam', 'beams', 'became', 'because', 'been', 'beep', 'before', 'beg', 'began', 'behind', 'being', 'believe', 'believed', 'bell', 'belong', 'below', 'belt', 'belts', 'beltway', 'bent', 'benz', 'better', 'between', 'beyond', 'bf', 'bigger', 'binding', 'bit', 'bizarre', 'blew', 'blink', 'blinker', 'blinking', 'block', 'blocks', 'blowing', 'blowout', 'bmw', 'board', 'body', 'bolted', 'bone', 'booster', 'boosters', 'both', 'bother', 'bottom', 'bought', 'bound', 'box', 'bracket', 'brake', 'brakes', 'braking', 'brand', 'break', 'breaking', 'brick', 'bring', 'broadsided', 'broke', 'broken', 'brought', 'bruise', 'bruised', 'bruises', 'bruising', 'buckle', 'buckled', 'bug', 'buick', 'building', 'builds', 'built', 'bump', 'bumper', 'buns', 'buried', 'burned', 'burning', 'burns', 'busy', 'but', 'button', 'buttons', 'buy', 'buying', 'by', 'ca', 'cab', 'cable', 'cadillac', 'cafe', 'caliber', 'caliper', 'calipers', 'call', 'called', 'calls', 'cam', 'came', 'campaign', 'campaigns', 'camping', 'camry', 'can', 'canada', 'canadian', 'cannot', 'cap', 'car', 'care', 'carefully', 'cares', 'carnival', 'caromed', 'carriers', 'carrying', 'cars', 'case', 'catalytic', 'catch', 'caught', 'cause', 'caused', 'causes', 'causing', 'cb', 'center', 'centre', 'ceo', 'certain', 'certainly', 'chain', 'chandra', 'change', 'changed', 'charge', 'charged', 'cheat', 'check', 'checked', 'cherokee', 'chest', 'chevrolet', 'chevy', 'child', 'children', 'chimes', 'chin', 'choose', 'chrysler', 'cinergy', 'circle', 'circuit', 'circuits', 'claim', 'claimed', 'clash', 'classic', 'clear', 'clearly', 'climb', 'clock', 'clockspring', 'close', 'closed', 'closure', 'clue', 'cn', 'co', 'coasting', 'codes', 'coil', 'coils', 'coincidentally', 'cold', 'collapse', 'collapsed', 'collapsing', 'collide', 'collided', 'collision', 'collison', 'column', 'com', 'combi', 'come', 'comes', 'coming', 'comment', 'common', 'company', 'compartment', 'compensation', 'complained', 'complaint', 'complaints', 'complete', 'completely', 'component', 'compressor', 'compromises', 'computer', 'concern', 'concerned', 'concerning', 'concerns', 'concord', 'concrete', 'concussion', 'condition', 'conditioner', 'conditions', 'conducted', 'confirm', 'consider', 'considerably', 'considered', 'console', 'constantly', 'consume', 'consumer', 'consumers', 'contact', 'contacted', 'contemplating', 'continue', 'continued', 'continues', 'contributed', 'control', 'controls', 'converter', 'cool', 'cooler', 'cooling', 'corner', 'corolla', 'corollas', 'corporation', 'correct', 'corrode', 'corrosion', 'cost', 'costs', 'could', 'country', 'county', 'couple', 'course', 'court', 'cover', 'covered', 'crack', 'cracked', 'cracking', 'crash', 'crashed', 'crazy', 'critical', 'cronic', 'cross', 'crossed', 'crossing', 'crossroads', 'cruise', 'cruising', 'crumpled', 'crv', 'cts', 'cupping', 'curb', 'curbing', 'current', 'currently', 'currents', 'curtain', 'customer', 'cut', 'cuts', 'cutting', 'cylinder', 'd4', 'daimler', 'damage', 'damaged', 'danger', 'dangerous', 'dash', 'dashboard', 'date', 'dating', 'daughter', 'day', 'days', 'daytime', 'dazed', 'dead', 'deadly', 'dealer', 'dealers', 'dealership', 'dealerships', 'dear', 'death', 'december', 'decent', 'decided', 'decides', 'decision', 'declared', 'deemed', 'deer', 'defect', 'defective', 'defects', 'defog', 'defogger', 'defrost', 'degree', 'degrees', 'delay', 'demons', 'denied', 'denies', 'dented', 'department', 'deploy', 'deployed', 'deploying', 'deployment', 'describe', 'design', 'despite', 'destroyed', 'destructive', 'detached', 'details', 'detect', 'determine', 'determined', 'developed', 'diagnose', 'diagnosed', 'diagnosis', 'diagnostic', 'diagnostics', 'did', 'didn', 'die', 'died', 'differ', 'different', 'difficult', 'difficulty', 'digital', 'ding', 'direct', 'directed', 'dirt', 'disabled', 'discovered', 'discs', 'discuss', 'dismantled', 'dispite', 'display', 'displayed', 'dissappointed', 'distance', 'distribution', 'ditch', 'do', 'doctor', 'documented', 'doddge', 'does', 'doesn', 'dog', 'dollars', 'don', 'done', 'door', 'doors', 'down', 'dream', 'drivable', 'drive', 'driven', 'driver', 'drivers', 'driveway', 'driving', 'drivng', 'drove', 'dry', 'dt', 'dual', 'due', 'duplicate', 'during', 'dust', 'duty', 'dvd', 'e320', 'ea13003', 'each', 'ear', 'early', 'ears', 'edmunds', 'either', 'elderly', 'electrical', 'electronic', 'electronics', 'elk', 'else', 'emailed', 'embankment', 'emergency', 'emitted', 'emptied', 'empty', 'en', 'encountered', 'end', 'ended', 'engaged', 'engine', 'enough', 'entered', 'entering', 'enters', 'entertainment', 'entiire', 'entire', 'equipment', 'equipped', 'era', 'erratic', 'erratically', 'error', 'especially', 'esserman', 'estimated', 'et', 'etc', 'even', 'event', 'events', 'ever', 'every', 'everyday', 'everyone', 'everything', 'everywhere', 'evidence', 'evident', 'evidently', 'exact', 'examined', 'except', 'exceptional', 'excessive', 'exhaust', 'exists', 'exit', 'expedition', 'expense', 'expensive', 'experience', 'experienced', 'experiences', 'experiencing', 'expired', 'explained', 'explains', 'exploded', 'explorer', 'explosions', 'explosive', 'extended', 'extensive', 'extra', 'extremely', 'eye', 'face', 'facets', 'facility', 'facing', 'fact', 'factory', 'fades', 'fail', 'failed', 'failing', 'fails', 'failure', 'failures', 'faint', 'fairly', 'fallen', 'false', 'family', 'fan', 'far', 'fast', 'fatal', 'fate', 'father', 'fault', 'faults', 'faulty', 'fax', 'feature', 'features', 'february', 'federally', 'fee', 'feeding', 'feel', 'feels', 'feet', 'fell', 'felt', 'fence', 'fender', 'few', 'fiance', 'fifteen', 'fight', 'figure', 'file', 'filed', 'filler', 'filling', 'filter', 'final', 'finally', 'financially', 'find', 'fine', 'fire', 'firestone', 'firing', 'first', 'fit', 'five', 'fix', 'fixed', 'flagship', 'flashing', 'flaw', 'floor', 'flying', 'fm', 'fog', 'foia', 'fold', 'following', 'foot', 'for', 'force', 'forced', 'ford', 'forearm', 'fortunately', 'forum', 'forums', 'forward', 'found', 'four', 'fraction', 'fractured', 'frame', 'free', 'freedom', 'freon', 'from', 'front', 'frontal', 'frustrated', 'fuel', 'full', 'fully', 'function', 'functional', 'funtion', 'further', 'fuse', 'future', 'ga', 'garage', 'gas', 'gasket', 'gasoline', 'gate', 'gauge', 'gauges', 'gb250', 'gear', 'gears', 'get', 'gets', 'getting', 'give', 'given', 'glass', 'glove', 'gm', 'gmc', 'go', 'god', 'goes', 'going', 'golfs', 'gone', 'good', 'goodness', 'got', 'gotten', 'grace', 'grand', 'graph', 'gravel', 'green', 'grill', 'grinding', 'grip', 'grooves', 'ground', 'grove', 'gto', 'guard', 'had', 'hamilton', 'hand', 'handle', 'handles', 'handling', 'hands', 'happen', 'happened', 'happening', 'happens', 'hard', 'hardware', 'harness', 'harnessing', 'has', 'hasn', 'have', 'haven', 'having', 'hazard', 'hazardous', 'hd', 'he', 'head', 'headlight', 'headlights', 'headliner', 'hear', 'heard', 'heated', 'heater', 'heating', 'heavy', 'held', 'help', 'hence', 'her', 'here', 'hesitant', 'hesitated', 'high', 'higher', 'highway', 'him', 'hindered', 'hindering', 'hinge', 'hinges', 'his', 'hit', 'hitting', 'hold', 'holding', 'holds', 'holes', 'hollow', 'home', 'honda', 'honor', 'honored', 'hooked', 'hops', 'horn', 'horror', 'hose', 'hospital', 'hot', 'hour', 'hours', 'house', 'how', 'howe', 'however', 'hows', 'hub', 'huge', 'humid', 'humidity', 'hundreds', 'hurry', 'hurt', 'husband', 'husbands', 'hutchinson', 'hydraulic', 'hyosung', 'hyundai', 'i35s', 'i95', 'iahwan', 'id', 'idea', 'identified', 'ie', 'if', 'ignition', 'ii', 'illuminated', 'illuminating', 'im', 'imbursement', 'immediately', 'impact', 'impacting', 'impacts', 'impala', 'impeding', 'importation', 'in', 'inaccurate', 'inactive', 'inadvertent', 'inch', 'incident', 'include', 'included', 'including', 'indeed', 'independent', 'indianapolis', 'indicate', 'indicated', 'indicating', 'indication', 'indicator', 'indicators', 'infant', 'infiniti', 'inflate', 'inflater', 'information', 'informed', 'initial', 'initially', 'injet', 'injuires', 'injured', 'injuries', 'injuring', 'injury', 'inoperable', 'inserted', 'inside', 'insight', 'insists', 'inspect', 'inspected', 'inspecting', 'inspection', 'inspector', 'installed', 'instead', 'instrument', 'insurance', 'integral', 'intended', 'interest', 'interior', 'intermittent', 'intermittently', 'internal', 'international', 'intersection', 'intersections', 'interstate', 'interval', 'intervals', 'interventions', 'into', 'intrusive', 'investigate', 'investigated', 'investigation', 'invoice', 'involve', 'involved', 'invoved', 'iraq', 'is', 'isn', 'isolated', 'issue', 'issued', 'issues', 'it', 'itbstruck', 'item', 'items', 'its', 'itself', 'jackets', 'january', 'japanese', 'jarred', 'jb', 'jeep', 'jerked', 'jersey', 'jetta', 'job', 'joint', 'js', 'juice', 'jump', 'jumping', 'june', 'just', 'justify', 'k2500', 'kb', 'keep', 'keeps', 'kept', 'key', 'kia', 'kicked', 'kids', 'kill', 'killed', 'killing', 'kind', 'kit', 'kits', 'km', 'kms', 'knee', 'knees', 'knew', 'knock', 'knocked', 'know', 'known', 'knows', 'la', 'labor', 'lacerations', 'lamps', 'lane', 'lanes', 'lap', 'laredo', 'large', 'last', 'lasting', 'latch', 'later', 'launch', 'lawn', 'laying', 'leading', 'leak', 'leakage', 'leaking', 'leaks', 'leaned', 'leased', 'least', 'leather', 'leave', 'leaving', 'left', 'leg', 'legs', 'lehmer', 'lengths', 'lesion', 'let', 'letter', 'level', 'lever', 'lied', 'life', 'lift', 'lifter', 'liftgate', 'light', 'lights', 'like', 'likely', 'lincoln', 'line', 'lines', 'link', 'list', 'listed', 'lit', 'literally', 'little', 'lives', 'lj', 'local', 'located', 'location', 'lock', 'locked', 'locking', 'locks', 'logical', 'long', 'longer', 'looked', 'looks', 'loose', 'lose', 'losing', 'loss', 'lost', 'lot', 'loud', 'loved', 'low', 'lower', 'luckily', 'lunch', 'lurched', 'luxury', 'ma', 'made', 'mail', 'main', 'maintain', 'maintained', 'maintenance', 'major', 'make', 'makes', 'making', 'malfunction', 'malfunctioning', 'malfunctions', 'malfuntioned', 'malibu', 'manager', 'maneuvering', 'manifold', 'manual', 'manually', 'manufacture', 'manufactured', 'manufacturer', 'many', 'march', 'market', 'massive', 'master', 'matrix', 'matter', 'maxima', 'may', 'maybe', 'mbrusman', 'mdx', 'me', 'means', 'mechanic', 'mechanical', 'mechanism', 'mechanisms', 'median', 'mediation', 'mediator', 'mention', 'mentioned', 'mercedes', 'merging', 'message', 'met', 'metal', 'meters', 'middle', 'mileage', 'mileages', 'miles', 'mine', 'minor', 'minutes', 'miraculously', 'mishap', 'missing', 'ml', 'model', 'models', 'moderate', 'module', 'moisture', 'molding', 'moldings', 'moment', 'money', 'month', 'months', 'more', 'morning', 'most', 'mostly', 'mother', 'motion', 'motor', 'motorcycle', 'mountain', 'mounting', 'mouth', 'mph', 'mr', 'much', 'multiple', 'murano', 'must', 'mustang', 'my', 'myself', 'na', 'name', 'nature', 'nc', 'near', 'neck', 'need', 'needed', 'needle', 'needles', 'needs', 'neighborhood', 'neither', 'nerve', 'nerves', 'net', 'never', 'new', 'newer', 'news', 'next', 'nhtsa', 'nice', 'night', 'nightmare', 'nissan', 'nj', 'nm', 'no', 'nobody', 'noise', 'noises', 'non', 'none', 'nor', 'normal', 'north', 'not', 'noted', 'nothing', 'notice', 'noticeable', 'noticed', 'notified', 'notifying', 'november', 'now', 'number', 'numbers', 'numerous', 'oakland', 'object', 'objects', 'obtain', 'obvious', 'obviously', 'occasion', 'occasions', 'occupant', 'occupants', 'occurred', 'occurrence', 'occurring', 'occurs', 'ocs', 'october', 'odi', 'odometer', 'odyssey', 'of', 'off', 'offer', 'offered', 'office', 'officer', 'official', 'often', 'oil', 'ok', 'old', 'on', 'once', 'oncoming', 'one', 'ones', 'oneself', 'ongoing', 'online', 'only', 'onto', 'oo', 'op', 'open', 'opened', 'operate', 'operates', 'operation', 'opinion', 'opposite', 'optional', 'or', 'order', 'ordered', 'organs', 'original', 'orthopedic', 'other', 'others', 'otherwise', 'ounce', 'our', 'ours', 'out', 'outcome', 'outside', 'over', 'overall', 'overheating', 'overnight', 'own', 'owned', 'owner', 'owners', 'owns', 'oxygen', 'p225', 'pads', 'paid', 'pain', 'paint', 'panel', 'panic', 'park', 'parked', 'parking', 'parkway', 'part', 'partial', 'particular', 'parts', 'passanger', 'passat', 'passats', 'passed', 'passenger', 'passengers', 'passing', 'past', 'pavement', 'pay', 'pe', 'pedal', 'pedestrians', 'people', 'per', 'perfect', 'perfectly', 'performed', 'perhaps', 'period', 'periodic', 'permantly', 'persist', 'person', 'ph', 'phone', 'picked', 'pickup', 'pictures', 'pieces', 'pillar', 'pinion', 'pipe', 'pixels', 'place', 'placed', 'places', 'plan', 'plastic', 'play', 'please', 'plenty', 'plymouth', 'pocket', 'pockets', 'point', 'poles', 'police', 'pond', 'pontiac', 'pop', 'popped', 'popping', 'portion', 'position', 'positioned', 'possibilities', 'possible', 'possibly', 'postal', 'posted', 'potential', 'potentially', 'pothole', 'powder', 'power', 'powertrain', 'preoccupied', 'pressed', 'pressing', 'pressure', 'prevented', 'previous', 'prior', 'private', 'probably', 'problem', 'problems', 'process', 'produce', 'product', 'products', 'professional', 'program', 'prominent', 'promise', 'promised', 'prompted', 'proof', 'properly', 'protect', 'protection', 'proveout', 'provide', 'provincially', 'provoke', 'public', 'puddle', 'pull', 'pulled', 'pulley', 'pulling', 'pump', 'purchase', 'purchased', 'pursuant', 'pursuing', 'push', 'pushed', 'pushing', 'put', 'putting', 'quality', 'quart', 'quarter', 'quick', 'quickly', 'quite', 'quits', 'quote', 'r16', 'rack', 'radio', 'rail', 'raining', 'ran', 'random', 'randomly', 'range', 'rate', 'rather', 'rating', 'rattling', 'rav', 'rayed', 're', 'read', 'reading', 'readings', 'reads', 'realized', 'really', 'rear', 'rearfacing', 'reason', 'reasonable', 'reasoning', 'reasons', 'recall', 'recalled', 'recalls', 'receive', 'received', 'recent', 'recently', 'recode', 'recommended', 'record', 'recording', 'records', 'recovered', 'rectifying', 'recurred', 'recurring', 'red', 'redacted', 'redo', 'referred', 'refused', 'refuses', 'refusing', 'regarding', 'regards', 'regional', 'regular', 'regularly', 'reimbursement', 'related', 'relations', 'relay', 'releasing', 'relied', 'rely', 'remain', 'remained', 'remains', 'remedy', 'remember', 'reoccur', 'reoccurring', 'reopened', 'repair', 'repaired', 'repairs', 'repeated', 'repeatedly', 'replace', 'replaced', 'replacement', 'replacing', 'report', 'reported', 'reproduce', 'reps', 'request', 'requested', 'requires', 'requiring', 'research', 'reserve', 'reset', 'resistance', 'resolve', 'respond', 'responded', 'responsibility', 'responsible', 'rest', 'restart', 'restarted', 'restrain', 'restraint', 'result', 'resulted', 'resulting', 'resurface', 'retract', 'retractor', 'return', 'returned', 'returning', 'rewire', 'rhd', 'ribbon', 'ride', 'ridgeline', 'riding', 'right', 'rims', 'ringing', 'rings', 'ripping', 'risk', 'river', 'road', 'roadside', 'rod', 'roll', 'rollover', 'rondo', 'rotate', 'rotors', 'roughly', 'route', 'rt', 'rte', 'rubbing', 'ruined', 'run', 'running', 'runnings', 'rural', 'ruralinfo', 'rust', 'rusted', 'sacramento', 'sadly', 'safe', 'safely', 'safety', 'safey', 'said', 'salesman', 'salesperson', 'same', 'sanctioned', 'saturn', 'save', 'saved', 'saw', 'say', 'says', 'sc', 'scam', 'scared', 'school', 'scn', 'screen', 'screw', 'scrutinized', 'sd', 'seal', 'seat', 'seatbelt', 'seatbelted', 'seatbelts', 'sebring', 'second', 'secondary', 'secure', 'security', 'see', 'seeing', 'seem', 'seemed', 'seems', 'seen', 'selling', 'semi', 'send', 'sensor', 'sensors', 'sent', 'separation', 'sept', 'sequoia', 'series', 'serious', 'seriously', 'serpentine', 'service', 'serviced', 'set', 'several', 'severe', 'shaft', 'shake', 'shared', 'shattering', 'she', 'sheered', 'shield', 'shift', 'shifting', 'shipping', 'shock', 'shocked', 'shocking', 'shop', 'short', 'shortly', 'should', 'shoulder', 'shouldn', 'show', 'showed', 'showing', 'shows', 'shrapnel', 'shrubs', 'shudder', 'shut', 'shutting', 'side', 'sideroof', 'sides', 'sideways', 'sign', 'signal', 'signals', 'significant', 'significantly', 'silent', 'silverado', 'similar', 'simply', 'simultaneously', 'since', 'single', 'sit', 'sited', 'sitting', 'situated', 'situation', 'six', 'skid', 'skidded', 'skull', 'slam', 'slammed', 'slc', 'sliding', 'slip', 'slipping', 'slither', 'slow', 'slowed', 'slowing', 'slumped', 'small', 'smashed', 'smd', 'smell', 'smoke', 'snapped', 'so', 'software', 'solara', 'solstice', 'solutions', 'some', 'someone', 'something', 'sometimes', 'somewhere', 'sonata', 'soon', 'soooo', 'sore', 'soreness', 'sorry', 'sort', 'sound', 'sounds', 'source', 'space', 'spanning', 'specialists', 'specific', 'speed', 'speeding', 'speedometer', 'spin', 'spiral', 'split', 'sport', 'spot', 'spots', 'sprained', 'spring', 'springclock', 'springs', 'spun', 'sputtered', 'srs', 'st', 'stabilitrack', 'stability', 'staff', 'stalled', 'stalls', 'stance', 'standards', 'staples', 'start', 'started', 'starting', 'starts', 'state', 'stated', 'statedon', 'states', 'stations', 'stay', 'stayed', 'stays', 'steel', 'steer', 'steering', 'stem', 'stick', 'still', 'stitches', 'stomach', 'stop', 'stopped', 'stopping', 'store', 'straight', 'stranded', 'strange', 'streeing', 'street', 'strike', 'strongly', 'struck', 'stuck', 'subject', 'submitted', 'subsequent', 'substantial', 'suburban', 'such', 'sudden', 'suddenly', 'suffer', 'suffered', 'summer', 'sun', 'sunroof', 'supply', 'support', 'supposed', 'supposedly', 'sure', 'surged', 'surprise', 'surprised', 'suspected', 'suspension', 'sustain', 'sustained', 'suv', 'sway', 'switch', 'swollen', 'symptoms', 'system', 'systems', 'tags', 'tail', 'tailgage', 'tailpipe', 'takata', 'take', 'taken', 'taking', 'talked', 'tank', 'tap', 'taurus', 'taylor', 'tcs', 'tear', 'technicians', 'tee', 'tell', 'telling', 'temp', 'temperature', 'tends', 'tennessee', 'terrible', 'terribly', 'test', 'tested', 'tgw', 'than', 'thank', 'thankfully', 'thanks', 'that', 'the', 'their', 'them', 'then', 'there', 'thereafter', 'these', 'they', 'thing', 'things', 'think', 'third', 'this', 'thoroughfare', 'those', 'though', 'thought', 'thousands', 'three', 'throttle', 'through', 'thrown', 'ticked', 'ticket', 'tie', 'tightened', 'tightening', 'tilt', 'time', 'times', 'tinted', 'tire', 'tires', 'tl', 'to', 'today', 'together', 'told', 'tomorrow', 'tone', 'too', 'took', 'top', 'total', 'totaled', 'totalled', 'totally', 'tow', 'toward', 'towards', 'towed', 'town', 'toyota', 'tr', 'trac', 'track', 'traction', 'trade', 'traffic', 'trailer', 'transfer', 'transferred', 'transmission', 'transportation', 'trapping', 'trauma', 'traveled', 'traveling', 'tread', 'tree', 'tried', 'trigger', 'triggering', 'trip', 'trips', 'troubleshoot', 'truck', 'trucks', 'trunk', 'trust', 'try', 'trying', 'ts', 'tt', 'turbo', 'turn', 'turned', 'turning', 'turns', 'twice', 'twitted', 'two', 'type', 'tyre', 'umbrellas', 'un', 'unable', 'under', 'underneath', 'understand', 'understanding', 'unexpected', 'unexpectedly', 'unfortunately', 'unique', 'unit', 'unity', 'unknown', 'unless', 'unlock', 'unreadable', 'unreturned', 'unsafe', 'unsure', 'until', 'unwarranted', 'unwilling', 'up', 'update', 'updated', 'updates', 'upload', 'upon', 'upper', 'upright', 'upset', 'us', 'usage', 'use', 'used', 'using', 'usps', 'usually', 'vacation', 'van', 'vancouver', 'vans', 've', 'veer', 'veered', 'veering', 'vehicle', 'vehicles', 'verbal', 'vertebra', 'very', 'vibrate', 'video', 'view', 'vin', 'violated', 'violently', 'visibility', 'visit', 'visiting', 'voice', 'volkswagen', 'volvo', 'voyager', 'vsc', 'vulnerable', 'vw', 'wade', 'wait', 'waited', 'waiting', 'walked', 'wall', 'want', 'wanted', 'wants', 'warm', 'warms', 'warned', 'warning', 'warnings', 'warped', 'warrant', 'warranty', 'was', 'wasn', 'watched', 'water', 'way', 'we', 'weak', 'wear', 'wearing', 'weather', 'website', 'week', 'weeks', 'welding', 'well', 'went', 'were', 'weren', 'westbound', 'wet', 'what', 'wheel', 'when', 'where', 'whether', 'which', 'while', 'whiplash', 'who', 'whole', 'why', 'wider', 'wife', 'wiggling', 'will', 'willing', 'wilson', 'wind', 'window', 'windows', 'windshield', 'wiper', 'wipers', 'wires', 'wiring', 'wished', 'with', 'within', 'without', 'withstand', 'witnesses', 'won', 'wonder', 'woosh', 'word', 'work', 'working', 'works', 'worn', 'worse', 'worsened', 'worst', 'worth', 'would', 'wouldn', 'wrangler', 'wreck', 'wrecks', 'wrist', 'write', 'writes', 'writing', 'written', 'wrong', 'xterra', 'xxx', 'yards', 'yc', 'year', 'years', 'yes', 'yet', 'yield', 'york', 'you', 'your', 'zero', 'zone']

Big secret: The "fit" part of .fit_transform means "learn the words." The "transform" part means "count them."

You can take advantage of this list to build a nice-looking dataframe:

pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
00 000 01 01v347000 02 02v105000 02v146000 03 03v455000 04 05 05v395000 06 07 08 08v303000 09 10 1000 10017 11 12 128 12th 13 136 13v136000 14 1420 15 150 15pm 16 160lbs 17 180 1996 1997 1998 1999 1st 20 2000 2001 2002 2003 2004 2005 2006 2007 ... window windows windshield wiper wipers wires wiring wished with within without withstand witnesses won wonder woosh word work working works worn worse worsened worst worth would wouldn wrangler wreck wrecks wrist write writes writing written wrong xterra xxx yards yc year years yes yet yield york you your zero zone
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 3 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
160 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
161 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
162 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
163 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
164 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

165 rows × 2280 columns

Only counting with ones and zeros#

It doesn't seem to matter too much whether a word shows up one or two or twenty times in a complaint - the only important thing is whether yes it shows up or no it doesn't show up.

To turn the counting into just 0s and 1s, we send an extra option to our CountVectorizer.

vectorizer = CountVectorizer(binary=True)

vectors = vectorizer.fit_transform(labeled.CDESCR)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()
00 000 01 01v347000 02 02v105000 02v146000 03 03v455000 04 05 05v395000 06 07 08 08v303000 09 10 1000 10017 11 12 128 12th 13 136 13v136000 14 1420 15 150 15pm 16 160lbs 17 180 1996 1997 1998 1999 1st 20 2000 2001 2002 2003 2004 2005 2006 2007 ... window windows windshield wiper wipers wires wiring wished with within without withstand witnesses won wonder woosh word work working works worn worse worsened worst worth would wouldn wrangler wreck wrecks wrist write writes writing written wrong xterra xxx yards yc year years yes yet yield york you your zero zone
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 rows × 2280 columns

Using our new dataframe in machine learning#

We really like random forests now, right? They're more or less a fancy decision tree, and they usually give pretty good results.

Let's try one out with our new every-single-word features.

Hot tip: a vector is just a list of numbers (for example, each row). A matrix is a list of vectors.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

Usually we do .drop to get rid of the label, but when we counted all of our words it didn't carry over the label column (whether it's suspicious or not). Instead, we'll just use the is_suspicious column from our original dataframe, the one with the actual text.

X = words_df
y = labeled.is_suspicious

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Confusion matrix#

With all of those incredible features, how did it do?

y_true = y
y_pred = clf.predict(X)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted not suspicious Predicted suspicious
Is not suspicious 150 0
Is suspicious 0 15

Amazing!!! 100% accuracy!!! Loving it!!!

What did the random forest think were the important features?

import eli5
feature_names = list(X.columns)

# Use this line instead of warnings about judging these classifier
# eli5.show_weights(clf, feature_names=feature_names, show=eli5.formatters.fields.ALL)
eli5.show_weights(clf, feature_names=feature_names)
Weight Feature
0.0184 ± 0.1117 deployed
0.0163 ± 0.0944 pulling
0.0163 ± 0.0904 burns
0.0147 ± 0.1022 burning
0.0146 ± 0.0926 degree
0.0141 ± 0.0701 sunroof
0.0129 ± 0.0798 school
0.0116 ± 0.0725 1st
0.0110 ± 0.0776 apart
0.0107 ± 0.0736 zone
0.0102 ± 0.0564 driver
0.0086 ± 0.0574 problem
0.0085 ± 0.0610 killing
0.0083 ± 0.0573 unexpectedly
0.0081 ± 0.0537 further
0.0076 ± 0.0667 sputtered
0.0074 ± 0.0597 2nd
0.0073 ± 0.0698 street
0.0071 ± 0.0611 chin
0.0068 ± 0.0556 suffered
… 2260 more …

Sure, sure, that all makes sense.

No, wait! let's train-test split#

Oh boy we totally forgot about train-test split, we were testing the classifier on things it had already seen. Let's split them up into test sets and train sets and try again.

from sklearn.model_selection import train_test_split

X = words_df
y = labeled.is_suspicious

X_train, X_test, y_train, y_test = train_test_split(X, y)
clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
y_true = y_test
y_pred = clf.predict(X_test)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted not suspicious Predicted suspicious
Is not suspicious 39 0
Is suspicious 3 0

Oh no, that's horrible. That's terrible. Let's try looking at our feature importances, just to see if it's making dumb decisions.

eli5.show_weights(clf, feature_names=feature_names)
Weight Feature
0.0146 ± 0.0872 face
0.0145 ± 0.0830 problem
0.0142 ± 0.1067 deployed
0.0123 ± 0.0874 passenger
0.0118 ± 0.0873 killing
0.0116 ± 0.0922 burns
0.0099 ± 0.0797 ripping
0.0097 ± 0.0830 mouth
0.0096 ± 0.0842 malfunction
0.0094 ± 0.0926 suffered
0.0093 ± 0.0692 his
0.0092 ± 0.0857 both
0.0092 ± 0.0720 degree
0.0089 ± 0.0855 chin
0.0088 ± 0.0666 resulting
0.0087 ± 0.0546 unexpectedly
0.0087 ± 0.0585 1st
0.0085 ± 0.0635 further
0.0084 ± 0.0695 2nd
0.0078 ± 0.0910 apart
… 2260 more …

I mean, it makes sense, I guess. Even though we added all those new features, why doesn't it work well?

Trying again with a Logistic Classifier#

Well, if there's one thing we know to do, it's try again and again with different classifiers until something works. Let's see if a logistic classifier work any better!*

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e9, solver='lbfgs')

clf.fit(X_train, y_train)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
y_true = y_test
y_pred = clf.predict(X_test)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted not suspicious Predicted suspicious
Is not suspicious 39 0
Is suspicious 3 0

Just as bad! Sadly, with this much information there's no good pattern. We can feel good about how explainable it is, though.

eli5.show_weights(clf, feature_names=feature_names, target_names=['not suspicious', 'suspicious'])

y=suspicious top features

Weight? Feature
+3.187 deployed
+2.961 passenger
+2.145 degree
+2.017 problem
+1.869 1st
+1.840 2nd
+1.840 hands
+1.818 face
+1.772 burns
+1.612 provide
+1.377 further
… 859 more positive …
… 1155 more negative …
-1.376 traveling
-1.390 light
-1.465 brake
-1.510 pads
-1.546 front
-2.400 is
-2.888 did
-2.924 not
-6.441 <BIAS>

Review#

While last time we just used hand-picked words to have our classifier pay attention to, this time we used a vectorizer to just use all of the words. We figured that more information was better information, and we wouldn't even have to flag more complaints!

Unfortunately our classifier still didn't really find any suspicious complaints.

Discussion topics#

Brainstorm reasons why more information didn't save us.

In classification problems, when might you want to hand-pick words and when might you want to use a vectorizer? Compare this airbag situation, about sentiment analysis of tweets, and separating sci-fi and romance novels.