# Finding faulty airbags in a sea of consumer complaints by counting words and classifying the results

**Topics:** Vectorizing text

**Datasets**

* **sampled-labeled.csv:** a sample of vehicle complaints, labeled with being suspicious or not

## What's the goal?

It was too much work to read twenty years of vehicle comments to find the ones related to dangerous airbags! The last two times we tried to pick out important words to dangerous/not dangerous airbags, but it didn't go so well because we weren't sure what the best ones to pick were.

This time we're going to pick _everything_.

<p class="reading-options">
  <a class="btn" href="/nyt-takata-airbags/airbag-classifier-search-countvectorizer">
    <i class="fa fa-sm fa-book"></i>
    Read online
  </a>
  <a class="btn" href="/nyt-takata-airbags/notebooks/Airbag classifier search (CountVectorizer).ipynb">
    <i class="fa fa-sm fa-download"></i>
    Download notebook
  </a>
  <a class="btn" href="https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/nyt-takata-airbags/notebooks/Airbag classifier search (CountVectorizer).ipynb" target="_new">
    <i class="fa fa-sm fa-laptop"></i>
    Interactive version
  </a>
</p>

### Prep work: Downloading necessary files
Before we get started, we need to download all of the data we'll be using.
* **sampled-labeled.csv:** labeled complaints - a sample of vehicle complaints, labeled with being suspicious or not


In [None]:
# Make data directory if it doesn't exist
!mkdir -p data
!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/nyt-takata-airbags/data/sampled-labeled.csv -P data

## Setup

In [1]:
import pandas as pd

# Allow us to display 100 columns at a time, and 100 characters in each column (instead of ...)
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_colwidth", 100)

## Read in our labeled data

We'll start by reading in our complaints that have labeled attached to them. **Read in `sampled-labeled.csv`.**

In [2]:
labeled = pd.read_csv("data/sampled-labeled.csv")
labeled.head()

Unnamed: 0,is_suspicious,CDESCR
0,0.0,"ALTHOUGH I LOVED THE CAR OVERALL AT THE TIME I DECIDED TO OWN, , MY DREAM CAR CADILLAC CTS HAS T..."
1,0.0,"CONSUMER SHUT SLIDING DOOR WHEN ALL POWER LOCKS ON ALL DOORS LOCKED BY ITSELF, TRAPPING INFANT I..."
2,0.0,DRIVERS SEAT BACK COLLAPSED AND BENT WHEN REAR ENDED. PLEASE DESCRIBE DETAILS. TT
3,0.0,TL* THE CONTACT OWNS A 2009 NISSAN ALTIMA. THE CONTACT STATED THAT THE START BUTTON FOR THE IGNI...
4,0.0,THE FRONT MIDDLE SEAT DOESN'T LOCK IN PLACE. *AK


Even though it's called `labeled`, not all of them have labels. **Drop the ones missing labels.**

In [3]:
labeled = labeled.dropna()

See how many **suspicious/not suspicious comments** we have.

In [4]:
labeled.is_suspicious.value_counts()

0.0    150
1.0     15
Name: is_suspicious, dtype: int64

150 non-suspicious and 15 suspicious is a pretty terrible ratio, but we're remarkably lazy and not very many of the comments are actually suspicious.

Now that we've read a few, let's train our classifier

## Creating features

### Selecting our features and building a features dataframe

Last time, we can thought of some words or phrases that might make a comment interesting or not interesting. We came up with this list:

* airbag
* air bag
* failed
* did not deploy
* violent
* explode
* shrapnel

We then built a dataframe that included those words for each row - `0` if it's in there, `1` if it isn't - along with the `is_suspicious` label. That process looked like this:

In [5]:
train_df = pd.DataFrame({
    'is_suspicious': labeled.is_suspicious,
    'airbag': labeled.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'air bag': labeled.CDESCR.str.contains("AIR BAG", na=False).astype(int),
    'failed': labeled.CDESCR.str.contains("FAILED", na=False).astype(int),
    'did not deploy': labeled.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
    'violent': labeled.CDESCR.str.contains("VIOLENT", na=False).astype(int),
    'explode': labeled.CDESCR.str.contains("EXPLODE", na=False).astype(int),
    'shrapnel': labeled.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
train_df.head()

Unnamed: 0,is_suspicious,airbag,air bag,failed,did not deploy,violent,explode,shrapnel
0,0.0,0,0,0,0,0,0,0
1,0.0,0,0,0,0,0,0,0
2,0.0,0,0,0,0,0,0,0
3,0.0,0,0,0,0,0,0,0
4,0.0,0,0,0,0,0,0,0


But as we found out later, picking which words are important - **feature selection** - can be a difficult process. There are a _lot_ of words in there, and it isn't like we're going to go through and look at *every single word*, right?

Well, actually, **it's definitely possible to look at every single word**, and it takes way less code than what we did up above.

You can count words using the `CountVectorizer` from sci-kit learn. Using `.fit_transform` below will learn all of the words in a column, then count them.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

vectors = vectorizer.fit_transform(labeled.CDESCR)
vectors

<165x2280 sparse matrix of type '<class 'numpy.int64'>'
	with 9089 stored elements in Compressed Sparse Row format>

But... what's a "sparse matrix"? We can see something that looks more familiar if we tell it to become an array (basically a list).

In [7]:
vectors.toarray()

array([[0, 0, 0, ..., 3, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

It's still a little hard to understand, but a list of lists? Sounds like a great opportunity for a dataframe!

In [8]:
pd.DataFrame(vectors.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,...,2230,2231,2232,2233,2234,2235,2236,2237,2238,2239,2240,2241,2242,2243,2244,2245,2246,2247,2248,2249,2250,2251,2252,2253,2254,2255,2256,2257,2258,2259,2260,2261,2262,2263,2264,2265,2266,2267,2268,2269,2270,2271,2272,2273,2274,2275,2276,2277,2278,2279
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,3,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
160,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
161,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
162,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
163,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


**Each row is a sentence, and each column is a word!**
    
* We had 165 sentences, so we know have 165 rows
* There were 2280 words, so we have 2280 columns

If a word appears zero times in a sentence, that column gets a `0`. If it appears one or two or twenty times, that number appears in the column instead.

The whole **sparse matrix** thing is part of numpy. It's the idea that since the list of lists was mostly empty, Python can be lazy and not keep track of all of the `0`s - instead, it only tracks where there are non-`0` numbers. A sparse matrix is much more efficient with space if you have a lot lot lot of `0`'s!

We used `.toarray()` to turn it into a list of lists (although sometimes if we have a lot lot lot of words and sentences our computer might not be able to do it).

**How do we know which column is which word?** When we told the vectorizer to count all of the words in each sentence, it also memorized all of the words separately.

In [9]:
print(vectorizer.get_feature_names())



> **Big secret:** The "fit" part of `.fit_transform` means "learn the words." The "transform" part means "count them."

You can take advantage of this list to build a nice-looking dataframe:

In [10]:
pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())

Unnamed: 0,00,000,01,01v347000,02,02v105000,02v146000,03,03v455000,04,05,05v395000,06,07,08,08v303000,09,10,1000,10017,11,12,128,12th,13,136,13v136000,14,1420,15,150,15pm,16,160lbs,17,180,1996,1997,1998,1999,1st,20,2000,2001,2002,2003,2004,2005,2006,2007,...,window,windows,windshield,wiper,wipers,wires,wiring,wished,with,within,without,withstand,witnesses,won,wonder,woosh,word,work,working,works,worn,worse,worsened,worst,worth,would,wouldn,wrangler,wreck,wrecks,wrist,write,writes,writing,written,wrong,xterra,xxx,yards,yc,year,years,yes,yet,yield,york,you,your,zero,zone
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,3,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
160,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
161,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
162,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
163,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Only counting with ones and zeros

It doesn't seem to matter too much whether a word shows up one or two or twenty times in a complaint - the only important thing is whether **yes** it shows up or **no** it doesn't show up.

To turn the counting into just `0`s and `1`s, we send an extra option to our `CountVectorizer`.

In [11]:
vectorizer = CountVectorizer(binary=True)

vectors = vectorizer.fit_transform(labeled.CDESCR)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

Unnamed: 0,00,000,01,01v347000,02,02v105000,02v146000,03,03v455000,04,05,05v395000,06,07,08,08v303000,09,10,1000,10017,11,12,128,12th,13,136,13v136000,14,1420,15,150,15pm,16,160lbs,17,180,1996,1997,1998,1999,1st,20,2000,2001,2002,2003,2004,2005,2006,2007,...,window,windows,windshield,wiper,wipers,wires,wiring,wished,with,within,without,withstand,witnesses,won,wonder,woosh,word,work,working,works,worn,worse,worsened,worst,worth,would,wouldn,wrangler,wreck,wrecks,wrist,write,writes,writing,written,wrong,xterra,xxx,yards,yc,year,years,yes,yet,yield,york,you,your,zero,zone
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


# Using our new dataframe in machine learning

We really like random forests now, right? They're more or less a fancy decision tree, and they usually give pretty good results.

Let's try one out with our new every-single-word features.

> **Hot tip:** a vector is just a list of numbers (for example, each row). A matrix is a list of vectors.

In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

Usually we do `.drop` to get rid of the label, but when we counted all of our words it didn't carry over the label column (whether it's suspicious or not). Instead, we'll just use the `is_suspicious` column from our original dataframe, the one with the actual text.

In [13]:
X = words_df
y = labeled.is_suspicious

clf = RandomForestClassifier(n_estimators=100)

clf.fit(X, y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## Confusion matrix

With all of those incredible features, how did it do?

In [14]:
y_true = y
y_pred = clf.predict(X)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not suspicious,Predicted suspicious
Is not suspicious,150,0
Is suspicious,0,15


Amazing!!! 100% accuracy!!! Loving it!!!

**What did the random forest think were the important features?**

In [15]:
import eli5
feature_names = list(X.columns)

# Use this line instead of warnings about judging these classifier
# eli5.show_weights(clf, feature_names=feature_names, show=eli5.formatters.fields.ALL)
eli5.show_weights(clf, feature_names=feature_names)

Weight,Feature
0.0184  ± 0.1117,deployed
0.0163  ± 0.0944,pulling
0.0163  ± 0.0904,burns
0.0147  ± 0.1022,burning
0.0146  ± 0.0926,degree
0.0141  ± 0.0701,sunroof
0.0129  ± 0.0798,school
0.0116  ± 0.0725,1st
0.0110  ± 0.0776,apart
0.0107  ± 0.0736,zone


Sure, sure, that all makes sense.

## No, wait! let's train-test split

Oh boy we totally forgot about train-test split, we were testing the classifier on things it had already seen. Let's split them up into test sets and train sets and try again.

In [16]:
from sklearn.model_selection import train_test_split

X = words_df
y = labeled.is_suspicious

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [17]:
clf = RandomForestClassifier(n_estimators=100)

clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [18]:
y_true = y_test
y_pred = clf.predict(X_test)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not suspicious,Predicted suspicious
Is not suspicious,39,0
Is suspicious,3,0


**Oh no, that's horrible.** That's terrible. Let's try looking at our feature importances, just to see if it's making dumb decisions.

In [19]:
eli5.show_weights(clf, feature_names=feature_names)

Weight,Feature
0.0146  ± 0.0872,face
0.0145  ± 0.0830,problem
0.0142  ± 0.1067,deployed
0.0123  ± 0.0874,passenger
0.0118  ± 0.0873,killing
0.0116  ± 0.0922,burns
0.0099  ± 0.0797,ripping
0.0097  ± 0.0830,mouth
0.0096  ± 0.0842,malfunction
0.0094  ± 0.0926,suffered


I mean, it makes sense, I guess. **Even though we added all those new features, why doesn't it work well?**

## Trying again with a Logistic Classifier

Well, if there's one thing we know to do, it's try again and again with different classifiers until something works. **Let's see if a logistic classifier work any better!***

In [20]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e9, solver='lbfgs')

clf.fit(X_train, y_train)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [21]:
y_true = y_test
y_pred = clf.predict(X_test)

matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not suspicious', 'suspicious'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not suspicious,Predicted suspicious
Is not suspicious,39,0
Is suspicious,3,0


Just as bad! Sadly, with this much information there's no good pattern. We can feel good about how explainable it is, though.

In [22]:
eli5.show_weights(clf, feature_names=feature_names, target_names=['not suspicious', 'suspicious'])

Weight?,Feature
+3.187,deployed
+2.961,passenger
+2.145,degree
+2.017,problem
+1.869,1st
+1.840,2nd
+1.840,hands
+1.818,face
+1.772,burns
+1.612,provide


## Review

While last time we just used **hand-picked words** to have our classifier pay attention to, this time we used a **vectorizer** to just use _all_ of the words. We figured that more information was better information, and we wouldn't even have to flag more complaints!

Unfortunately our classifier still didn't really find any suspicious complaints.

## Discussion topics

Brainstorm reasons why more information didn't save us.

In classification problems, when might you want to hand-pick words and when might you want to use a vectorizer? Compare this airbag situation, about sentiment analysis of tweets, and separating sci-fi and romance novels.