5.3 Using our classifier

The point of a classifier is to classify documents it hasn’t seen before, to read them and put them into the appropriate category. Before we can do this, we need to extract features from our original dataframe, the one that doesn’t have labels.

We’ll do this the same way we did with our set of labeled data, by checking for each item in the list of words.

features = pd.DataFrame({
    'airbag': df.CDESCR.str.contains("AIRBAG", na=False).astype(int),
    'air bag': df.CDESCR.str.contains("AIR BAG", na=False).astype(int),
    'failed': df.CDESCR.str.contains("FAILED", na=False).astype(int),
    'did not deploy': df.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
    'violent': df.CDESCR.str.contains("VIOLENT", na=False).astype(int),
    'explode': df.CDESCR.str.contains("EXPLODE", na=False).astype(int),
    'shrapnel': df.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
features.head()
airbag air bag failed did not deploy violent explode shrapnel
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0

Notice that we didn’t create a is_suspicious category - that’s because we don’t know if these ones are suspicious or not!

Now we can have our classifier predict whether they’re suspicious or not based on whether or not they have the suspicious words inside. Let’s add it as a new column as to whether it looked suspicious or not to the classifier.

features['is_suspicious'] = clf.predict(features)
features.head()
airbag air bag failed did not deploy violent explode shrapnel is_suspicious
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0

Let’s take a look at only the suspicious ones.

features[features.is_suspicious == 1].head(20)
airbag air bag failed did not deploy violent explode shrapnel is_suspicious
56 0 0 0 0 1 0 0 1
1217 1 0 0 0 1 0 0 1
1868 0 0 0 0 1 0 0 1
2035 0 0 0 0 1 0 0 1
2960 0 0 0 0 1 0 0 1
4129 0 0 0 0 1 0 0 1
5362 0 0 0 0 1 0 0 1
5663 0 1 0 0 1 0 0 1
5672 0 1 0 0 1 0 0 1
7507 0 0 0 0 1 0 0 1
7581 0 0 0 0 1 0 0 1
7686 0 0 0 0 1 0 0 1
8100 0 0 0 0 1 0 0 1
8834 0 0 0 0 1 0 0 1
10341 0 1 0 0 1 0 0 1
10425 0 1 0 0 1 0 0 1
11196 0 1 0 0 1 0 0 1
11202 0 1 0 0 1 0 0 1
13135 0 0 0 0 1 0 0 1
13518 0 1 0 0 1 0 0 1

We can see most of the ones marked as suspicious include the words “airbag” and “violent,” and none of them include “failed” or “did not deploy.” That all makes sense, but what about all of the ones that include the word “violent” but not “airbag” or “air bag?” None of those should be good!

While we could just filter it to only include ones with the word “airbag” in it, we probably need a way to test the quality of our classifier.