5.3 Using our classifier
The point of a classifier is to classify documents it hasn’t seen before, to read them and put them into the appropriate category. Before we can do this, we need to extract features from our original dataframe, the one that doesn’t have labels.
We’ll do this the same way we did with our set of labeled data, by checking for each item in the list of words.
features = pd.DataFrame({
'airbag': df.CDESCR.str.contains("AIRBAG", na=False).astype(int),
'air bag': df.CDESCR.str.contains("AIR BAG", na=False).astype(int),
'failed': df.CDESCR.str.contains("FAILED", na=False).astype(int),
'did not deploy': df.CDESCR.str.contains("DID NOT DEPLOY", na=False).astype(int),
'violent': df.CDESCR.str.contains("VIOLENT", na=False).astype(int),
'explode': df.CDESCR.str.contains("EXPLODE", na=False).astype(int),
'shrapnel': df.CDESCR.str.contains("SHRAPNEL", na=False).astype(int),
})
features.head()
airbag | air bag | failed | did not deploy | violent | explode | shrapnel | |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Notice that we didn’t create a is_suspicious
category - that’s because we don’t know if these ones are suspicious or not!
Now we can have our classifier predict whether they’re suspicious or not based on whether or not they have the suspicious words inside. Let’s add it as a new column as to whether it looked suspicious or not to the classifier.
airbag | air bag | failed | did not deploy | violent | explode | shrapnel | is_suspicious | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Let’s take a look at only the suspicious ones.
airbag | air bag | failed | did not deploy | violent | explode | shrapnel | is_suspicious | |
---|---|---|---|---|---|---|---|---|
56 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
1217 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
1868 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
2035 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
2960 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
4129 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
5362 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
5663 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
5672 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
7507 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
7581 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
7686 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
8100 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
8834 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
10341 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
10425 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
11196 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
11202 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
13135 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
13518 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
We can see most of the ones marked as suspicious include the words “airbag” and “violent,” and none of them include “failed” or “did not deploy.” That all makes sense, but what about all of the ones that include the word “violent” but not “airbag” or “air bag?” None of those should be good!
While we could just filter it to only include ones with the word “airbag” in it, we probably need a way to test the quality of our classifier.