7.1 Trying other classifiers

But now we get to do one of the most exciting and absurd parts of machine learning: replace our classifier!

One of the magic parts of machine learning is that there are dozens of techniques - different types of regression, decision trees, random forests, etc - and they all operate the same way when you’re programming with them. You give them your data, they learn, you ask for a prediction - you usually only have to change one line to change the type of classifier you’re using!

Most of the time different techniques perform roughly the same. It’s kind of like driving an SUV vs a sedan vs a sports car - if you’re driving around the suburbs in normal weather, there isn’t a big difference between all of them. Every once in a while one does a better job because of peculiarities in your data, but you don’t always know ahead of time which one that will be.

7.1.1 Trying: LogisticRegression

Since our random forest did a terrible job, let’s try a one called a logistic regression classifier. Don’t get scared because the name sounds fancy and mathematical! At this point we don’t care about how these things work or even what they are: we’re just swapping code in and out to see how it performs.

We’re using this one because the LA Times used one for their piece, although they referred to it as a “maximum entropy classifier,” which is definitely a cooler name.

As we change the classifier, compare our new code to the code above. Notice how we only change the clf = line where we create the classifier:

from sklearn.linear_model import LogisticRegression

# Create a new classifier
clf = LogisticRegression(C=1e9)

# Teach the classifier how the words (X) relate to the categories (y)
clf.fit(X, y)

# Make our predictions and see how it looks
## LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
##                    fit_intercept=True, intercept_scaling=1, l1_ratio=None,
##                    max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
##                    random_state=None, solver='warn', tol=0.0001, verbose=0,
##                    warm_start=False)
## 
## /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
##   FutureWarning)
## /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
##   "the number of iterations.", ConvergenceWarning)
df['prediction'] = clf.predict(X)
df[(df.is_part_i == 1) & (df.reported == 0)].prediction.value_counts()
## 0    1496
## 1     448
## Name: prediction, dtype: int64

Wow, that’s a huge improvement! Instead of 8% correctly marked as Part I, we’re up to about 45%!

7.1.2 Trying: LinearSVC

And how about a linear support vector machine? Again, you don’t need to know what it is, and again, we’ll only change the clf = line where we create the classifier:

from sklearn.svm import LinearSVC

# Create a new classifier
clf = LinearSVC()

# Teach the classifier how the words (X) relate to the categories (y)
clf.fit(X, y)

# Make our predictions and see how it looks
## LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
##           intercept_scaling=1, loss='squared_hinge', max_iter=1000,
##           multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
##           verbose=0)
df['prediction'] = clf.predict(X)
df[(df.is_part_i == 1) & (df.reported == 0)].prediction.value_counts()
## 0    1399
## 1     545
## Name: prediction, dtype: int64

The huge improvement we saw with the logistic regression is also here with LinearSVC, the support vector machine.