5.1 Building a classifier

Classification is the act of putting things in categories. There are approximately ten billion kinds of classifiers, most of which work roughly the same way.

  1. Features: the 1s and 0s that are the words in each comment
  2. Labels: 1 or 0, whether the comment was suspicious or not

If you’re familiar with linear regression, it takes a bunch of inputs to predict a number. Instead we’re going to use logistic regression, which (in this case) takes a bunch of inputs to predict a category.

from sklearn.linear_model import LogisticRegression

# Every column EXCEPT whether it's suspicious
X = training_features.drop(columns='is_suspicious')
# ONLY if it's suspicious
y = training_features.is_suspicious

# Build a new classifier
# C=1e9 is a magic number we gotta use
clf = LogisticRegression(C=1e9)

# Teach the classifier about the complaints we read
clf.fit(X, y)
## LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
##                    fit_intercept=True, intercept_scaling=1, l1_ratio=None,
##                    max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
##                    random_state=None, solver='warn', tol=0.0001, verbose=0,
##                    warm_start=False)
## 
## /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
##   FutureWarning)

That was it! That’s all it took! Our classifier is done!

When it comes to machine learning, you’ll spend most of your time finding and tending to your data. Once you get to the actual “algorithm” part it’s usually just a matter of making small tweaks here or there.