6.2 Training our classifier

Technically speaking, is called a model, because it mathematically models the way our words reflect our categories. The first thing we need to do with our model is feed it our data so it can learn. It needs two things:

  • The word frequencies (we saved them as X before)
  • The categories (whether it’s Part I or Part II - 1 or 0)

It’s probably a waste of time to teach it using every single offense, so we’ll just use the first 10,000. That should give it a pretty good idea of what is a Part I crime and what is a Part II crime. It’d probably be better to train it with more, but I’m more than a little impatient.

training = df
# We call these X and y because everyone else on earth does
# .fit_transform learns the words AND counts them
X = vec.fit_transform(training.DO_NARRATIVE)
y = training.reported
clf.fit(X, y)
## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
##                        max_depth=None, max_features='auto', max_leaf_nodes=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=2,
##                        min_weight_fraction_leaf=0.0, n_estimators=100,
##                        n_jobs=None, oob_score=False, random_state=None,
##                        verbose=0, warm_start=False)

Notice that we’re teaching our model with the “fake” categories, the one where we have downgraded 15% of the Part I offenses. That’s because in the real world we wouldn’t know which ones are accurately reported and which ones aren’t. We can only hope that enough are correctly reported for our model to learn well!