5.3 Random Forest

A random forest is a collection of slightly different decision trees, hence the name. If you compare the output, you can get a performance boost over a single decision tree.

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)
## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
##                        max_depth=None, max_features='auto', max_leaf_nodes=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=2,
##                        min_weight_fraction_leaf=0.0, n_estimators=100,
##                        n_jobs=None, oob_score=False, random_state=None,
##                        verbose=0, warm_start=False)

Even though we don’t enjoy accuracy, we might as well check it.

# Check its accuracy
accuracy_score(y_test, forest.predict(X_test))
## 0.7700087950747582

And now the confusion matrix.

y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted unsuccessful Predicted successful
Is unsuccessful 1384 183
Is successful 340 367
eli5.explain_weights_df(forest, feature_names=list(X.columns))
feature weight std
high_success_rate_agency 0.3413467 0.0219515
word_count 0.2297791 0.0203512
avg_sen_len 0.2273866 0.0199684
specificity 0.1608934 0.0274081
hyperlink 0.0132717 0.0058599
ref_foia 0.0127861 0.0049166
ref_fees 0.0082162 0.0034040
email_address 0.0063202 0.0027815

It does a bit better than the other classifiers we’ve seen so far, and we can see it’s using more than just the ever-popular high_success_rate_agency column to figure out whether it will be granted or not.