5.3 Random Forest |

5.3 Random Forest

A random forest is a collection of slightly different decision trees, hence the name. If you compare the output, you can get a performance boost over a single decision tree.

from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)

## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
##                        max_depth=None, max_features='auto', max_leaf_nodes=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=2,
##                        min_weight_fraction_leaf=0.0, n_estimators=100,
##                        n_jobs=None, oob_score=False, random_state=None,
##                        verbose=0, warm_start=False)

Even though we don’t enjoy accuracy, we might as well check it.

# Check its accuracy
accuracy_score(y_test, forest.predict(X_test))

## 0.7700087950747582

And now the confusion matrix.

y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

	Predicted unsuccessful	Predicted successful
Is unsuccessful	1384	183
Is successful	340	367

eli5.explain_weights_df(forest, feature_names=list(X.columns))

feature	weight	std
high_success_rate_agency	0.3413467	0.0219515
word_count	0.2297791	0.0203512
avg_sen_len	0.2273866	0.0199684
specificity	0.1608934	0.0274081
hyperlink	0.0132717	0.0058599
ref_foia	0.0127861	0.0049166
ref_fees	0.0082162	0.0034040
email_address	0.0063202	0.0027815

It does a bit better than the other classifiers we’ve seen so far, and we can see it’s using more than just the ever-popular high_success_rate_agency column to figure out whether it will be granted or not.