5.3 Random Forest
A random forest is a collection of slightly different decision trees, hence the name. If you compare the output, you can get a performance boost over a single decision tree.
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)
## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
## max_depth=None, max_features='auto', max_leaf_nodes=None,
## min_impurity_decrease=0.0, min_impurity_split=None,
## min_samples_leaf=1, min_samples_split=2,
## min_weight_fraction_leaf=0.0, n_estimators=100,
## n_jobs=None, oob_score=False, random_state=None,
## verbose=0, warm_start=False)
Even though we don’t enjoy accuracy, we might as well check it.
## 0.7700087950747582
And now the confusion matrix.
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Predicted unsuccessful | Predicted successful | |
---|---|---|
Is unsuccessful | 1384 | 183 |
Is successful | 340 | 367 |
feature | weight | std |
---|---|---|
high_success_rate_agency | 0.3413467 | 0.0219515 |
word_count | 0.2297791 | 0.0203512 |
avg_sen_len | 0.2273866 | 0.0199684 |
specificity | 0.1608934 | 0.0274081 |
hyperlink | 0.0132717 | 0.0058599 |
ref_foia | 0.0127861 | 0.0049166 |
ref_fees | 0.0082162 | 0.0034040 |
email_address | 0.0063202 | 0.0027815 |
It does a bit better than the other classifiers we’ve seen so far, and we can see it’s using more than just the ever-popular high_success_rate_agency
column to figure out whether it will be granted or not.