5.3 Random Forest
A random forest is a collection of slightly different decision trees, hence the name. If you compare the output, you can get a performance boost over a single decision tree.
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
## max_depth=None, max_features='auto', max_leaf_nodes=None,
## min_impurity_decrease=0.0, min_impurity_split=None,
## min_samples_leaf=1, min_samples_split=2,
## min_weight_fraction_leaf=0.0, n_estimators=100,
## n_jobs=None, oob_score=False, random_state=None,
## verbose=0, warm_start=False)
Even though we don’t enjoy accuracy, we might as well check it.
## 0.7700087950747582
And now the confusion matrix.
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)| Predicted unsuccessful | Predicted successful | |
|---|---|---|
| Is unsuccessful | 1384 | 183 |
| Is successful | 340 | 367 |
| feature | weight | std |
|---|---|---|
| high_success_rate_agency | 0.3413467 | 0.0219515 |
| word_count | 0.2297791 | 0.0203512 |
| avg_sen_len | 0.2273866 | 0.0199684 |
| specificity | 0.1608934 | 0.0274081 |
| hyperlink | 0.0132717 | 0.0058599 |
| ref_foia | 0.0127861 | 0.0049166 |
| ref_fees | 0.0082162 | 0.0034040 |
| email_address | 0.0063202 | 0.0027815 |
It does a bit better than the other classifiers we’ve seen so far, and we can see it’s using more than just the ever-popular high_success_rate_agency column to figure out whether it will be granted or not.