5.2 Decision Trees
A decision tree is another classifier we can try out. Can it beat logistic regression?
from sklearn import tree
dec_tree = tree.DecisionTreeClassifier(max_depth=4)
dec_tree.fit(X_train, y_train)
# Check its accuracy
## DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
## max_features=None, max_leaf_nodes=None,
## min_impurity_decrease=0.0, min_impurity_split=None,
## min_samples_leaf=1, min_samples_split=2,
## min_weight_fraction_leaf=0.0, presort=False,
## random_state=None, splitter='best')
## 0.7783641160949868
y_true = y_test
y_pred = dec_tree.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Predicted unsuccessful | Predicted successful | |
---|---|---|
Is unsuccessful | 1387 | 180 |
Is successful | 324 | 383 |
The decision tree performs similarly to the logistic regression, but it has one big big benefit: to explain it, we can draw a super-fun diagram.
import pydotplus
# from IPython.display import Image
dot_data = tree.export_graphviz(dec_tree,
max_depth=3,
feature_names=X.columns,
class_names=dec_tree.classes_.astype(str),
out_file=None, filled=True, rounded=True, proportion=True)
graph = pydotplus.graph_from_dot_data(dot_data)
# Image(graph.write_png("output.png"))
graph.write_png("output.png")
I can also get a similar chart to the logistic regression one, explaining why a particular feature may or may not be important.
feature | weight |
---|---|
high_success_rate_agency | 0.8791873 |
word_count | 0.0418041 |
specificity | 0.0376105 |
avg_sen_len | 0.0324807 |
ref_foia | 0.0080910 |
ref_fees | 0.0008264 |
email_address | 0.0000000 |
hyperlink | 0.0000000 |
Again, it’s the high_success_rate_agency
feature that does heavy lifting.