6.1 Leaving out our best feature
Even though high_success_rate_agency
is a great feature in terms of predicting success or failure, it’s so powerful that it seems like cheating. We don’t want to hear that we’re being rejected just because we’re sending a FOIA to the CIA - we want to build the best predictor we can!
What happens if we ignore the agency completely? Let’s remove high_success_rate_agency
from our feature set and see how our classifiers perform.
We’re saying our current classifiers are useless, but are they really? It’s an open question! Knowing that we can be lazy with our request to the EPA, but need more effort when applying to the CIA might actually be useful.
6.1.1 Setting up our features
6.1.2 K-nearest neighbors
# Train a knn classifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
# Build a confusion matrix
y_true = y_test
y_pred = knn.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Predicted unsuccessful | Predicted successful | |
---|---|---|
Is unsuccessful | 1351 | 249 |
Is successful | 475 | 199 |
6.1.3 Logistic Regression
# Train a logistic regression classifier
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)
# Build a confusion matrix
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Predicted unsuccessful | Predicted successful | |
---|---|---|
Is unsuccessful | 1592 | 8 |
Is successful | 664 | 10 |
6.1.4 Decision tree
# Train a decision tree
dec_tree = tree.DecisionTreeClassifier(max_depth=4)
dec_tree.fit(X_train, y_train)
# Build a confusion matrix
y_true = y_test
y_pred = dec_tree.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Predicted unsuccessful | Predicted successful | |
---|---|---|
Is unsuccessful | 1557 | 43 |
Is successful | 622 | 52 |
6.1.5 Random forest
# Train a decision tree
forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)
# Build a confusion matrix
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Predicted unsuccessful | Predicted successful | |
---|---|---|
Is unsuccessful | 1272 | 328 |
Is successful | 409 | 265 |
6.1.6 Summary
Was it shocking? Looking at our confusion matrices, it becomes clear that some algorithms relied heavily on high_success_rate_agency
while others could be flexible once it was removed.
- The KNN did worse, but still got a decent number correct
- The logistic regression is almost as bad
- The decision tree does about as well as k-nearest neighbors
- The random forest performed slightly better than the KNN
Random forests are generally thought of as a good “general purpose” classification algorithm. They’re flexible, they’re interpretable, and they usually work pretty well. We’ll see some problems with them in other chapters - they can be super slow, for example - but they get the job done.