6.1 Leaving out our best feature

Even though high_success_rate_agency is a great feature in terms of predicting success or failure, it’s so powerful that it seems like cheating. We don’t want to hear that we’re being rejected just because we’re sending a FOIA to the CIA - we want to build the best predictor we can!

What happens if we ignore the agency completely? Let’s remove high_success_rate_agency from our feature set and see how our classifiers perform.

We’re saying our current classifiers are useless, but are they really? It’s an open question! Knowing that we can be lazy with our request to the EPA, but need more effort when applying to the CIA might actually be useful.

6.1.1 Setting up our features

# Remove high_success_rate_agency
X = df[[
  'word_count', 'avg_sen_len', 
  'ref_foia', 'ref_fees', 'hyperlink', 
  'email_address', 'specificity'
]]
y = df.successful

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y)

6.1.2 K-nearest neighbors

# Train a knn classifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# Build a confusion matrix
y_true = y_test
y_pred = knn.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted unsuccessful Predicted successful
Is unsuccessful 1351 249
Is successful 475 199

6.1.3 Logistic Regression

# Train a logistic regression classifier
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)

# Build a confusion matrix
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted unsuccessful Predicted successful
Is unsuccessful 1592 8
Is successful 664 10

6.1.4 Decision tree

# Train a decision tree
dec_tree = tree.DecisionTreeClassifier(max_depth=4)
dec_tree.fit(X_train, y_train)

# Build a confusion matrix
y_true = y_test
y_pred = dec_tree.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted unsuccessful Predicted successful
Is unsuccessful 1557 43
Is successful 622 52

6.1.5 Random forest

# Train a decision tree
forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)

# Build a confusion matrix
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted unsuccessful Predicted successful
Is unsuccessful 1272 328
Is successful 409 265

6.1.6 Summary

Was it shocking? Looking at our confusion matrices, it becomes clear that some algorithms relied heavily on high_success_rate_agency while others could be flexible once it was removed.

  • The KNN did worse, but still got a decent number correct
  • The logistic regression is almost as bad
  • The decision tree does about as well as k-nearest neighbors
  • The random forest performed slightly better than the KNN

Random forests are generally thought of as a good “general purpose” classification algorithm. They’re flexible, they’re interpretable, and they usually work pretty well. We’ll see some problems with them in other chapters - they can be super slow, for example - but they get the job done.