6.1 Leaving out our best feature

Even though high_success_rate_agency is a great feature in terms of predicting success or failure, it’s so powerful that it seems like cheating. We don’t want to hear that we’re being rejected just because we’re sending a FOIA to the CIA - we want to build the best predictor we can!

What happens if we ignore the agency completely? Let’s remove high_success_rate_agency from our feature set and see how our classifiers perform.

We’re saying our current classifiers are useless, but are they really? It’s an open question! Knowing that we can be lazy with our request to the EPA, but need more effort when applying to the CIA might actually be useful.

6.1.1 Setting up our features

# Remove high_success_rate_agency
X = df[[
  'word_count', 'avg_sen_len', 
  'ref_foia', 'ref_fees', 'hyperlink', 
  'email_address', 'specificity'
]]
y = df.successful

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y)

6.1.2 K-nearest neighbors

# Train a knn classifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# Build a confusion matrix

y_true = y_test
y_pred = knn.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

	Predicted unsuccessful	Predicted successful
Is unsuccessful	1351	249
Is successful	475	199

6.1.3 Logistic Regression

# Train a logistic regression classifier
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)

# Build a confusion matrix

y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

	Predicted unsuccessful	Predicted successful
Is unsuccessful	1592	8
Is successful	664	10

6.1.4 Decision tree

# Train a decision tree
dec_tree = tree.DecisionTreeClassifier(max_depth=4)
dec_tree.fit(X_train, y_train)

# Build a confusion matrix

y_true = y_test
y_pred = dec_tree.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

	Predicted unsuccessful	Predicted successful
Is unsuccessful	1557	43
Is successful	622	52

6.1.5 Random forest

# Train a decision tree
forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)

# Build a confusion matrix

y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

	Predicted unsuccessful	Predicted successful
Is unsuccessful	1272	328
Is successful	409	265

6.1.6 Summary

Was it shocking? Looking at our confusion matrices, it becomes clear that some algorithms relied heavily on high_success_rate_agency while others could be flexible once it was removed.

The KNN did worse, but still got a decent number correct
The logistic regression is almost as bad
The decision tree does about as well as k-nearest neighbors
The random forest performed slightly better than the KNN

Random forests are generally thought of as a good “general purpose” classification algorithm. They’re flexible, they’re interpretable, and they usually work pretty well. We’ll see some problems with them in other chapters - they can be super slow, for example - but they get the job done.