6 Feature selection and engineering

One thing that always bothered me about the FOIA Predictor is that high_success_rate_agency column. It seems like all of the classifiers really love it, what if we remove every other feature except for that one?

Let’s see what it looks like with a logistic regression using only high_success_rate_agency.

# Only select one feature this time
X = df[['high_success_rate_agency']]
y = df.successful

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Train a logistic regression classifier
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)

# Build a confusion matrix
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted unsuccessful Predicted successful
Is unsuccessful 1364 207
Is successful 274 429

That was disappointing! Even though we have all of those other columns, it looks like high_success_rate_agency isn’t only the most important, but it’s the only thing we need.

It feels like adding those other columns would only give us marginal improvement over just knowing whether we’re making a request to a high success rate agency. If you tell the algorithm “hey, I’m sending a FOIA to the EPA” it can completely ignore the actual content of the FOIA itself and tell you “oh sure yeah it’s gonna get granted.”

This leads up to an important question, that focusing on what we’re even using this predictor for: if our classifier is only paying attention to the agency, is it helpful at all?