5.1 Logistic Regression

The logistic regression classifier is one alternative to the k-nearest neighbors classifier.

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)

# Check its accuracy
## LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
##                    fit_intercept=True, intercept_scaling=1, l1_ratio=None,
##                    max_iter=1000, multi_class='warn', n_jobs=None, penalty='l2',
##                    random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
##                    warm_start=False)
accuracy_score(y_test, logreg.predict(X_test))
## 0.7801231310466139

Accuracy looks like it’s a few points better, but we know to not trust that as an evaluation metric.

y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted unsuccessful Predicted successful
Is unsuccessful 1372 195
Is successful 305 402

It turns out a logistic regression does way better than k-nearest neighbors! For our successful requests, KNN predicted just shy of 200 of them. Logistic regression was able to predict over four hundred!

The KNN did better with our unsuccessful results, though - it predicted around ~1500 compared to the logistic regression classifier’s ~1400.

When facing a choice between different classifiers, these are the sorts of things that might make you go in one direction or another. If we do better at identifying successful requests, is it okay to have a few unsuccessful requests incorrectly predicted as successful?

Along with performing better than a KNN algorithm, logistic regression is also remarkably convenient when it comes to explaining the result! We can use the Python package eli5 to see which columns are important.

import eli5
eli5.explain_weights_df(logreg, feature_names=list(X.columns))
target feature weight
1 high_success_rate_agency 2.3657931
1 ref_foia 0.5136506
1 email_address 0.0687234
1 word_count 0.0007750
1 avg_sen_len -0.0008912
1 specificity -0.0358758
1 hyperlink -0.1191786
1 ref_fees -0.3808806
1 <BIAS> -1.4161082

From these results we can see that by far the most important thing is whether you’re applying to a high-success-rate agency. We’ll address that later, let’s move on to another classifier.

There’s a lot more to be said about logistic regression! After you’re done here, each one of these classifiers has more examples and explanations in other chapters.