5.1 Logistic Regression
The logistic regression classifier is one alternative to the k-nearest neighbors classifier.
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X_train, y_train)
# Check its accuracy
## LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
## fit_intercept=True, intercept_scaling=1, l1_ratio=None,
## max_iter=1000, multi_class='warn', n_jobs=None, penalty='l2',
## random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
## warm_start=False)
## 0.7801231310466139
Accuracy looks like it’s a few points better, but we know to not trust that as an evaluation metric.
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['unsuccessful', 'successful'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Predicted unsuccessful | Predicted successful | |
---|---|---|
Is unsuccessful | 1372 | 195 |
Is successful | 305 | 402 |
It turns out a logistic regression does way better than k-nearest neighbors! For our successful requests, KNN predicted just shy of 200 of them. Logistic regression was able to predict over four hundred!
The KNN did better with our unsuccessful results, though - it predicted around ~1500 compared to the logistic regression classifier’s ~1400.
When facing a choice between different classifiers, these are the sorts of things that might make you go in one direction or another. If we do better at identifying successful requests, is it okay to have a few unsuccessful requests incorrectly predicted as successful?
Along with performing better than a KNN algorithm, logistic regression is also remarkably convenient when it comes to explaining the result! We can use the Python package eli5
to see which columns are important.
target | feature | weight |
---|---|---|
1 | high_success_rate_agency | 2.3657931 |
1 | ref_foia | 0.5136506 |
1 | email_address | 0.0687234 |
1 | word_count | 0.0007750 |
1 | avg_sen_len | -0.0008912 |
1 | specificity | -0.0358758 |
1 | hyperlink | -0.1191786 |
1 | ref_fees | -0.3808806 |
1 | <BIAS> | -1.4161082 |
From these results we can see that by far the most important thing is whether you’re applying to a high-success-rate agency. We’ll address that later, let’s move on to another classifier.
There’s a lot more to be said about logistic regression! After you’re done here, each one of these classifiers has more examples and explanations in other chapters.