Need to predict what category something goes into? Hate speech/not hate speech, spy plane/not spy plane, dangerous doctor/not dangerous doctor? Classification is what you're looking for!

While it's best if you've read up on logistic regression first, you'll definitely be okay even if you didn't. Maybe a few feelings of being left out here or there, but 100% physically survivable.

Our dataset#

Once upon a time when we were studying logistic regression, we were really really into knitting scarves. We had a list of scarves we had attempted to knit, and whether we were successful or not in completing them. We're going to revisit our adventures in knitting, starting with these four measurements on each attempted scarf:

  • ๐Ÿ“ How long each scarf was
  • ๐Ÿงน Whether we used a large-gauge knitting needle
  • ๐ŸŽจ The color of our yarn
  • ๐Ÿงฃ Whether we finished the scarf or not

With these four pieces of data about each scarf, we wound up with a dataset that looked something like this:

๐Ÿ“ ๐Ÿงน ๐ŸŽจ ๐Ÿงฃ
55 inches Yes orange Finished!
55 inches No orange Finished!
55 inches No brown Finished!
60 inches No brown Finished!
60 inches No grey Nope
70 inches No grey Finished!
70 inches No orange Nope
82 inches Yes grey Finished!
82 inches No brown Nope
82 inches No orange Nope
82 inches Yes brown Nope

We used logistic regression to see how scarf length, needle gauge, and scarf color affected our ability to complete the scarf. Our findings were as follows:

  • As scarves got longer we were less likely to complete them...
  • ...but if we used large-gauge needles we were more likely to finish them.
  • Certain colors seemed to have a positive or negative effect, but their p-values were just too high to consider significant.

Let's review how we did this, nice and quickly. We'll be performing our logistic regression with the statsmodels package, which is a delight when it comes to running linear and logistic regression in Python.

First we'll make our dataframe.

import pandas as pd

df = pd.DataFrame([
    { 'length_in': 55, 'large_gauge': 1, 'color': 'orange', 'completed': 1 },
    { 'length_in': 55, 'large_gauge': 0, 'color': 'orange', 'completed': 1 },
    { 'length_in': 55, 'large_gauge': 0, 'color': 'brown', 'completed': 1 },
    { 'length_in': 60, 'large_gauge': 0, 'color': 'brown', 'completed': 1 },
    { 'length_in': 60, 'large_gauge': 0, 'color': 'grey', 'completed': 0 },
    { 'length_in': 70, 'large_gauge': 0, 'color': 'grey', 'completed': 1 },
    { 'length_in': 70, 'large_gauge': 0, 'color': 'orange', 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 1, 'color': 'grey', 'completed': 1 },
    { 'length_in': 82, 'large_gauge': 0, 'color': 'brown', 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 0, 'color': 'orange', 'completed': 0 },
    { 'length_in': 82, 'large_gauge': 1, 'color': 'brown', 'completed': 0 },

length_in large_gauge color completed
0 55 1 orange 1
1 55 0 orange 1
2 55 0 brown 1
3 60 0 brown 1
4 60 0 grey 0
5 70 0 grey 1
6 70 0 orange 0
7 82 1 grey 1
8 82 0 brown 0
9 82 0 orange 0
10 82 1 brown 0

Then we'll feed it into statsmodels, asking it how the completed column is related to the three columns we're using as features.

model = smf.logit("completed ~ length_in + large_gauge + C(color, Treatment('orange'))", data=df)
results =
Optimization terminated successfully.
         Current function value: 0.424906
         Iterations 7
Logit Regression Results
Dep. Variable: completed No. Observations: 11
Model: Logit Df Residuals: 6
Method: MLE Df Model: 4
Date: Fri, 20 Dec 2019 Pseudo R-squ.: 0.3833
Time: 14:52:06 Log-Likelihood: -4.6740
converged: True LL-Null: -7.5791
Covariance Type: nonrobust LLR p-value: 0.2138
coef std err z P>|z| [0.025 0.975]
Intercept 12.1245 8.094 1.498 0.134 -3.740 27.989
C(color, Treatment('orange'))[T.brown] 0.4594 2.257 0.204 0.839 -3.965 4.884
C(color, Treatment('orange'))[T.grey] 1.4708 2.289 0.643 0.520 -3.015 5.957
length_in -0.1944 0.126 -1.540 0.124 -0.442 0.053
large_gauge 2.8814 2.845 1.013 0.311 -2.694 8.457

These results don't mean much, as we need to convert the coefficients into odds ratios to see how much better or worse our odds get as our features change.

coefs = pd.DataFrame({
    'coef': results.params.values,
    'odds ratio': np.exp(results.params.values),
    'name': results.params.index
coef odds ratio name
0 12.124529 184338.541216 Intercept
1 0.459412 1.583143 C(color, Treatment('orange'))[T.brown]
2 1.470759 4.352538 C(color, Treatment('orange'))[T.grey]
3 -0.194425 0.823308 length_in
4 2.881375 17.838786 large_gauge

Each extra inch decreases our ability to finish by about 18%, while using large gauge needles improves our odds of successfully finishing by 18x. That's logistic regression: how much each input changes the output.

Logistic Classifier#

Classifiers go in the opposite direction: if I want to make a scarf that looks like _, am I going to finish it?

Let's say we have a bunch of scarves in our to-do list, scarves we haven't made, and we're curious as to whether we're going to complete them or not.

๐Ÿ“ ๐Ÿงน ๐ŸŽจ ๐Ÿงฃ
55 inches Yes grey ???
65 inches No grey ???
75 inches Yes orange ???
80 inches No orange ???
90 inches Yes brown ???

A classifier will tell us whether or not we're likely to finish each scarf. To do this we're going to abandon our friend statsmodels and use a new Python library called scikit-learn (or sklearn). Scikit-learn is a machine learning library that can do all sorts of data science magic.

If you're thinking "wait, we made predictions before!" you're perfectly right, but you're just going to have to hold on for a while.

There are all sorts of classifiers - scikit-learn has at least ten different ones - but since we learned about logistic regression already we're going to start with a logistic classifier. It uses the same sort of math as the logistic regression, it's just more focused around prediction as opposed to giving us odds ratios and the like.

Let's hop in!

# Import the classifier from scikit-learn
from sklearn.linear_model import LogisticRegression

# Create a new classifier
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=4000, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,

When we make a LogisticRegression in scikit-learn, we give it a few options. While we don't need all of these all of the time, I find it useful to always use them for ease of cut-and-paste.

  • C=1e9 is a magic number that Just Makes Things Always Work. Don't worry about it, just use it. Sorry!!
  • solver='lbfgs' is the default technique for solving the regression in newer versions of scikit-learn. We specify it here so we don't get a warning in older versions.
  • max_iter=4000 tells it to try reaaaally hard to find a solution to the problem. The default is 100 iterations, we up it to 4000 just in case working extra hard gets an answer.

We call our classifier clf because it's, well, a classifier, and for some reason programmers hate typing more than three or four letters at a time.

Training our classifier#

When we build a classifier, we can't just tell it to start guessing answers about scarves - first we need to show it what completed and incomplete scarves look like. This is called training or fitting your model.

To train our model on what a complete or incomplete scarf looks like, we need a dataset that we know the answers to. This might seem obvious, but it's an important part! This labeled dataset is (somewhat obviously) called your training data.

We'll be using our original dataframe as our training data. To keep things simple, we're also going to get rid of the color column (don't worry, it'll come back later).

train_df = df.drop('color', axis=1).copy()
length_in large_gauge completed
0 55 1 1
1 55 0 1
2 55 0 1
3 60 0 1
4 60 0 0

This dataframe can be our training data because it has features or inputs (length and whether we used large gauge needles), as well as the output label or class (whether we finished it or not).

Separating features and labels#

Before we train our classifier, we'll need to actually separate the features and the labels. Typically these are called X (the features) and y (the labels). We'll follow convention because we love to fit in!

# X will be the features, so we'll drop the 'completed' column
X = train_df.drop('completed', axis=1)
# y will be the labels, so just the 'completed' column
y = train_df.completed

Let's take a look, just to be sure. The features - the inputs that determine whether we're successful or not - are hiding inside of X. This should look like a dataframe.

length_in large_gauge
0 55 1
1 55 0
2 55 0
3 60 0
4 60 0

While whether we completed each scarf or not is inside of y. This should look like a single column.

0    1
1    1
2    1
3    1
4    0
Name: completed, dtype: int64

Training our model#

Now that we've split apart our features and labels, we can train our classifier. This will teach the logistic classifier about the relationship between the features and the labels, and what kind of label (completed/not completed) to connect with scarf lengths and needle gauges.

# Teach the classifier about scarves, y)
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=4000, multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,

Anticlimatic, yeah? Sorry!

Notice that when you fit an sklearn classifier, it doesn't give fancy, readable results like statsmodels. It might be a lot of the same math behind the scenes, but scikit-learn's logistic regression is based around classification, not for nice people-understandable relationships between inputs and outputs.

Making predictions#

Now that the classifier has been successfully trained (no error message, right?), it should be able to make predictions about whether we'll finish a scarf or not.

Let's build a new dataset of mystery scarves. We haven't started on them, so we don't know whether we'll finish them or not.

unknown = pd.DataFrame([
    { 'length_in': 55, 'large_gauge': 1, 'color': 'grey' },
    { 'length_in': 65, 'large_gauge': 0, 'color': 'grey' },
    { 'length_in': 75, 'large_gauge': 1, 'color': 'orange' },
    { 'length_in': 80, 'large_gauge': 0, 'color': 'orange' },
    { 'length_in': 90, 'large_gauge': 1, 'color': 'brown' },
length_in large_gauge color
0 55 1 grey
1 65 0 grey
2 75 1 orange
3 80 0 orange
4 90 1 brown

Just like we made an X variable with our training features, I like to make an unknown_X variable for the features of our mystery dataset. Since we aren't using the scarf's color when we make the prediction, I'm going to drop the column.

X_unknown = unknown.drop('color', axis=1)
length_in large_gauge
0 55 1
1 65 0
2 75 1
3 80 0
4 90 1

We can then use the classifier's .predict method to determine whether we're going to finish the scarf or not.

array([1, 1, 1, 0, 0])

That's not very pleasant-looking, so let's actually insert it into our dataframe.

unknown['predicted'] = clf.predict(X_unknown)
length_in large_gauge color predicted
0 55 1 grey 1
1 65 0 grey 1
2 75 1 orange 1
3 80 0 orange 0
4 90 1 brown 0

There we go!!!!! A prediction!!! Incredible!!!

So now we're guaranteed to finish the first three scarves, and the final two are doomed to be ignored forever? Not so fast!

Classes and probabilities#

The biggest flaw of using a classifier is falling in love with that 1 or 0 predicted value. We love the finality of it, the distinct separation into yes/no categories. If you want to use these tools successfully and ethnically, though, I have a big surprise for you:

  • 1 means you're more likely to finish the scarf than NOT finish the scarf
  • 0 means you're more likely to NOT finish the scarf than to finish the scarf

While that seems kind of obvious, what's hiding inside is the fact that a scarf we have a 50.1% chance of completing gets a 1, the exact same as a scarf we have a 99% chance of completing. A scarf we have a 1% chance of completing gets marked as firmly in the 0 camp, same as one that we have a 49.99% chance of finishing!

When running a classifier, the output labels are just shorthand for probabilities (or similar, depending on the classifier). We're forcing our classifier to say yes or no, so it's going to put every data point in one bucket or the other.

If we feel like it's important to know about that probability, it's actually pretty easy to get (if a little awkward).

# Predict the probability of each category
array([[0.00694023, 0.99305977],
       [0.45775348, 0.54224652],
       [0.21465282, 0.78534718],
       [0.92958777, 0.07041223],
       [0.81040751, 0.18959249]])

The results are a little more complicated than a raw probability, so let's break it down. It's a list of list - each datapoint has two numbers

  • The first row, [0.00694023, 0.99305977], means that the first datapoint has a 0.7% chance of being in class 0 (incomplete), and a 99.3% chance of being in class 1 (completed)
  • The second row - the second datapoint - lists [0.45775348, 0.54224652], which is a 46% chance of being incomplete (0) and 54% chance of being completed (1)
  • The third datapoint gets [0.21465282, 0.78534718], which is a 21% chance of being incomplete, and 79% of being completed

In this case (and in most cases) we're interested in the second number - the chance of being a 1 - so we'll only grab that number for our dataframe.

# Predict the probability of each category,
# but only keep the probability for the label '1'
unknown['predict_proba'] = clf.predict_proba(unknown_X)[:,1]
length_in large_gauge color predicted predict_proba
0 55 1 grey 1 0.993060
1 65 0 grey 1 0.542247
2 75 1 orange 1 0.785347
3 80 0 orange 0 0.070412
4 90 1 brown 0 0.189592

We can see that even though the first three are likely to be completed, the 75" scarf is actually more likely to be completed than the 65" scarf! Must be those large gauge needles.

The reason we need to jump through these hoops is this classifier can predict more than just yes/no! Given some information about an animal, for example, we could predict whether it was a bear, a wolf, or a housecat (very useful example, of course). We'll keep it to two categories for now.

Depending on what you're doing with your machine learning, the difference between using the class or using a probability might be a very very big deal.

For example, if we wanted to make the best use of our time, we might start with the scarf we're most likely to finish and work our way down the list.

unknown.sort_values(by='predict_proba', ascending=False)
length_in large_gauge color predicted predict_proba
0 55 1 grey 1 0.993060
2 75 1 orange 1 0.785347
1 65 0 grey 1 0.542247
4 90 1 brown 0 0.189592
3 80 0 orange 0 0.070412

If we successfully finished the first three and got to the 90-inch scarf, we wouldn't throw up our arms and say, we shouldn't try this one! Instead, we'd start working on it known that maybe we won't finish it, but not taking the 0 prediction as anything other than a maybe-wrong prediction.

More practically, journalists often do this when searching through documents. Let's say we have a dump of 100,000 ultra-secret PDFs - are we going to be able to read them all? Probably not! After training a classifier about what "interesting" documents look like, we'll start down the list from the top and work our way down. Even if the classifier thinks something isn't interesting, the classification is just a suggestion. Going down the list until we find what we need (or hit our deadline) is probably a better journalistic practice.


In this section we looked at classification, which is a machine learning technique to predict yes/no answers (or other categories). You train a classifier with some known examples, then let it loose to make predictions on unknown data. While there are many kinds of classifiers, this time we used one based on logistic regression.

When making predictions, classifiers put each element into a category, only caring if it's the most likely category. This makes it seem like "yes" or "no" are definitive statements, but it might be the difference between 49.9% likely and 50.1% likely!

Outside of predictions, you can also have your classifier report back the raw probability (well, most of the time!). Sometimes it's more useful to use this number than the actual predicted class.

Discussion topics#

Should we not start on scarves our classifier thinks we won't finish?

Let's say we stop trying to make anything the classifier says we won't finish, and keep adding our finished 55-inch scarves to our dataset. Will that change anything about what we do in the future? We could experiment by adding 55-inch completed scarves to the dataset, or treat it as a thought experiment.

If we wanted to finish the maximum number of scarves with the least amount of failures, how might we order our work?

We might be missing the emotional part of finishing a project. We might fail on a long scarf, but pick up something nice and short next in order to fill the "oh you didn't finish" hole in our heart. How could I build that into the model?

When we talked about logistic regression, it was a reaaaaally big deal that we should always check our results with someone. Let's say we used a logistic classifier to investigate app reviews about sexual content: is it still important to talk to someone else? A stats person? Why or why not, and about what?