Classification snippets

Python data science coding reference from investigate.ai

Creating classifiers

Logistic Regression

A classifier based on logistic regression. We set a few options:

  • C=1e9 because, well, it makes it work all the time
  • solver='lbfgs' because it's the default solver in newer sklearn
  • max_iter=4000 just in case your classifier needs to work a little harder
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
clf.fit(X, y)

Decision Tree

Standard decision tree, nothing by default. If you'd like to graph it you might want to set a max_depth.

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X, y)

Random Forest

By default it only uses 10 estimators, which is being increased to 100 in a future version of sklearn. Let's make the future be now! If it's exceptionally slow because you have a lot of data you can decrease that number.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)

LinearSVC

With class_weight='balanced', we give a little boost to under represented categories of data so the classifier is sure to pay attention to them. You can do this with most every classifier, actually!

from sklearn.svm import LinearSVC

clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)

Multinomial Naive Bayes

Common Naive Bayes for text classification. There are other Naive Bayes classifiers, too.

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X, y)

Predictions

Original data predictions

You can get the predictions for the data you trained with by not passing anything to clf.predict(). You might use this to see where your classifier disagrees with the actual class.

I like to save it back into the original dataframe for safekeeping.

df['predicted'] = clf.predict()

New data predictions

Make sure the columns you pick for your clf.predict are the same columns that you trained with.

X_unknown = df[['col1', 'col2', 'col3']]
df['prediction'] = clf.predict(X_unknown)

Class probability

Typically you have two classes, 0 and 1. This will give you the probability that each row is in the 1 class. Not all classifiers support this, for example LinearSVC.

You'll probably use this if you're looking for the top 500 (or whatever) most likely candidates for class 1, even if they didn't meet the threshold to actually be class 1.

If you have more than a yes/no question, things get more complicated. Take a look at clf.predict_proba(X_unknown).

X_unknown = df[['col1', 'col2', 'col3']]
df['predict_proba'] = clf.predict_proba(X_unknown)[:,1]

Class probability (kind of)

This is .predict_proba for classifiers that don't support it! Typically you have two classes, 0 and 1. This method won't give you probability, but the closer to 0 the number is the more uncertain the classifier is. From that:

  • The higher the number, the more certain it's class 1.
  • The further below zero, the more certain it's class 0.

You'll probably you use this if you're looking for the top 500 (or whatever) most likely candidates for class 1, even if they didn't meet the threshold to actually be class 1.

X_unknown = df[['col1', 'col2', 'col3']]
df['predict_proba'] = clf.decision_function(X_unknown)

Evaluating classifiers

Train/test split

I always forget whether it's test_train_split or train_test_split so use this snippet every day of my life.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

Accuracy

Compute accuracy from predicted and actual values. How often was the prediction correct?

from sklearn.metrics import accuracy_score

y_true = y_test
y_pred = clf.predict(X_test)

accuracy_score(y_true, y_pred)

F1 score

Compared to the accuracy score, F1 is not only worried about maximizing being right. It tries to balance false positives and false negatives.

from sklearn.metrics import f1_score

y_true = y_test
y_pred = clf.predict(X_test)

f1_score(y_true, y_pred)

Confusion matrix

This is a nicer-looking confusion matrix that the default, with a little extra clarity about what's being predicted and what are the actual values.

Make sure your label_names are in the right order.

from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Confusion matrix (percents)

This is a nicer-looking confusion matrix that the default, with a little extra clarity about what's being predicted and what are the actual values. Displays percentages instead of raw numbers.

Make sure your label_names are in the right order. This will also work for more than two classes.

from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)

Feature importance

Importance explanation

I am firmly in the camp that ELI5 is the greatest blessing to the modern world.

import eli5

# X is your list of features
# you could also write them out manually
feature_names = list(X.columns)

eli5.show_weights(clf, feature_names=feature_names)

Scoring details

While eli5 typically just shows you feature importance, adding in show= will also give you descriptions of what the numbers mean, and possibly warnings about how to interpret them. Can also be done by picking certain things to show, but using ALL is easier to remember.

import eli5

# X is your list of features
# you could also write them out manually
feature_names = list(X.columns)

eli5.show_weights(clf,
                  feature_names=feature_names,
                  show=eli5.formatters.fields.ALL)

Text feature importance

When each feature is a piece of text, you should pass the vectorizer to eli5. If you don't, each feature will just be a number. If you pass the vectorizer, it automatically shows you what each term actually is.

import eli5

eli5.show_weights(clf, vec=vectorizer)

Graphing features

Here we're using eli5 to extract a dataframe of the top 20 and bottom 20 most important features, then graphing it. We add a little fun magic - coloring and resizing - but you can skip that if you'd like.

# Build a dataframe of what we have above
weights = eli5.explain_weights_df(clf, vec=vectorizer, top=(20, 20))

# Pick colors based on being above or below zero
colors = weights.weight.apply(
  lambda weight: 'lightblue' if weight > 0 else 'tan'
)

# Sort it and plot it
weights.sort_values(
    by='weight',
    ascending=True
).plot(
    y='weight',
    x='feature',
    kind='barh',
    figsize=(7,8),
    color=colors,
    legend=False
)

Explain one prediction

You use this if you aren't interested in understanding your model, you're interesting in understanding one single prediction.

import eli5

eli5.show_prediction(clf, X.iloc[0], vec=vectorizer)