Classification snippets
Python data science coding reference from investigate.ai
Creating classifiers
Logistic Regression
A classifier based on logistic regression. We set a few options:
C=1e9
because, well, it makes it work all the timesolver='lbfgs'
because it's the default solver in newer sklearnmax_iter=4000
just in case your classifier needs to work a little harder
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
clf.fit(X, y)
Decision Tree
Standard decision tree, nothing by default. If you'd like to graph it you might want to set a max_depth
.
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X, y)
Random Forest
By default it only uses 10 estimators, which is being increased to 100 in a future version of sklearn. Let's make the future be now! If it's exceptionally slow because you have a lot of data you can decrease that number.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)
LinearSVC
With class_weight='balanced'
, we give a little boost to under represented categories of data so the classifier is sure to pay attention to them. You can do this with most every classifier, actually!
from sklearn.svm import LinearSVC
clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)
Multinomial Naive Bayes
Common Naive Bayes for text classification. There are other Naive Bayes classifiers, too.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y)
Predictions
Original data predictions
You can get the predictions for the data you trained with by not passing anything to clf.predict()
. You might use this to see where your classifier disagrees with the actual class.
I like to save it back into the original dataframe for safekeeping.
df['predicted'] = clf.predict()
New data predictions
Make sure the columns you pick for your clf.predict
are the same columns that you trained with.
X_unknown = df[['col1', 'col2', 'col3']]
df['prediction'] = clf.predict(X_unknown)
Class probability
Typically you have two classes, 0
and 1
. This will give you the probability that each row is in the 1
class. Not all classifiers support this, for example LinearSVC
.
You'll probably use this if you're looking for the top 500 (or whatever) most likely candidates for class 1
, even if they didn't meet the threshold to actually be class 1
.
If you have more than a yes/no question, things get more complicated. Take a look at clf.predict_proba(X_unknown)
.
X_unknown = df[['col1', 'col2', 'col3']]
df['predict_proba'] = clf.predict_proba(X_unknown)[:,1]
Class probability (kind of)
This is .predict_proba
for classifiers that don't support it! Typically you have two classes, 0
and 1
. This method won't give you probability, but the closer to 0
the number is the more uncertain the classifier is. From that:
- The higher the number, the more certain it's class
1
. - The further below zero, the more certain it's class
0
.
You'll probably you use this if you're looking for the top 500 (or whatever) most likely candidates for class 1
, even if they didn't meet the threshold to actually be class 1
.
X_unknown = df[['col1', 'col2', 'col3']]
df['predict_proba'] = clf.decision_function(X_unknown)
Evaluating classifiers
Train/test split
I always forget whether it's test_train_split
or train_test_split
so use this snippet every day of my life.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
Accuracy
Compute accuracy from predicted and actual values. How often was the prediction correct?
from sklearn.metrics import accuracy_score
y_true = y_test
y_pred = clf.predict(X_test)
accuracy_score(y_true, y_pred)
F1 score
Compared to the accuracy score, F1 is not only worried about maximizing being right. It tries to balance false positives and false negatives.
from sklearn.metrics import f1_score
y_true = y_test
y_pred = clf.predict(X_test)
f1_score(y_true, y_pred)
Confusion matrix
This is a nicer-looking confusion matrix that the default, with a little extra clarity about what's being predicted and what are the actual values.
Make sure your label_names
are in the right order.
from sklearn.metrics import confusion_matrix
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Confusion matrix (percents)
This is a nicer-looking confusion matrix that the default, with a little extra clarity about what's being predicted and what are the actual values. Displays percentages instead of raw numbers.
Make sure your label_names
are in the right order. This will also work for more than two classes.
from sklearn.metrics import confusion_matrix
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Feature importance
Importance explanation
I am firmly in the camp that ELI5 is the greatest blessing to the modern world.
import eli5
# X is your list of features
# you could also write them out manually
feature_names = list(X.columns)
eli5.show_weights(clf, feature_names=feature_names)
Scoring details
While eli5
typically just shows you feature importance, adding in show=
will also give you descriptions of what the numbers mean, and possibly warnings about how to interpret them. Can also be done by picking certain things to show, but using ALL
is easier to remember.
import eli5
# X is your list of features
# you could also write them out manually
feature_names = list(X.columns)
eli5.show_weights(clf,
feature_names=feature_names,
show=eli5.formatters.fields.ALL)
Text feature importance
When each feature is a piece of text, you should pass the vectorizer to eli5
. If you don't, each feature will just be a number. If you pass the vectorizer, it automatically shows you what each term actually is.
import eli5
eli5.show_weights(clf, vec=vectorizer)
Graphing features
Here we're using eli5
to extract a dataframe of the top 20 and bottom 20 most important features, then graphing it. We add a little fun magic - coloring and resizing - but you can skip that if you'd like.
# Build a dataframe of what we have above
weights = eli5.explain_weights_df(clf, vec=vectorizer, top=(20, 20))
# Pick colors based on being above or below zero
colors = weights.weight.apply(
lambda weight: 'lightblue' if weight > 0 else 'tan'
)
# Sort it and plot it
weights.sort_values(
by='weight',
ascending=True
).plot(
y='weight',
x='feature',
kind='barh',
figsize=(7,8),
color=colors,
legend=False
)
Explain one prediction
You use this if you aren't interested in understanding your model, you're interesting in understanding one single prediction.
import eli5
eli5.show_prediction(clf, X.iloc[0], vec=vectorizer)