 # Classification snippets

Python data science coding reference from investigate.ai

## Creating classifiers

### Logistic Regression

A classifier based on logistic regression. We set a few options:

• `C=1e9` because, well, it makes it work all the time
• `solver='lbfgs'` because it's the default solver in newer sklearn
• `max_iter=4000` just in case your classifier needs to work a little harder
```from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
clf.fit(X, y)
```

### Decision Tree

Standard decision tree, nothing by default. If you'd like to graph it you might want to set a `max_depth`.

```from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X, y)
```

### Random Forest

By default it only uses 10 estimators, which is being increased to 100 in a future version of sklearn. Let's make the future be now! If it's exceptionally slow because you have a lot of data you can decrease that number.

```from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)
```

### LinearSVC

With `class_weight='balanced'`, we give a little boost to under represented categories of data so the classifier is sure to pay attention to them. You can do this with most every classifier, actually!

```from sklearn.svm import LinearSVC

clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)
```

### Multinomial Naive Bayes

Common Naive Bayes for text classification. There are other Naive Bayes classifiers, too.

```from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X, y)
```

## Predictions

### Original data predictions

You can get the predictions for the data you trained with by not passing anything to `clf.predict()`. You might use this to see where your classifier disagrees with the actual class.

I like to save it back into the original dataframe for safekeeping.

```df['predicted'] = clf.predict()
```

### New data predictions

Make sure the columns you pick for your `clf.predict` are the same columns that you trained with.

```X_unknown = df[['col1', 'col2', 'col3']]
df['prediction'] = clf.predict(X_unknown)
```

### Class probability

Typically you have two classes, `0` and `1`. This will give you the probability that each row is in the `1` class. Not all classifiers support this, for example `LinearSVC`.

You'll probably use this if you're looking for the top 500 (or whatever) most likely candidates for class `1`, even if they didn't meet the threshold to actually be class `1`.

If you have more than a yes/no question, things get more complicated. Take a look at `clf.predict_proba(X_unknown)`.

```X_unknown = df[['col1', 'col2', 'col3']]
df['predict_proba'] = clf.predict_proba(X_unknown)[:,1]
```

### Class probability (kind of)

This is `.predict_proba` for classifiers that don't support it! Typically you have two classes, `0` and `1`. This method won't give you probability, but the closer to `0` the number is the more uncertain the classifier is. From that:

• The higher the number, the more certain it's class `1`.
• The further below zero, the more certain it's class `0`.

You'll probably you use this if you're looking for the top 500 (or whatever) most likely candidates for class `1`, even if they didn't meet the threshold to actually be class `1`.

```X_unknown = df[['col1', 'col2', 'col3']]
df['predict_proba'] = clf.decision_function(X_unknown)
```

## Evaluating classifiers

### Train/test split

I always forget whether it's `test_train_split` or `train_test_split` so use this snippet every day of my life.

```from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
```

### Accuracy

Compute accuracy from predicted and actual values. How often was the prediction correct?

```from sklearn.metrics import accuracy_score

y_true = y_test
y_pred = clf.predict(X_test)

accuracy_score(y_true, y_pred)
```

### F1 score

Compared to the accuracy score, F1 is not only worried about maximizing being right. It tries to balance false positives and false negatives.

```from sklearn.metrics import f1_score

y_true = y_test
y_pred = clf.predict(X_test)

f1_score(y_true, y_pred)
```

### Confusion matrix

This is a nicer-looking confusion matrix that the default, with a little extra clarity about what's being predicted and what are the actual values.

Make sure your `label_names` are in the right order.

```from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
```

### Confusion matrix (percents)

This is a nicer-looking confusion matrix that the default, with a little extra clarity about what's being predicted and what are the actual values. Displays percentages instead of raw numbers.

Make sure your `label_names` are in the right order. This will also work for more than two classes.

```from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
```

## Feature importance

### Importance explanation

I am firmly in the camp that ELI5 is the greatest blessing to the modern world.

```import eli5

# X is your list of features
# you could also write them out manually
feature_names = list(X.columns)

eli5.show_weights(clf, feature_names=feature_names)
```

### Scoring details

While `eli5` typically just shows you feature importance, adding in `show=` will also give you descriptions of what the numbers mean, and possibly warnings about how to interpret them. Can also be done by picking certain things to show, but using `ALL` is easier to remember.

```import eli5

# X is your list of features
# you could also write them out manually
feature_names = list(X.columns)

eli5.show_weights(clf,
feature_names=feature_names,
show=eli5.formatters.fields.ALL)
```

### Text feature importance

When each feature is a piece of text, you should pass the vectorizer to `eli5`. If you don't, each feature will just be a number. If you pass the vectorizer, it automatically shows you what each term actually is.

```import eli5

eli5.show_weights(clf, vec=vectorizer)
```

### Graphing features

Here we're using `eli5` to extract a dataframe of the top 20 and bottom 20 most important features, then graphing it. We add a little fun magic - coloring and resizing - but you can skip that if you'd like.

```# Build a dataframe of what we have above
weights = eli5.explain_weights_df(clf, vec=vectorizer, top=(20, 20))

# Pick colors based on being above or below zero
colors = weights.weight.apply(
lambda weight: 'lightblue' if weight > 0 else 'tan'
)

# Sort it and plot it
weights.sort_values(
by='weight',
ascending=True
).plot(
y='weight',
x='feature',
kind='barh',
figsize=(7,8),
color=colors,
legend=False
)
```

### Explain one prediction

You use this if you aren't interested in understanding your model, you're interesting in understanding one single prediction.

```import eli5

eli5.show_prediction(clf, X.iloc, vec=vectorizer)
```