# Classification snippets

Python data science coding reference from **investigate.ai**

## Creating classifiers

### Logistic Regression

A classifier based on logistic regression. We set a few options:

`C=1e9`

because, well, it makes it work all the time`solver='lbfgs'`

because it's the default solver in newer sklearn`max_iter=4000`

just in case your classifier needs to work a little harder

```
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
clf.fit(X, y)
```

### Decision Tree

Standard decision tree, nothing by default. If you'd like to graph it you might want to set a `max_depth`

.

```
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
clf.fit(X, y)
```

### Random Forest

By default it only uses 10 estimators, which is being increased to 100 in a future version of sklearn. Let's make the future be now! If it's exceptionally slow because you have a lot of data you can decrease that number.

```
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)
```

### LinearSVC

With `class_weight='balanced'`

, we give a little boost to under represented categories of data so the classifier is sure to pay attention to them. You can do this with most every classifier, actually!

```
from sklearn.svm import LinearSVC
clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)
```

### Multinomial Naive Bayes

Common Naive Bayes for text classification. There are other Naive Bayes classifiers, too.

```
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y)
```

## Predictions

### Original data predictions

You can get the predictions for the data you trained with by not passing anything to `clf.predict()`

. You might use this to see where your classifier disagrees with the actual class.

I like to save it back into the original dataframe for safekeeping.

```
df['predicted'] = clf.predict()
```

### New data predictions

Make sure the columns you pick for your `clf.predict`

are the same columns that you trained with.

```
X_unknown = df[['col1', 'col2', 'col3']]
df['prediction'] = clf.predict(X_unknown)
```

### Class probability

Typically you have two classes, `0`

and `1`

. This will give you the probability that each row is in the `1`

class. **Not all classifiers support this,** for example `LinearSVC`

.

You'll probably use this if you're looking for the top 500 (or whatever) most likely candidates for class `1`

, even if they didn't meet the threshold to actually be class `1`

.

If you have more than a yes/no question, things get more complicated. Take a look at `clf.predict_proba(X_unknown)`

.

```
X_unknown = df[['col1', 'col2', 'col3']]
df['predict_proba'] = clf.predict_proba(X_unknown)[:,1]
```

### Class probability (kind of)

This is `.predict_proba`

for classifiers that don't support it! Typically you have two classes, `0`

and `1`

. This method **won't give you probability**, but the closer to `0`

the number is the more uncertain the classifier is. From that:

- The higher the number, the more certain it's class
`1`

. - The further below zero, the more certain it's class
`0`

.

You'll probably you use this if you're looking for the top 500 (or whatever) most likely candidates for class `1`

, even if they didn't meet the threshold to actually be class `1`

.

```
X_unknown = df[['col1', 'col2', 'col3']]
df['predict_proba'] = clf.decision_function(X_unknown)
```

## Evaluating classifiers

### Train/test split

I always forget whether it's `test_train_split`

or `train_test_split`

so use this snippet every day of my life.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
```

### Accuracy

Compute accuracy from predicted and actual values. How often was the prediction correct?

```
from sklearn.metrics import accuracy_score
y_true = y_test
y_pred = clf.predict(X_test)
accuracy_score(y_true, y_pred)
```

### F1 score

Compared to the accuracy score, F1 is not *only* worried about maximizing being right. It tries to balance false positives and false negatives.

```
from sklearn.metrics import f1_score
y_true = y_test
y_pred = clf.predict(X_test)
f1_score(y_true, y_pred)
```

### Confusion matrix

This is a nicer-looking confusion matrix that the default, with a little extra clarity about what's being predicted and what are the actual values.

Make sure your `label_names`

are in the right order.

```
from sklearn.metrics import confusion_matrix
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
```

### Confusion matrix (percents)

This is a nicer-looking confusion matrix that the default, with a little extra clarity about what's being predicted and what are the actual values. Displays percentages instead of raw numbers.

Make sure your `label_names`

are in the right order. This will also work for more than two classes.

```
from sklearn.metrics import confusion_matrix
y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
```

## Feature importance

### Importance explanation

I am firmly in the camp that ELI5 is the greatest blessing to the modern world.

```
import eli5
# X is your list of features
# you could also write them out manually
feature_names = list(X.columns)
eli5.show_weights(clf, feature_names=feature_names)
```

### Scoring details

While `eli5`

typically just shows you feature importance, adding in `show=`

will also give you descriptions of what the numbers mean, and possibly warnings about how to interpret them. Can also be done by picking certain things to show, but using `ALL`

is easier to remember.

```
import eli5
# X is your list of features
# you could also write them out manually
feature_names = list(X.columns)
eli5.show_weights(clf,
feature_names=feature_names,
show=eli5.formatters.fields.ALL)
```

### Text feature importance

When each feature is a piece of text, you should pass the vectorizer to `eli5`

. If you don't, each feature will just be a number. If you pass the vectorizer, it automatically shows you what each term actually is.

```
import eli5
eli5.show_weights(clf, vec=vectorizer)
```

### Graphing features

Here we're using `eli5`

to extract a dataframe of the top 20 and bottom 20 most important features, then graphing it. We add a little fun magic - coloring and resizing - but you can skip that if you'd like.

```
# Build a dataframe of what we have above
weights = eli5.explain_weights_df(clf, vec=vectorizer, top=(20, 20))
# Pick colors based on being above or below zero
colors = weights.weight.apply(
lambda weight: 'lightblue' if weight > 0 else 'tan'
)
# Sort it and plot it
weights.sort_values(
by='weight',
ascending=True
).plot(
y='weight',
x='feature',
kind='barh',
figsize=(7,8),
color=colors,
legend=False
)
```

### Explain one prediction

You use this if you aren't interested in understanding your model, you're interesting in understanding *one single prediction*.

```
import eli5
eli5.show_prediction(clf, X.iloc[0], vec=vectorizer)
```