Using classification algorithms with text#

It's easy to understand how a classifier might figure things out with numbers: "these numbers are close, so these topics must be related." But how's that go down when we're talking about language?

The foundations#

When you work on a classifier with text, the very first thing you need to do is turn those words into numbers. No matter what you use - Python's Counter, scikit-learn's CountVectorizer, or even something like spaCy, NLTK, org en Sim, it all ends up more or less the same in the end.

Let's build a dataset#

We have a lot of walkthroughs of real, published classifiers that use text, so we'll take care of this nice and quickly. Let's say we have a dataset of recipe ingredients.

import pandas as pd
pd.set_option("display.max_colwidth", 150)

df = pd.read_csv('data/recipes-indian.csv')
df.head()
cuisine id ingredient_list is_indian
0 indian 23348 minced ginger, garlic, oil, coriander powder, chickpeas, onions, chopped tomatoes, salt, lemon juice, fenugreek leaves, chili powder, cumin seed, ... 1
1 indian 18869 chicken, chicken breasts 1
2 indian 36405 flour, rose essence, frying oil, powdered milk, ghee, sugar, baking powder 1
3 indian 11494 soda, ghee, sugar, khoa, maida flour, milk, oil 1
4 indian 32675 tumeric, garam masala, salt, chicken, curry leaves, water, ginger, cinnamon sticks, fresh spinach, crushed red pepper flakes, cumin seed, tomatoes... 1
df.cuisine.value_counts()
indian          3000
italian          703
mexican          497
southern_us      325
chinese          228
french           211
thai             132
cajun_creole     111
japanese         105
greek             95
spanish           86
british           72
vietnamese        68
moroccan          63
filipino          61
korean            60
irish             60
jamaican          44
russian           40
brazilian         39
Name: cuisine, dtype: int64
(df.cuisine == 'indian').value_counts()
True     3000
False    3000
Name: cuisine, dtype: int64

First things first: convert everything to numbers.

Convert our target to numbers#

Right now our cuisine column is the name of the cuisine - a string! Let's fix it up into a number real quick. We'll use the "make it True or False and then turn that into a number" trick to make it happen.

df['is_indian'] = (df.cuisine == "indian").astype(int)
df.head()
cuisine id ingredient_list is_indian
0 indian 23348 minced ginger, garlic, oil, coriander powder, chickpeas, onions, chopped tomatoes, salt, lemon juice, fenugreek leaves, chili powder, cumin seed, ... 1
1 indian 18869 chicken, chicken breasts 1
2 indian 36405 flour, rose essence, frying oil, powdered milk, ghee, sugar, baking powder 1
3 indian 11494 soda, ghee, sugar, khoa, maida flour, milk, oil 1
4 indian 32675 tumeric, garam masala, salt, chicken, curry leaves, water, ginger, cinnamon sticks, fresh spinach, crushed red pepper flakes, cumin seed, tomatoes... 1
df.is_indian.value_counts()
1    3000
0    3000
Name: is_indian, dtype: int64

Convert our words to numbers#

Now we'll need to fix up our words. We've hopefully been through counting words, and in this situation we're specifically using a TF-IDF vectorizer to de emphasize more common words.

I'm stealing those code from the reference page for vectorization.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(balanced.ingredient_list)

words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
words_df.head()
10 14 95 abalone abura acai achiote acid ackee acorn ... yolks yoplait york yucca yukon zest zesty zinfandel ziti zucchini
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1813 columns

Build our classifier#

Setting up our data#

In order to teach our classifier about what is a vegetarian recipe and what isn't, we need two variables:

  • X, the features (the ingredients)
  • y, the target labels (whether it's vegetarian or not)
X = words_df
y = balanced.is_indian

We can look at them if we want!

# Our features
X.head(2)
10 14 95 abalone abura acai achiote acid ackee acorn ... yolks yoplait york yucca yukon zest zesty zinfandel ziti zucchini
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 rows × 1813 columns

# Our labels
y.head(2)
10476    1
20078    1
Name: is_indian, dtype: int64

Now all we need to do is make a classifier and feed it our data.

Training our classifier#

We have a million and one classifiers at our disposal, but we'll keep things simple and use a logistic regression classifier for now. Honestly, we're using it because when we try to explain it it's just so crisp and nice. Whether it's the best classifier is not something we're addressing right now.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Making predictions#

Let's build a new dataframe with a few recipes and see what the algorithm thinks of them.

unknown = pd.DataFrame({
    'content': [
        'microwaveable chicken breast, lettuce, taco seasoning, salsa, tortilla chips',
        'onions, besan, green chilies, cumin seeds, turmeric, coriander powder, salt, oil for deep-frying',
        'spinach, olive oil, vinegar, carrots, cucumbers, cilantro'
    ]
})
unknown
content
0 microwaveable chicken breast, lettuce, taco seasoning, salsa, tortilla chips
1 onions, besan, green chilies, cumin seeds, turmeric, coriander powder, salt, oil for deep-frying
2 spinach, olive oil, vinegar, carrots, cucumbers, cilantro

In order to make a prediction, we need to transform these strings into lists. Last time we used vectorizer.fit_transform, but we don't do that this time! When you do .fit_transform, it does two things:

  • fit learns all the words
  • transform turns the strings into numbers

We already know all the words we need to know. Any new words won't add anything, because we won't know whether they're related to Indian cuisine or not. Instead of using .fit_transform we'll just use .transform!

# Re-use our vectorizer, making sure to use .transform and NOT .fit_transform
# We're also skipping turning this into a words_df - we just did that last time
# so we could look at it with our poor human eyes
matrix = vectorizer.transform(unknown.content)

unknown['prediction'] = clf.predict(matrix)

What's it think?

unknown
content prediction
0 microwaveable chicken breast, lettuce, taco seasoning, salsa, tortilla chips 0
1 onions, besan, green chilies, cumin seeds, turmeric, coriander powder, salt, oil for deep-frying 1
2 spinach, olive oil, vinegar, carrots, cucumbers, cilantro 0

Check the evaluating classifiers section for more details on how to know if our classifier did a good job or not.

Explaining our classifier#

It's one thing to make predictions or to judge how well it did, the real important thing is to figure how how to explain what it was up to. We'll use the eli5 library to do this, because it's perfect. You just send it the classifier and your vectorizer and voilà, you have your top terms.

import eli5

eli5.show_weights(clf, vec=vectorizer)

y=1 top features

Weight? Feature
+8.038 curry
+5.363 cardamom
+5.349 masala
+4.307 yogurt
+4.239 ginger
+4.026 seeds
+3.961 garam
+3.735 cumin
+3.478 yoghurt
+3.301 turmeric
+3.225 ghee
+2.952 powder
+2.857 basmati
+2.728 coriander
+2.615 coconut
… 530 more positive …
… 1264 more negative …
-2.588 fish
-2.766 cheese
-2.889 eggs
-2.964 oregano
-3.830 sauce

According to the classifier, "curry" is the top signifier of a recipe being Indian (notice the y=1 up top?). eggs, oregano and sauce are the major signifies that it is not Indian.

eli5.show_prediction(clf, unknown.content[0], vec=vectorizer)

y=0 (probability 0.967, score -3.377) top features

Contribution? Feature
+1.594 <BIAS>
+0.482 salsa
+0.417 lettuce
+0.407 seasoning
+0.197 tortilla
+0.177 taco
+0.174 chips
+0.011 chicken
-0.082 breast

Explaining individual predictions#

We can also explain individual predictions!

# Explain the first one
eli5.show_prediction(clf, unknown.content[0], vec=vectorizer)

y=0 (probability 0.967, score -3.377) top features

Contribution? Feature
+1.594 <BIAS>
+0.482 salsa
+0.417 lettuce
+0.407 seasoning
+0.197 tortilla
+0.177 taco
+0.174 chips
+0.011 chicken
-0.082 breast

Be sure to read up top and notice it's predicting y=0. That's why salsa is +1.594 - it's a strong pressure that this is not Indian food. Compare that with the second predictions:

# Second one
eli5.show_prediction(clf, unknown.content[1], vec=vectorizer)

y=1 (probability 0.893, score 2.119) top features

Contribution? Feature
+0.710 seeds
+0.633 turmeric
+0.526 cumin
+0.448 coriander
+0.413 powder
+0.302 chilies
+0.217 salt
+0.198 oil
+0.166 onions
+0.077 frying
+0.052 for
+0.017 besan
-0.044 green
-1.594 <BIAS>

It's talking about y=1 up top, so seeds and turmeric are both pushing strongly for it being Indian food. For the third one, we're back at y=0, so they're telling you it is not Indian.

# Third one
eli5.show_prediction(clf, unknown.content[2], vec=vectorizer)

y=0 (probability 0.671, score -0.712) top features

Contribution? Feature
+1.594 <BIAS>
+0.362 olive
+0.159 vinegar
+0.007 cucumbers
-0.064 carrots
-0.284 oil
-0.521 cilantro
-0.540 spinach

Review#

In this section we learned how to combine word counts with

Discussion topics#

In this situation we used a TfidfVectorizer to vectorize our text. They're great at stressing less-often-used words. What if we were trying to determine whether a recipe was vegetarian or not? What argument could be made that instead o TF-IDF, we should just use a 0/1 as to whether each ingredient was included?

Your vectorizer can filter out words that occur a certain amount. For example, "only things that show up in less than 25% of recipes" or "must show up at least 5 recipes." What could be the benefits to doing this?