Using classification algorithms with text#

It's easy to understand how a classifier might figure things out with numbers: "these numbers are close, so these topics must be related." But how's that go down when we're talking about language?

Read online Download notebook Interactive version

The foundations#

When you work on a classifier with text, the very first thing you need to do is turn those words into numbers. No matter what you use - Python's Counter, scikit-learn's CountVectorizer, or even something like spaCy, NLTK, org en Sim, it all ends up more or less the same in the end.

Let's build a dataset#

We have a lot of walkthroughs of real, published classifiers that use text, so we'll take care of this nice and quickly. Let's say we have a dataset of recipe ingredients.

import pandas as pd
pd.set_option("display.max_colwidth", 150)

df = pd.read_csv('data/recipes-indian.csv')
df.head()

	cuisine	id	ingredient_list	is_indian
0	indian	23348	minced ginger, garlic, oil, coriander powder, chickpeas, onions, chopped tomatoes, salt, lemon juice, fenugreek leaves, chili powder, cumin seed, ...	1
1	indian	18869	chicken, chicken breasts	1
2	indian	36405	flour, rose essence, frying oil, powdered milk, ghee, sugar, baking powder	1
3	indian	11494	soda, ghee, sugar, khoa, maida flour, milk, oil	1
4	indian	32675	tumeric, garam masala, salt, chicken, curry leaves, water, ginger, cinnamon sticks, fresh spinach, crushed red pepper flakes, cumin seed, tomatoes...	1

df.cuisine.value_counts()

indian          3000
italian          703
mexican          497
southern_us      325
chinese          228
french           211
thai             132
cajun_creole     111
japanese         105
greek             95
spanish           86
british           72
vietnamese        68
moroccan          63
filipino          61
korean            60
irish             60
jamaican          44
russian           40
brazilian         39
Name: cuisine, dtype: int64

(df.cuisine == 'indian').value_counts()

True     3000
False    3000
Name: cuisine, dtype: int64

First things first: convert everything to numbers.

Convert our target to numbers#

Right now our cuisine column is the name of the cuisine - a string! Let's fix it up into a number real quick. We'll use the "make it True or False and then turn that into a number" trick to make it happen.

df['is_indian'] = (df.cuisine == "indian").astype(int)
df.head()

	cuisine	id	ingredient_list	is_indian
0	indian	23348	minced ginger, garlic, oil, coriander powder, chickpeas, onions, chopped tomatoes, salt, lemon juice, fenugreek leaves, chili powder, cumin seed, ...	1
1	indian	18869	chicken, chicken breasts	1
2	indian	36405	flour, rose essence, frying oil, powdered milk, ghee, sugar, baking powder	1
3	indian	11494	soda, ghee, sugar, khoa, maida flour, milk, oil	1
4	indian	32675	tumeric, garam masala, salt, chicken, curry leaves, water, ginger, cinnamon sticks, fresh spinach, crushed red pepper flakes, cumin seed, tomatoes...	1

df.is_indian.value_counts()

1    3000
0    3000
Name: is_indian, dtype: int64

Convert our words to numbers#

Now we'll need to fix up our words. We've hopefully been through counting words, and in this situation we're specifically using a TF-IDF vectorizer to de emphasize more common words.

I'm stealing those code from the reference page for vectorization.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(balanced.ingredient_list)

words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

	10	14	95	abalone	abura	acai	achiote	acid	ackee	acorn	...	yolks	yoplait	york	yucca	yukon	zest	zesty	zinfandel	ziti	zucchini
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 1813 columns

Build our classifier#

Setting up our data#

In order to teach our classifier about what is a vegetarian recipe and what isn't, we need two variables:

X, the features (the ingredients)
y, the target labels (whether it's vegetarian or not)

X = words_df
y = balanced.is_indian

We can look at them if we want!

# Our features
X.head(2)

	10	14	95	abalone	abura	acai	achiote	acid	ackee	acorn	...	yolks	yoplait	york	yucca	yukon	zest	zesty	zinfandel	ziti	zucchini
0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

2 rows × 1813 columns

# Our labels
y.head(2)

10476    1
20078    1
Name: is_indian, dtype: int64

Now all we need to do is make a classifier and feed it our data.

Training our classifier#

We have a million and one classifiers at our disposal, but we'll keep things simple and use a logistic regression classifier for now. Honestly, we're using it because when we try to explain it it's just so crisp and nice. Whether it's the best classifier is not something we're addressing right now.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Making predictions#

Let's build a new dataframe with a few recipes and see what the algorithm thinks of them.

unknown = pd.DataFrame({
    'content': [
        'microwaveable chicken breast, lettuce, taco seasoning, salsa, tortilla chips',
        'onions, besan, green chilies, cumin seeds, turmeric, coriander powder, salt, oil for deep-frying',
        'spinach, olive oil, vinegar, carrots, cucumbers, cilantro'
    ]
})
unknown

	content
0	microwaveable chicken breast, lettuce, taco seasoning, salsa, tortilla chips
1	onions, besan, green chilies, cumin seeds, turmeric, coriander powder, salt, oil for deep-frying
2	spinach, olive oil, vinegar, carrots, cucumbers, cilantro

In order to make a prediction, we need to transform these strings into lists. Last time we used vectorizer.fit_transform, but we don't do that this time! When you do .fit_transform, it does two things:

fit learns all the words
transform turns the strings into numbers

We already know all the words we need to know. Any new words won't add anything, because we won't know whether they're related to Indian cuisine or not. Instead of using .fit_transform we'll just use .transform!

# Re-use our vectorizer, making sure to use .transform and NOT .fit_transform
# We're also skipping turning this into a words_df - we just did that last time
# so we could look at it with our poor human eyes
matrix = vectorizer.transform(unknown.content)

unknown['prediction'] = clf.predict(matrix)

What's it think?

unknown

	content	prediction
0	microwaveable chicken breast, lettuce, taco seasoning, salsa, tortilla chips	0
1	onions, besan, green chilies, cumin seeds, turmeric, coriander powder, salt, oil for deep-frying	1
2	spinach, olive oil, vinegar, carrots, cucumbers, cilantro	0

Check the evaluating classifiers section for more details on how to know if our classifier did a good job or not.

Explaining our classifier#

It's one thing to make predictions or to judge how well it did, the real important thing is to figure how how to explain what it was up to. We'll use the eli5 library to do this, because it's perfect. You just send it the classifier and your vectorizer and voilà, you have your top terms.

import eli5

eli5.show_weights(clf, vec=vectorizer)

y=1 top features

Weight^?	Feature
+8.038	curry
+5.363	cardamom
+5.349	masala
+4.307	yogurt
+4.239	ginger
+4.026	seeds
+3.961	garam
+3.735	cumin
+3.478	yoghurt
+3.301	turmeric
+3.225	ghee
+2.952	powder
+2.857	basmati
+2.728	coriander
+2.615	coconut
… 530 more positive …
… 1264 more negative …
-2.588	fish
-2.766	cheese
-2.889	eggs
-2.964	oregano
-3.830	sauce

According to the classifier, "curry" is the top signifier of a recipe being Indian (notice the y=1 up top?). eggs, oregano and sauce are the major signifies that it is not Indian.

eli5.show_prediction(clf, unknown.content[0], vec=vectorizer)

y=0 (probability 0.967, score -3.377) top features

Contribution^?	Feature
+1.594	<BIAS>
+0.482	salsa
+0.417	lettuce
+0.407	seasoning
+0.197	tortilla
+0.177	taco
+0.174	chips
+0.011	chicken
-0.082	breast

Explaining individual predictions#

We can also explain individual predictions!

# Explain the first one
eli5.show_prediction(clf, unknown.content[0], vec=vectorizer)

y=0 (probability 0.967, score -3.377) top features

Contribution^?	Feature
+1.594	<BIAS>
+0.482	salsa
+0.417	lettuce
+0.407	seasoning
+0.197	tortilla
+0.177	taco
+0.174	chips
+0.011	chicken
-0.082	breast

Be sure to read up top and notice it's predicting y=0. That's why salsa is +1.594 - it's a strong pressure that this is not Indian food. Compare that with the second predictions:

# Second one
eli5.show_prediction(clf, unknown.content[1], vec=vectorizer)

y=1 (probability 0.893, score 2.119) top features

Contribution^?	Feature
+0.710	seeds
+0.633	turmeric
+0.526	cumin
+0.448	coriander
+0.413	powder
+0.302	chilies
+0.217	salt
+0.198	oil
+0.166	onions
+0.077	frying
+0.052	for
+0.017	besan
-0.044	green
-1.594	<BIAS>

It's talking about y=1 up top, so seeds and turmeric are both pushing strongly for it being Indian food. For the third one, we're back at y=0, so they're telling you it is not Indian.

# Third one
eli5.show_prediction(clf, unknown.content[2], vec=vectorizer)

y=0 (probability 0.671, score -0.712) top features

Contribution^?	Feature
+1.594	<BIAS>
+0.362	olive
+0.159	vinegar
+0.007	cucumbers
-0.064	carrots
-0.284	oil
-0.521	cilantro
-0.540	spinach

Review#

In this section we learned how to combine word counts with

Discussion topics#

In this situation we used a TfidfVectorizer to vectorize our text. They're great at stressing less-often-used words. What if we were trying to determine whether a recipe was vegetarian or not? What argument could be made that instead o TF-IDF, we should just use a 0/1 as to whether each ingredient was included?

Your vectorizer can filter out words that occur a certain amount. For example, "only things that show up in less than 25% of recipes" or "must show up at least 5 recipes." What could be the benefits to doing this?