Using classification algorithms with text#
It's easy to understand how a classifier might figure things out with numbers: "these numbers are close, so these topics must be related." But how's that go down when we're talking about language?
The foundations#
When you work on a classifier with text, the very first thing you need to do is turn those words into numbers. No matter what you use - Python's Counter
, scikit-learn's CountVectorizer
, or even something like spaCy, NLTK, org en Sim, it all ends up more or less the same in the end.
Let's build a dataset#
We have a lot of walkthroughs of real, published classifiers that use text, so we'll take care of this nice and quickly. Let's say we have a dataset of recipe ingredients.
import pandas as pd
pd.set_option("display.max_colwidth", 150)
df = pd.read_csv('data/recipes-indian.csv')
df.head()
df.cuisine.value_counts()
(df.cuisine == 'indian').value_counts()
First things first: convert everything to numbers.
Convert our target to numbers#
Right now our cuisine
column is the name of the cuisine - a string! Let's fix it up into a number real quick. We'll use the "make it True
or False
and then turn that into a number" trick to make it happen.
df['is_indian'] = (df.cuisine == "indian").astype(int)
df.head()
df.is_indian.value_counts()
Convert our words to numbers#
Now we'll need to fix up our words. We've hopefully been through counting words, and in this situation we're specifically using a TF-IDF vectorizer to de emphasize more common words.
I'm stealing those code from the reference page for vectorization.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(balanced.ingredient_list)
words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
words_df.head()
X = words_df
y = balanced.is_indian
We can look at them if we want!
# Our features
X.head(2)
# Our labels
y.head(2)
Now all we need to do is make a classifier and feed it our data.
Training our classifier#
We have a million and one classifiers at our disposal, but we'll keep things simple and use a logistic regression classifier for now. Honestly, we're using it because when we try to explain it it's just so crisp and nice. Whether it's the best classifier is not something we're addressing right now.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X, y)
Making predictions#
Let's build a new dataframe with a few recipes and see what the algorithm thinks of them.
unknown = pd.DataFrame({
'content': [
'microwaveable chicken breast, lettuce, taco seasoning, salsa, tortilla chips',
'onions, besan, green chilies, cumin seeds, turmeric, coriander powder, salt, oil for deep-frying',
'spinach, olive oil, vinegar, carrots, cucumbers, cilantro'
]
})
unknown
In order to make a prediction, we need to transform these strings into lists. Last time we used vectorizer.fit_transform
, but we don't do that this time! When you do .fit_transform
, it does two things:
fit
learns all the wordstransform
turns the strings into numbers
We already know all the words we need to know. Any new words won't add anything, because we won't know whether they're related to Indian cuisine or not. Instead of using .fit_transform
we'll just use .transform
!
# Re-use our vectorizer, making sure to use .transform and NOT .fit_transform
# We're also skipping turning this into a words_df - we just did that last time
# so we could look at it with our poor human eyes
matrix = vectorizer.transform(unknown.content)
unknown['prediction'] = clf.predict(matrix)
What's it think?
unknown
Check the evaluating classifiers section for more details on how to know if our classifier did a good job or not.
Explaining our classifier#
It's one thing to make predictions or to judge how well it did, the real important thing is to figure how how to explain what it was up to. We'll use the eli5
library to do this, because it's perfect. You just send it the classifier and your vectorizer and voilà, you have your top terms.
import eli5
eli5.show_weights(clf, vec=vectorizer)
According to the classifier, "curry" is the top signifier of a recipe being Indian (notice the y=1
up top?). eggs
, oregano
and sauce
are the major signifies that it is not Indian.
eli5.show_prediction(clf, unknown.content[0], vec=vectorizer)
Explaining individual predictions#
We can also explain individual predictions!
# Explain the first one
eli5.show_prediction(clf, unknown.content[0], vec=vectorizer)
Be sure to read up top and notice it's predicting y=0
. That's why salsa is +1.594
- it's a strong pressure that this is not Indian food. Compare that with the second predictions:
# Second one
eli5.show_prediction(clf, unknown.content[1], vec=vectorizer)
It's talking about y=1
up top, so seeds
and turmeric
are both pushing strongly for it being Indian food. For the third one, we're back at y=0
, so they're telling you it is not Indian.
# Third one
eli5.show_prediction(clf, unknown.content[2], vec=vectorizer)
Review#
In this section we learned how to combine word counts with
Discussion topics#
In this situation we used a TfidfVectorizer
to vectorize our text. They're great at stressing less-often-used words. What if we were trying to determine whether a recipe was vegetarian or not? What argument could be made that instead o TF-IDF, we should just use a 0
/1
as to whether each ingredient was included?
Your vectorizer can filter out words that occur a certain amount. For example, "only things that show up in less than 25% of recipes" or "must show up at least 5 recipes." What could be the benefits to doing this?