What is an n-gram in text analysis?#

When you're performing text analysis, you usually stick to looking at individual words. Sometimes you need a little more specificity, or a little more context, and need to move to multi-word phrases. This is where ngrams come in.

Our dataset#

Let's say we have some sentences.

sentences = [
    'Leopold ate the fish',
    'The fish ate Leopold',
    'Nora ate the fish',
    'The fish ate Nora',
    'Nora ate the bread',
]

Word counting and similarity#

Which ones are the most similar? With a little natural language processing, we can find out! We'll start by counting the words.

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer 

# We just want yes/no for our words, so we use binary=True
vectorizer = CountVectorizer(binary=True)
# Later, try uncommenting this line and see what happens to the chart below!
# vectorizer = TfidfVectorizer(use_idf=False)
matrix = vectorizer.fit_transform(sentences)
counts = pd.DataFrame(
    matrix.toarray(),
    index=sentences,
    columns=vectorizer.get_feature_names())
counts
ate bread fish leopold nora the
Leopold ate the fish 1 0 1 1 0 1
The fish ate Leopold 1 0 1 1 0 1
Nora ate the fish 1 0 1 0 1 1
The fish ate Nora 1 0 1 0 1 1
Nora ate the bread 1 1 0 0 1 1

We can see that "Leopold ate the fish" and "The fish ate Leopold" have all the same words, so they should definitely match up. Let's put a number to it to see how similar each of the sentences is:

counts.dot(counts.T) \
    .style \
    .background_gradient(axis=None)
Leopold ate the fish The fish ate Leopold Nora ate the fish The fish ate Nora Nora ate the bread
Leopold ate the fish 4 4 3 3 2
The fish ate Leopold 4 4 3 3 2
Nora ate the fish 3 3 4 4 3
The fish ate Nora 3 3 4 4 3
Nora ate the bread 2 2 3 3 4

Each sentence is very very similar to itself, so it gets a nice big score. "The fish ate Leopold" and "Leopold ate the fish" have a full match between all the words, so they also get nice big scores.

Here's the problem, though: those sentences aren't similar in content at all! I think that "Leopold ate the fish" and "Nora ate the fish" are far more similar, but they get a lower score.

What makes those sentences the same or different? It isn't just the words they have, it's context, it's the order the words show up in.

Technically speaking, just basing analysis on the words is called "bag of words," because it's like you threw all the words into a bag and shook them up!

Instead of just counting the words, we have an alternative: count phrases. That way we can see when "ate the fish" repeats.

Counting n-grams#

In the world of natural language processing, phrases are called n-grams, where n is the number of words you're looking at. 1-grams are one word, 2-grams are two words, 3-grams are three words. If you're feeling fancy you can also call them unigrams, bigrams or trigrams.

Let's use the same code we did before, but throw in an additional option.

# We just want yes/no for our words, so we use binary=True
vectorizer = CountVectorizer(binary=True, ngram_range=(3,3))
# Later, try uncommenting this line and see what happens to the chart below!
# vectorizer = TfidfVectorizer(use_idf=False, ngram_range=(3,3))
matrix = vectorizer.fit_transform(sentences)
counts = pd.DataFrame(
    matrix.toarray(),
    index=sentences,
    columns=vectorizer.get_feature_names())
counts
ate the bread ate the fish fish ate leopold fish ate nora leopold ate the nora ate the the fish ate
Leopold ate the fish 0 1 0 0 1 0 0
The fish ate Leopold 0 0 1 0 0 0 1
Nora ate the fish 0 1 0 0 0 1 0
The fish ate Nora 0 0 0 1 0 0 1
Nora ate the bread 1 0 0 0 0 1 0

Now we're only counting three word phrases, aka trigrams. If we use these counts to compute the similarity between each of the sentences, we feel... kind of better?

counts.dot(counts.T) \
    .style \
    .background_gradient(axis=None)
Leopold ate the fish The fish ate Leopold Nora ate the fish The fish ate Nora Nora ate the bread
Leopold ate the fish 2 0 1 0 0
The fish ate Leopold 0 2 0 1 0
Nora ate the fish 1 0 2 0 1
The fish ate Nora 0 1 0 2 0
Nora ate the bread 0 0 1 0 2

First off, the good thing: "Leopold ate the fish" and "Nora ate the fish" are now showing up as matches!

But then there's the bad thing: "The fish ate Leopold" is certainly a little similar to "Leopold ate the fish," but it shows up as not matching at all. What a crisis!!!

Variable length phrases#

One more attempt: instead of just looking at 1-grams or 3-grams, we can look at 1-grams, 2-grams, and 3-grams, all at the same time.

# We have more columns than we're used to, so I'm increasing the number pandas will display.
pd.set_option("display.max_columns", 30)

# We just want yes/no for our words, so we use binary=True
vectorizer = CountVectorizer(binary=True, ngram_range=(1,3))
# Later, try uncommenting this line and see what happens to the chart below!
# vectorizer = TfidfVectorizer(use_idf=False, ngram_range=(1,3))
matrix = vectorizer.fit_transform(sentences)
counts = pd.DataFrame(
    matrix.toarray(),
    index=sentences,
    columns=vectorizer.get_feature_names())
counts
ate ate leopold ate nora ate the ate the bread ate the fish bread fish fish ate fish ate leopold fish ate nora leopold leopold ate leopold ate the nora nora ate nora ate the the the bread the fish the fish ate
Leopold ate the fish 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0
The fish ate Leopold 1 1 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 0 1 1
Nora ate the fish 1 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1 1 1 0 1 0
The fish ate Nora 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0 1 1
Nora ate the bread 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0

We have a lot lot lot more columns, maybe more than we can easily judge by the eye, but that's why we have that nice colorful chart!

counts.dot(counts.T) \
    .style \
    .background_gradient(axis=None)
Leopold ate the fish The fish ate Leopold Nora ate the fish The fish ate Nora Nora ate the bread
Leopold ate the fish 9 5 6 4 3
The fish ate Leopold 5 9 4 6 2
Nora ate the fish 6 4 9 5 6
The fish ate Nora 4 6 5 9 3
Nora ate the bread 3 2 6 3 9

Looking good! Comparisons are made based on single words, sure, but also on larger phrases up to 3 words long. Even though it takes a lot more columns of data, I think it does a much better job of allowing us to make useful comparisons between sentences.

Review#

Text analysis often deals with words individually, ignoring the order they come in. This is called the bag of words technique. While this is a simple technique, it turns out you lose some of the meaning of a sentence when you look at words in isolation.

To solve this, we turned to the idea of n-grams, where you use combinations of 2- or 3- or even more words instead of just looking at them individually. It ends up creating more data for you to deal with (and possibly more processing time), but it does allow your code to make more nuance comparisons with the text.