Counting words in Python with sklearn's CountVectorizer#

There are several ways to count words in Python: the easiest is probably to use a Counter! We'll be covering another technique here, the CountVectorizer from scikit-learn.

CountVectorizer is a little more intense than using Counter, but don't let that frighten you off! If your project is more complicated than "count the words in this book," the CountVectorizer might actually be easier in the long run.

Using CountVectorizer#

While Counter is used for counting all sorts of things, the CountVectorizer is specifically used for counting words. The vectorizer part of CountVectorizer is (technically speaking!) the process of converting text into some sort of number-y thing that computers can understand.

Unfortunately, the "number-y thing that computers can understand" is kind of hard for us to understand. See below:

from sklearn.feature_extraction.text import CountVectorizer

# Build our text
text = """Yesterday I went fishing. I don't fish that often, 
so I didn't catch any fish. I was told I'd enjoy myself, 
but it didn't really seem that fun."""

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform([text])
matrix
<1x20 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

We need to do a little magic to turn the results into a format we can understand.

import pandas as pd

counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names())
counts
any but catch didn don enjoy fish fishing fun it myself often really seem so that told was went yesterday
0 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1

Understanding CountVectorizer#

Let's break it down line by line.

Creating and using a vectorizer#

First, we made a new CountVectorizer. This is the thing that's going to understand and count the words for us. It has a lot of different options, but we'll just use the normal, standard version for now.

vectorizer = CountVectorizer()

Then we told the vectorizer to read the text for us.

matrix = vectorizer.fit_transform([text])
matrix
<1x20 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

Notice that we gave it [text] instead of just text. This is because sklearn is typically meant for the world of MACHINE LEARNING, where you're probably reading a lot of documents at once. Sklearn doesn't even want to deal with texts one at a time, so we have to send it a list.

When we did .fit_transform(), this did two things:

  1. Found all of the different words in the text
  2. Counted how many of each there were

The matrix variable it sent back is a big ugly thing just for computers. If we want to look at it, though, we can!

matrix.toarray()
array([[1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1]])

Each of those numbers is how many times a word showed up - most words showed up one time, and some showed up twice. But how do we know which word is which?

print(vectorizer.get_feature_names())
['any', 'but', 'catch', 'didn', 'don', 'enjoy', 'fish', 'fishing', 'fun', 'it', 'myself', 'often', 'really', 'seem', 'so', 'that', 'told', 'was', 'went', 'yesterday']

The order of the words matches the order of the numbers! First in the words list is any, and first in the numbers list is 1. That means "any" showed up once. In the same way you can figure out that fish is the seventh word in the list, which (count to the seventh number) showed up 2 times.

Converting the output#

Reading the matrix output gets easier if we move it into a pandas dataframe.

counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names())
counts
any but catch didn don enjoy fish fishing fun it myself often really seem so that told was went yesterday
0 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1

If we want to see a sorted list similar to what Counter gave us, though, we need to do a little shifting around.

counts.T.sort_values(by=0, ascending=False).head(10)
0
didn 2
fish 2
that 2
any 1
often 1
went 1
was 1
told 1
so 1
seem 1

There's something a little weird about this. didn isn't a word - it should be didn't, right? And i isn't in our list, even though the first sentence is "I went fishing yesterday." The reasons why:

  • By default, the CountVectorizer splits words on punctuation, so didn't becomes two words - didn and t. Their argument is that it's actually "did not" and shouldn't be kept together. You can read more about this right here.
  • By default, the CountVectorizer also only uses words that are 2 or more letters. So i doesn't make the cute, nor does the t up above.

Customizing CountVectorizer#

We don't have a good solution to the first one, but we can customize CountVectorizer to include 1-character words.

vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")

matrix = vectorizer.fit_transform([text])
counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names())

counts
any but catch d didn don enjoy fish fishing fun ... often really seem so t that told was went yesterday
0 1 1 1 1 2 1 1 2 1 1 ... 1 1 1 1 3 2 1 1 1 1

1 rows × 23 columns

This ability to customize CountVectorizer means for even intermediate text analysis it's usually more useful than Counter.

This was a boring example that makes CountVectorizer seem like trouble, but it has a lot of other options we aren't dealing with, too.

CountVectorizer in practice#

Counting words in a book#

Now that we know the basics of how to clean text and do text analysis with CountVectorizer, let's try it with an actual book! We'll use Jane Austen's Pride and Prejudice.

import requests

# Download the book
response = requests.get('http://www.gutenberg.org/cache/epub/42671/pg42671.txt')
text = response.text

# Look at some text in the middle
print(text[4100:4600])
d to be any thing extraordinary now. When a woman has
five grown up daughters, she ought to give over thinking of her own
beauty."

"In such cases, a woman has not often much beauty to think of."

"But, my dear, you must indeed go and see Mr. Bingley when he comes into
the neighbourhood."

"It is more than I engage for, I assure you."

"But consider your daughters. Only think what an establishment it would
be for one of them. Sir William and Lady Lucas are determined to go,
merely o

To count the words in the book, we're going to use the same code we used before. Since we have new content in text, we can 100% cut-and-paste.

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform([text])
counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names())

# Show us the top 10 most common words
counts.T.sort_values(by=0, ascending=False).head(10)
0
the 4520
to 4242
of 3749
and 3662
her 2205
in 1941
was 1846
she 1689
that 1566
it 1549

How often is love used?

counts['love']
0    92
Name: love, dtype: int64

How about hate?

counts['hate']
0    9
Name: hate, dtype: int64

Counting words in multiple books#

Remember how i said CountVectorizer is better at multiple pieces of text? Let's use that ability! We'll use a few:

We'll create a dataframe out of the name and URL, then grab the contents of the books from the URL.

# Build our dataframe
df = pd.DataFrame([
    { 'name': 'Pride and Prejudice', 'url': 'http://www.gutenberg.org/cache/epub/42671/pg42671.txt' },
    { 'name': 'Frankenstein', 'url': 'https://www.gutenberg.org/files/84/84-0.txt' },
    { 'name': 'Dr. Jekyll and Mr. Hyde', 'url': 'https://www.gutenberg.org/files/43/43-0.txt' },
    { 'name': 'Great Expectations', 'url': 'https://www.gutenberg.org/files/1400/1400-0.txt' },
])

# Download the contents of the book, put it in the 'content' column
df['content'] = df.url.apply(lambda url: requests.get(url).text)

# How'd it turn out?
df
name url content
0 Pride and Prejudice http://www.gutenberg.org/cache/epub/42671/pg42... The Project Gutenberg eBook, Pride and Prejud...
1 Frankenstein https://www.gutenberg.org/files/84/84-0.txt \r\nProject Gutenberg's Frankenstein, by Ma...
2 Dr. Jekyll and Mr. Hyde https://www.gutenberg.org/files/43/43-0.txt \r\nThe Project Gutenberg EBook of The Strange...
3 Great Expectations https://www.gutenberg.org/files/1400/1400-0.txt The Project Gutenberg EBook of Great Expect...

Now we just feed it to the CountVectorizer, and we get a nice organized dataframe of the words counted in each book!

vectorizer = CountVectorizer()

# Use the content column instead of our single text variable
matrix = vectorizer.fit_transform(df.content)
counts = pd.DataFrame(matrix.toarray(),
                  index=df.name,
                  columns=vectorizer.get_feature_names())

counts.head()
000 10 10_th 11 11th 12 12th 13 13th 14 ... yourselves youth youthful youthfulness youths youâ zeal zealous zest zip
name
Pride and Prejudice 1 0 0 0 0 0 0 0 0 0 ... 2 9 0 0 1 0 0 0 0 3
Frankenstein 1 2 0 2 2 2 2 3 1 2 ... 1 21 3 0 0 1 4 0 0 1
Dr. Jekyll and Mr. Hyde 1 0 1 0 0 0 1 0 0 0 ... 0 2 0 0 0 1 0 0 0 1
Great Expectations 1 0 0 0 0 0 0 0 0 0 ... 2 9 2 1 0 0 2 2 1 1

4 rows × 16183 columns

We can even use it to select a interesting words out of each!

counts[['love', 'hate', 'murder', 'terror', 'cried', 'food', 'dead', 'sister', 'husband', 'wife']]
love hate murder terror cried food dead sister husband wife
name
Pride and Prejudice 92 9 0 0 91 0 5 217 50 47
Frankenstein 59 9 21 10 15 27 23 26 2 11
Dr. Jekyll and Mr. Hyde 3 1 10 12 11 0 13 0 0 1
Great Expectations 60 4 20 28 60 8 49 170 16 27

Although though Python's Counter might be easier in situations where we're just looking at one piece of text and have time to clean it up, if you're looking to do more heavy lifting (including machine learning!) you'll want to turn to scikit-learn's vectorizers.

While we talked at length about CountVectorizer here, TfidfVectorizer is another common one that will take into account how often a word is used, and whether your texts are book-long or tweet-short.

Review#

We covered how to count words in documents with scikit-learn's CountVectorizer. It works best with multiple documents at once and is lot more complicated than working with Python's Counter.

We'll forgive CountVectorizer for its complexity because it's the foundation of a lot of machine learning and text analysis that we'll cover later.