Counting words in Python with sklearn's CountVectorizer#

There are several ways to count words in Python: the easiest is probably to use a Counter! We'll be covering another technique here, the CountVectorizer from scikit-learn.

CountVectorizer is a little more intense than using Counter, but don't let that frighten you off! If your project is more complicated than "count the words in this book," the CountVectorizer might actually be easier in the long run.

Read online Download notebook Interactive version

Using CountVectorizer#

While Counter is used for counting all sorts of things, the CountVectorizer is specifically used for counting words. The vectorizer part of CountVectorizer is (technically speaking!) the process of converting text into some sort of number-y thing that computers can understand.

Unfortunately, the "number-y thing that computers can understand" is kind of hard for us to understand. See below:

from sklearn.feature_extraction.text import CountVectorizer

# Build our text
text = """Yesterday I went fishing. I don't fish that often, 
so I didn't catch any fish. I was told I'd enjoy myself, 
but it didn't really seem that fun."""

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform([text])
matrix

<1x20 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

We need to do a little magic to turn the results into a format we can understand.

import pandas as pd

counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names())
counts

	any	but	catch	didn	don	enjoy	fish	fishing	fun	it	myself	often	really	seem	so	that	told	was	went	yesterday
0	1	1	1	2	1	1	2	1	1	1	1	1	1	1	1	2	1	1	1	1

Understanding CountVectorizer#

Let's break it down line by line.

Creating and using a vectorizer#

First, we made a new CountVectorizer. This is the thing that's going to understand and count the words for us. It has a lot of different options, but we'll just use the normal, standard version for now.

vectorizer = CountVectorizer()

Then we told the vectorizer to read the text for us.

matrix = vectorizer.fit_transform([text])
matrix

<1x20 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

Notice that we gave it [text] instead of just text. This is because sklearn is typically meant for the world of MACHINE LEARNING, where you're probably reading a lot of documents at once. Sklearn doesn't even want to deal with texts one at a time, so we have to send it a list.

When we did .fit_transform(), this did two things:

Found all of the different words in the text
Counted how many of each there were

The matrix variable it sent back is a big ugly thing just for computers. If we want to look at it, though, we can!

matrix.toarray()

array([[1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1]])

Each of those numbers is how many times a word showed up - most words showed up one time, and some showed up twice. But how do we know which word is which?

print(vectorizer.get_feature_names())

['any', 'but', 'catch', 'didn', 'don', 'enjoy', 'fish', 'fishing', 'fun', 'it', 'myself', 'often', 'really', 'seem', 'so', 'that', 'told', 'was', 'went', 'yesterday']

The order of the words matches the order of the numbers! First in the words list is any, and first in the numbers list is 1. That means "any" showed up once. In the same way you can figure out that fish is the seventh word in the list, which (count to the seventh number) showed up 2 times.

Converting the output#

Reading the matrix output gets easier if we move it into a pandas dataframe.

counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names())
counts

	any	but	catch	didn	don	enjoy	fish	fishing	fun	it	myself	often	really	seem	so	that	told	was	went	yesterday
0	1	1	1	2	1	1	2	1	1	1	1	1	1	1	1	2	1	1	1	1

If we want to see a sorted list similar to what Counter gave us, though, we need to do a little shifting around.

counts.T.sort_values(by=0, ascending=False).head(10)

	0
didn	2
fish	2
that	2
any	1
often	1
went	1
was	1
told	1
so	1
seem	1

There's something a little weird about this. didn isn't a word - it should be didn't, right? And i isn't in our list, even though the first sentence is "I went fishing yesterday." The reasons why:

By default, the CountVectorizer splits words on punctuation, so didn't becomes two words - didn and t. Their argument is that it's actually "did not" and shouldn't be kept together. You can read more about this right here.
By default, the CountVectorizer also only uses words that are 2 or more letters. So i doesn't make the cute, nor does the t up above.

Customizing CountVectorizer#

We don't have a good solution to the first one, but we can customize CountVectorizer to include 1-character words.

vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")

matrix = vectorizer.fit_transform([text])
counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names())

counts

	any	but	catch	d	didn	don	enjoy	fish	fishing	fun	...	often	really	seem	so	t	that	told	was	went	yesterday
0	1	1	1	1	2	1	1	2	1	1	...	1	1	1	1	3	2	1	1	1	1

1 rows × 23 columns

This ability to customize CountVectorizer means for even intermediate text analysis it's usually more useful than Counter.

This was a boring example that makes CountVectorizer seem like trouble, but it has a lot of other options we aren't dealing with, too.

CountVectorizer in practice#

Counting words in a book#

Now that we know the basics of how to clean text and do text analysis with CountVectorizer, let's try it with an actual book! We'll use Jane Austen's Pride and Prejudice.

import requests

# Download the book
response = requests.get('http://www.gutenberg.org/cache/epub/42671/pg42671.txt')
text = response.text

# Look at some text in the middle
print(text[4100:4600])

d to be any thing extraordinary now. When a woman has
five grown up daughters, she ought to give over thinking of her own
beauty."

"In such cases, a woman has not often much beauty to think of."

"But, my dear, you must indeed go and see Mr. Bingley when he comes into
the neighbourhood."

"It is more than I engage for, I assure you."

"But consider your daughters. Only think what an establishment it would
be for one of them. Sir William and Lady Lucas are determined to go,
merely o

To count the words in the book, we're going to use the same code we used before. Since we have new content in text, we can 100% cut-and-paste.

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform([text])
counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names())

# Show us the top 10 most common words
counts.T.sort_values(by=0, ascending=False).head(10)

	0
the	4520
to	4242
of	3749
and	3662
her	2205
in	1941
was	1846
she	1689
that	1566
it	1549

How often is love used?

counts['love']

0    92
Name: love, dtype: int64

How about hate?

counts['hate']

0    9
Name: hate, dtype: int64

Counting words in multiple books#

Remember how i said CountVectorizer is better at multiple pieces of text? Let's use that ability! We'll use a few:

We'll create a dataframe out of the name and URL, then grab the contents of the books from the URL.

# Build our dataframe
df = pd.DataFrame([
    { 'name': 'Pride and Prejudice', 'url': 'http://www.gutenberg.org/cache/epub/42671/pg42671.txt' },
    { 'name': 'Frankenstein', 'url': 'https://www.gutenberg.org/files/84/84-0.txt' },
    { 'name': 'Dr. Jekyll and Mr. Hyde', 'url': 'https://www.gutenberg.org/files/43/43-0.txt' },
    { 'name': 'Great Expectations', 'url': 'https://www.gutenberg.org/files/1400/1400-0.txt' },
])

# Download the contents of the book, put it in the 'content' column
df['content'] = df.url.apply(lambda url: requests.get(url).text)

# How'd it turn out?
df

	name	url	content
0	Pride and Prejudice	http://www.gutenberg.org/cache/epub/42671/pg42...	The Project Gutenberg eBook, Pride and Prejud...
1	Frankenstein	https://www.gutenberg.org/files/84/84-0.txt	ï»¿\r\nProject Gutenberg's Frankenstein, by Ma...
2	Dr. Jekyll and Mr. Hyde	https://www.gutenberg.org/files/43/43-0.txt	\r\nThe Project Gutenberg EBook of The Strange...
3	Great Expectations	https://www.gutenberg.org/files/1400/1400-0.txt	ï»¿The Project Gutenberg EBook of Great Expect...

Now we just feed it to the CountVectorizer, and we get a nice organized dataframe of the words counted in each book!

vectorizer = CountVectorizer()

# Use the content column instead of our single text variable
matrix = vectorizer.fit_transform(df.content)
counts = pd.DataFrame(matrix.toarray(),
                  index=df.name,
                  columns=vectorizer.get_feature_names())

counts.head()

	000	10	10_th	11	11th	12	12th	13	13th	14	...	yourselves	youth	youthful	youthfulness	youths	youâ	zeal	zealous	zest	zip
name
Pride and Prejudice	1	0	0	0	0	0	0	0	0	0	...	2	9	0	0	1	0	0	0	0	3
Frankenstein	1	2	0	2	2	2	2	3	1	2	...	1	21	3	0	0	1	4	0	0	1
Dr. Jekyll and Mr. Hyde	1	0	1	0	0	0	1	0	0	0	...	0	2	0	0	0	1	0	0	0	1
Great Expectations	1	0	0	0	0	0	0	0	0	0	...	2	9	2	1	0	0	2	2	1	1

4 rows × 16183 columns

We can even use it to select a interesting words out of each!

counts[['love', 'hate', 'murder', 'terror', 'cried', 'food', 'dead', 'sister', 'husband', 'wife']]

	love	hate	murder	terror	cried	food	dead	sister	husband	wife
name
Pride and Prejudice	92	9	0	0	91	0	5	217	50	47
Frankenstein	59	9	21	10	15	27	23	26	2	11
Dr. Jekyll and Mr. Hyde	3	1	10	12	11	0	13	0	0	1
Great Expectations	60	4	20	28	60	8	49	170	16	27

Although though Python's Counter might be easier in situations where we're just looking at one piece of text and have time to clean it up, if you're looking to do more heavy lifting (including machine learning!) you'll want to turn to scikit-learn's vectorizers.

While we talked at length about CountVectorizer here, TfidfVectorizer is another common one that will take into account how often a word is used, and whether your texts are book-long or tweet-short.

Review#

We covered how to count words in documents with scikit-learn's CountVectorizer. It works best with multiple documents at once and is lot more complicated than working with Python's Counter.

We'll forgive CountVectorizer for its complexity because it's the foundation of a lot of machine learning and text analysis that we'll cover later.

Counting words in Python with sklearn's CountVectorizer#

Using CountVectorizer#

Understanding CountVectorizer#

Creating and using a vectorizer#

Converting the output#

Customizing CountVectorizer#

CountVectorizer in practice#

Counting words in a book#

Counting words in multiple books#

Review#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects