Counting words in Python with sklearn's CountVectorizer#
There are several ways to count words in Python: the easiest is probably to use a Counter! We'll be covering another technique here, the CountVectorizer from scikit-learn.
CountVectorizer is a little more intense than using Counter, but don't let that frighten you off! If your project is more complicated than "count the words in this book," the CountVectorizer might actually be easier in the long run.
Using CountVectorizer#
While Counter is used for counting all sorts of things, the CountVectorizer is specifically used for counting words. The vectorizer part of CountVectorizer is (technically speaking!) the process of converting text into some sort of number-y thing that computers can understand.
Unfortunately, the "number-y thing that computers can understand" is kind of hard for us to understand. See below:
from sklearn.feature_extraction.text import CountVectorizer
# Build our text
text = """Yesterday I went fishing. I don't fish that often,
so I didn't catch any fish. I was told I'd enjoy myself,
but it didn't really seem that fun."""
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform([text])
matrix
We need to do a little magic to turn the results into a format we can understand.
import pandas as pd
counts = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
counts
vectorizer = CountVectorizer()
Then we told the vectorizer to read the text for us.
matrix = vectorizer.fit_transform([text])
matrix
Notice that we gave it [text]
instead of just text
. This is because sklearn is typically meant for the world of MACHINE LEARNING, where you're probably reading a lot of documents at once. Sklearn doesn't even want to deal with texts one at a time, so we have to send it a list.
When we did .fit_transform()
, this did two things:
- Found all of the different words in the text
- Counted how many of each there were
The matrix
variable it sent back is a big ugly thing just for computers. If we want to look at it, though, we can!
matrix.toarray()
Each of those numbers is how many times a word showed up - most words showed up one time, and some showed up twice. But how do we know which word is which?
print(vectorizer.get_feature_names())
The order of the words matches the order of the numbers! First in the words list is any
, and first in the numbers list is 1
. That means "any" showed up once. In the same way you can figure out that fish
is the seventh word in the list, which (count to the seventh number) showed up 2
times.
Converting the output#
Reading the matrix
output gets easier if we move it into a pandas dataframe.
counts = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
counts
If we want to see a sorted list similar to what Counter gave us, though, we need to do a little shifting around.
counts.T.sort_values(by=0, ascending=False).head(10)
There's something a little weird about this. didn
isn't a word - it should be didn't
, right? And i
isn't in our list, even though the first sentence is "I went fishing yesterday." The reasons why:
- By default, the CountVectorizer splits words on punctuation, so
didn't
becomes two words -didn
andt
. Their argument is that it's actually "did not" and shouldn't be kept together. You can read more about this right here. - By default, the CountVectorizer also only uses words that are 2 or more letters. So
i
doesn't make the cute, nor does thet
up above.
Customizing CountVectorizer#
We don't have a good solution to the first one, but we can customize CountVectorizer to include 1-character words.
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
matrix = vectorizer.fit_transform([text])
counts = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
counts
This ability to customize CountVectorizer
means for even intermediate text analysis it's usually more useful than Counter
.
This was a boring example that makes CountVectorizer seem like trouble, but it has a lot of other options we aren't dealing with, too.
CountVectorizer in practice#
Counting words in a book#
Now that we know the basics of how to clean text and do text analysis with CountVectorizer
, let's try it with an actual book! We'll use Jane Austen's Pride and Prejudice.
import requests
# Download the book
response = requests.get('http://www.gutenberg.org/cache/epub/42671/pg42671.txt')
text = response.text
# Look at some text in the middle
print(text[4100:4600])
To count the words in the book, we're going to use the same code we used before. Since we have new content in text
, we can 100% cut-and-paste.
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform([text])
counts = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
# Show us the top 10 most common words
counts.T.sort_values(by=0, ascending=False).head(10)
How often is love used?
counts['love']
How about hate?
counts['hate']
Counting words in multiple books#
Remember how i said CountVectorizer is better at multiple pieces of text? Let's use that ability! We'll use a few:
We'll create a dataframe out of the name and URL, then grab the contents of the books from the URL.
# Build our dataframe
df = pd.DataFrame([
{ 'name': 'Pride and Prejudice', 'url': 'http://www.gutenberg.org/cache/epub/42671/pg42671.txt' },
{ 'name': 'Frankenstein', 'url': 'https://www.gutenberg.org/files/84/84-0.txt' },
{ 'name': 'Dr. Jekyll and Mr. Hyde', 'url': 'https://www.gutenberg.org/files/43/43-0.txt' },
{ 'name': 'Great Expectations', 'url': 'https://www.gutenberg.org/files/1400/1400-0.txt' },
])
# Download the contents of the book, put it in the 'content' column
df['content'] = df.url.apply(lambda url: requests.get(url).text)
# How'd it turn out?
df
Now we just feed it to the CountVectorizer, and we get a nice organized dataframe of the words counted in each book!
vectorizer = CountVectorizer()
# Use the content column instead of our single text variable
matrix = vectorizer.fit_transform(df.content)
counts = pd.DataFrame(matrix.toarray(),
index=df.name,
columns=vectorizer.get_feature_names())
counts.head()
We can even use it to select a interesting words out of each!
counts[['love', 'hate', 'murder', 'terror', 'cried', 'food', 'dead', 'sister', 'husband', 'wife']]
Although though Python's Counter might be easier in situations where we're just looking at one piece of text and have time to clean it up, if you're looking to do more heavy lifting (including machine learning!) you'll want to turn to scikit-learn's vectorizers.
While we talked at length about CountVectorizer here, TfidfVectorizer is another common one that will take into account how often a word is used, and whether your texts are book-long or tweet-short.
Review#
We covered how to count words in documents with scikit-learn's CountVectorizer. It works best with multiple documents at once and is lot more complicated than working with Python's Counter.
We'll forgive CountVectorizer for its complexity because it's the foundation of a lot of machine learning and text analysis that we'll cover later.