Counting words with Python's Counter#

Like all things, counting words using Python can be done two different ways: the easy way or the hard way. Using the Counter tool is the easy way!

Counter is generally used for, well, counting things.

Getting started#

from collections import Counter

Counter([1, 4, 3, 2, 3, 3, 2, 1, 3, 4, 1, 2])
Counter({1: 3, 4: 2, 3: 4, 2: 3})

If you have a list of words, you can use it to count how many times each word appears.

Counter(['hello', 'goodbye', 'goodbye', 'hello', 'hello', 'party'])
Counter({'hello': 3, 'goodbye': 2, 'party': 1})

If we want to use it to count words in a normal piece of text, though, we'll have to turn our text into a list of words. We also need to do a little bit of cleanup - removing punctuation, making everything lowercase, just making sure the only things left are words.

import re

text = """Yesterday I went fishing. I don't fish that often, 
so I didn't catch any fish. I was told I'd enjoy myself, 
but it didn't really seem that fun."""

# Force to all be lowercase because FISH and fish and Fish are the same
text = text.lower()

# Remove anything that isn't a word character or a space
# We could use .replace(".", "") but regex is a lot easier!
text = re.sub("[^\w ]", "", text)

print("Cleaned sentence is:", text)

words = text.split(" ")
Counter(words)
Cleaned sentence is: yesterday i went fishing i dont fish that often so i didnt catch any fish i was told id enjoy myself but it didnt really seem that fun
Counter({'yesterday': 1,
         'i': 4,
         'went': 1,
         'fishing': 1,
         'dont': 1,
         'fish': 2,
         'that': 2,
         'often': 1,
         'so': 1,
         'didnt': 2,
         'catch': 1,
         'any': 1,
         'was': 1,
         'told': 1,
         'id': 1,
         'enjoy': 1,
         'myself': 1,
         'but': 1,
         'it': 1,
         'really': 1,
         'seem': 1,
         'fun': 1})

If you have a lot of text, you're usually only interested in the most common words. If you just want the top words, .most_common is going to be your best friend.

Counter(words).most_common(5)
[('i', 4), ('fish', 2), ('that', 2), ('didnt', 2), ('yesterday', 1)]

Counting words in a book#

Now that we know the basics of how to clean text and do text analysis with Counter, let's try it with an actual book! We'll use Jane Austen's Pride and Prejudice.

import requests

response = requests.get('http://www.gutenberg.org/cache/epub/42671/pg42671.txt')
text = response.text

print(text[4100:4500])
d to be any thing extraordinary now. When a woman has
five grown up daughters, she ought to give over thinking of her own
beauty."

"In such cases, a woman has not often much beauty to think of."

"But, my dear, you must indeed go and see Mr. Bingley when he comes into
the neighbourhood."

"It is more than I engage for, I assure you."

"But consider your daughters. Only think what an es

The easiest and most boring thing we can do is count the words in it. So, let's count the words in it.

text = text.lower()
text = re.sub("[^\w ]", "", text)

words = text.split(" ")
Counter(words).most_common(20)
[('the', 3751),
 ('to', 3746),
 ('of', 3298),
 ('', 3289),
 ('and', 3113),
 ('her', 1811),
 ('a', 1745),
 ('in', 1679),
 ('i', 1655),
 ('was', 1622),
 ('she', 1385),
 ('that', 1325),
 ('it', 1294),
 ('not', 1278),
 ('he', 1148),
 ('you', 1145),
 ('be', 1101),
 ('his', 1061),
 ('as', 1052),
 ('had', 1036)]

Secret tricks with Counter#

Counting words is all fine and good, but if you have a little bit of regular expressions skills we can dig a little bit deeper!

Only extracting some words with regular expressions#

Do men and women do different things in this book? Let's look at she ____ and he ____ to see what we can find out!

\b marks a word boundary, otherwise the phrase "she talks" would match both she (\w+) and he (\w+)

# Catch every word after 'she'
she_words = re.findall(r"\b[Ss]he (\w+)", text)
she_words[:5]
['for', 'ought', 'is', 'was', 'was']
# Catch every word after 'he'
he_words = re.findall(r"\b[Hh]e (\w+)", text)
he_words[:5]
['is', 'had', 'camedown', 'agreed', 'isto']

Most common verbs#

Then we can use .most_common to get the top verbs for both men and women. While they aren't necessarily verbs, they mostly should be.

# Most common words after 'she'
Counter(he_words).most_common(20)
[('had', 139),
 ('was', 129),
 ('is', 54),
 ('has', 42),
 ('could', 32),
 ('would', 24),
 ('did', 24),
 ('should', 23),
 ('will', 21),
 ('must', 20),
 ('might', 17),
 ('replied', 14),
 ('said', 12),
 ('thought', 11),
 ('does', 10),
 ('may', 10),
 ('looked', 9),
 ('never', 9),
 ('came', 9),
 ('continued', 8)]
# Most common words after 'she'
Counter(she_words).most_common(20)
[('was', 165),
 ('had', 152),
 ('could', 102),
 ('is', 46),
 ('would', 44),
 ('did', 26),
 ('felt', 26),
 ('might', 21),
 ('has', 19),
 ('will', 16),
 ('saw', 16),
 ('added', 15),
 ('should', 14),
 ('said', 13),
 ('i', 11),
 ('must', 10),
 ('found', 10),
 ('cried', 10),
 ('spoke', 10),
 ('began', 9)]

Data! It's a very, very naive example of text analysis, but at least it's a start.

Comparing top words#

Now that we have two datasets created with Counter, we can actually push them into a pandas dataframe and do a comparison.

We'll get the raw counts into the he and she columns, and then do a little bit of calculating to get a percentage column.

import pandas as pd

df = pd.DataFrame({
    'he': Counter(he_words),
    'she': Counter(she_words)    
}).fillna(0)

df['total'] = df.he + df.she
df['pct_she'] = df.she / df.total * 100
df.head()
he she total pct_she
is 54.0 46.0 100.0 46.000000
had 139.0 152.0 291.0 52.233677
camedown 1.0 0.0 1.0 0.000000
agreed 1.0 0.0 1.0 0.000000
isto 1.0 0.0 1.0 0.000000

Let's look at words used ten or more times, sorted by how often they're done by women.

df[df.total >= 10].sort_values(by='pct_she', ascending=False).head(5)
he she total pct_she
cried 0.0 10.0 10.0 100.000000
felt 3.0 26.0 29.0 89.655172
saw 2.0 16.0 18.0 88.888889
i 3.0 11.0 14.0 78.571429
could 32.0 102.0 134.0 76.119403

Again: super naive text analysis, with a totally cherry-picked example to make "cried" and "felt" show up at the top. Feels like we did something cool, though, right? You can find other books at Project Gutenberg if you're interested in doing more.

Review#

We used Python's Counter tool to easily count words in a document or two. It also works well with pandas dataframes, allowing us to make simple comparisons.