from collections import Counter
Counter([1, 4, 3, 2, 3, 3, 2, 1, 3, 4, 1, 2])
If you have a list of words, you can use it to count how many times each word appears.
Counter(['hello', 'goodbye', 'goodbye', 'hello', 'hello', 'party'])
If we want to use it to count words in a normal piece of text, though, we'll have to turn our text into a list of words. We also need to do a little bit of cleanup - removing punctuation, making everything lowercase, just making sure the only things left are words.
import re
text = """Yesterday I went fishing. I don't fish that often,
so I didn't catch any fish. I was told I'd enjoy myself,
but it didn't really seem that fun."""
# Force to all be lowercase because FISH and fish and Fish are the same
text = text.lower()
# Remove anything that isn't a word character or a space
# We could use .replace(".", "") but regex is a lot easier!
text = re.sub("[^\w ]", "", text)
print("Cleaned sentence is:", text)
words = text.split(" ")
Counter(words)
If you have a lot of text, you're usually only interested in the most common words. If you just want the top words, .most_common
is going to be your best friend.
Counter(words).most_common(5)
Counting words in a book#
Now that we know the basics of how to clean text and do text analysis with Counter
, let's try it with an actual book! We'll use Jane Austen's Pride and Prejudice.
import requests
response = requests.get('http://www.gutenberg.org/cache/epub/42671/pg42671.txt')
text = response.text
print(text[4100:4500])
The easiest and most boring thing we can do is count the words in it. So, let's count the words in it.
text = text.lower()
text = re.sub("[^\w ]", "", text)
words = text.split(" ")
Counter(words).most_common(20)
Secret tricks with Counter#
Counting words is all fine and good, but if you have a little bit of regular expressions skills we can dig a little bit deeper!
Only extracting some words with regular expressions#
Do men and women do different things in this book? Let's look at she ____
and he ____
to see what we can find out!
\b
marks a word boundary, otherwise the phrase "she talks" would match bothshe (\w+)
andhe (\w+)
# Catch every word after 'she'
she_words = re.findall(r"\b[Ss]he (\w+)", text)
she_words[:5]
# Catch every word after 'he'
he_words = re.findall(r"\b[Hh]e (\w+)", text)
he_words[:5]
Most common verbs#
Then we can use .most_common
to get the top verbs for both men and women. While they aren't necessarily verbs, they mostly should be.
# Most common words after 'she'
Counter(he_words).most_common(20)
# Most common words after 'she'
Counter(she_words).most_common(20)
Data! It's a very, very naive example of text analysis, but at least it's a start.
Comparing top words#
Now that we have two datasets created with Counter
, we can actually push them into a pandas dataframe and do a comparison.
We'll get the raw counts into the he
and she
columns, and then do a little bit of calculating to get a percentage column.
import pandas as pd
df = pd.DataFrame({
'he': Counter(he_words),
'she': Counter(she_words)
}).fillna(0)
df['total'] = df.he + df.she
df['pct_she'] = df.she / df.total * 100
df.head()
Let's look at words used ten or more times, sorted by how often they're done by women.
df[df.total >= 10].sort_values(by='pct_she', ascending=False).head(5)
Again: super naive text analysis, with a totally cherry-picked example to make "cried" and "felt" show up at the top. Feels like we did something cool, though, right? You can find other books at Project Gutenberg if you're interested in doing more.
Review#
We used Python's Counter
tool to easily count words in a document or two. It also works well with pandas dataframes, allowing us to make simple comparisons.