Categorizing text based on keyword matching#
Sometimes instead of using a fancy classifier or topic modeling, you just want to do a simple keyword search to assign categories to your data points. For example: it has "cat" or "dog" we'll label it pets, if it has "boat" or "train" we'll label it transportation. In this case we're reproducing this Bloomberg piece.
Our data#
We have billions upon billions of tweets from Democratic presidential candidates, let's see if we can put them into categories. Topic modeling didn't work out so hot, so we're going to do this manually now.
import pandas as pd
# We don't need all of the columns, let's leave out a lot of them
columns = ['username', 'text', 'date']
df = pd.read_csv("data/tweets.csv", usecols=columns)
df.sample(5)
df.shape
Sorry, not billions of tweets - more like 39k.
Categorizing the tweets#
We're going to work off of Austin Wehrwein's take (in R), where he made a short short list of words associated with teach topic. Our approach is going to be kind of awkward, but it's pretty flexible for things you might want to do in the future.
# We're only using single words (no "green new deal") because the
# stemmer won't work with multiple words
categories = {
'immigration': ['immigration', 'border', 'wall'],
'education': ['students', 'education', 'teacher'],
'foreign_policy': ['foreign policy', 'peace'],
'climate_change': ['climate', 'emissions', 'carbon'],
'economy': ['economy', 'tariffs', 'taxes'],
'military': ['veterans', 'troops', 'war'],
'jobs': ['jobs', 'unemployment', 'wages'],
'drugs': ['drugs', 'opioid'],
'health': ['health', 'insurance', 'medicare'],
'repro_rights': ['reproductive', 'abortion'],
'gun_control': ['gun'],
}
categories
We'll turn these into a nice long dataframe of words and category names. We'll also stem the keywords so they'll match a bit more broadly. For example, immigrant
, immigrants
, and immigration
will all end up as immigr
.
"green new deal" ends up as "green new d" when stemmed, so we'll stick with single words.
import Stemmer
stemmer = Stemmer.Stemmer('en')
dfs = []
for key,values in categories.items():
words = pd.DataFrame({'category': key, 'term': stemmer.stemWords(values)})
dfs.append(words)
terms_df = pd.concat(dfs)
terms_df
Now we're going to do a very unique kind of vectorizer: we're only going to count words in this list, and we're also only going to say yes/no for each of them.
from sklearn.feature_extraction.text import CountVectorizer
import Stemmer
# Using pyStemmer because it's way faster than NLTK
stemmer = Stemmer.Stemmer('en')
# Based on CountVectorizer
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: stemmer.stemWords([w for w in analyzer(doc)])
# Take the 'term' column from our list of terms
term_list = list(terms_df.term)
# binary=True only does 0/1
# vocabulary= is the list of words we're interested in tracking
vectorizer = StemmedCountVectorizer(binary=True, vocabulary=term_list)
matrix = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
words_df.head()
Out of the first five tweets, it looks like only the fourth one fits into a category - it includes the word teacher, so we'll put it into the education group.
We're going to loop through each category, and then see if any of the terms for that category have a 1
in them. If so, we'll assign that row a 1
for the category. If not, we'll give it a 0
.
For example, row 3 does not have student
or educ
, but it does have teacher
. As a result, it will get a 1
for the education category.
# Group the terms by category, then loop through each category
for category_name, rows in terms_df.groupby('category'):
# Convert the terms for that category into a simple list
# for example, ['student', 'educ', 'teacher']
terms = list(rows['term'])
print(f"Looking at {category_name} with terms {terms}")
# words_df[terms] gets the columns for 'student', 'educ', and 'teacher'
# .any(axis=1) sees if any of them are a 1, gives True/False
# .astype(int) converts True/False to 1/0
# df[category_name] = will assign that value to df['education']
df[category_name] = words_df[terms].any(axis=1).astype(int)
Let's see how it looks.
df.sample(4)
Many of these don't have categories, but we did a really really really bad job coming up with a list of terms. In the "real world" you'd probably have more than three keywords per category!
Let's take a second to save the labeled tweets, as we'll need them in the future.
df.to_csv("data/tweets-categorized.csv", index=False)
Exploring categorized tweets#
Now that we have a set of tweets that are labeled with different categories, we can start to count and classify them. For example, we can see how tweets the most about jobs.
df.groupby('username').jobs.sum().sort_values(ascending=False)
In fact, since these 0's and 1's are our only numeric columns, we can just ask the dataframe to group by username and add up every category.
overall = df.groupby('username').sum()
overall
The problem with this view is that some candidates tweet a lot, and some candidates tweet much less. If we graph it, it isn't going to give a good view of what topics the candidates' campaigns value.
ax = overall.plot(kind='bar', stacked=True, figsize=(13,6), width=0.9)
# Move the legend off of the chart
ax.legend(loc=(1.04,0))
What we need is for this to be based on percentages. In order to do that, we'll need to divide each column by the sum of the counts in that column. Because of Weird Pandas Magic we'll need to use .div
instead of /
. The division sign might not give you an error but it will definitely give you incorrect results!
overall_pct = overall.div(overall.sum(axis=1), axis=0)
overall_pct
Now we can plot it successfully!
ax = overall_pct.plot(kind='bar', stacked=True, figsize=(13,6), width=0.9)
# Move the legend off of the chart
ax.legend(loc=(1.04,0))
Matplotlib is pretty horrifying to look at, though, so we might want to upgrade to using plotly instead of the normal .plot
. Unfortunately that will involve reshaping our data!
reshaped = overall_pct.reset_index().melt(id_vars=['username'], var_name='topic', value_name='pct')
reshaped.head()
Once it's reshaped we're free to plot.
import plotly.express as px
fig = px.bar(reshaped, x='username', y='pct', color='topic')
fig.show()
Review#
In this section we learned how to categorized text based on lists of words.
Discussion topics#
How does this approach compare to something like classification? Is there a difference?
With this approach we're double-counting tweets that fall into two categories (for example, a tweet could count for climate change as well as foreign policy). Should this give us a panic attack regarding our 100% stacked bar graph? Why or why not?
What is stressed in the data differently between the "normal" stacked bar as compared to the 100% stacked bar?
What is an alternative method we could use instead of counting the tweet as being once for climate change and once for foreign policy? We don't need to implement it, just figure an alternative out.