Categorizing text based on keyword matching#

Sometimes instead of using a fancy classifier or topic modeling, you just want to do a simple keyword search to assign categories to your data points. For example: it has "cat" or "dog" we'll label it pets, if it has "boat" or "train" we'll label it transportation. In this case we're reproducing this Bloomberg piece.

Our data#

We have billions upon billions of tweets from Democratic presidential candidates, let's see if we can put them into categories. Topic modeling didn't work out so hot, so we're going to do this manually now.

import pandas as pd

# We don't need all of the columns, let's leave out a lot of them
columns = ['username', 'text', 'date']

df = pd.read_csv("data/tweets.csv", usecols=columns)
df.sample(5)
username text date
36633 AndrewYang As a politician I’m a pretty good entrepreneur. 2019-03-27 19:45:19+00:00
37588 AndrewYang Would you believe that we are passing 28,000 d... 2019-02-24 03:39:36+00:00
26789 CoryBooker It's about time. 2019-04-18 03:01:23+00:00
13770 marwilliamson All that a country is is a collection of peopl... 2019-07-31 02:43:47+00:00
15597 marwilliamson The USA has essentially reverted to an aristoc... 2019-02-20 13:06:50+00:00
df.shape
(38559, 3)

Sorry, not billions of tweets - more like 39k.

Categorizing the tweets#

We're going to work off of Austin Wehrwein's take (in R), where he made a short short list of words associated with teach topic. Our approach is going to be kind of awkward, but it's pretty flexible for things you might want to do in the future.

# We're only using single words (no "green new deal") because the
# stemmer won't work with multiple words

categories = {
    'immigration': ['immigration', 'border', 'wall'],
    'education': ['students', 'education', 'teacher'],
    'foreign_policy': ['foreign policy', 'peace'],
    'climate_change': ['climate', 'emissions', 'carbon'],
    'economy': ['economy', 'tariffs', 'taxes'],
    'military': ['veterans', 'troops', 'war'],
    'jobs': ['jobs', 'unemployment', 'wages'],
    'drugs': ['drugs', 'opioid'],
    'health': ['health', 'insurance', 'medicare'],
    'repro_rights': ['reproductive', 'abortion'],
    'gun_control': ['gun'],
}
categories
{'immigration': ['immigration', 'border', 'wall'],
 'education': ['students', 'education', 'teacher'],
 'foreign_policy': ['foreign policy', 'peace'],
 'climate_change': ['climate', 'emissions', 'carbon'],
 'economy': ['economy', 'tariffs', 'taxes'],
 'military': ['veterans', 'troops', 'war'],
 'jobs': ['jobs', 'unemployment', 'wages'],
 'drugs': ['drugs', 'opioid'],
 'health': ['health', 'insurance', 'medicare'],
 'repro_rights': ['reproductive', 'abortion'],
 'gun_control': ['gun']}

We'll turn these into a nice long dataframe of words and category names. We'll also stem the keywords so they'll match a bit more broadly. For example, immigrant, immigrants, and immigration will all end up as immigr.

"green new deal" ends up as "green new d" when stemmed, so we'll stick with single words.

import Stemmer

stemmer = Stemmer.Stemmer('en')

dfs = []
for key,values in categories.items():
    words = pd.DataFrame({'category': key, 'term': stemmer.stemWords(values)})
    dfs.append(words)

terms_df = pd.concat(dfs)

terms_df
category term
0 immigration immigr
1 immigration border
2 immigration wall
0 education student
1 education educ
2 education teacher
0 foreign_policy foreign polici
1 foreign_policy peac
0 climate_change climat
1 climate_change emiss
2 climate_change carbon
0 economy economi
1 economy tariff
2 economy tax
0 military veteran
1 military troop
2 military war
0 jobs job
1 jobs unemploy
2 jobs wage
0 drugs drug
1 drugs opioid
0 health health
1 health insur
2 health medicar
0 repro_rights reproduct
1 repro_rights abort
0 gun_control gun

Now we're going to do a very unique kind of vectorizer: we're only going to count words in this list, and we're also only going to say yes/no for each of them.

from sklearn.feature_extraction.text import CountVectorizer
import Stemmer

# Using pyStemmer because it's way faster than NLTK
stemmer = Stemmer.Stemmer('en')

# Based on CountVectorizer
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords([w for w in analyzer(doc)])
# Take the 'term' column from our list of terms
term_list = list(terms_df.term)

# binary=True only does 0/1
# vocabulary= is the list of words we're interested in tracking
vectorizer = StemmedCountVectorizer(binary=True, vocabulary=term_list)
matrix = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())
words_df.head()
immigr border wall student educ teacher foreign polici peac climat emiss ... unemploy wage drug opioid health insur medicar reproduct abort gun
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 28 columns

Out of the first five tweets, it looks like only the fourth one fits into a category - it includes the word teacher, so we'll put it into the education group.

We're going to loop through each category, and then see if any of the terms for that category have a 1 in them. If so, we'll assign that row a 1 for the category. If not, we'll give it a 0.

For example, row 3 does not have student or educ, but it does have teacher. As a result, it will get a 1 for the education category.

# Group the terms by category, then loop through each category
for category_name, rows in terms_df.groupby('category'):
    # Convert the terms for that category into a simple list
    # for example, ['student', 'educ', 'teacher']
    terms = list(rows['term'])
    print(f"Looking at {category_name} with terms {terms}")

    # words_df[terms] gets the columns for 'student', 'educ', and 'teacher'
    # .any(axis=1) sees if any of them are a 1, gives True/False
    # .astype(int) converts True/False to 1/0
    # df[category_name] = will assign that value to df['education']
    df[category_name] = words_df[terms].any(axis=1).astype(int)
Looking at climate_change with terms ['climat', 'emiss', 'carbon']
Looking at drugs with terms ['drug', 'opioid']
Looking at economy with terms ['economi', 'tariff', 'tax']
Looking at education with terms ['student', 'educ', 'teacher']
Looking at foreign_policy with terms ['foreign polici', 'peac']
Looking at gun_control with terms ['gun']
Looking at health with terms ['health', 'insur', 'medicar']
Looking at immigration with terms ['immigr', 'border', 'wall']
Looking at jobs with terms ['job', 'unemploy', 'wage']
Looking at military with terms ['veteran', 'troop', 'war']
Looking at repro_rights with terms ['reproduct', 'abort']

Let's see how it looks.

df.sample(4)
username text date climate_change drugs economy education foreign_policy gun_control health immigration jobs military repro_rights
4676 BetoORourke Grateful for the opportunity to bring everyone... 2019-04-19 14:57:00+00:00 0 0 0 0 0 0 0 0 0 0 0
7526 JoeBiden Trump continues to undermine our standing in t... 2019-06-19 01:34:30+00:00 0 0 0 0 0 0 0 0 0 0 0
5729 amyklobuchar AG Barr told me to ask Director Mueller for Pr... 2019-05-09 12:05:49+00:00 0 0 1 0 0 0 0 0 0 0 0
18305 JayInslee Donald Trump is for environmentalism like he i... 2019-07-09 23:46:04+00:00 0 0 0 0 0 0 0 0 0 0 0

Many of these don't have categories, but we did a really really really bad job coming up with a list of terms. In the "real world" you'd probably have more than three keywords per category!

Let's take a second to save the labeled tweets, as we'll need them in the future.

df.to_csv("data/tweets-categorized.csv", index=False)

Exploring categorized tweets#

Now that we have a set of tweets that are labeled with different categories, we can start to count and classify them. For example, we can see how tweets the most about jobs.

df.groupby('username').jobs.sum().sort_values(ascending=False)
username
BernieSanders      218
AndrewYang         132
ewarren            122
JohnDelaney        118
JayInslee          113
KamalaHarris       113
TimRyan             82
GovernorBullock     75
BetoORourke         63
Hickenlooper        60
marwilliamson       59
SenGillibrand       52
CoryBooker          51
WayneMessam         41
JoeBiden            41
TomSteyer           40
amyklobuchar        39
sethmoulton         38
BilldeBlasio        31
JulianCastro        30
PeteButtigieg       29
TulsiGabbard        27
MichaelBennet       26
ericswalwell        19
JoeSestak           15
Name: jobs, dtype: int64

In fact, since these 0's and 1's are our only numeric columns, we can just ask the dataframe to group by username and add up every category.

overall = df.groupby('username').sum()
overall
climate_change drugs economy education foreign_policy gun_control health immigration jobs military repro_rights
username
AndrewYang 41 20 142 65 2 21 51 27 132 38 5
BernieSanders 87 72 111 146 16 38 321 123 218 73 36
BetoORourke 86 23 57 146 9 83 126 165 63 94 24
BilldeBlasio 10 1 18 13 1 11 29 14 31 8 4
CoryBooker 18 37 26 39 4 108 62 49 51 33 33
GovernorBullock 62 19 59 63 2 30 34 32 75 32 3
Hickenlooper 46 9 70 20 7 72 43 24 60 23 15
JayInslee 747 4 121 53 3 41 64 52 113 28 53
JoeBiden 40 7 48 48 16 37 72 37 41 20 0
JoeSestak 32 6 29 16 9 1 21 22 15 30 7
JohnDelaney 130 43 152 64 3 14 152 58 118 37 6
JulianCastro 25 2 20 55 4 28 37 139 30 14 24
KamalaHarris 65 31 88 200 8 147 208 94 113 23 54
MichaelBennet 54 11 58 69 0 19 124 29 26 15 3
PeteButtigieg 24 2 24 27 6 21 35 11 29 29 13
SenGillibrand 54 19 54 51 8 62 127 57 52 23 111
TimRyan 42 18 92 103 7 32 126 18 82 30 13
TomSteyer 117 2 67 26 3 27 17 55 40 18 6
TulsiGabbard 15 14 27 18 39 1 22 16 27 160 1
WayneMessam 16 1 40 80 6 27 13 35 41 4 2
amyklobuchar 45 56 47 41 3 59 82 31 39 18 8
ericswalwell 12 6 26 58 7 182 37 36 19 35 12
ewarren 79 44 151 195 1 65 95 131 122 29 43
marwilliamson 23 13 63 53 87 18 46 37 59 79 5
sethmoulton 43 16 40 24 9 34 60 49 38 110 5

The problem with this view is that some candidates tweet a lot, and some candidates tweet much less. If we graph it, it isn't going to give a good view of what topics the candidates' campaigns value.

ax = overall.plot(kind='bar', stacked=True, figsize=(13,6), width=0.9)

# Move the legend off of the chart
ax.legend(loc=(1.04,0))
<matplotlib.legend.Legend at 0x115a5b710>

What we need is for this to be based on percentages. In order to do that, we'll need to divide each column by the sum of the counts in that column. Because of Weird Pandas Magic we'll need to use .div instead of /. The division sign might not give you an error but it will definitely give you incorrect results!

overall_pct = overall.div(overall.sum(axis=1), axis=0)
overall_pct
climate_change drugs economy education foreign_policy gun_control health immigration jobs military repro_rights
username
AndrewYang 0.075368 0.036765 0.261029 0.119485 0.003676 0.038603 0.093750 0.049632 0.242647 0.069853 0.009191
BernieSanders 0.070105 0.058018 0.089444 0.117647 0.012893 0.030620 0.258662 0.099114 0.175665 0.058824 0.029009
BetoORourke 0.098174 0.026256 0.065068 0.166667 0.010274 0.094749 0.143836 0.188356 0.071918 0.107306 0.027397
BilldeBlasio 0.071429 0.007143 0.128571 0.092857 0.007143 0.078571 0.207143 0.100000 0.221429 0.057143 0.028571
CoryBooker 0.039130 0.080435 0.056522 0.084783 0.008696 0.234783 0.134783 0.106522 0.110870 0.071739 0.071739
GovernorBullock 0.150852 0.046229 0.143552 0.153285 0.004866 0.072993 0.082725 0.077859 0.182482 0.077859 0.007299
Hickenlooper 0.118252 0.023136 0.179949 0.051414 0.017995 0.185090 0.110540 0.061697 0.154242 0.059126 0.038560
JayInslee 0.584050 0.003127 0.094605 0.041439 0.002346 0.032056 0.050039 0.040657 0.088350 0.021892 0.041439
JoeBiden 0.109290 0.019126 0.131148 0.131148 0.043716 0.101093 0.196721 0.101093 0.112022 0.054645 0.000000
JoeSestak 0.170213 0.031915 0.154255 0.085106 0.047872 0.005319 0.111702 0.117021 0.079787 0.159574 0.037234
JohnDelaney 0.167310 0.055341 0.195624 0.082368 0.003861 0.018018 0.195624 0.074646 0.151866 0.047619 0.007722
JulianCastro 0.066138 0.005291 0.052910 0.145503 0.010582 0.074074 0.097884 0.367725 0.079365 0.037037 0.063492
KamalaHarris 0.063046 0.030068 0.085354 0.193986 0.007759 0.142580 0.201746 0.091174 0.109602 0.022308 0.052376
MichaelBennet 0.132353 0.026961 0.142157 0.169118 0.000000 0.046569 0.303922 0.071078 0.063725 0.036765 0.007353
PeteButtigieg 0.108597 0.009050 0.108597 0.122172 0.027149 0.095023 0.158371 0.049774 0.131222 0.131222 0.058824
SenGillibrand 0.087379 0.030744 0.087379 0.082524 0.012945 0.100324 0.205502 0.092233 0.084142 0.037217 0.179612
TimRyan 0.074600 0.031972 0.163410 0.182948 0.012433 0.056838 0.223801 0.031972 0.145648 0.053286 0.023091
TomSteyer 0.309524 0.005291 0.177249 0.068783 0.007937 0.071429 0.044974 0.145503 0.105820 0.047619 0.015873
TulsiGabbard 0.044118 0.041176 0.079412 0.052941 0.114706 0.002941 0.064706 0.047059 0.079412 0.470588 0.002941
WayneMessam 0.060377 0.003774 0.150943 0.301887 0.022642 0.101887 0.049057 0.132075 0.154717 0.015094 0.007547
amyklobuchar 0.104895 0.130536 0.109557 0.095571 0.006993 0.137529 0.191142 0.072261 0.090909 0.041958 0.018648
ericswalwell 0.027907 0.013953 0.060465 0.134884 0.016279 0.423256 0.086047 0.083721 0.044186 0.081395 0.027907
ewarren 0.082723 0.046073 0.158115 0.204188 0.001047 0.068063 0.099476 0.137173 0.127749 0.030366 0.045026
marwilliamson 0.047619 0.026915 0.130435 0.109731 0.180124 0.037267 0.095238 0.076605 0.122153 0.163561 0.010352
sethmoulton 0.100467 0.037383 0.093458 0.056075 0.021028 0.079439 0.140187 0.114486 0.088785 0.257009 0.011682

Now we can plot it successfully!

ax = overall_pct.plot(kind='bar', stacked=True, figsize=(13,6), width=0.9)

# Move the legend off of the chart
ax.legend(loc=(1.04,0))
<matplotlib.legend.Legend at 0x11e9c06d8>

Matplotlib is pretty horrifying to look at, though, so we might want to upgrade to using plotly instead of the normal .plot. Unfortunately that will involve reshaping our data!

reshaped = overall_pct.reset_index().melt(id_vars=['username'], var_name='topic', value_name='pct')
reshaped.head()
username topic pct
0 AndrewYang climate_change 0.075368
1 BernieSanders climate_change 0.070105
2 BetoORourke climate_change 0.098174
3 BilldeBlasio climate_change 0.071429
4 CoryBooker climate_change 0.039130

Once it's reshaped we're free to plot.

import plotly.express as px

fig = px.bar(reshaped, x='username', y='pct', color='topic')
fig.show()

Review#

In this section we learned how to categorized text based on lists of words.

Discussion topics#

How does this approach compare to something like classification? Is there a difference?

With this approach we're double-counting tweets that fall into two categories (for example, a tweet could count for climate change as well as foreign policy). Should this give us a panic attack regarding our 100% stacked bar graph? Why or why not?

What is stressed in the data differently between the "normal" stacked bar as compared to the 100% stacked bar?

What is an alternative method we could use instead of counting the tweet as being once for climate change and once for foreign policy? We don't need to implement it, just figure an alternative out.