Using topic modeling to find topics discussed in Democratic presidential candidate tweets#

Can topic modeling help us understand what topics Democratic presidential candidates are talking about? Let's find out!

Since we're analyzing text, we'll need to increase the text that's displayed in each pandas column. We're also increasing the number of columns displayed to help with some of the topic modeling stuff.

import pandas as pd

pd.set_option("display.max_columns", 60)
pd.set_option("display.max_colwidth", 300)

Our data#

We have around 40,000 tweets from Democratic presidential candidates, starting in January 2019. We scraped them using GetOldTweets3 in the previous section.

# We don't need all of the columns, let's leave out a lot of them
columns = ['username', 'text', 'date']

df = pd.read_csv("data/tweets.csv", usecols=columns)
df.sample(5)
username text date
24277 JohnDelaney On Tuesday's debate I compared the Warren/Sanders agenda to the too far left shift that McGovern, Mondale and Dukakis made. McGovern lost 49 states, Mondale lost 49 states and Dukakis lost 40 states. We have to run on big economic ideas that also appeal to centrist voters. 2019-08-02 17:50:57+00:00
25018 JohnDelaney The President clearly cares more about his Twitter followers than the American people. His continued dishonesty and weaponization of social media has been divisive. I am calling on all Americans to #UnfollowTrump and hit him where it actually hurts him... his ego. 2019-04-24 15:58:08+00:00
36444 AndrewYang An AI system defeated elite Chinese doctors in a two-round brain tumor diagnosis competition on both speed and accuracy. This could do incredible good but is another example of areas in which new technology is capable of beating humans. We have to evolve quickly. 2019-04-10 15:36:02+00:00
19139 JayInslee We must cut off the gravy train of federal subsidies for oil and gas companies. They’re literally killing us. 2019-05-07 20:00:18+00:00
8764 sethmoulton The Second Amendment was written in 1791 when people were firing single rounds out of a musket and dueling with pistols. 2019-08-07 16:42:48+00:00

And how many do we have?

df.shape
(38559, 3)

And how many from each candidate?

df.username.value_counts()
AndrewYang         4425
marwilliamson      2571
ewarren            2570
JayInslee          2120
KamalaHarris       2110
JohnDelaney        1913
BernieSanders      1881
GovernorBullock    1721
ericswalwell       1705
BetoORourke        1667
SenGillibrand      1538
TimRyan            1481
amyklobuchar       1405
CoryBooker         1315
TomSteyer          1279
sethmoulton        1239
JulianCastro       1220
Hickenlooper        959
MichaelBennet       904
TulsiGabbard        893
PeteButtigieg       856
JoeBiden            856
WayneMessam         815
JoeSestak           619
BilldeBlasio        497
Name: username, dtype: int64

Using topic modeling#

We'll be trying to use topic modeling to generate a list of topics each tweet is about, as well as words associated with each topic. Why do we think that? Because the methodology text told us so!

The initial keywords were generated by topic modeling the entire corpus of tweets, then supplemented manually with additional keywords.

First we'll need to vectorize our text into numbers that scikit-learn can understand, and then we'll use topic modeling to find the topics inside.

Vectorize the text#

When you're doing topic modeling, the kind of vectorizing you use depends on the kind of topic model you're going to build. Using and LDA topic model required a CountVectorizer, while any other kind of topic model works best with a TfidfVectorizer. LDA magically has TF-IDF built in, so it understands the difference between things like low-frequency and high-frequency words.

I'm lazy and LDA takes a long time to run, so we're not going to use LDA, which means we'll need a TfidfVectorizer. Since I want words like "tomato" and "tomato" and "tomatoes" combined, I'm also going to use a stemmer. More or less we're just stealing from the reference page.

from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer

# Using pyStemmer because it's way faster than NLTK
stemmer = Stemmer.Stemmer('en')

# Based on TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords([w for w in analyzer(doc)])

We're going to count all words that show up at least one hundred times. If it isn't mentioned a hundred times across 40k tweets, the word is probably not that important.

%%time
vectorizer = StemmedTfidfVectorizer(stop_words='english',
                                    min_df=100)

matrix = vectorizer.fit_transform(df.text.str.replace("[^\w ]", ""))
CPU times: user 2.77 s, sys: 85 ms, total: 2.85 s
Wall time: 3.21 s
matrix.shape
(38559, 975)

Down to just under a thousand. Now that we're vectorized we can head on to topic modeling.

Using LSI/SVD for topic modeling#

Whenever we're building a topic model, we have the important question of how many topics? The Bloomberg uses fourteen categories, so let's pick seventeen to add a little bit of buffer room.

%%time
from sklearn.decomposition import TruncatedSVD

# Tell the model to find the topics
model = TruncatedSVD(n_components=17)
model.fit(matrix)

# Print the top 10 words per category
n_words = 10
feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(model.components_):
    message = "Topic #%d: " % topic_idx
    message += ", ".join([feature_names[i]
                         for i in topic.argsort()[:-n_words - 1:-1]])
    print(message)
print()
Topic #0: thank, support, great, work, peopl, im, make, need, fight, time
Topic #1: peopl, need, american, presid, make, work, countri, im, right, trump
Topic #2: care, health, right, need, american, women, trump, afford, protect, access
Topic #3: climat, chang, trump, presid, need, donald, defeat, crisi, nation, threat
Topic #4: climat, chang, health, care, need, plan, new, debat, great, crisi
Topic #5: right, im, fight, presid, women, climat, trump, run, chang, vote
Topic #6: need, im, debat, make, care, health, help, stage, campaign, just
Topic #7: gun, violenc, im, need, fight, end, live, love, peopl, work
Topic #8: gun, need, violenc, health, care, join, presid, trump, look, end
Topic #9: need, right, time, debat, make, let, help, great, vote, donor
Topic #10: peopl, look, right, join, forward, campaign, like, tune, polit, just
Topic #11: love, trump, make, support, famili, day, donald, let, debat, work
Topic #12: love, presid, like, peopl, need, im, run, happi, look, new
Topic #13: need, work, look, right, new, forward, countri, join, good, worker
Topic #14: time, like, look, im, new, famili, forward, plan, support, pay
Topic #15: look, forward, make, gun, let, great, violenc, plan, fight, state
Topic #16: new, support, hampshir, let, day, campaign, presid, peopl, team, today

CPU times: user 739 ms, sys: 121 ms, total: 859 ms
Wall time: 526 ms

So we've got topics about thank yous/appreciation, general praise of America, climate change, gun violence, something that might be healthcare, Trump... They seem reasonable, right?

That was so fast, we might as well try it with another topic modeling algorithm, too.

Topic modeling with NME/NMF#

What's the difference between this version of topic modeling and the previous one? For right now: who cares! Let's just try it out.

%%time
from sklearn.decomposition import NMF

# Tell the model to find the topics
model = NMF(n_components=17)
model.fit(matrix)

# Print the top 10 words per category
n_words = 10
feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(model.components_):
    message = "Topic #%d: " % topic_idx
    message += ", ".join([feature_names[i]
                         for i in topic.argsort()[:-n_words - 1:-1]])
    print(message)
print()
Topic #0: thank, have, leadership, appreci, come, soon, host, share, amaz, convers
Topic #1: work, famili, worker, pay, american, year, countri, job, economi, america
Topic #2: im, join, live, campaign, tune, run, talk, fight, tonight, iowa
Topic #3: trump, presid, donald, administr, immigr, mr, run, elect, state, unit
Topic #4: climat, chang, crisi, defeat, plan, threat, ourclimatemo, action, issu, big
Topic #5: right, fight, women, vote, protect, stand, human, reproduct, equal, abort
Topic #6: gun, violenc, end, epidem, communiti, live, action, safeti, nra, check
Topic #7: great, iowa, meet, day, talk, morn, today, enjoy, state, convers
Topic #8: care, health, afford, plan, medicar, access, mental, insur, univers, million
Topic #9: make, let, debat, sure, help, just, stage, happen, donat, donor
Topic #10: peopl, american, power, polit, govern, campaign, money, want, young, dont
Topic #11: need, dont, help, countri, real, donor, that, america, talk, secur
Topic #12: love, happi, day, hate, one, today, life, celebr, world, birthday
Topic #13: time, spend, long, year, past, come, congress, impeach, start, act
Topic #14: look, forward, like, soon, see, join, come, good, way, hope
Topic #15: new, hampshir, plan, york, citi, green, town, event, soon, state
Topic #16: support, appreci, team, donat, proud, grate, help, yes, campaign, debat

CPU times: user 4.8 s, sys: 304 ms, total: 5.1 s
Wall time: 5.77 s

These actually look a bit firmer - appreciation, working and families, climate change, reproductive rights/women's issues, Iowa, gun violence, healthcare, donors and political power, and maybe a little bit of the Green New Deal.

What we do with topic models#

Now that we have our topic models, the big question is: what do we do with them? Usually you use topic models to automatically assign categories to things - "this is about healthcare," "this is about gun violence," etc - but things are a little different here.

Let's review what the methodology note said:

The text of the tweets were classified programmatically using a body of keywords that corresponded to a larger bucket of topics categorized by Bloomberg News....The initial keywords were generated by topic modeling the entire corpus of tweets, then supplemented manually with additional keywords.

So they used keywords to assign a category (or categories) to each tweet. Sounds like something we might be able to do, until we get to the example:

For example, a May 12 tweet from Beto O'Rourke reading, "We will repeal the discriminatory and hateful transgender troop ban and replace it with the Equality Act to ensure full civil rights for LGBTQ Americans," was classified under "social issues" and "military."

Even though the military and social issues are major topics that candidates will tweet about, none of the categories our topic models uncovered were about the military or social issues. So what do we do? Looks like we'll just need to invent our own keywords!

And that's exactly what they did, too.

The text of the tweets were classified programmatically using a body of keywords that corresponded to a larger bucket of topics categorized by Bloomberg News.

Did we just... do that for no reason?

Review#

In this section we applied topic modeling to a large number of tweets, comparing several different algorithms to see which one could best categorize our dataset. It turns out they were all pretty bad, and we're just going to use keywords instead.

Discussion topics#

We don't know the difference between how the different topic modeling techniques work. What might be downsides to that? What are the downsides to learning them?

If we didn't want to learn the intricacies of topic modeling, but still wanted to do this project using topic modeling, how could we find someone to give us advice?

We only selected words that showed up at least one hundred times. Why?