Using topic modeling to find topics discussed in Democratic presidential candidate tweets#
Can topic modeling help us understand what topics Democratic presidential candidates are talking about? Let's find out!
Since we're analyzing text, we'll need to increase the text that's displayed in each pandas column. We're also increasing the number of columns displayed to help with some of the topic modeling stuff.
import pandas as pd pd.set_option("display.max_columns", 60) pd.set_option("display.max_colwidth", 300)
# We don't need all of the columns, let's leave out a lot of them columns = ['username', 'text', 'date'] df = pd.read_csv("data/tweets.csv", usecols=columns) df.sample(5)
|24277||JohnDelaney||On Tuesday's debate I compared the Warren/Sanders agenda to the too far left shift that McGovern, Mondale and Dukakis made. McGovern lost 49 states, Mondale lost 49 states and Dukakis lost 40 states. We have to run on big economic ideas that also appeal to centrist voters.||2019-08-02 17:50:57+00:00|
|25018||JohnDelaney||The President clearly cares more about his Twitter followers than the American people. His continued dishonesty and weaponization of social media has been divisive. I am calling on all Americans to #UnfollowTrump and hit him where it actually hurts him... his ego.||2019-04-24 15:58:08+00:00|
|36444||AndrewYang||An AI system defeated elite Chinese doctors in a two-round brain tumor diagnosis competition on both speed and accuracy. This could do incredible good but is another example of areas in which new technology is capable of beating humans. We have to evolve quickly.||2019-04-10 15:36:02+00:00|
|19139||JayInslee||We must cut off the gravy train of federal subsidies for oil and gas companies. They’re literally killing us.||2019-05-07 20:00:18+00:00|
|8764||sethmoulton||The Second Amendment was written in 1791 when people were firing single rounds out of a musket and dueling with pistols.||2019-08-07 16:42:48+00:00|
And how many do we have?
And how many from each candidate?
AndrewYang 4425 marwilliamson 2571 ewarren 2570 JayInslee 2120 KamalaHarris 2110 JohnDelaney 1913 BernieSanders 1881 GovernorBullock 1721 ericswalwell 1705 BetoORourke 1667 SenGillibrand 1538 TimRyan 1481 amyklobuchar 1405 CoryBooker 1315 TomSteyer 1279 sethmoulton 1239 JulianCastro 1220 Hickenlooper 959 MichaelBennet 904 TulsiGabbard 893 PeteButtigieg 856 JoeBiden 856 WayneMessam 815 JoeSestak 619 BilldeBlasio 497 Name: username, dtype: int64
Using topic modeling#
We'll be trying to use topic modeling to generate a list of topics each tweet is about, as well as words associated with each topic. Why do we think that? Because the methodology text told us so!
The initial keywords were generated by topic modeling the entire corpus of tweets, then supplemented manually with additional keywords.
First we'll need to vectorize our text into numbers that scikit-learn can understand, and then we'll use topic modeling to find the topics inside.
Vectorize the text#
When you're doing topic modeling, the kind of vectorizing you use depends on the kind of topic model you're going to build. Using and LDA topic model required a
CountVectorizer, while any other kind of topic model works best with a
TfidfVectorizer. LDA magically has TF-IDF built in, so it understands the difference between things like low-frequency and high-frequency words.
I'm lazy and LDA takes a long time to run, so we're not going to use LDA, which means we'll need a
TfidfVectorizer. Since I want words like "tomato" and "tomato" and "tomatoes" combined, I'm also going to use a stemmer. More or less we're just stealing from the reference page.
from sklearn.feature_extraction.text import TfidfVectorizer import Stemmer # Using pyStemmer because it's way faster than NLTK stemmer = Stemmer.Stemmer('en') # Based on TfidfVectorizer class StemmedTfidfVectorizer(TfidfVectorizer): def build_analyzer(self): analyzer = super(StemmedTfidfVectorizer, self).build_analyzer() return lambda doc: stemmer.stemWords([w for w in analyzer(doc)])
We're going to count all words that show up at least one hundred times. If it isn't mentioned a hundred times across 40k tweets, the word is probably not that important.
%%time vectorizer = StemmedTfidfVectorizer(stop_words='english', min_df=100) matrix = vectorizer.fit_transform(df.text.str.replace("[^\w ]", ""))
CPU times: user 2.77 s, sys: 85 ms, total: 2.85 s Wall time: 3.21 s
Down to just under a thousand. Now that we're vectorized we can head on to topic modeling.
Using LSI/SVD for topic modeling#
Whenever we're building a topic model, we have the important question of how many topics? The Bloomberg uses fourteen categories, so let's pick seventeen to add a little bit of buffer room.
%%time from sklearn.decomposition import TruncatedSVD # Tell the model to find the topics model = TruncatedSVD(n_components=17) model.fit(matrix) # Print the top 10 words per category n_words = 10 feature_names = vectorizer.get_feature_names() for topic_idx, topic in enumerate(model.components_): message = "Topic #%d: " % topic_idx message += ", ".join([feature_names[i] for i in topic.argsort()[:-n_words - 1:-1]]) print(message) print()
Topic #0: thank, support, great, work, peopl, im, make, need, fight, time Topic #1: peopl, need, american, presid, make, work, countri, im, right, trump Topic #2: care, health, right, need, american, women, trump, afford, protect, access Topic #3: climat, chang, trump, presid, need, donald, defeat, crisi, nation, threat Topic #4: climat, chang, health, care, need, plan, new, debat, great, crisi Topic #5: right, im, fight, presid, women, climat, trump, run, chang, vote Topic #6: need, im, debat, make, care, health, help, stage, campaign, just Topic #7: gun, violenc, im, need, fight, end, live, love, peopl, work Topic #8: gun, need, violenc, health, care, join, presid, trump, look, end Topic #9: need, right, time, debat, make, let, help, great, vote, donor Topic #10: peopl, look, right, join, forward, campaign, like, tune, polit, just Topic #11: love, trump, make, support, famili, day, donald, let, debat, work Topic #12: love, presid, like, peopl, need, im, run, happi, look, new Topic #13: need, work, look, right, new, forward, countri, join, good, worker Topic #14: time, like, look, im, new, famili, forward, plan, support, pay Topic #15: look, forward, make, gun, let, great, violenc, plan, fight, state Topic #16: new, support, hampshir, let, day, campaign, presid, peopl, team, today CPU times: user 739 ms, sys: 121 ms, total: 859 ms Wall time: 526 ms
So we've got topics about thank yous/appreciation, general praise of America, climate change, gun violence, something that might be healthcare, Trump... They seem reasonable, right?
That was so fast, we might as well try it with another topic modeling algorithm, too.
Topic modeling with NME/NMF#
What's the difference between this version of topic modeling and the previous one? For right now: who cares! Let's just try it out.
%%time from sklearn.decomposition import NMF # Tell the model to find the topics model = NMF(n_components=17) model.fit(matrix) # Print the top 10 words per category n_words = 10 feature_names = vectorizer.get_feature_names() for topic_idx, topic in enumerate(model.components_): message = "Topic #%d: " % topic_idx message += ", ".join([feature_names[i] for i in topic.argsort()[:-n_words - 1:-1]]) print(message) print()
Topic #0: thank, have, leadership, appreci, come, soon, host, share, amaz, convers Topic #1: work, famili, worker, pay, american, year, countri, job, economi, america Topic #2: im, join, live, campaign, tune, run, talk, fight, tonight, iowa Topic #3: trump, presid, donald, administr, immigr, mr, run, elect, state, unit Topic #4: climat, chang, crisi, defeat, plan, threat, ourclimatemo, action, issu, big Topic #5: right, fight, women, vote, protect, stand, human, reproduct, equal, abort Topic #6: gun, violenc, end, epidem, communiti, live, action, safeti, nra, check Topic #7: great, iowa, meet, day, talk, morn, today, enjoy, state, convers Topic #8: care, health, afford, plan, medicar, access, mental, insur, univers, million Topic #9: make, let, debat, sure, help, just, stage, happen, donat, donor Topic #10: peopl, american, power, polit, govern, campaign, money, want, young, dont Topic #11: need, dont, help, countri, real, donor, that, america, talk, secur Topic #12: love, happi, day, hate, one, today, life, celebr, world, birthday Topic #13: time, spend, long, year, past, come, congress, impeach, start, act Topic #14: look, forward, like, soon, see, join, come, good, way, hope Topic #15: new, hampshir, plan, york, citi, green, town, event, soon, state Topic #16: support, appreci, team, donat, proud, grate, help, yes, campaign, debat CPU times: user 4.8 s, sys: 304 ms, total: 5.1 s Wall time: 5.77 s
These actually look a bit firmer - appreciation, working and families, climate change, reproductive rights/women's issues, Iowa, gun violence, healthcare, donors and political power, and maybe a little bit of the Green New Deal.
What we do with topic models#
Now that we have our topic models, the big question is: what do we do with them? Usually you use topic models to automatically assign categories to things - "this is about healthcare," "this is about gun violence," etc - but things are a little different here.
Let's review what the methodology note said:
The text of the tweets were classified programmatically using a body of keywords that corresponded to a larger bucket of topics categorized by Bloomberg News....The initial keywords were generated by topic modeling the entire corpus of tweets, then supplemented manually with additional keywords.
So they used keywords to assign a category (or categories) to each tweet. Sounds like something we might be able to do, until we get to the example:
For example, a May 12 tweet from Beto O'Rourke reading, "We will repeal the discriminatory and hateful transgender troop ban and replace it with the Equality Act to ensure full civil rights for LGBTQ Americans," was classified under "social issues" and "military."
Even though the military and social issues are major topics that candidates will tweet about, none of the categories our topic models uncovered were about the military or social issues. So what do we do? Looks like we'll just need to invent our own keywords!
And that's exactly what they did, too.
The text of the tweets were classified programmatically using a body of keywords that corresponded to a larger bucket of topics categorized by Bloomberg News.
Did we just... do that for no reason?
In this section we applied topic modeling to a large number of tweets, comparing several different algorithms to see which one could best categorize our dataset. It turns out they were all pretty bad, and we're just going to use keywords instead.
We don't know the difference between how the different topic modeling techniques work. What might be downsides to that? What are the downsides to learning them?
If we didn't want to learn the intricacies of topic modeling, but still wanted to do this project using topic modeling, how could we find someone to give us advice?
We only selected words that showed up at least one hundred times. Why?