Using topic modeling to find topics discussed in Democratic presidential candidate tweets#
Can topic modeling help us understand what topics Democratic presidential candidates are talking about? Let's find out!
Since we're analyzing text, we'll need to increase the text that's displayed in each pandas column. We're also increasing the number of columns displayed to help with some of the topic modeling stuff.
import pandas as pd
pd.set_option("display.max_columns", 60)
pd.set_option("display.max_colwidth", 300)
Our data#
We have around 40,000 tweets from Democratic presidential candidates, starting in January 2019. We scraped them using GetOldTweets3 in the previous section.
# We don't need all of the columns, let's leave out a lot of them
columns = ['username', 'text', 'date']
df = pd.read_csv("data/tweets.csv", usecols=columns)
df.sample(5)
And how many do we have?
df.shape
And how many from each candidate?
df.username.value_counts()
Using topic modeling#
We'll be trying to use topic modeling to generate a list of topics each tweet is about, as well as words associated with each topic. Why do we think that? Because the methodology text told us so!
The initial keywords were generated by topic modeling the entire corpus of tweets, then supplemented manually with additional keywords.
First we'll need to vectorize our text into numbers that scikit-learn can understand, and then we'll use topic modeling to find the topics inside.
Vectorize the text#
When you're doing topic modeling, the kind of vectorizing you use depends on the kind of topic model you're going to build. Using and LDA topic model required a CountVectorizer
, while any other kind of topic model works best with a TfidfVectorizer
. LDA magically has TF-IDF built in, so it understands the difference between things like low-frequency and high-frequency words.
I'm lazy and LDA takes a long time to run, so we're not going to use LDA, which means we'll need a TfidfVectorizer
. Since I want words like "tomato" and "tomato" and "tomatoes" combined, I'm also going to use a stemmer. More or less we're just stealing from the reference page.
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer
# Using pyStemmer because it's way faster than NLTK
stemmer = Stemmer.Stemmer('en')
# Based on TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
return lambda doc: stemmer.stemWords([w for w in analyzer(doc)])
We're going to count all words that show up at least one hundred times. If it isn't mentioned a hundred times across 40k tweets, the word is probably not that important.
%%time
vectorizer = StemmedTfidfVectorizer(stop_words='english',
min_df=100)
matrix = vectorizer.fit_transform(df.text.str.replace("[^\w ]", ""))
matrix.shape
Down to just under a thousand. Now that we're vectorized we can head on to topic modeling.
Using LSI/SVD for topic modeling#
Whenever we're building a topic model, we have the important question of how many topics? The Bloomberg uses fourteen categories, so let's pick seventeen to add a little bit of buffer room.
%%time
from sklearn.decomposition import TruncatedSVD
# Tell the model to find the topics
model = TruncatedSVD(n_components=17)
model.fit(matrix)
# Print the top 10 words per category
n_words = 10
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += ", ".join([feature_names[i]
for i in topic.argsort()[:-n_words - 1:-1]])
print(message)
print()
So we've got topics about thank yous/appreciation, general praise of America, climate change, gun violence, something that might be healthcare, Trump... They seem reasonable, right?
That was so fast, we might as well try it with another topic modeling algorithm, too.
Topic modeling with NME/NMF#
What's the difference between this version of topic modeling and the previous one? For right now: who cares! Let's just try it out.
%%time
from sklearn.decomposition import NMF
# Tell the model to find the topics
model = NMF(n_components=17)
model.fit(matrix)
# Print the top 10 words per category
n_words = 10
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(model.components_):
message = "Topic #%d: " % topic_idx
message += ", ".join([feature_names[i]
for i in topic.argsort()[:-n_words - 1:-1]])
print(message)
print()
These actually look a bit firmer - appreciation, working and families, climate change, reproductive rights/women's issues, Iowa, gun violence, healthcare, donors and political power, and maybe a little bit of the Green New Deal.
What we do with topic models#
Now that we have our topic models, the big question is: what do we do with them? Usually you use topic models to automatically assign categories to things - "this is about healthcare," "this is about gun violence," etc - but things are a little different here.
Let's review what the methodology note said:
The text of the tweets were classified programmatically using a body of keywords that corresponded to a larger bucket of topics categorized by Bloomberg News....The initial keywords were generated by topic modeling the entire corpus of tweets, then supplemented manually with additional keywords.
So they used keywords to assign a category (or categories) to each tweet. Sounds like something we might be able to do, until we get to the example:
For example, a May 12 tweet from Beto O'Rourke reading, "We will repeal the discriminatory and hateful transgender troop ban and replace it with the Equality Act to ensure full civil rights for LGBTQ Americans," was classified under "social issues" and "military."
Even though the military and social issues are major topics that candidates will tweet about, none of the categories our topic models uncovered were about the military or social issues. So what do we do? Looks like we'll just need to invent our own keywords!
And that's exactly what they did, too.
The text of the tweets were classified programmatically using a body of keywords that corresponded to a larger bucket of topics categorized by Bloomberg News.
Did we just... do that for no reason?
Review#
In this section we applied topic modeling to a large number of tweets, comparing several different algorithms to see which one could best categorize our dataset. It turns out they were all pretty bad, and we're just going to use keywords instead.
Discussion topics#
We don't know the difference between how the different topic modeling techniques work. What might be downsides to that? What are the downsides to learning them?
If we didn't want to learn the intricacies of topic modeling, but still wanted to do this project using topic modeling, how could we find someone to give us advice?
We only selected words that showed up at least one hundred times. Why?