Picking the "right" number of topics for a scikit-learn topic model#
When you ask a topic model to find topics in documents for you, you only need to provide it with one thing: a number of topics to find. Somehow that one little number ends up being a lot of trouble! Let's figure out best practices for finding a good number of topics.
import pandas as pd
import matplotlib.pyplot as plt
# These styles look nicer than default pandas
plt.style.use('ggplot')
# We'll be able to see more text at once
pd.set_option("display.max_colwidth", 100)
Our dataset#
We'll use the same dataset of State of the Union addresses as in our last exercise.
# Read in our data
speeches = pd.read_csv("data/state-of-the-union.csv")
# Remove non-word characters, so numbers and ___ etc
speeches.content = speeches.content.str.replace("[^A-Za-z ]", " ")
speeches.sample(5)
Vectorizing our data#
We'll also use the same vectorizer as last time - a stemmed TF-IDF vectorizer that requires each term to appear at least 5 terms, but no more frequently than in half of the documents.
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer
# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')
analyzer = TfidfVectorizer().build_analyzer()
# Override TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: stemmer.stemWords(analyzer(doc))
vectorizer = StemmedTfidfVectorizer(stop_words='english', min_df=5, max_df=0.5)
matrix = vectorizer.fit_transform(speeches.content)
words_df = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
words_df.head()
Building our model#
Previously we used NMF (also known as LSI) for topic modeling. It seemed to work okay! We asked for fifteen topics.
from sklearn.decomposition import NMF
# Use NMF to look for 15 topics
n_topics = 15
model = NMF(n_components=n_topics)
model.fit(matrix)
# Print the top 10 words
n_words = 10
feature_names = vectorizer.get_feature_names()
topic_list = []
for topic_idx, topic in enumerate(model.components_):
top_n = [feature_names[i]
for i in topic.argsort()
[-n_words:]][::-1]
top_features = ' '.join(top_n)
topic_list.append(f"topic_{'_'.join(top_n[:3])}")
print(f"Topic {topic_idx}: {top_features}")
But how do we know we don't need twenty-five labels instead of just fifteen?
# Use NMF to look for 25 topics
n_topics = 25
model = NMF(n_components=n_topics)
model.fit(matrix)
# Print the top 10 words per topic
n_words = 10
feature_names = vectorizer.get_feature_names()
topic_list = []
for topic_idx, topic in enumerate(model.components_):
top_n = [feature_names[i]
for i in topic.argsort()
[-n_words:]][::-1]
top_features = ' '.join(top_n)
topic_list.append(f"topic_{'_'.join(top_n[:3])}")
print(f"Topic {topic_idx}: {top_features}")
I mean yeah, that honestly looks even better! These topics all seem to make sense. Should we go even higher?
Comparing topic models#
Scikit-learn comes with a magic thing called GridSearchCV
. Any time you can't figure out the "right" combination of options to use with something, you can feed them to GridSearchCV
and it will try them all. After it's done, it'll check the score on each to let you know the best combination.
We have a little problem, though: NMF can't be scored (at least in scikit-learn!). Because our model can't give us a number that represents how well it did, we can't compare it to other models, which means the only way to differentiate between 15 topics or 20 topics or 30 topics is how we feel about them.
This is not good! We want to be able to point to a number and say, "look! we did it right!" and have everyone nod their head in agreement.
Fortunately, though, there's a topic model that we haven't tried yet! LDA, a.k.a. latent Dirichlet allocation. Let's sidestep GridSearchCV
for a second and see if LDA can help us.
Introducing LDA#
LDA is another topic model that we haven't covered yet because it's so much slower than NMF. Even if it's better it's just painful to sit around for minutes waiting for our computer to give you a result, when NMF has it done in under a second.
With that complaining out of the way, let's give LDA a shot. The code looks almost exactly like NMF, we just use something else to build our model.
There's one big difference: LDA has TF-IDF built in, so we need to use a CountVectorizer
as the vectorizer instead of a TfidfVectorizer
. If you don't do this your results will be tragic.
from sklearn.feature_extraction.text import CountVectorizer
import Stemmer
# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')
analyzer = CountVectorizer().build_analyzer()
# Override TfidfVectorizer
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(CountVectorizer, self).build_analyzer()
return lambda doc: stemmer.stemWords(analyzer(doc))
vectorizer = StemmedCountVectorizer(stop_words='english', min_df=5, max_df=0.5)
matrix = vectorizer.fit_transform(speeches.content)
words_df = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
words_df.head()
We're going to use
%%time
at the top of the cell to see how long this takes to run. Just remember that NMF took all of a second.
%%time
from sklearn.decomposition import LatentDirichletAllocation
# Use LDA to look for 15 topics
n_topics = 15
model = LatentDirichletAllocation(n_components=n_topics)
model.fit(matrix)
# Print the top 10 words per topic
n_words = 10
feature_names = vectorizer.get_feature_names()
topic_list = []
for topic_idx, topic in enumerate(model.components_):
top_n = [feature_names[i]
for i in topic.argsort()
[-n_words:]][::-1]
top_features = ' '.join(top_n)
topic_list.append(f"topic_{'_'.join(top_n[:3])}")
print(f"Topic {topic_idx}: {top_features}")
Those results look great, and ten seconds isn't so bad! The problem comes when you have larger data sets, so we really did a good job picking something with under 300 documents. Let's see how our topic scores look for each document.
# Convert our counts into numbers
amounts = model.transform(matrix) * 100
# Set it up as a dataframe
topics = pd.DataFrame(amounts, columns=topic_list)
topics.head(2)
Uh, hm, that's kind of weird. lots of really low numbers, and then it jumps up super high for some topics. How's it look graphed?
x_axis = speeches.year
y_axis = topics
fig, ax = plt.subplots(figsize=(10,5))
# Plot a stackplot - https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/stackplot_demo.html
ax.stackplot(x_axis, y_axis.T, baseline='wiggle', labels=y_axis.columns)
# Move the legend off of the chart
ax.legend(loc=(1.04,0))
Ouch. Looks like LDA doesn't like having topics shared in a document, while NMF was all about it. Let's keep on going, though!
Spoiler: It gives you different results every time, but this graph always looks wild and black.
Using GridSearchCV to pick the best number of topics#
We'll need to build a dictionary for GridSearchCV
to explain all of the options we're interested in changing, along with what they should be set to.
How many topics? Who knows! Somewhere between 15 and 60, maybe? We'll feed it a list of all of the different values we might set n_components
to be.
We can also change the learning_decay
option, which does Other Things That Change The Output. That's capitalized because we'll just treat it as fact instead of something to be investigated.
The learning decay doesn't actually have an agreed-upon default value! In scikit-learn it's at
0.7
, but in Gensim it uses0.5
instead.
Remember that GridSearchCV
is going to try every single combination. For example, let's say you had the following:
search_params = {
'n_components': [20, 40, 60],
'learning_decay': [.5, .7]
}
It builds, trains and scores a separate model for each combination of the two options, leading you to six different runs:
n_components | learning_decay |
---|---|
20 | 0.5 |
20 | 0.7 |
40 | 0.5 |
40 | 0.7 |
60 | 0.5 |
60 | 0.7 |
That means that if your LDA is slow, this is going to be much much slower. You might need to walk away and get a coffee while it's working its way through.
%%time
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV
# Options to try with our LDA
# Beware it will try *all* of the combinations, so it'll take ages
search_params = {
'n_components': [5, 10, 15, 20, 25, 30],
'learning_decay': [.5, .7]
}
# Set up LDA with the options we'll keep static
model = LatentDirichletAllocation(learning_method='online')
# Try all of the options
gridsearch = GridSearchCV(model, param_grid=search_params, n_jobs=-1, verbose=1)
gridsearch.fit(matrix)
# What did we find?
print("Best Model's Params: ", gridsearch.best_params_)
print("Best Log Likelihood Score: ", gridsearch.best_score_)
Great, we've been presented with the best option:
Best Model's Params: {'learning_decay': 0.7, 'n_components': 5}
Let's see what it looks like.
%%time
from sklearn.decomposition import LatentDirichletAllocation
# Use LDA to look for 5 topics
n_topics = 5
model = LatentDirichletAllocation(learning_method='online', n_components=n_topics, learning_decay=0.5)
model.fit(matrix)
# Print the top 10 words per topic
n_words = 10
feature_names = vectorizer.get_feature_names()
topic_list = []
for topic_idx, topic in enumerate(model.components_):
top_n = [feature_names[i]
for i in topic.argsort()
[-n_words:]][::-1]
top_features = ' '.join(top_n)
topic_list.append(f"topic_{'_'.join(top_n[:3])}")
print(f"Topic {topic_idx}: {top_features}")
Might as well graph it while we're at it.
# Convert our counts into numbers
amounts = model.transform(matrix) * 100
# Set it up as a dataframe
topics = pd.DataFrame(amounts, columns=topic_list)
topics.head(2)
x_axis = speeches.year
y_axis = topics
fig, ax = plt.subplots(figsize=(10,5))
# Plot a stackplot - https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/stackplot_demo.html
ax.stackplot(x_axis, y_axis.T, baseline='wiggle', labels=y_axis.columns)
# Move the legend off of the chart
ax.legend(loc=(1.04,0))
While that makes perfect sense (I guess), it just doesn't feel right. Even trying fifteen topics looked better than that.
%%time
from sklearn.decomposition import LatentDirichletAllocation
# Use LDA to look for 15 topics
n_topics = 15
model = LatentDirichletAllocation(n_components=n_topics)
model.fit(matrix)
# Print the top 10 words per topic
n_words = 10
feature_names = vectorizer.get_feature_names()
topic_list = []
for topic_idx, topic in enumerate(model.components_):
top_n = [feature_names[i]
for i in topic.argsort()
[-n_words:]][::-1]
top_features = ' '.join(top_n)
topic_list.append(f"topic_{'_'.join(top_n[:3])}")
print(f"Topic {topic_idx}: {top_features}")
Right? They seem pretty reasonable, even if the graph looked horrible because LDA doesn't like to share. And hey, maybe NMF wasn't so bad after all. Just because we can't score it doesn't mean we can't enjoy it.
%%time
# Use LDA to look for 15 topics
n_topics = 15
model = NMF(n_components=n_topics)
model.fit(matrix)
# Print the top 10 words per topic
n_words = 10
feature_names = vectorizer.get_feature_names()
topic_list = []
for topic_idx, topic in enumerate(model.components_):
top_n = [feature_names[i]
for i in topic.argsort()
[-n_words:]][::-1]
top_features = ' '.join(top_n)
topic_list.append(f"topic_{'_'.join(top_n[:3])}")
print(f"Topic {topic_idx}: {top_features}")