Topic models with Gensim#
Gensim is a popular library for topic modeling. Here we'll see how it stacks up to scikit-learn.
Gensim vs. Scikit-learn#
Gensim is a very very popular piece of software to do topic modeling with (as is Mallet, if you're making a list). Since we're using scikit-learn for everything else, though, we use scikit-learn instead of Gensim when we get to topic modeling.
Since someone might show up one day offering us tens of thousands of dollars to demonstrate proficiency in Gensim, though, we might as well see how it works as compared to scikit-learn.
Our data#
We'll be using the same dataset as we did with scikit-learn: State of the Union addresses from 1790 to 2012, where America's president addresses the Congress about the coming year.
import pandas as pd
df = pd.read_csv("data/state-of-the-union.csv")
# Clean it up a little bit, removing non-word characters (numbers and ___ etc)
df.content = df.content.str.replace("[^A-Za-z ]", " ")
df.head()
Using Gensim#
#!pip install --upgrade gensim
from gensim.utils import simple_preprocess
texts = df.content.apply(simple_preprocess)
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.5)
corpus = [dictionary.doc2bow(text) for text in texts]
from gensim import models
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
n_topics = 15
# Build an LSI model
lsi_model = models.LsiModel(corpus_tfidf,
id2word=dictionary,
num_topics=n_topics)
lsi_model.print_topics()
Gensim is all about how important each word is to the category. Why not visualize it? First we'll make a dataframe that shows each topic, its top five words, and its values.
n_words = 10
topic_words = pd.DataFrame({})
for i, topic in enumerate(lsi_model.get_topics()):
top_feature_ids = topic.argsort()[-n_words:][::-1]
feature_values = topic[top_feature_ids]
words = [dictionary[id] for id in top_feature_ids]
topic_df = pd.DataFrame({'value': feature_values, 'word': words, 'topic': i})
topic_words = pd.concat([topic_words, topic_df], ignore_index=True)
topic_words.head()
Then we'll use seaborn to visualize it.
import seaborn as sns
g = sns.FacetGrid(topic_words, col="topic", col_wrap=3, sharey=False)
g.map(plt.barh, "word", "value")
Using LDA with Gensim#
Now we'll use LDA.
from gensim.utils import simple_preprocess
texts = df.content.apply(simple_preprocess)
from gensim import corpora
dictionary = corpora.Dictionary(texts)
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=2000)
corpus = [dictionary.doc2bow(text) for text in texts]
from gensim import models
n_topics = 15
lda_model = models.LdaModel(corpus=corpus, num_topics=n_topics)
lda_model.print_topics()
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
vis