Using topic modeling to extract topics from documents#
Sometimes you have a nice big set of documents, and all you wish for is to know what's hiding inside. But without reading them, of course! Two approaches to try to lazily get some information from your texts are topic modeling and clustering.
How computers read#
I'm going to tell you a big secret: computers are really really really bad at reading documents and figuring out what they're about. Text is for people to read, people with a collective knowledge of The World At Large and a history of reading things and all kinds of other tricky secret little things we don't think about that help us understand what a piece of text means.
When dealing with understanding content, computers are good for very specific situations to do very specific things. Or alternatively, to do a not-that-great job when you aren't going to be terribly picky about the results.
Do I sound a little biased? Oh, but aren't we all. It isn't going to stop us from talking about it, though!
Before we start, let's make some assumptions:
- When you're dealing with documents, each document is (typically) about something.
- You know each document is about by looking at the words in the document.
- Documents with similar words are probably about similar things.
We have two major options available to us: topic modeling and clustering. There's a lot of NLP nuance going on between the two, but we're going to keep it simple:
Topic modeling is if each document can be about multiple topics. There might be 100 different topics, and a document might be 30% about one topic, 20% about another, and then 50% spread out between the others.
Clustering is if each document should only fit into one topic. It's an all-or-nothing approach.
The most important part of all of this is the fact that the computer figures out these topics by itself. You don't tell it what to do! If you're teaching the algorithm what different specific topics look like, that's classification. In this case we're just saying "hey computer, please figure this out!"
Let's get started.
import pandas as pd
import matplotlib.pyplot as plt
# These styles look nicer than default pandas
plt.style.use('ggplot')
# We'll be able to see more text at once
pd.set_option("display.max_colwidth", 100)
recipes = pd.read_csv("data/recipes.csv")
recipes.head()
In order to analyze the text, we'll need to count the words in each recipe. To do that we're going to use a stemmed TF-IDF vectorizer from scikit-learn.
- Stemming will allow us to combine words like
tomato
andtomatoes
- Using TF-IDF will allow us to devalue common ingredients like salt and water
I'm using the code from the reference section, just adjusted from a CountVectorizer
to a TfidfVectorizer
, and set it so ingredients have to appear in at least fifty recipes.
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer
# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')
analyzer = TfidfVectorizer().build_analyzer()
# Override TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: stemmer.stemWords(analyzer(doc))
vectorizer = StemmedTfidfVectorizer(min_df=50)
matrix = vectorizer.fit_transform(recipes.ingredient_list)
words_df = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
words_df.head()
Looks like we have 752 ingredients! Yes, there are some numbers in there and probably other things we aren't interested in, but let's stick with it for now.
Topic modeling#
There are multiple techniques for topic modeling, but in the end they do the same thing: you get a list of topics, and a list of words associated with each topic.
Let's tell it to break them down into five topics.
from sklearn.decomposition import NMF
model = NMF(n_components=5)
model.fit(matrix)
Why five topics? Because we have to tell it something. Our job is to decide the number of topics, and it's the computer's job to find the topics. We'll talk about how to pick the "right" number later, but for now: it's magic.
Fitting the model allowed it to "learn" what the ingredients are and how they're organized, we just need to find out what's inside. Let's ask for the top ten terms in each group.
n_words = 10
feature_names = vectorizer.get_feature_names()
topic_list = []
for topic_idx, topic in enumerate(model.components_):
top_features = [feature_names[i] for i in topic.argsort()][::-1][:n_words]
top_n = ' '.join(top_features)
topic_list.append(f"topic_{'_'.join(top_features[:3])}")
print(f"Topic {topic_idx}: {top_n}")
print(topic_list)
Those actually seem like pretty good topics. Italian-is, then baking, then Chinese, maybe Latin American or Indian food, and then dairy. What if we did it with fifteen topics instead?
model = NMF(n_components=15)
model.fit(matrix)
n_words = 10
feature_names = vectorizer.get_feature_names()
topic_list = []
for topic_idx, topic in enumerate(model.components_):
top_n = [feature_names[i]
for i in topic.argsort()
[-n_words:]][::-1]
top_features = ' '.join(top_n)
topic_list.append(f"topic_{'_'.join(top_n[:3])}")
print(f"Topic {topic_idx}: {top_features}")
This is where we start to see the big difference between categories and topics. The grouping with five groups seemed very much like cuisines - Italian, Chinese, etc. But now that we're breaking it down further, the groups have changed a bit.
They're now more like classes of ingredients. Baking gets a category - chicken breast boneless skinless
and so do generic Mediterranean ingredients - oliv extra virgin oil clove garlic fresh salt
. The algorithm got a little confused about black pepper vs. hot pepper flakes vs green/yellow bell peppers when it created pepper bell red green onion celeri flake black
, but we understand what it's going for.
Remember, the important thing about topic modeling is that every row in our dataset is a combinations of topics. It might be a little bit about one thing, a little bit less about another, etc etc. Let's take a look at how that works.
# If we don't want 'real' names for the topics, we can run this line
# topic_list = [f"topic_{i}" for i in range(model.n_components_)]
# Convert our counts into numbers
amounts = model.transform(matrix) * 100
# Set it up as a dataframe
topics = pd.DataFrame(amounts, columns=topic_list)
topics.head(2)
Our first recipe is primary topic_3
with a rating of 2.44, but it's also a bit topic 0 and topic 8 with scores of 1.5 and 1.36.
Our second recipe is a bit bolder - it scores a whopping 5.7 in topic_7
, with 0, 8 and 14 coming up in the 2.5-3 range.
Let's combine this topics dataframe with our original dataframe so we can see it all in one place.
merged = recipes.merge(topics, right_index=True, left_index=True)
merged.head(2)
Now we can do things like...
- Uncover possible topics discussed in the dataset
- See how many documents cover each topic
- Find the top documents in each topic
And graph it! Let's see what our distribution of topics looks like.
ax = merged[topic_list].sum().to_frame().T.plot(kind='barh', stacked=True)
# Move the legend off of the chart
ax.legend(loc=(1.04,0))
Suspiciously even, but that's an investigation for another day. Let's try a different dataset that splits a little differently.
Attempt two: State of the Union addresses#
One of the fun things to do with topic modeling is see how things change over time. For this example, we're going to reproduce an assignment from Jonathan Stray's Computational Journalism course.
At the beginning of each year, the President of the United States traditionally addresses Congress in a speech called the State of the Union. It's a good way to judge what's important in the country at the time, because the speech is sure to be used as a platform to address the legislative agenda for the year. Let's see if topic modeling can help illustrate how it's changed over time.
Our data#
We have a simple CSV of State of the Union addresses, nothing too crazy.
speeches = pd.read_csv("data/state-of-the-union.csv")
speeches.sample(5)
It's not too many, only a little over 226. Because it's a smaller dataset, we're able to do more computationally intensive forms of topic modeling (LDA, for example) without sitting around getting bored.
speeches.shape
To help the analysis out a bit, we're going to clean the text. Only a little bit, though - we'll just remove anything that isn't a word.
# Remove non-word characters, so numbers and ___ etc
speeches.content = speeches.content.str.replace("[^A-Za-z ]", " ")
speeches.head()
Vectorize#
We're going to use the same TF-IDF vectorizer we used up above, which stems in addition to just vectorizing. We'll reproduce the code down here for completeness sake (and easy cut-and-paste).
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer
# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')
analyzer = TfidfVectorizer().build_analyzer()
# Override TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: stemmer.stemWords(analyzer(doc))
With our first pass we'll vectorize everything, no limits!
vectorizer = StemmedTfidfVectorizer(stop_words='english')
matrix = vectorizer.fit_transform(speeches.content)
words_df = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
words_df.head()
Running NME/NMF topic modeling#
Now we'll leap into topic modeling. We'll look at fifteen topics, since we're covering a long span of time where lots of different things may have happened.
model = NMF(n_components=15)
model.fit(matrix)
n_words = 10
feature_names = vectorizer.get_feature_names()
topic_list = []
for topic_idx, topic in enumerate(model.components_):
top_n = [feature_names[i]
for i in topic.argsort()
[-n_words:]][::-1]
top_features = ' '.join(top_n)
topic_list.append(f"topic_{'_'.join(top_n[:3])}")
print(f"Topic {topic_idx}: {top_features}")
Let's be honest with ourselves: we expected something a bit better. So many of these words are so common that it doesn't do much to convince me these are meaningful concepts.
Adjusting our min and max document frequency#
One way to cut those overly broad topics from our topic model is to remove them from the vectorizer. Instead of accepting all words, we can set minimum or maximum limits.
Let's only accept words used in at least 5 speeches, but also don't appear in more than half of the speeches.
vectorizer = StemmedTfidfVectorizer(stop_words='english', min_df=5, max_df=0.5)
matrix = vectorizer.fit_transform(speeches.content)
words_df = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
words_df.head()
And now we'll check the topic model.
model = NMF(n_components=15)
model.fit(matrix)
n_words = 10
feature_names = vectorizer.get_feature_names()
topic_list = []
for topic_idx, topic in enumerate(model.components_):
top_n = [feature_names[i]
for i in topic.argsort()
[-n_words:]][::-1]
top_features = ' '.join(top_n)
topic_list.append(f"topic_{'_'.join(top_n[:3])}")
print(f"Topic {topic_idx}: {top_features}")
That's looking a little more interesting! Lots of references to wars and political conflict, along with slavery and monetary policy.
Visualizing the outcome#
We can get a better handle on what our data looks like through a little visualization. We'll start by loading up the topic popularity dataframe. Remember that each row is one of our speeches.
# Convert our counts into numbers
amounts = model.transform(matrix) * 100
# Set it up as a dataframe
topics = pd.DataFrame(amounts, columns=topic_list)
topics.head(2)
The first row is our first speech, the second row is our second speech, and so on.
ax = topics.sum().to_frame().T.plot(kind='barh', stacked=True)
# Move the legend off of the chart
ax.legend(loc=(1.04,0))
Again, pretty even! A few are larger or smaller, but overall the topics seem pretty evenly distributed.
Looking at things over all time doesn't mean much, though, we're interested in change over time.
The hip way to do this is with a streamgraph, which is a stacked area graph that centers on the vertical axis. Usually you'd have to merge the two dataframes in order to graph, but we can sneakily get around it since we aren't plotting with pandas (plotting streamgraphs requires directly talking to matplotlib).
x_axis = speeches.year
y_axis = topics
fig, ax = plt.subplots(figsize=(10,5))
# Plot a stackplot - https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/stackplot_demo.html
ax.stackplot(x_axis, y_axis.T, baseline='wiggle', labels=y_axis.columns)
# Move the legend off of the chart
ax.legend(loc=(1.04,0))
I know that "Presidents talk about current news topics" is probably not the most exciting things you've ever seen, but you can watch things rise and fall easily enough.
merged = topics.join(speeches)
ax = merged.plot(x='year', y=['topic_kansa_slave_slaveri', 'topic_soviet_communist_atom'], figsize=(10,3))
ax.legend(loc=(1.04,0))
So what do you do with this?#
Good question. TODO.
Review#
In this section we looked at topic modeling, a technique of extracting topics out of text datasets. Unlike clustering, where each document is assigned one category, in topic modeling each document is considered blend of different topics.
You don't need to "teach" a topic model anything about your dataset, you just let it loose and it comes back with what terms represent each topic. The only thing you need to give it is the number of topics to find.
The way you pre-process the text is very important to a topic model. We found that common words ended up appearing in many topics unless we used max_df=
in our vectorizer to filter out high-frequency words.
There are many different algorithms to use for topic modeling, but we're saving that for a later section.
Discussion topics#
TODO