Trump's tone to Congress#
We're going to reproduce Trump Sounds a Different Tone in First Address to Congress from The UpShot.
Data source 1: The NRC Emotional Lexicon, a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The annotations were manually done by crowdsourcing.
Data source 2: A database of Trump speeches, one speech per file. There are a lot of GitHub repositories of Trump speeches, but at the time this analysis was performed that was the best.
Data source 3: State of the Union addresses taken from this repo's data directory. I also cheated and pasted Trump's SOTU-y address in.
Our target#
Here's the graphic we're trying to reproduce:
State of the Union addresses in one color, Trump speeches in another. Anger on one axis, positivity on another axis.
Let's get started!
import pandas as pd
%matplotlib inline
Reading in the EmoLex#
I'm just copying this from the intro to the Emotional Lexicon notebook! It's the one at the very bottom that does a lot of reshaping, as I think that layout is the easiest to work with.
filepath = "data/NRC-Emotion-Lexicon-v0.92/NRC-emotion-lexicon-wordlevel-alphabetized-v0.92.txt"
emolex_df = pd.read_csv(filepath, names=["word", "emotion", "association"], skiprows=45, sep='\t', keep_default_na=False)
emolex_df = emolex_df.pivot(index='word', columns='emotion', values='association').reset_index()
emolex_df.head()
import glob
filenames = glob.glob("data/trump_speeches-master/data/speech*")
filenames[:5]
Read them all in individually#
speeches = [open(filename).read() for filename in filenames]
len(speeches)
Create a dataframe out of the results#
Instead of passing a list of dictionaries to pd.DataFrame
, we pass a dictionary that says "here are all of the filenames" and "here are all of the texts" and it puts each list into a column.
speeches_df = pd.DataFrame({
'text': speeches,
'filename': filenames
})
speeches_df.head(3)
Splitting out the title and content of the speech#
The "text" column is formatted with first the title of the speech, then the text. Like this:
speeches_df.loc[0]['text'][:200]
We're going to split those out into multiple columns, then delete the original column so we don't get mixed up later.
speeches_df['name'] = speeches_df['text'].apply(lambda value: value.split("\n")[0])
speeches_df['content'] = speeches_df['text'].apply(lambda value: value.split("\n")[1])
del speeches_df['text']
speeches_df.head(2)
How does Trump sound?#
Let's analyze by counting words.
We could use the code below to count all of his words. But do we really want all of the words?
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
matrix = vec.fit_transform(speeches_df['content'])
vocab = vec.get_feature_names()
wordcount_df = pd.DataFrame(matrix.toarray(), columns=vocab)
wordcount_df.head()
While we could count all the words, remember that the NRC Emotional Lexicon only includes some words. It'd kind of be a waste of time to count them all, right?
emolex_df.word.head(3)
Instead of letting the vectorizer count willy-nilly, we'll feed the vectorizer just the words in the lexicon. It's easy-peasy, you just pass vocabulary=
when you're building your vectorizer.
We're going to use a TfidfVectorizer
here because we don't care about raw counts - otherwise a longer speech would tend to be angel or more surprised or happier than a shorter one! Instead we're looking for percentages. If I say "I hate these onions," we'd count 25% of the words to be negative in there (hate
, specifically).
To get percentages in our dataframe, we can use a combination of use_idf=False
and norm='l1'
.
from sklearn.feature_extraction.text import TfidfVectorizer
# I only want you to look for words in the emotional lexicon
# because we don't know what's up with the other words
vec = TfidfVectorizer(vocabulary=emolex_df.word,
use_idf=False,
norm='l1') # ELL - ONE
matrix = vec.fit_transform(speeches_df.content)
vocab = vec.get_feature_names()
wordcount_df = pd.DataFrame(matrix.toarray(), columns=vocab)
wordcount_df.head()
Analysis without the EmoLex#
Let's poke around at the results a little bit. We can sort by one word...
wordcount_df.sort_values(by='america', ascending=False).head(5)
But since our words are in groups - angry, happy, etc - we'll want to get a collection at a time.
wordcount_df[['murder', 'america', 'great', 'prison', 'immigrant']].head(2)
What are some negative words? Let's experiment a little bit.
# bad bad bad = 100% negative
# bad bad evil evil = 50% bad + 50% evil = 100% negative
# bad fish evil fish = 25% bad + 25% evil = 50% negative
# awful % + hate % + bad % + worse % + evil % = negative %
wordcount_df[['awful', 'hate', 'bad', 'worse', 'evil']].sum(axis=1).head(20)
If we thought those were all of the negative words that existed in the world, we could add them up to get a "percentage of the speech that these words" number, which we could also consider as "percentage of the speech that was negative" number.
speeches_df['negative'] = wordcount_df[['awful', 'hate', 'bad', 'worse', 'evil']].sum(axis=1)
speeches_df.head(3)
We could do the same thing about policy if we had a list of words about policy.
speeches_df['policy'] = wordcount_df[['crime', 'discrimination', 'poverty', 'border']].sum(axis=1)
speeches_df.head(3)
And then magically enough we can plot them against each other!
speeches_df.plot(x='negative',
y='policy',
kind='scatter',
ylim=(0,0.01),
xlim=(0,0.005))
Adding in the EmoLex#
Instead of a list of semi-random words, we'll use the NRC Emotional Lexicon instead.
emolex_df.head()
What words are angry?
emolex_df[emolex_df.anger == 1].head()
We don't need all those columns, right? We just need the words themselves.
# Get your list of angry words
angry_words = emolex_df[emolex_df.anger == 1]['word']
angry_words.head()
Previously we asked the wordcount_df
for specific words, words that we chose.
wordcount_df[['awful', 'hate', 'bad', 'worse', 'evil']]
But what if instead we just... fed it the list of angry words from the emotional lexicon? In the same way we could do ['awful', 'hate', 'bad', 'worse', 'evil']
we could also just feed it the list of angry_words
from above.
wordcount_df[angry_words].head()
Now we just need to add them up, just like we did with "policy" and "negative" above.
# Only give me the columns of angry words
speeches_df['anger'] = wordcount_df[angry_words].sum(axis=1)
speeches_df.head(3)
Let's repeat that process with positivity. It's the same process we did with the anger words, but in the convenience of a single cell.
# Get your list of positive words
positive_words = emolex_df[emolex_df.positive == 1].word
# Only give me the columns of angry words
speeches_df['positivity'] = wordcount_df[positive_words].sum(axis=1)
speeches_df.head(3)
Plot our results#
speeches_df.plot(x='positivity', y='anger', kind='scatter')
Okay, looks good so far. But we need to plot it against State of the Union addresses to fully reproduce the graphic.
Reading in the SOTU addresses#
Pretty much the same thing as what we did with Trump!
# Get the filenames
# Read them in
# Create a dataframe from the results
filenames = glob.glob("data/SOTU/*.txt")
contents = [open(filename).read() for filename in filenames]
sotu_df = pd.DataFrame({
'content': contents,
'filename': filenames
})
sotu_df.head(3)
Add a column for the name#
We don't have a name for these, so we'll just use the filename.
sotu_df['name'] = sotu_df['filename']
sotu_df.head()
How do State of the Unions sound?#
Let's analyze by counting words. Same thing we did with Trump - set the vocabulary, use_idf=False
and norm='l1
.
from sklearn.feature_extraction.text import TfidfVectorizer
# I only want you to look for words in the emotional lexicon
# because we don't know what's up with the other words
vec = TfidfVectorizer(vocabulary=emolex_df.word,
use_idf=False,
norm='l1') # ELL - ONE
matrix = vec.fit_transform(sotu_df['content'])
vocab = vec.get_feature_names()
sotu_wordcount_df = pd.DataFrame(matrix.toarray(), columns=vocab)
sotu_wordcount_df.head()
Sum up anger and positivity#
Then we'll reach into the NRC Emotional Lexicon and total up the positivity and anger.
# Get your list of positive words
positive_words = emolex_df[emolex_df.positive == 1]['word']
# Only give me the columns of angry words
sotu_df['positivity'] = sotu_wordcount_df[positive_words].sum(axis=1)
sotu_df.head(3)
# Get your list of positive words
angry_words = emolex_df[emolex_df.anger == 1].word
# Only give me the columns of angry words
sotu_df['anger'] = sotu_wordcount_df[angry_words].sum(axis=1)
sotu_df.head(3)
Comparing SOTU vs Trump#
Now that we have our two dataframes with positivity and anger, we can plot our graphic!
ax = speeches_df.plot(x='positivity', y='anger', kind='scatter')
sotu_df.plot(x='positivity', y='anger', kind='scatter', c='red', ax=ax)
Review#
In this section, we used the Emotional Lexicon to compare several sets of political speeches. Instead of just positive and negative sentiment, we were able to graph anger compared to positivity.
The Emotional Lexicon depends on each individual word having emotional ratings. Our approach was to add up the percentage of words that were each emotion, and use that as our score.
Discussion topics#
To build the NRC Emotional Lexicon, people were asked what emotions individual words had, without any context at all. But somehow when you run it against this dataset, the aggregate seems to make sense. Does this make us have more faith in individually scoring each word?
To a large degree, this piece reflects many peoples' thoughts that Trump's speeches are negative and angry. If the visual showed them to be more positive and less angry than a normal State of the Union, would we still trust the dataset? Would we still publish this piece? Why or why not?
If the results didn't seem to make sense, would we try again with other emotions on the axes?
Do you think it would work as well with 'anger' vs. 'negative' as opposed to 'anger' vs 'positive'? Try it out and see what you think about the results!