What does Trump tweet about?

What does Trump tweet about? An analysis of over 11,000 tweets.

natural language processing text analysis topic modeling doing it by hand reading lots of documents


The New York Times wanted to analyze the contents of over 11,000 of Trump's tweets, so they took an unexpected approach: they read them. From the how they did it page:

KAREN YOURISH After Mr. Trump tweeted attacks on “the Squad,” the four Democratic congresswomen of color, our executive editor, Dean Baquet, felt like it would be worth doing a deep dive into what Mr. Trump has been tweeting. We wanted to know how many of those tweets are attacks on specific people, or minorities, or other groups, and shed light on something that is unique to this president.

My colleague Larry Buchanan and I decided, in the beginning, that it made sense to read through all of the tweets. Doing a data analysis without actually reading the content of the tweets wasn’t going to give us the kind of detail that we needed.

There's nothing more exciting to me than analyzing of a large number actual journalist eyes and brains. Text was meant for humans to read, so humans are going to be the best at it!

Discussion topics

It took them over a month for two people to read all the tweets. If an algorithm could read all the tweets after a day of tagging and a day of programming/tweaking, why would you take this (much!) longer option instead?

"for each tweet I went through and determined if it was an attack, if it was praise, if it was both." A close technical alternative to this would be a sentiment analysis tool. What are the downsides to using sentiment analysis in this way? How about writing your own classifier to look at praise compared to attacks?

"I also had a separate category where I wrote down who he was attacking or what he was attacking." You could also try this with spaCy's entity recognition. What might be different if you used the technical solution instead?

"We also analyzed the accounts that he retweeted and separated them into verified accounts and unverified accounts." You could try to do this with a scraper or using the Twitter API. What might be different when collecting this information automatically as opposed to doing it manually?

Was the point of segmenting into verified/unverified to say "Trump retweets XXX% unverified accounts?"

"The graphics editors Karen Yourish and Larry Buchanan read every tweet — more than 11,000 of them — twice." What benefits does reading the tweets and categorizing the have as compared to using a classification algorithm? Does it enable you to use different language in your article about your results?

How many different stories did The New York Times publish as a result of this dataset? What are they able to do in the future?