{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using topic modeling to find topics discussed in Democratic presidential candidate tweets\n", "\n", "Can topic modeling help us understand what topics Democratic presidential candidates are talking about? Let's find out!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<p class=\"reading-options\">\n <a class=\"btn\" href=\"/bloomberg-tweet-topics/topic-modeling-for-tweets\">\n <i class=\"fa fa-sm fa-book\"></i>\n Read online\n </a>\n <a class=\"btn\" href=\"/bloomberg-tweet-topics/notebooks/Topic modeling for tweets.ipynb\">\n <i class=\"fa fa-sm fa-download\"></i>\n Download notebook\n </a>\n <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/bloomberg-tweet-topics/notebooks/Topic modeling for tweets.ipynb\" target=\"_new\">\n <i class=\"fa fa-sm fa-laptop\"></i>\n Interactive version\n </a>\n</p>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Prep work: Downloading necessary files\n", "Before we get started, we need to download all of the data we'll be using.\n", "* **tweets.csv:** raw tweets - Approximately 39k tweets from Democratic presidential candidates\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Make data directory if it doesn't exist\n", "!mkdir -p data\n", "!wget -nc https://nyc3.digitaloceanspaces.com/ml-files-distro/v1/bloomberg-tweet-topics/data/tweets.csv -P data" ], "outputs": [], "execution_count": null }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we're analyzing text, we'll need to increase the text that's displayed in each pandas column. We're also increasing the number of columns displayed to help with some of the topic modeling stuff." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "pd.set_option(\"display.max_columns\", 60)\n", "pd.set_option(\"display.max_colwidth\", 300)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Our data\n", "\n", "We have around 40,000 tweets from Democratic presidential candidates, starting in January 2019. We scraped them using [GetOldTweets3](https://github.com/Mottl/GetOldTweets3/) in the previous section." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>username</th>\n", " <th>text</th>\n", " <th>date</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>24277</th>\n", " <td>JohnDelaney</td>\n", " <td>On Tuesday's debate I compared the Warren/Sanders agenda to the too far left shift that McGovern, Mondale and Dukakis made. McGovern lost 49 states, Mondale lost 49 states and Dukakis lost 40 states. We have to run on big economic ideas that also appeal to centrist voters.</td>\n", " <td>2019-08-02 17:50:57+00:00</td>\n", " </tr>\n", " <tr>\n", " <th>25018</th>\n", " <td>JohnDelaney</td>\n", " <td>The President clearly cares more about his Twitter followers than the American people. His continued dishonesty and weaponization of social media has been divisive. I am calling on all Americans to #UnfollowTrump and hit him where it actually hurts him... his ego.</td>\n", " <td>2019-04-24 15:58:08+00:00</td>\n", " </tr>\n", " <tr>\n", " <th>36444</th>\n", " <td>AndrewYang</td>\n", " <td>An AI system defeated elite Chinese doctors in a two-round brain tumor diagnosis competition on both speed and accuracy. This could do incredible good but is another example of areas in which new technology is capable of beating humans. We have to evolve quickly.</td>\n", " <td>2019-04-10 15:36:02+00:00</td>\n", " </tr>\n", " <tr>\n", " <th>19139</th>\n", " <td>JayInslee</td>\n", " <td>We must cut off the gravy train of federal subsidies for oil and gas companies. They\u2019re literally killing us.</td>\n", " <td>2019-05-07 20:00:18+00:00</td>\n", " </tr>\n", " <tr>\n", " <th>8764</th>\n", " <td>sethmoulton</td>\n", " <td>The Second Amendment was written in 1791 when people were firing single rounds out of a musket and dueling with pistols.</td>\n", " <td>2019-08-07 16:42:48+00:00</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " username \\\n", "24277 JohnDelaney \n", "25018 JohnDelaney \n", "36444 AndrewYang \n", "19139 JayInslee \n", "8764 sethmoulton \n", "\n", " text \\\n", "24277 On Tuesday's debate I compared the Warren/Sanders agenda to the too far left shift that McGovern, Mondale and Dukakis made. McGovern lost 49 states, Mondale lost 49 states and Dukakis lost 40 states. We have to run on big economic ideas that also appeal to centrist voters. \n", "25018 The President clearly cares more about his Twitter followers than the American people. His continued dishonesty and weaponization of social media has been divisive. I am calling on all Americans to #UnfollowTrump and hit him where it actually hurts him... his ego. \n", "36444 An AI system defeated elite Chinese doctors in a two-round brain tumor diagnosis competition on both speed and accuracy. This could do incredible good but is another example of areas in which new technology is capable of beating humans. We have to evolve quickly. \n", "19139 We must cut off the gravy train of federal subsidies for oil and gas companies. They\u2019re literally killing us. \n", "8764 The Second Amendment was written in 1791 when people were firing single rounds out of a musket and dueling with pistols. \n", "\n", " date \n", "24277 2019-08-02 17:50:57+00:00 \n", "25018 2019-04-24 15:58:08+00:00 \n", "36444 2019-04-10 15:36:02+00:00 \n", "19139 2019-05-07 20:00:18+00:00 \n", "8764 2019-08-07 16:42:48+00:00 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We don't need all of the columns, let's leave out a lot of them\n", "columns = ['username', 'text', 'date']\n", "\n", "df = pd.read_csv(\"data/tweets.csv\", usecols=columns)\n", "df.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And how many do we have?" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(38559, 3)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And how many from each candidate?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AndrewYang 4425\n", "marwilliamson 2571\n", "ewarren 2570\n", "JayInslee 2120\n", "KamalaHarris 2110\n", "JohnDelaney 1913\n", "BernieSanders 1881\n", "GovernorBullock 1721\n", "ericswalwell 1705\n", "BetoORourke 1667\n", "SenGillibrand 1538\n", "TimRyan 1481\n", "amyklobuchar 1405\n", "CoryBooker 1315\n", "TomSteyer 1279\n", "sethmoulton 1239\n", "JulianCastro 1220\n", "Hickenlooper 959\n", "MichaelBennet 904\n", "TulsiGabbard 893\n", "PeteButtigieg 856\n", "JoeBiden 856\n", "WayneMessam 815\n", "JoeSestak 619\n", "BilldeBlasio 497\n", "Name: username, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.username.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using topic modeling\n", "\n", "We'll be trying to use **topic modeling** to generate a list of topics each tweet is about, as well as words associated with each topic. Why do we think that? Because the methodology text told us so!\n", "\n", "> The initial keywords were generated by topic modeling the entire corpus of tweets, then supplemented manually with additional keywords. \n", "\n", "First we'll need to **vectorize** our text into numbers that scikit-learn can understand, and then we'll use topic modeling to find the topics inside.\n", "\n", "### Vectorize the text\n", "\n", "When you're doing topic modeling, **the kind of vectorizing you use depends on the kind of topic model you're going to build.** Using and LDA topic model required a `CountVectorizer`, while any other kind of topic model works best with a `TfidfVectorizer`. LDA magically has TF-IDF built in, so it understands the difference between things like low-frequency and high-frequency words.\n", "\n", "I'm lazy and LDA takes a long time to run, so we're _not_ going to use LDA, which means we'll need a `TfidfVectorizer`. Since I want words like \"tomato\" and \"tomatos\" and \"tomatoes\" combined, I'm also going to use a stemmer. More or less we're just stealing from [the reference page](/reference/vectorizing/)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "import Stemmer\n", "\n", "# Using pyStemmer because it's way faster than NLTK\n", "stemmer = Stemmer.Stemmer('en')\n", "\n", "# Based on TfidfVectorizer\n", "class StemmedTfidfVectorizer(TfidfVectorizer):\n", " def build_analyzer(self):\n", " analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()\n", " return lambda doc: stemmer.stemWords([w for w in analyzer(doc)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to count all words that show up **at least one hundred times.** If it isn't mentioned a hundred times across 40k tweets, the word is probably not that important." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.77 s, sys: 85 ms, total: 2.85 s\n", "Wall time: 3.21 s\n" ] } ], "source": [ "%%time\n", "vectorizer = StemmedTfidfVectorizer(stop_words='english',\n", " min_df=100)\n", "\n", "matrix = vectorizer.fit_transform(df.text.str.replace(\"[^\\w ]\", \"\"))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(38559, 975)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Down to just under a thousand. Now that we're vectorized we can head on to topic modeling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using LSI/SVD for topic modeling\n", "\n", "Whenever we're building a topic model, we have the important question of **how many topics?** The Bloomberg uses fourteen categories, so let's pick seventeen to add a little bit of buffer room." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic #0: thank, support, great, work, peopl, im, make, need, fight, time\n", "Topic #1: peopl, need, american, presid, make, work, countri, im, right, trump\n", "Topic #2: care, health, right, need, american, women, trump, afford, protect, access\n", "Topic #3: climat, chang, trump, presid, need, donald, defeat, crisi, nation, threat\n", "Topic #4: climat, chang, health, care, need, plan, new, debat, great, crisi\n", "Topic #5: right, im, fight, presid, women, climat, trump, run, chang, vote\n", "Topic #6: need, im, debat, make, care, health, help, stage, campaign, just\n", "Topic #7: gun, violenc, im, need, fight, end, live, love, peopl, work\n", "Topic #8: gun, need, violenc, health, care, join, presid, trump, look, end\n", "Topic #9: need, right, time, debat, make, let, help, great, vote, donor\n", "Topic #10: peopl, look, right, join, forward, campaign, like, tune, polit, just\n", "Topic #11: love, trump, make, support, famili, day, donald, let, debat, work\n", "Topic #12: love, presid, like, peopl, need, im, run, happi, look, new\n", "Topic #13: need, work, look, right, new, forward, countri, join, good, worker\n", "Topic #14: time, like, look, im, new, famili, forward, plan, support, pay\n", "Topic #15: look, forward, make, gun, let, great, violenc, plan, fight, state\n", "Topic #16: new, support, hampshir, let, day, campaign, presid, peopl, team, today\n", "\n", "CPU times: user 739 ms, sys: 121 ms, total: 859 ms\n", "Wall time: 526 ms\n" ] } ], "source": [ "%%time\n", "from sklearn.decomposition import TruncatedSVD\n", "\n", "# Tell the model to find the topics\n", "model = TruncatedSVD(n_components=17)\n", "model.fit(matrix)\n", "\n", "# Print the top 10 words per category\n", "n_words = 10\n", "feature_names = vectorizer.get_feature_names()\n", "\n", "for topic_idx, topic in enumerate(model.components_):\n", " message = \"Topic #%d: \" % topic_idx\n", " message += \", \".join([feature_names[i]\n", " for i in topic.argsort()[:-n_words - 1:-1]])\n", " print(message)\n", "print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we've got topics about thank yous/appreciation, general praise of America, climate change, gun violence, something that might be healthcare, Trump... They seem reasonable, right?\n", "\n", "That was so fast, we might as well try it with another topic modeling algorithm, too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Topic modeling with NME/NMF\n", "\n", "What's the difference between this version of topic modeling and the previous one? For right now: **who cares!** Let's just try it out." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Topic #0: thank, have, leadership, appreci, come, soon, host, share, amaz, convers\n", "Topic #1: work, famili, worker, pay, american, year, countri, job, economi, america\n", "Topic #2: im, join, live, campaign, tune, run, talk, fight, tonight, iowa\n", "Topic #3: trump, presid, donald, administr, immigr, mr, run, elect, state, unit\n", "Topic #4: climat, chang, crisi, defeat, plan, threat, ourclimatemo, action, issu, big\n", "Topic #5: right, fight, women, vote, protect, stand, human, reproduct, equal, abort\n", "Topic #6: gun, violenc, end, epidem, communiti, live, action, safeti, nra, check\n", "Topic #7: great, iowa, meet, day, talk, morn, today, enjoy, state, convers\n", "Topic #8: care, health, afford, plan, medicar, access, mental, insur, univers, million\n", "Topic #9: make, let, debat, sure, help, just, stage, happen, donat, donor\n", "Topic #10: peopl, american, power, polit, govern, campaign, money, want, young, dont\n", "Topic #11: need, dont, help, countri, real, donor, that, america, talk, secur\n", "Topic #12: love, happi, day, hate, one, today, life, celebr, world, birthday\n", "Topic #13: time, spend, long, year, past, come, congress, impeach, start, act\n", "Topic #14: look, forward, like, soon, see, join, come, good, way, hope\n", "Topic #15: new, hampshir, plan, york, citi, green, town, event, soon, state\n", "Topic #16: support, appreci, team, donat, proud, grate, help, yes, campaign, debat\n", "\n", "CPU times: user 4.8 s, sys: 304 ms, total: 5.1 s\n", "Wall time: 5.77 s\n" ] } ], "source": [ "%%time\n", "from sklearn.decomposition import NMF\n", "\n", "# Tell the model to find the topics\n", "model = NMF(n_components=17)\n", "model.fit(matrix)\n", "\n", "# Print the top 10 words per category\n", "n_words = 10\n", "feature_names = vectorizer.get_feature_names()\n", "\n", "for topic_idx, topic in enumerate(model.components_):\n", " message = \"Topic #%d: \" % topic_idx\n", " message += \", \".join([feature_names[i]\n", " for i in topic.argsort()[:-n_words - 1:-1]])\n", " print(message)\n", "print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These actually look a bit firmer - appreciation, working and families, climate change, reproductive rights/women's issues, Iowa, gun violence, healthcare, donors and politicla power, and maybe a little bit of the Green New Deal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What we do with topic models\n", "\n", "Now that we have our topic models, the big question is: **what do we do with them?** Usually you use topic models to automatically assign categories to things - \"this is about healthcare,\" \"this is about gun violence,\" etc - but things are a little different here.\n", "\n", "Let's review what the methodology note said:\n", "\n", "> The text of the tweets were classified programmatically using a body of keywords that corresponded to a larger bucket of topics categorized by Bloomberg News....The initial keywords were generated by topic modeling the entire corpus of tweets, then supplemented manually with additional keywords.\n", "\n", "So they used **keywords** to assign a category (or categories) to each tweet. Sounds like something we might be able to do, until we get to the example:\n", "\n", "> For example, a May 12 tweet from Beto O'Rourke reading, \"We will repeal the discriminatory and hateful transgender troop ban and replace it with the Equality Act to ensure full civil rights for LGBTQ Americans,\" was classified under \"social issues\" and \"military.\"\n", "\n", "Even though the military and social issues are major topics that candidates will tweet about, **none of the categories our topic models uncovered were about the military or social issues.** So what do we do? Looks like we'll just need to invent our own keywords!\n", "\n", "And that's exactly what they did, too.\n", "\n", "> The text of the tweets were classified programmatically using a body of keywords that corresponded to a larger bucket of topics categorized by Bloomberg News.\n", "\n", "Did we just... do that for no reason?\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review\n", "\n", "In this section we applied **topic modeling** to a large number of tweets, comparing several different algorithms to see which one could best categorize our dataset. It turns out they were all pretty bad, and we're just going to use keywords instead.\n", "\n", "## Discussion topics\n", "\n", "We don't know the difference between how the different topic modeling techniques work. What might be downsides to that? What are the downsides to learning them?\n", "\n", "If we didn't want to learn the intricacies of topic modeling, but still wanted to do this project using topic modeling, how could we find someone to give us advice?\n", "\n", "We only selected words that showed up at least one hundred times. Why?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }