{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping tweets from Democratic presidential primary candidates\n", "\n", "What's a person to do when the Twitter API only lets you go back so far? Scraping to the rescue! And luckily we can use a *library* to scrape instead of having to write something manually." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<p class=\"reading-options\">\n <a class=\"btn\" href=\"/bloomberg-tweet-topics/scrape-tweets-from-presidential-primary-candidates\">\n <i class=\"fa fa-sm fa-book\"></i>\n Read online\n </a>\n <a class=\"btn\" href=\"/bloomberg-tweet-topics/notebooks/Scrape tweets from presidential primary candidates.ipynb\">\n <i class=\"fa fa-sm fa-download\"></i>\n Download notebook\n </a>\n <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/bloomberg-tweet-topics/notebooks/Scrape tweets from presidential primary candidates.ipynb\" target=\"_new\">\n <i class=\"fa fa-sm fa-laptop\"></i>\n Interactive version\n </a>\n</p>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introducing GetOldTweets3\n", "\n", "We'll be using the adorably-named [GetOldTweets3](https://github.com/Mottl/GetOldTweets3) library to acquire the Twitter history of the candidates in the Democratic presidential primary. We *could* use the [Twitter API](https://developer.twitter.com/en/docs), but unfortunately it doesn't let you go all the way back to the begining.\n", "\n", "GetOldTweets3, though, will help you get each and every tweet from 2019 by scraping each user's public timeline." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!pip install GetOldTweets3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scraping our tweets\n", "\n", "We're going to start with a list of usernames we're interested in, then loop through each one and use GetOldTweets3 to save the tweets into a CSV file named after the username." ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [], "source": [ "usernames = [\n", " 'joebiden', 'corybooker','petebuttigieg','juliancastro','kamalaharris',\n", " 'amyklobuchar','betoorourke','berniesanders','ewarren','andrewyang',\n", " 'michaelbennet','governorbullock','billdeblasio','johndelaney',\n", " 'tulsigabbard','waynemessam','timryan','joesestak','tomsteyer',\n", " 'marwilliamson','sengillibrand','hickenlooper','jayinslee',\n", " 'sethmoulton','ericswalwell'\n", "]" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading for joebiden\n", "(859, 15)\n", "Downloading for corybooker\n", "(1317, 15)\n", "Downloading for petebuttigieg\n", "(866, 15)\n", "Downloading for juliancastro\n", "(1231, 15)\n", "Downloading for kamalaharris\n", "(2114, 15)\n", "Downloading for amyklobuchar\n", "(1405, 15)\n", "Downloading for betoorourke\n", "(1683, 15)\n", "Downloading for berniesanders\n", "(1881, 15)\n", "Downloading for ewarren\n", "(2571, 15)\n", "Downloading for andrewyang\n", "(4475, 15)\n", "Downloading for michaelbennet\n", "(906, 15)\n", "Downloading for governorbullock\n", "(1722, 15)\n", "Downloading for billdeblasio\n", "(500, 15)\n", "Downloading for johndelaney\n", "(1921, 15)\n", "Downloading for tulsigabbard\n", "(900, 15)\n", "Downloading for waynemessam\n", "(817, 15)\n", "Downloading for timryan\n", "(1486, 15)\n", "Downloading for joesestak\n", "(621, 15)\n", "Downloading for tomsteyer\n", "(1279, 15)\n", "Downloading for marwilliamson\n", "(2637, 15)\n", "Downloading for sengillibrand\n", "(1538, 15)\n", "Downloading for hickenlooper\n", "(973, 15)\n", "Downloading for jayinslee\n", "(2128, 15)\n", "Downloading for sethmoulton\n", "(1242, 15)\n", "Downloading for ericswalwell\n", "(1717, 15)\n" ] } ], "source": [ "import GetOldTweets3 as got\n", "\n", "def download_tweets(username):\n", " print(f\"Downloading for {username}\")\n", " tweetCriteria = got.manager.TweetCriteria().setUsername(username)\\\n", " .setSince(\"2019-01-01\")\\\n", " .setUntil(\"2019-09-01\")\\\n", "\n", " tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n", " df = pd.DataFrame([tweet.__dict__ for tweet in tweets])\n", " print(df.shape)\n", " df.to_csv(f\"data/tweets-raw-{username}.csv\", index=False)\n", " \n", "for username in usernames:\n", " download_tweets(username)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combining our files\n", "\n", "We don't want to operate on these tweets in separate files, though - we'd rather have them all in one file! We'll finish up our data scraping by combining all of the tweets into one file.\n", "\n", "We'll start by using the [glob library](https://pymotw.com/3/glob/) to get a list of the filenames." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['data/tweets-kamalaharris.csv', 'data/tweets-tomsteyer.csv', 'data/tweets-betoorourke.csv', 'data/tweets-amyklobuchar.csv', 'data/tweets-billdeblasio.csv', 'data/tweets-joebiden.csv', 'data/tweets-petebuttigieg.csv', 'data/tweets-sethmoulton.csv', 'data/tweets-joesestak.csv', 'data/tweets-juliancastro.csv', 'data/tweets-tulsigabbard.csv', 'data/tweets-waynemessam.csv', 'data/tweets-marwilliamson.csv', 'data/tweets-governorbullock.csv', 'data/tweets-jayinslee.csv', 'data/tweets-hickenlooper.csv', 'data/tweets-sengillibrand.csv', 'data/tweets-ericswalwell.csv', 'data/tweets-johndelaney.csv', 'data/tweets-corybooker.csv', 'data/tweets-michaelbennet.csv', 'data/tweets-timryan.csv', 'data/tweets-ewarren.csv', 'data/tweets-berniesanders.csv', 'data/tweets-andrewyang.csv']\n" ] } ], "source": [ "import glob\n", "\n", "filenames = glob.glob(\"data/tweets-raw-*.csv\")\n", "print(filenames)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll then use a list comprehension to turn each filename into a datframe, then `pd.concat` to combine them together." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(38789, 15)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "dataframes = [pd.read_csv(filename) for filename in filenames]\n", "df = pd.concat(dataframes)\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's pull a sample to make sure it looks like we think it should..." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>username</th>\n", " <th>to</th>\n", " <th>text</th>\n", " <th>retweets</th>\n", " <th>favorites</th>\n", " <th>replies</th>\n", " <th>id</th>\n", " <th>permalink</th>\n", " <th>author_id</th>\n", " <th>date</th>\n", " <th>formatted_date</th>\n", " <th>hashtags</th>\n", " <th>mentions</th>\n", " <th>geo</th>\n", " <th>urls</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>247</th>\n", " <td>TimRyan</td>\n", " <td>NaN</td>\n", " <td>Hate, racism, white nationalism is terrorizing...</td>\n", " <td>88</td>\n", " <td>401</td>\n", " <td>198</td>\n", " <td>1158019752931074049</td>\n", " <td>https://twitter.com/TimRyan/status/11580197529...</td>\n", " <td>466532637</td>\n", " <td>2019-08-04 14:19:58+00:00</td>\n", " <td>Sun Aug 04 14:19:58 +0000 2019</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>681</th>\n", " <td>amyklobuchar</td>\n", " <td>washingtonpost</td>\n", " <td>We need to see the full report in order to pro...</td>\n", " <td>518</td>\n", " <td>1813</td>\n", " <td>240</td>\n", " <td>1126198087213563905</td>\n", " <td>https://twitter.com/amyklobuchar/status/112619...</td>\n", " <td>33537967</td>\n", " <td>2019-05-08 18:52:02+00:00</td>\n", " <td>Wed May 08 18:52:02 +0000 2019</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>https://twitter.com/washingtonpost/status/1126...</td>\n", " </tr>\n", " <tr>\n", " <th>647</th>\n", " <td>GovernorBullock</td>\n", " <td>NaN</td>\n", " <td>McConnell has stood in the way of American pro...</td>\n", " <td>89</td>\n", " <td>392</td>\n", " <td>157</td>\n", " <td>1149037399030272011</td>\n", " <td>https://twitter.com/GovernorBullock/status/114...</td>\n", " <td>111721601</td>\n", " <td>2019-07-10 19:27:18+00:00</td>\n", " <td>Wed Jul 10 19:27:18 +0000 2019</td>\n", " <td>NaN</td>\n", " <td>@AmyMcGrathKY</td>\n", " <td>NaN</td>\n", " <td>http://bit.ly/2SdvJP6</td>\n", " </tr>\n", " <tr>\n", " <th>543</th>\n", " <td>ericswalwell</td>\n", " <td>NaN</td>\n", " <td>$1 could be the difference between 4 more year...</td>\n", " <td>327</td>\n", " <td>893</td>\n", " <td>877</td>\n", " <td>1133149644014391303</td>\n", " <td>https://twitter.com/ericswalwell/status/113314...</td>\n", " <td>377609596</td>\n", " <td>2019-05-27 23:15:02+00:00</td>\n", " <td>Mon May 27 23:15:02 +0000 2019</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>https://bit.ly/2EnCLuG</td>\n", " </tr>\n", " <tr>\n", " <th>1423</th>\n", " <td>marwilliamson</td>\n", " <td>maidenoftheair</td>\n", " <td>Hear hear.</td>\n", " <td>0</td>\n", " <td>5</td>\n", " <td>0</td>\n", " <td>1122503791553630210</td>\n", " <td>https://twitter.com/marwilliamson/status/11225...</td>\n", " <td>21522338</td>\n", " <td>2019-04-28 14:12:13+00:00</td>\n", " <td>Sun Apr 28 14:12:13 +0000 2019</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " username to \\\n", "247 TimRyan NaN \n", "681 amyklobuchar washingtonpost \n", "647 GovernorBullock NaN \n", "543 ericswalwell NaN \n", "1423 marwilliamson maidenoftheair \n", "\n", " text retweets favorites \\\n", "247 Hate, racism, white nationalism is terrorizing... 88 401 \n", "681 We need to see the full report in order to pro... 518 1813 \n", "647 McConnell has stood in the way of American pro... 89 392 \n", "543 $1 could be the difference between 4 more year... 327 893 \n", "1423 Hear hear. 0 5 \n", "\n", " replies id \\\n", "247 198 1158019752931074049 \n", "681 240 1126198087213563905 \n", "647 157 1149037399030272011 \n", "543 877 1133149644014391303 \n", "1423 0 1122503791553630210 \n", "\n", " permalink author_id \\\n", "247 https://twitter.com/TimRyan/status/11580197529... 466532637 \n", "681 https://twitter.com/amyklobuchar/status/112619... 33537967 \n", "647 https://twitter.com/GovernorBullock/status/114... 111721601 \n", "543 https://twitter.com/ericswalwell/status/113314... 377609596 \n", "1423 https://twitter.com/marwilliamson/status/11225... 21522338 \n", "\n", " date formatted_date hashtags \\\n", "247 2019-08-04 14:19:58+00:00 Sun Aug 04 14:19:58 +0000 2019 NaN \n", "681 2019-05-08 18:52:02+00:00 Wed May 08 18:52:02 +0000 2019 NaN \n", "647 2019-07-10 19:27:18+00:00 Wed Jul 10 19:27:18 +0000 2019 NaN \n", "543 2019-05-27 23:15:02+00:00 Mon May 27 23:15:02 +0000 2019 NaN \n", "1423 2019-04-28 14:12:13+00:00 Sun Apr 28 14:12:13 +0000 2019 NaN \n", "\n", " mentions geo urls \n", "247 NaN NaN NaN \n", "681 NaN NaN https://twitter.com/washingtonpost/status/1126... \n", "647 @AmyMcGrathKY NaN http://bit.ly/2SdvJP6 \n", "543 NaN NaN https://bit.ly/2EnCLuG \n", "1423 NaN NaN NaN " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking good! Let's **remove any missing the `text` column** (I don't know why, *but they exist*), and save it so we can analyze it in the next notebook." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "df = df.dropna(subset=['text'])\n", "df.to_csv(\"data/tweets.csv\", index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review\n", "\n", "In this section we used the `GetOldTweets3` library to download large numbers of tweets that the API could not get us.\n", "\n", "## Discussion topics\n", "\n", "We're certainly breaking Twitter's Terms of Service by scraping these tweets. Should we not do it? What are the ethical and legal issues at play?\n", "\n", "Why are we scraping tweets as opposed to Facebook posts or campaign speeches?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }