{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Scraping tweets from Democratic presidential primary candidates\n",
                "\n",
                "What's a person to do when the Twitter API only lets you go back so far? Scraping to the rescue! And luckily we can use a *library* to scrape instead of having to write something manually."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<p class=\"reading-options\">\n  <a class=\"btn\" href=\"/bloomberg-tweet-topics/scrape-tweets-from-presidential-primary-candidates\">\n    <i class=\"fa fa-sm fa-book\"></i>\n    Read online\n  </a>\n  <a class=\"btn\" href=\"/bloomberg-tweet-topics/notebooks/Scrape tweets from presidential primary candidates.ipynb\">\n    <i class=\"fa fa-sm fa-download\"></i>\n    Download notebook\n  </a>\n  <a class=\"btn\" href=\"https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/bloomberg-tweet-topics/notebooks/Scrape tweets from presidential primary candidates.ipynb\" target=\"_new\">\n    <i class=\"fa fa-sm fa-laptop\"></i>\n    Interactive version\n  </a>\n</p>"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Introducing GetOldTweets3\n",
                "\n",
                "We'll be using the adorably-named [GetOldTweets3](https://github.com/Mottl/GetOldTweets3) library to acquire the Twitter history of the candidates in the Democratic presidential primary. We *could* use the [Twitter API](https://developer.twitter.com/en/docs), but unfortunately it doesn't let you go all the way back to the begining.\n",
                "\n",
                "GetOldTweets3, though, will help you get each and every tweet from 2019 by scraping each user's public timeline."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "#!pip install GetOldTweets3"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Scraping our tweets\n",
                "\n",
                "We're going to start with a list of usernames we're interested in, then loop through each one and use GetOldTweets3 to save the tweets into a CSV file named after the username."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 60,
            "metadata": {},
            "outputs": [],
            "source": [
                "usernames = [\n",
                "    'joebiden', 'corybooker','petebuttigieg','juliancastro','kamalaharris',\n",
                "    'amyklobuchar','betoorourke','berniesanders','ewarren','andrewyang',\n",
                "    'michaelbennet','governorbullock','billdeblasio','johndelaney',\n",
                "    'tulsigabbard','waynemessam','timryan','joesestak','tomsteyer',\n",
                "    'marwilliamson','sengillibrand','hickenlooper','jayinslee',\n",
                "    'sethmoulton','ericswalwell'\n",
                "]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 61,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Downloading for joebiden\n",
                        "(859, 15)\n",
                        "Downloading for corybooker\n",
                        "(1317, 15)\n",
                        "Downloading for petebuttigieg\n",
                        "(866, 15)\n",
                        "Downloading for juliancastro\n",
                        "(1231, 15)\n",
                        "Downloading for kamalaharris\n",
                        "(2114, 15)\n",
                        "Downloading for amyklobuchar\n",
                        "(1405, 15)\n",
                        "Downloading for betoorourke\n",
                        "(1683, 15)\n",
                        "Downloading for berniesanders\n",
                        "(1881, 15)\n",
                        "Downloading for ewarren\n",
                        "(2571, 15)\n",
                        "Downloading for andrewyang\n",
                        "(4475, 15)\n",
                        "Downloading for michaelbennet\n",
                        "(906, 15)\n",
                        "Downloading for governorbullock\n",
                        "(1722, 15)\n",
                        "Downloading for billdeblasio\n",
                        "(500, 15)\n",
                        "Downloading for johndelaney\n",
                        "(1921, 15)\n",
                        "Downloading for tulsigabbard\n",
                        "(900, 15)\n",
                        "Downloading for waynemessam\n",
                        "(817, 15)\n",
                        "Downloading for timryan\n",
                        "(1486, 15)\n",
                        "Downloading for joesestak\n",
                        "(621, 15)\n",
                        "Downloading for tomsteyer\n",
                        "(1279, 15)\n",
                        "Downloading for marwilliamson\n",
                        "(2637, 15)\n",
                        "Downloading for sengillibrand\n",
                        "(1538, 15)\n",
                        "Downloading for hickenlooper\n",
                        "(973, 15)\n",
                        "Downloading for jayinslee\n",
                        "(2128, 15)\n",
                        "Downloading for sethmoulton\n",
                        "(1242, 15)\n",
                        "Downloading for ericswalwell\n",
                        "(1717, 15)\n"
                    ]
                }
            ],
            "source": [
                "import GetOldTweets3 as got\n",
                "\n",
                "def download_tweets(username):\n",
                "    print(f\"Downloading for {username}\")\n",
                "    tweetCriteria = got.manager.TweetCriteria().setUsername(username)\\\n",
                "                                               .setSince(\"2019-01-01\")\\\n",
                "                                               .setUntil(\"2019-09-01\")\\\n",
                "\n",
                "    tweets = got.manager.TweetManager.getTweets(tweetCriteria)\n",
                "    df = pd.DataFrame([tweet.__dict__ for tweet in tweets])\n",
                "    print(df.shape)\n",
                "    df.to_csv(f\"data/tweets-raw-{username}.csv\", index=False)\n",
                "    \n",
                "for username in usernames:\n",
                "    download_tweets(username)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Combining our files\n",
                "\n",
                "We don't want to operate on these tweets in separate files, though - we'd rather have them all in one file! We'll finish up our data scraping by combining all of the tweets into one file.\n",
                "\n",
                "We'll start by using the [glob library](https://pymotw.com/3/glob/) to get a list of the filenames."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "['data/tweets-kamalaharris.csv', 'data/tweets-tomsteyer.csv', 'data/tweets-betoorourke.csv', 'data/tweets-amyklobuchar.csv', 'data/tweets-billdeblasio.csv', 'data/tweets-joebiden.csv', 'data/tweets-petebuttigieg.csv', 'data/tweets-sethmoulton.csv', 'data/tweets-joesestak.csv', 'data/tweets-juliancastro.csv', 'data/tweets-tulsigabbard.csv', 'data/tweets-waynemessam.csv', 'data/tweets-marwilliamson.csv', 'data/tweets-governorbullock.csv', 'data/tweets-jayinslee.csv', 'data/tweets-hickenlooper.csv', 'data/tweets-sengillibrand.csv', 'data/tweets-ericswalwell.csv', 'data/tweets-johndelaney.csv', 'data/tweets-corybooker.csv', 'data/tweets-michaelbennet.csv', 'data/tweets-timryan.csv', 'data/tweets-ewarren.csv', 'data/tweets-berniesanders.csv', 'data/tweets-andrewyang.csv']\n"
                    ]
                }
            ],
            "source": [
                "import glob\n",
                "\n",
                "filenames = glob.glob(\"data/tweets-raw-*.csv\")\n",
                "print(filenames)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We'll then use a list comprehension to turn each filename into a datframe, then `pd.concat` to combine them together."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/plain": [
                            "(38789, 15)"
                        ]
                    },
                    "execution_count": 4,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "import pandas as pd\n",
                "\n",
                "dataframes = [pd.read_csv(filename) for filename in filenames]\n",
                "df = pd.concat(dataframes)\n",
                "df.shape"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Let's pull a sample to make sure it looks like we think it should..."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>username</th>\n",
                            "      <th>to</th>\n",
                            "      <th>text</th>\n",
                            "      <th>retweets</th>\n",
                            "      <th>favorites</th>\n",
                            "      <th>replies</th>\n",
                            "      <th>id</th>\n",
                            "      <th>permalink</th>\n",
                            "      <th>author_id</th>\n",
                            "      <th>date</th>\n",
                            "      <th>formatted_date</th>\n",
                            "      <th>hashtags</th>\n",
                            "      <th>mentions</th>\n",
                            "      <th>geo</th>\n",
                            "      <th>urls</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>247</th>\n",
                            "      <td>TimRyan</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>Hate, racism, white nationalism is terrorizing...</td>\n",
                            "      <td>88</td>\n",
                            "      <td>401</td>\n",
                            "      <td>198</td>\n",
                            "      <td>1158019752931074049</td>\n",
                            "      <td>https://twitter.com/TimRyan/status/11580197529...</td>\n",
                            "      <td>466532637</td>\n",
                            "      <td>2019-08-04 14:19:58+00:00</td>\n",
                            "      <td>Sun Aug 04 14:19:58 +0000 2019</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>681</th>\n",
                            "      <td>amyklobuchar</td>\n",
                            "      <td>washingtonpost</td>\n",
                            "      <td>We need to see the full report in order to pro...</td>\n",
                            "      <td>518</td>\n",
                            "      <td>1813</td>\n",
                            "      <td>240</td>\n",
                            "      <td>1126198087213563905</td>\n",
                            "      <td>https://twitter.com/amyklobuchar/status/112619...</td>\n",
                            "      <td>33537967</td>\n",
                            "      <td>2019-05-08 18:52:02+00:00</td>\n",
                            "      <td>Wed May 08 18:52:02 +0000 2019</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>https://twitter.com/washingtonpost/status/1126...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>647</th>\n",
                            "      <td>GovernorBullock</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>McConnell has stood in the way of American pro...</td>\n",
                            "      <td>89</td>\n",
                            "      <td>392</td>\n",
                            "      <td>157</td>\n",
                            "      <td>1149037399030272011</td>\n",
                            "      <td>https://twitter.com/GovernorBullock/status/114...</td>\n",
                            "      <td>111721601</td>\n",
                            "      <td>2019-07-10 19:27:18+00:00</td>\n",
                            "      <td>Wed Jul 10 19:27:18 +0000 2019</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>@AmyMcGrathKY</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>http://bit.ly/2SdvJP6</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>543</th>\n",
                            "      <td>ericswalwell</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>$1 could be the difference between 4 more year...</td>\n",
                            "      <td>327</td>\n",
                            "      <td>893</td>\n",
                            "      <td>877</td>\n",
                            "      <td>1133149644014391303</td>\n",
                            "      <td>https://twitter.com/ericswalwell/status/113314...</td>\n",
                            "      <td>377609596</td>\n",
                            "      <td>2019-05-27 23:15:02+00:00</td>\n",
                            "      <td>Mon May 27 23:15:02 +0000 2019</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>https://bit.ly/2EnCLuG</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1423</th>\n",
                            "      <td>marwilliamson</td>\n",
                            "      <td>maidenoftheair</td>\n",
                            "      <td>Hear hear.</td>\n",
                            "      <td>0</td>\n",
                            "      <td>5</td>\n",
                            "      <td>0</td>\n",
                            "      <td>1122503791553630210</td>\n",
                            "      <td>https://twitter.com/marwilliamson/status/11225...</td>\n",
                            "      <td>21522338</td>\n",
                            "      <td>2019-04-28 14:12:13+00:00</td>\n",
                            "      <td>Sun Apr 28 14:12:13 +0000 2019</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "      <td>NaN</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "             username              to  \\\n",
                            "247           TimRyan             NaN   \n",
                            "681      amyklobuchar  washingtonpost   \n",
                            "647   GovernorBullock             NaN   \n",
                            "543      ericswalwell             NaN   \n",
                            "1423    marwilliamson  maidenoftheair   \n",
                            "\n",
                            "                                                   text  retweets  favorites  \\\n",
                            "247   Hate, racism, white nationalism is terrorizing...        88        401   \n",
                            "681   We need to see the full report in order to pro...       518       1813   \n",
                            "647   McConnell has stood in the way of American pro...        89        392   \n",
                            "543   $1 could be the difference between 4 more year...       327        893   \n",
                            "1423                                         Hear hear.         0          5   \n",
                            "\n",
                            "      replies                   id  \\\n",
                            "247       198  1158019752931074049   \n",
                            "681       240  1126198087213563905   \n",
                            "647       157  1149037399030272011   \n",
                            "543       877  1133149644014391303   \n",
                            "1423        0  1122503791553630210   \n",
                            "\n",
                            "                                              permalink  author_id  \\\n",
                            "247   https://twitter.com/TimRyan/status/11580197529...  466532637   \n",
                            "681   https://twitter.com/amyklobuchar/status/112619...   33537967   \n",
                            "647   https://twitter.com/GovernorBullock/status/114...  111721601   \n",
                            "543   https://twitter.com/ericswalwell/status/113314...  377609596   \n",
                            "1423  https://twitter.com/marwilliamson/status/11225...   21522338   \n",
                            "\n",
                            "                           date                  formatted_date hashtags  \\\n",
                            "247   2019-08-04 14:19:58+00:00  Sun Aug 04 14:19:58 +0000 2019      NaN   \n",
                            "681   2019-05-08 18:52:02+00:00  Wed May 08 18:52:02 +0000 2019      NaN   \n",
                            "647   2019-07-10 19:27:18+00:00  Wed Jul 10 19:27:18 +0000 2019      NaN   \n",
                            "543   2019-05-27 23:15:02+00:00  Mon May 27 23:15:02 +0000 2019      NaN   \n",
                            "1423  2019-04-28 14:12:13+00:00  Sun Apr 28 14:12:13 +0000 2019      NaN   \n",
                            "\n",
                            "           mentions  geo                                               urls  \n",
                            "247             NaN  NaN                                                NaN  \n",
                            "681             NaN  NaN  https://twitter.com/washingtonpost/status/1126...  \n",
                            "647   @AmyMcGrathKY  NaN                              http://bit.ly/2SdvJP6  \n",
                            "543             NaN  NaN                             https://bit.ly/2EnCLuG  \n",
                            "1423            NaN  NaN                                                NaN  "
                        ]
                    },
                    "execution_count": 6,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "df.sample(5)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Looking good! Let's **remove any missing the `text` column** (I don't know why, *but they exist*), and save it so we can analyze it in the next notebook."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [],
            "source": [
                "df = df.dropna(subset=['text'])\n",
                "df.to_csv(\"data/tweets.csv\", index=False)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Review\n",
                "\n",
                "In this section we used the `GetOldTweets3` library to download large numbers of tweets that the API could not get us.\n",
                "\n",
                "## Discussion topics\n",
                "\n",
                "We're certainly breaking Twitter's Terms of Service by scraping these tweets. Should we not do it? What are the ethical and legal issues at play?\n",
                "\n",
                "Why are we scraping tweets as opposed to Facebook posts or campaign speeches?"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": []
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "Python 3",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.6.8"
        },
        "toc": {
            "base_numbering": 1,
            "nav_menu": {},
            "number_sections": true,
            "sideBar": true,
            "skip_h1_title": false,
            "title_cell": "Table of Contents",
            "title_sidebar": "Contents",
            "toc_cell": false,
            "toc_position": {},
            "toc_section_display": true,
            "toc_window_display": false
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}