Scraping tweets from Democratic presidential primary candidates#

What's a person to do when the Twitter API only lets you go back so far? Scraping to the rescue! And luckily we can use a library to scrape instead of having to write something manually.

Introducing GetOldTweets3#

We'll be using the adorably-named GetOldTweets3 library to acquire the Twitter history of the candidates in the Democratic presidential primary. We could use the Twitter API, but unfortunately it doesn't let you go all the way back to the beginning.

GetOldTweets3, though, will help you get each and every tweet from 2019 by scraping each user's public timeline.

#!pip install GetOldTweets3

Scraping our tweets#

We're going to start with a list of usernames we're interested in, then loop through each one and use GetOldTweets3 to save the tweets into a CSV file named after the username.

usernames = [
    'joebiden', 'corybooker','petebuttigieg','juliancastro','kamalaharris',
    'amyklobuchar','betoorourke','berniesanders','ewarren','andrewyang',
    'michaelbennet','governorbullock','billdeblasio','johndelaney',
    'tulsigabbard','waynemessam','timryan','joesestak','tomsteyer',
    'marwilliamson','sengillibrand','hickenlooper','jayinslee',
    'sethmoulton','ericswalwell'
]
import GetOldTweets3 as got

def download_tweets(username):
    print(f"Downloading for {username}")
    tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
                                               .setSince("2019-01-01")\
                                               .setUntil("2019-09-01")\

    tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    df = pd.DataFrame([tweet.__dict__ for tweet in tweets])
    print(df.shape)
    df.to_csv(f"data/tweets-raw-{username}.csv", index=False)
    
for username in usernames:
    download_tweets(username)
Downloading for joebiden
(859, 15)
Downloading for corybooker
(1317, 15)
Downloading for petebuttigieg
(866, 15)
Downloading for juliancastro
(1231, 15)
Downloading for kamalaharris
(2114, 15)
Downloading for amyklobuchar
(1405, 15)
Downloading for betoorourke
(1683, 15)
Downloading for berniesanders
(1881, 15)
Downloading for ewarren
(2571, 15)
Downloading for andrewyang
(4475, 15)
Downloading for michaelbennet
(906, 15)
Downloading for governorbullock
(1722, 15)
Downloading for billdeblasio
(500, 15)
Downloading for johndelaney
(1921, 15)
Downloading for tulsigabbard
(900, 15)
Downloading for waynemessam
(817, 15)
Downloading for timryan
(1486, 15)
Downloading for joesestak
(621, 15)
Downloading for tomsteyer
(1279, 15)
Downloading for marwilliamson
(2637, 15)
Downloading for sengillibrand
(1538, 15)
Downloading for hickenlooper
(973, 15)
Downloading for jayinslee
(2128, 15)
Downloading for sethmoulton
(1242, 15)
Downloading for ericswalwell
(1717, 15)

Combining our files#

We don't want to operate on these tweets in separate files, though - we'd rather have them all in one file! We'll finish up our data scraping by combining all of the tweets into one file.

We'll start by using the glob library to get a list of the filenames.

import glob

filenames = glob.glob("data/tweets-raw-*.csv")
print(filenames)
['data/tweets-kamalaharris.csv', 'data/tweets-tomsteyer.csv', 'data/tweets-betoorourke.csv', 'data/tweets-amyklobuchar.csv', 'data/tweets-billdeblasio.csv', 'data/tweets-joebiden.csv', 'data/tweets-petebuttigieg.csv', 'data/tweets-sethmoulton.csv', 'data/tweets-joesestak.csv', 'data/tweets-juliancastro.csv', 'data/tweets-tulsigabbard.csv', 'data/tweets-waynemessam.csv', 'data/tweets-marwilliamson.csv', 'data/tweets-governorbullock.csv', 'data/tweets-jayinslee.csv', 'data/tweets-hickenlooper.csv', 'data/tweets-sengillibrand.csv', 'data/tweets-ericswalwell.csv', 'data/tweets-johndelaney.csv', 'data/tweets-corybooker.csv', 'data/tweets-michaelbennet.csv', 'data/tweets-timryan.csv', 'data/tweets-ewarren.csv', 'data/tweets-berniesanders.csv', 'data/tweets-andrewyang.csv']

We'll then use a list comprehension to turn each filename into a dataframe, then pd.concat to combine them together.

import pandas as pd

dataframes = [pd.read_csv(filename) for filename in filenames]
df = pd.concat(dataframes)
df.shape
(38789, 15)

Let's pull a sample to make sure it looks like we think it should...

df.sample(5)
username to text retweets favorites replies id permalink author_id date formatted_date hashtags mentions geo urls
247 TimRyan NaN Hate, racism, white nationalism is terrorizing... 88 401 198 1158019752931074049 https://twitter.com/TimRyan/status/11580197529... 466532637 2019-08-04 14:19:58+00:00 Sun Aug 04 14:19:58 +0000 2019 NaN NaN NaN NaN
681 amyklobuchar washingtonpost We need to see the full report in order to pro... 518 1813 240 1126198087213563905 https://twitter.com/amyklobuchar/status/112619... 33537967 2019-05-08 18:52:02+00:00 Wed May 08 18:52:02 +0000 2019 NaN NaN NaN https://twitter.com/washingtonpost/status/1126...
647 GovernorBullock NaN McConnell has stood in the way of American pro... 89 392 157 1149037399030272011 https://twitter.com/GovernorBullock/status/114... 111721601 2019-07-10 19:27:18+00:00 Wed Jul 10 19:27:18 +0000 2019 NaN @AmyMcGrathKY NaN http://bit.ly/2SdvJP6
543 ericswalwell NaN $1 could be the difference between 4 more year... 327 893 877 1133149644014391303 https://twitter.com/ericswalwell/status/113314... 377609596 2019-05-27 23:15:02+00:00 Mon May 27 23:15:02 +0000 2019 NaN NaN NaN https://bit.ly/2EnCLuG
1423 marwilliamson maidenoftheair Hear hear. 0 5 0 1122503791553630210 https://twitter.com/marwilliamson/status/11225... 21522338 2019-04-28 14:12:13+00:00 Sun Apr 28 14:12:13 +0000 2019 NaN NaN NaN NaN

Looking good! Let's remove any missing the text column (I don't know why, but they exist), and save it so we can analyze it in the next notebook.

df = df.dropna(subset=['text'])
df.to_csv("data/tweets.csv", index=False)

Review#

In this section we used the GetOldTweets3 library to download large numbers of tweets that the API could not get us.

Discussion topics#

We're certainly breaking Twitter's Terms of Service by scraping these tweets. Should we not do it? What are the ethical and legal issues at play?

Why are we scraping tweets as opposed to Facebook posts or campaign speeches?