Scraping tweets from Democratic presidential primary candidates#
What's a person to do when the Twitter API only lets you go back so far? Scraping to the rescue! And luckily we can use a library to scrape instead of having to write something manually.
Introducing GetOldTweets3#
We'll be using the adorably-named GetOldTweets3 library to acquire the Twitter history of the candidates in the Democratic presidential primary. We could use the Twitter API, but unfortunately it doesn't let you go all the way back to the beginning.
GetOldTweets3, though, will help you get each and every tweet from 2019 by scraping each user's public timeline.
#!pip install GetOldTweets3
Scraping our tweets#
We're going to start with a list of usernames we're interested in, then loop through each one and use GetOldTweets3 to save the tweets into a CSV file named after the username.
usernames = [
'joebiden', 'corybooker','petebuttigieg','juliancastro','kamalaharris',
'amyklobuchar','betoorourke','berniesanders','ewarren','andrewyang',
'michaelbennet','governorbullock','billdeblasio','johndelaney',
'tulsigabbard','waynemessam','timryan','joesestak','tomsteyer',
'marwilliamson','sengillibrand','hickenlooper','jayinslee',
'sethmoulton','ericswalwell'
]
import GetOldTweets3 as got
def download_tweets(username):
print(f"Downloading for {username}")
tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
.setSince("2019-01-01")\
.setUntil("2019-09-01")\
tweets = got.manager.TweetManager.getTweets(tweetCriteria)
df = pd.DataFrame([tweet.__dict__ for tweet in tweets])
print(df.shape)
df.to_csv(f"data/tweets-raw-{username}.csv", index=False)
for username in usernames:
download_tweets(username)
Combining our files#
We don't want to operate on these tweets in separate files, though - we'd rather have them all in one file! We'll finish up our data scraping by combining all of the tweets into one file.
We'll start by using the glob library to get a list of the filenames.
import glob
filenames = glob.glob("data/tweets-raw-*.csv")
print(filenames)
We'll then use a list comprehension to turn each filename into a dataframe, then pd.concat
to combine them together.
import pandas as pd
dataframes = [pd.read_csv(filename) for filename in filenames]
df = pd.concat(dataframes)
df.shape
Let's pull a sample to make sure it looks like we think it should...
df.sample(5)
Looking good! Let's remove any missing the text
column (I don't know why, but they exist), and save it so we can analyze it in the next notebook.
df = df.dropna(subset=['text'])
df.to_csv("data/tweets.csv", index=False)
Review#
In this section we used the GetOldTweets3
library to download large numbers of tweets that the API could not get us.
Discussion topics#
We're certainly breaking Twitter's Terms of Service by scraping these tweets. Should we not do it? What are the ethical and legal issues at play?
Why are we scraping tweets as opposed to Facebook posts or campaign speeches?