Scraping tweets from Democratic presidential primary candidates#

What's a person to do when the Twitter API only lets you go back so far? Scraping to the rescue! And luckily we can use a library to scrape instead of having to write something manually.

Read online Download notebook Interactive version

Introducing GetOldTweets3#

We'll be using the adorably-named GetOldTweets3 library to acquire the Twitter history of the candidates in the Democratic presidential primary. We could use the Twitter API, but unfortunately it doesn't let you go all the way back to the beginning.

GetOldTweets3, though, will help you get each and every tweet from 2019 by scraping each user's public timeline.

#!pip install GetOldTweets3

Scraping our tweets#

We're going to start with a list of usernames we're interested in, then loop through each one and use GetOldTweets3 to save the tweets into a CSV file named after the username.

usernames = [
    'joebiden', 'corybooker','petebuttigieg','juliancastro','kamalaharris',
    'amyklobuchar','betoorourke','berniesanders','ewarren','andrewyang',
    'michaelbennet','governorbullock','billdeblasio','johndelaney',
    'tulsigabbard','waynemessam','timryan','joesestak','tomsteyer',
    'marwilliamson','sengillibrand','hickenlooper','jayinslee',
    'sethmoulton','ericswalwell'
]

import GetOldTweets3 as got

def download_tweets(username):
    print(f"Downloading for {username}")
    tweetCriteria = got.manager.TweetCriteria().setUsername(username)\
                                               .setSince("2019-01-01")\
                                               .setUntil("2019-09-01")\

    tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    df = pd.DataFrame([tweet.__dict__ for tweet in tweets])
    print(df.shape)
    df.to_csv(f"data/tweets-raw-{username}.csv", index=False)
    
for username in usernames:
    download_tweets(username)

Downloading for joebiden
(859, 15)
Downloading for corybooker
(1317, 15)
Downloading for petebuttigieg
(866, 15)
Downloading for juliancastro
(1231, 15)
Downloading for kamalaharris
(2114, 15)
Downloading for amyklobuchar
(1405, 15)
Downloading for betoorourke
(1683, 15)
Downloading for berniesanders
(1881, 15)
Downloading for ewarren
(2571, 15)
Downloading for andrewyang
(4475, 15)
Downloading for michaelbennet
(906, 15)
Downloading for governorbullock
(1722, 15)
Downloading for billdeblasio
(500, 15)
Downloading for johndelaney
(1921, 15)
Downloading for tulsigabbard
(900, 15)
Downloading for waynemessam
(817, 15)
Downloading for timryan
(1486, 15)
Downloading for joesestak
(621, 15)
Downloading for tomsteyer
(1279, 15)
Downloading for marwilliamson
(2637, 15)
Downloading for sengillibrand
(1538, 15)
Downloading for hickenlooper
(973, 15)
Downloading for jayinslee
(2128, 15)
Downloading for sethmoulton
(1242, 15)
Downloading for ericswalwell
(1717, 15)

Combining our files#

We don't want to operate on these tweets in separate files, though - we'd rather have them all in one file! We'll finish up our data scraping by combining all of the tweets into one file.

We'll start by using the glob library to get a list of the filenames.

import glob

filenames = glob.glob("data/tweets-raw-*.csv")
print(filenames)

['data/tweets-kamalaharris.csv', 'data/tweets-tomsteyer.csv', 'data/tweets-betoorourke.csv', 'data/tweets-amyklobuchar.csv', 'data/tweets-billdeblasio.csv', 'data/tweets-joebiden.csv', 'data/tweets-petebuttigieg.csv', 'data/tweets-sethmoulton.csv', 'data/tweets-joesestak.csv', 'data/tweets-juliancastro.csv', 'data/tweets-tulsigabbard.csv', 'data/tweets-waynemessam.csv', 'data/tweets-marwilliamson.csv', 'data/tweets-governorbullock.csv', 'data/tweets-jayinslee.csv', 'data/tweets-hickenlooper.csv', 'data/tweets-sengillibrand.csv', 'data/tweets-ericswalwell.csv', 'data/tweets-johndelaney.csv', 'data/tweets-corybooker.csv', 'data/tweets-michaelbennet.csv', 'data/tweets-timryan.csv', 'data/tweets-ewarren.csv', 'data/tweets-berniesanders.csv', 'data/tweets-andrewyang.csv']

We'll then use a list comprehension to turn each filename into a dataframe, then pd.concat to combine them together.

import pandas as pd

dataframes = [pd.read_csv(filename) for filename in filenames]
df = pd.concat(dataframes)
df.shape

(38789, 15)

Let's pull a sample to make sure it looks like we think it should...

df.sample(5)

	username	to	text	retweets	favorites	replies	id	permalink	author_id	date	formatted_date	hashtags	mentions	geo	urls
247	TimRyan	NaN	Hate, racism, white nationalism is terrorizing...	88	401	198	1158019752931074049	https://twitter.com/TimRyan/status/11580197529...	466532637	2019-08-04 14:19:58+00:00	Sun Aug 04 14:19:58 +0000 2019	NaN	NaN	NaN	NaN
681	amyklobuchar	washingtonpost	We need to see the full report in order to pro...	518	1813	240	1126198087213563905	https://twitter.com/amyklobuchar/status/112619...	33537967	2019-05-08 18:52:02+00:00	Wed May 08 18:52:02 +0000 2019	NaN	NaN	NaN	https://twitter.com/washingtonpost/status/1126...
647	GovernorBullock	NaN	McConnell has stood in the way of American pro...	89	392	157	1149037399030272011	https://twitter.com/GovernorBullock/status/114...	111721601	2019-07-10 19:27:18+00:00	Wed Jul 10 19:27:18 +0000 2019	NaN	@AmyMcGrathKY	NaN	http://bit.ly/2SdvJP6
543	ericswalwell	NaN	$1 could be the difference between 4 more year...	327	893	877	1133149644014391303	https://twitter.com/ericswalwell/status/113314...	377609596	2019-05-27 23:15:02+00:00	Mon May 27 23:15:02 +0000 2019	NaN	NaN	NaN	https://bit.ly/2EnCLuG
1423	marwilliamson	maidenoftheair	Hear hear.	0	5	0	1122503791553630210	https://twitter.com/marwilliamson/status/11225...	21522338	2019-04-28 14:12:13+00:00	Sun Apr 28 14:12:13 +0000 2019	NaN	NaN	NaN	NaN

Looking good! Let's remove any missing the text column (I don't know why, but they exist), and save it so we can analyze it in the next notebook.

df = df.dropna(subset=['text'])
df.to_csv("data/tweets.csv", index=False)

Review#

In this section we used the GetOldTweets3 library to download large numbers of tweets that the API could not get us.

Discussion topics#

We're certainly breaking Twitter's Terms of Service by scraping these tweets. Should we not do it? What are the ethical and legal issues at play?

Why are we scraping tweets as opposed to Facebook posts or campaign speeches?

Scraping tweets from Democratic presidential primary candidates#

Introducing GetOldTweets3#

Scraping our tweets#

Combining our files#

Review#

Discussion topics#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects