Cleaning the Sentiment140 data#

The Sentiment140 dataset is a collection of 1.6 million tweets that have been tagged as either positive or negative.

Before we clean it, a question: how'd they get so many tagged tweets? If you poke around on their documentation, the answer is hiding right here:

In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative.

That's a good thing to discuss later, but for now let's just clean it up. In this notebook we'll be removing columns we don't want, and standardizing the sentiment column.

Read online Download notebook Interactive version

Read the tweets in#

import pandas as pd

df = pd.read_csv("data/training.1600000.processed.noemoticon.csv",
                names=['polarity', 'id', 'date', 'query', 'user', 'text'],
                encoding='latin-1')
df.head()

	id	date	query	user	text
0	1467810369	Mon Apr 06 22:19:45 PDT 2009	NO_QUERY	_TheSpecialOne_	@switchfoot http://twitpic.com/2y1zl - Awww, t...
1	1467810672	Mon Apr 06 22:19:49 PDT 2009	NO_QUERY	scotthamilton	is upset that he can't update his Facebook by ...
2	1467810917	Mon Apr 06 22:19:53 PDT 2009	NO_QUERY	mattycus	@Kenichan I dived many times for the ball. Man...
3	1467811184	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	ElleCTF	my whole body feels itchy and like its on fire
4	1467811193	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	Karoli	@nationwideclass no, it's not behaving at all....

Update polarity#

Right now the polarity column is 0 for negative, 4 for positive. Let's change that to 0 and 1 to make things a little more reasonably readable.

df.polarity.value_counts()

4    800000
0    800000
Name: polarity, dtype: int64

df.polarity = df.polarity.replace({0: 0, 4: 1})
df.polarity.value_counts()

1    800000
0    800000
Name: polarity, dtype: int64

Remove unneeded columns#

We don't need all those columns! Let's get rid of the ones that won't affect the sentiment.

df = df.drop(columns=['id', 'date', 'query', 'user'])
df.head()

	polarity	text
0	0	@switchfoot http://twitpic.com/2y1zl - Awww, t...
1	0	is upset that he can't update his Facebook by ...
2	0	@Kenichan I dived many times for the ball. Man...
3	0	my whole body feels itchy and like its on fire
4	0	@nationwideclass no, it's not behaving at all....

Sample#

To make the filesize a little smaller and pandas a little happier, let's knock this down to 500,000 tweets.

df = df.sample(n=500000)
df.polarity.value_counts()

0    250275
1    249725
Name: polarity, dtype: int64

df.to_csv("data/sentiment140-subset.csv", index=False)

Review#

In this section, we cleaned up the Sentiment140 tweet dataset. Sentiment140 is a collection of 1.6 million tweets that are marked as either positive or negative sentiment.

Cleaning the Sentiment140 data#

Read the tweets in#

Update polarity#

Remove unneeded columns#

Sample#

Review#

Text analysis

Putting things in categories automatically

How X affects Y

Python data science reference

All Projects