Cleaning the Sentiment140 data#

The Sentiment140 dataset is a collection of 1.6 million tweets that have been tagged as either positive or negative.

Before we clean it, a question: how'd they get so many tagged tweets? If you poke around on their documentation, the answer is hiding right here:

In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative.

That's a good thing to discuss later, but for now let's just clean it up. In this notebook we'll be removing columns we don't want, and standardizing the sentiment column.

Read the tweets in#

import pandas as pd

df = pd.read_csv("data/training.1600000.processed.noemoticon.csv",
                names=['polarity', 'id', 'date', 'query', 'user', 'text'],
                encoding='latin-1')
df.head()
polarity id date query user text
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by ...
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Man...
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all....

Update polarity#

Right now the polarity column is 0 for negative, 4 for positive. Let's change that to 0 and 1 to make things a little more reasonably readable.

df.polarity.value_counts()
4    800000
0    800000
Name: polarity, dtype: int64
df.polarity = df.polarity.replace({0: 0, 4: 1})
df.polarity.value_counts()
1    800000
0    800000
Name: polarity, dtype: int64

Remove unneeded columns#

We don't need all those columns! Let's get rid of the ones that won't affect the sentiment.

df = df.drop(columns=['id', 'date', 'query', 'user'])
df.head()
polarity text
0 0 @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 0 is upset that he can't update his Facebook by ...
2 0 @Kenichan I dived many times for the ball. Man...
3 0 my whole body feels itchy and like its on fire
4 0 @nationwideclass no, it's not behaving at all....

Sample#

To make the filesize a little smaller and pandas a little happier, let's knock this down to 500,000 tweets.

df = df.sample(n=500000)
df.polarity.value_counts()
0    250275
1    249725
Name: polarity, dtype: int64
df.to_csv("data/sentiment140-subset.csv", index=False)

Review#

In this section, we cleaned up the Sentiment140 tweet dataset. Sentiment140 is a collection of 1.6 million tweets that are marked as either positive or negative sentiment.