Cleaning the Sentiment140 data#
The Sentiment140 dataset is a collection of 1.6 million tweets that have been tagged as either positive or negative.
Before we clean it, a question: how'd they get so many tagged tweets? If you poke around on their documentation, the answer is hiding right here:
In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative.
That's a good thing to discuss later, but for now let's just clean it up. In this notebook we'll be removing columns we don't want, and standardizing the sentiment column.
Read the tweets in#
import pandas as pd
df = pd.read_csv("data/training.1600000.processed.noemoticon.csv",
names=['polarity', 'id', 'date', 'query', 'user', 'text'],
encoding='latin-1')
df.head()
Update polarity#
Right now the polarity
column is 0
for negative, 4
for positive. Let's change that to 0
and 1
to make things a little more reasonably readable.
df.polarity.value_counts()
df.polarity = df.polarity.replace({0: 0, 4: 1})
df.polarity.value_counts()
Remove unneeded columns#
We don't need all those columns! Let's get rid of the ones that won't affect the sentiment.
df = df.drop(columns=['id', 'date', 'query', 'user'])
df.head()
Sample#
To make the filesize a little smaller and pandas a little happier, let's knock this down to 500,000 tweets.
df = df.sample(n=500000)
df.polarity.value_counts()
df.to_csv("data/sentiment140-subset.csv", index=False)
Review#
In this section, we cleaned up the Sentiment140 tweet dataset. Sentiment140 is a collection of 1.6 million tweets that are marked as either positive or negative sentiment.