Does adding more data make our sentiment classifier more accurate?#

Last time we were working with around tens of thousands of tweets to determine if we could figure predict whether a tweet was positive or negative. We weren't necessarily impressed with our performance - our best classifier hit around 75% accuracy. That means one out of four results is wrong!

We'd like to think that the more examples our classifier sees, the better it'll perform. Let's upgrade our selection to 500,000! We'll be using the same dataset from Sentiment140.

import pandas as pd

df = pd.read_csv("data/sentiment140-subset.csv")
df.head()
polarity text
0 0 @kconsidder You never tweet
1 0 Sick today coding from the couch.
2 1 @ChargerJenn Thx for answering so quick,I was ...
3 1 Wii fit says I've lost 10 pounds since last ti...
4 0 @MrKinetik Not a thing!!! I don't really have...
df.shape
(500000, 2)

Polarity is 0 for negative, or 1 for positive. We should have roughly equal numbers of each.

df.polarity.value_counts()
0    250275
1    249725
Name: polarity, dtype: int64

Extracting our features#

Just like last time, we're going to use a TfidfVectorizer to count our words. It does a little more than count words - it pays less attention to popular words, and makes adjustments for short vs long tweets - but that's the general idea.

Last time we only looked at 1000 words, but more words has to help, too, right? Let's increase the number of words we're examining to 3000.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=3000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()
00 000 09 10 100 1000 11 12 13 14 ... youu yr yrs yu yuck yum yummy yup zoo ½s
0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.336949 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 3000 columns

Training our models#

We'll need to figure out what we're predicting, and what we're using to predict. We're going to be using words to predict polarity. Let's assign these to the variable names that all data people seem to use.

X = words_df
y = df.polarity

We know whether every one of these tweets is positive or negative, but the algorithms don't! Not yet, anyway.

To test how well each algorithm performs, we're going to use some of the tweets to teach the algo and keep some of them secret to test it later.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

Picking our algorithms#

Last time we used four different algorithms:

  • LogisticRegression
  • RandomForestClassifier
  • LinearSVC
  • MultinomialNB

We have no idea how they work, but we did notice a difference: even if they all had about a 70-75% accuracy rate, the time it took to train them was very very different!

With around tens of thousands tweets our LogisticRegression and RandomForestClassifier both took well over a minute. We can only imagine it would be much much worse with 500,000, so let's set them aside for now.

The other two - LinearSVC and MultinominalNB - both took under a second with our original dataset, so we can hopefully trust them to not take years this new, larger set.

from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
%%time
svc = LinearSVC()
svc.fit(X_train, y_train)
CPU times: user 13.3 s, sys: 13.2 s, total: 26.5 s
Wall time: 34.9 s
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)
%%time
bayes = MultinomialNB()
bayes.fit(X_train, y_train)
CPU times: user 12.1 s, sys: 23.9 s, total: 36.1 s
Wall time: 36.2 s
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Analyzing their performance#

We'll use a confusion matrix to see how well each algorithm them performed. Last time we hit around 70-75% accuracy after training on a few tens of thousands of tweets. What kind of impact does upgrading to 15x as many tweets and 3x as many words have?

from sklearn.metrics import confusion_matrix

SVC#

y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Predicted negative Predicted positive
Is negative 0.767197 0.235177
Is positive 0.197144 0.800846

Naive Bayes#

y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Predicted negative Predicted positive
Is negative 0.768645 0.233713
Is positive 0.236958 0.760626

In predicting positive tweets, Linear won, improving from about ~76% last time to ~79% now. MultinomialNB went from around ~72% to ~76%.

In predicting negative tweets, though, MultinomialNB and LinearSVC both hit around ~74-77%, which is just about the same as last time.

Note that your numbers might be slightly different! Between the randomized nature of machine learning algorithms and getting a different selection in the test/train split, we're only able to talk about rough estimates of performance.

Review#

Last time we trained our sentiment analysis algorithms on tens of thousands of tweets, and we hoped that by increasing the amount of data we analyzed we'd also increase accuracy. We upgraded to hundreds of thousands of tweets, along with looking at more total words.

We only used two of the faster algorithms from last time, as the others would probably be too slow to finish in a reasonable amount of time. We still don't know the difference between them, but we're only concerned with the output at the moment.

This new approach took much more data and more training time, but it did improve performance by a few percentage points. We aren't anywhere near being perfect, though: our best approach successfully predicted 80% of positive tweets.

Discussion topics#

We grew our dataset by 15x and the vocabulary we were looking at by 3x - were the changes worth it?

The worst part about building a classifier is either finding a tagged dataset or building one yourself. In this case we downloaded the tweets from Sentiment140 and they were already marked as either positive or negative. If this were "real life," though, we'd probably have to order an army of interns on the task!

Sentiment140 tweets were automatically tagged based on the presence of :) or :( in the tweet. Does this seem reasonable?

Is a 4% gain in accuracy worth the tradeoff of having to acquire 15x more data and spend 60x more time training your model? When might it be worth it, and when might it not be?

Is 80% accuracy good? Do your feelings change if the performance is described as "incorrect one out of every five times?" What would your accuracy be for a random guess?

Going from 30k to 500k examples only got us a few percentage points of improvement: do you think this is always the case, or is this something special about tweets? Do you think product reviews would be the same?

How should we feel about not understanding what the difference is between the algorithms we're using or casting aside? Is knowing their training time and performance good enough substitute for understanding what's going on inside?