Designing your own sentiment analysis tool#

While there are a lot of tools that will automatically give us a sentiment of a piece of text, we learned that they don't always agree! Let's design our own to see both how these tools work internally, along with how we can test them to see how well they might perform.

I've cleaned the dataset up a bit.

# !pip install sklearn

Training on tweets#

Let's say we were going to analyze the sentiment of tweets. If we had a list of tweets that were scored positive vs. negative, we could see which words are usually associated with positive scores and which are usually associated with negative scores.

Luckily, we have Sentiment140 - http://help.sentiment140.com/for-students - a list of 1.6 million tweets along with a score as to whether they're negative or positive. We'll use it to build our own machine learning algorithm to see separate positivity from negativity.

Read in our data#

import pandas as pd

df = pd.read_csv("data/sentiment140-subset.csv", nrows=30000)
df.head()
polarity text
0 0 @kconsidder You never tweet
1 0 Sick today coding from the couch.
2 1 @ChargerJenn Thx for answering so quick,I was ...
3 1 Wii fit says I've lost 10 pounds since last ti...
4 0 @MrKinetik Not a thing!!! I don't really have...

It isn't a very complicated dataset. polarity is whether it's positive or not, text is the text of the tweet itself.

How many rows do we have?

df.shape
(30000, 2)

How many positive tweets compared to how many negative tweets?

df.polarity.value_counts()
1    15064
0    14936
Name: polarity, dtype: int64

Train our algorithm#

Vectorize our tweets#

Create a TfidfVectorizer and use it to vectorize our tweets. Since we don't have all the time in the world, we should probably use max_features to only take a selection of terms - how about 1000 for now?

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()
10 100 11 12 15 1st 20 2day 2nd 30 ... yesterday yet yo you young your yourself youtube yum yup
0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.334095 0.0 0.0 0.0 0.0 0.0 0.0
1 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
2 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
3 0.427465 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
4 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1000 columns

Setting up our variables#

Because we want to fit in with all the other programmers, we need to create two variables: one called X and one called y.

X is all of our features, the things we use to predict positive or negative. That's going to be our words.

y is all of our labels, the positive or negative rating. We'll use the polarity column for that.

X = words_df
y = df.polarity

Picking an algorithm#

What kind of algorithm do we want? Who knows, we don't know anything about machine learning! Let's just pick ALL OF THEM.

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

Training our algorithms#

When we teach our algorithm about what a positive or a negative tweet looks like, this is called training. Training can take different amounts of time based on what kind of algorithm you are using.

%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X, y)
CPU times: user 14.1 s, sys: 341 ms, total: 14.4 s
Wall time: 9.41 s
LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=1000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X, y)
CPU times: user 51.8 s, sys: 940 ms, total: 52.7 s
Wall time: 1min 16s
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)
CPU times: user 388 ms, sys: 12.9 ms, total: 401 ms
Wall time: 458 ms
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)
CPU times: user 174 ms, sys: 37.1 ms, total: 212 ms
Wall time: 169 ms
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

How long did each take to train? How much faster were some compared to others?

Use our models#

Now that we've trained our models, they can try to predict whether some content is positive or negative.

Preparing the data#

Add a few more sentences below. They should be a mix of positive and negative. They can be boring, they can be exciting, they can be short, they can be long.

# Create some test data

pd.set_option("display.max_colwidth", 200)

unknown = pd.DataFrame({'content': [
    "I love love love love this kitten",
    "I hate hate hate hate this keyboard",
    "I'm not sure how I feel about toast",
    "Did you see the baseball game yesterday?",
    "The package was delivered late and the contents were broken",
    "Trashy television shows are some of my favorites",
    "I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
    "I find chirping birds irritating, but I know I'm not the only one",
]})
unknown
content
0 I love love love love this kitten
1 I hate hate hate hate this keyboard
2 I'm not sure how I feel about toast
3 Did you see the baseball game yesterday?
4 The package was delivered late and the contents were broken
5 Trashy television shows are some of my favorites
6 I'm seeing a Kubrick film tomorrow, I hear not so great things about it.
7 I find chirping birds irritating, but I know I'm not the only one

First we need to vectorizer our sentences into numbers, so the algorithm can understand them.

Our algorithm only knows certain words. Run vectorizer.get_feature_names() to show you the list of the words it knows.

print(vectorizer.get_feature_names())
['10', '100', '11', '12', '15', '1st', '20', '2day', '2nd', '30', 'able', 'about', 'account', 'actually', 'add', 'after', 'afternoon', 'again', 'ago', 'agree', 'ah', 'ahh', 'ahhh', 'air', 'album', 'all', 'almost', 'alone', 'already', 'alright', 'also', 'although', 'always', 'am', 'amazing', 'amp', 'an', 'and', 'annoying', 'another', 'any', 'anymore', 'anyone', 'anything', 'anyway', 'app', 'apparently', 'apple', 'appreciate', 'are', 'around', 'art', 'as', 'ask', 'asleep', 'ass', 'at', 'ate', 'aw', 'awake', 'awards', 'away', 'awesome', 'aww', 'awww', 'baby', 'back', 'bad', 'band', 'bbq', 'bday', 'be', 'beach', 'beautiful', 'because', 'bed', 'been', 'beer', 'before', 'behind', 'being', 'believe', 'best', 'bet', 'better', 'big', 'bike', 'birthday', 'bit', 'bitch', 'black', 'blip', 'blog', 'blue', 'body', 'boo', 'book', 'books', 'bored', 'boring', 'both', 'bought', 'bout', 'box', 'boy', 'boys', 'break', 'breakfast', 'bring', 'bro', 'broke', 'broken', 'brother', 'brothers', 'btw', 'bus', 'business', 'busy', 'but', 'buy', 'by', 'bye', 'cake', 'call', 'called', 'came', 'can', 'cannot', 'cant', 'car', 'card', 'care', 'case', 'cat', 'catch', 'cause', 'cd', 'chance', 'change', 'channel', 'chat', 'check', 'chicken', 'chocolate', 'church', 'city', 'class', 'clean', 'cleaning', 'close', 'closed', 'club', 'coffee', 'cold', 'college', 'com', 'come', 'comes', 'coming', 'completely', 'computer', 'concert', 'congrats', 'cool', 'cos', 'could', 'couldn', 'country', 'couple', 'course', 'crap', 'crazy', 'cream', 'cry', 'crying', 'cut', 'cute', 'cuz', 'da', 'dad', 'damn', 'dance', 'date', 'daughter', 'david', 'day', 'days', 'ddlovato', 'de', 'dead', 'dear', 'decided', 'definitely', 'did', 'didn', 'didnt', 'die', 'died', 'dinner', 'dm', 'do', 'does', 'doesn', 'doesnt', 'dog', 'doing', 'don', 'done', 'dont', 'down', 'download', 'dream', 'dreams', 'dress', 'drink', 'drinking', 'drive', 'driving', 'drunk', 'dude', 'due', 'during', 'each', 'earlier', 'early', 'easy', 'eat', 'eating', 'either', 'else', 'em', 'email', 'end', 'ended', 'english', 'enjoy', 'enjoyed', 'enjoying', 'enough', 'episode', 'especially', 'even', 'evening', 'ever', 'every', 'everybody', 'everyone', 'everything', 'exactly', 'exam', 'exams', 'except', 'excited', 'exciting', 'eye', 'eyes', 'face', 'facebook', 'fact', 'fail', 'fair', 'fall', 'family', 'fan', 'fans', 'far', 'fast', 'favorite', 'fb', 'feel', 'feeling', 'feels', 'feet', 'fell', 'felt', 'few', 'ff', 'figure', 'final', 'finally', 'finals', 'find', 'fine', 'fingers', 'finish', 'finished', 'fire', 'first', 'fix', 'flight', 'flu', 'fly', 'fm', 'follow', 'followers', 'followfriday', 'following', 'food', 'for', 'forever', 'forget', 'forgot', 'forward', 'found', 'free', 'friday', 'friend', 'friends', 'from', 'front', 'fuck', 'fucking', 'full', 'fun', 'funny', 'game', 'games', 'garden', 'gave', 'gd', 'get', 'gets', 'getting', 'girl', 'girls', 'give', 'glad', 'go', 'god', 'goes', 'goin', 'going', 'gone', 'gonna', 'good', 'goodbye', 'goodnight', 'google', 'got', 'gotta', 'graduation', 'great', 'green', 'gt', 'guess', 'guitar', 'guy', 'guys', 'gym', 'ha', 'had', 'haha', 'hahaha', 'hair', 'half', 'hand', 'hang', 'happen', 'happened', 'happens', 'happy', 'hard', 'has', 'hate', 'hates', 'have', 'haven', 'havent', 'having', 'he', 'head', 'headache', 'headed', 'heading', 'hear', 'heard', 'heart', 'hehe', 'hell', 'hello', 'help', 'her', 'here', 'hey', 'hi', 'high', 'him', 'his', 'history', 'hit', 'hmm', 'holiday', 'home', 'homework', 'hope', 'hopefully', 'hoping', 'horrible', 'hot', 'hotel', 'hour', 'hours', 'house', 'how', 'http', 'hubby', 'hug', 'huge', 'hugs', 'hun', 'hungry', 'hurt', 'hurts', 'ice', 'idea', 'idk', 'if', 'ill', 'im', 'in', 'inside', 'instead', 'interesting', 'internet', 'into', 'iphone', 'ipod', 'is', 'isn', 'isnt', 'it', 'its', 'ive', 'jealous', 'job', 'join', 'jonas', 'jonasbrothers', 'july', 'june', 'jus', 'just', 'keep', 'keeps', 'kid', 'kids', 'kill', 'kind', 'kinda', 'knew', 'know', 'knows', 'la', 'lady', 'lakers', 'lame', 'laptop', 'last', 'late', 'later', 'laugh', 'lazy', 'learn', 'learning', 'least', 'leave', 'leaving', 'left', 'less', 'let', 'lets', 'life', 'like', 'liked', 'lil', 'line', 'link', 'list', 'listen', 'listening', 'little', 'live', 'living', 'll', 'lmao', 'lol', 'london', 'lonely', 'long', 'longer', 'look', 'looked', 'looking', 'looks', 'lost', 'lot', 'lots', 'love', 'loved', 'lovely', 'loves', 'loving', 'lt', 'luck', 'lucky', 'lunch', 'luv', 'ly', 'ma', 'mac', 'mad', 'made', 'mail', 'major', 'make', 'makes', 'making', 'man', 'many', 'maths', 'matter', 'may', 'maybe', 'mcfly', 'me', 'mean', 'means', 'meant', 'meet', 'meeting', 'message', 'met', 'might', 'miley', 'mileycyrus', 'mind', 'mine', 'minute', 'minutes', 'miss', 'missed', 'missing', 'mom', 'moment', 'monday', 'money', 'month', 'months', 'mood', 'moon', 'more', 'morning', 'most', 'mother', 'mouth', 'move', 'movie', 'movies', 'moving', 'mr', 'mtv', 'much', 'mum', 'music', 'must', 'my', 'myloc', 'myself', 'myspace', 'name', 'nap', 'near', 'need', 'needed', 'needs', 'never', 'new', 'news', 'next', 'nice', 'night', 'nights', 'nite', 'no', 'nope', 'not', 'nothing', 'now', 'number', 'of', 'off', 'office', 'oh', 'ohh', 'ok', 'okay', 'old', 'omg', 'on', 'once', 'one', 'ones', 'online', 'only', 'open', 'or', 'other', 'ouch', 'our', 'out', 'outside', 'over', 'own', 'packing', 'page', 'pain', 'paper', 'parents', 'park', 'part', 'party', 'pass', 'past', 'pay', 'peace', 'people', 'perfect', 'person', 'phone', 'photo', 'photos', 'pic', 'pick', 'pics', 'picture', 'pictures', 'pink', 'pizza', 'place', 'plan', 'plans', 'play', 'played', 'playing', 'please', 'pls', 'plurk', 'plus', 'point', 'pool', 'poor', 'post', 'posted', 'power', 'ppl', 'pretty', 'probably', 'problem', 'profile', 'project', 'proud', 'put', 'quite', 'quot', 'radio', 'rain', 'raining', 'rainy', 'random', 'rather', 're', 'read', 'reading', 'ready', 'real', 'realized', 'really', 'reason', 'red', 'relaxing', 'remember', 'reply', 'rest', 'revision', 'ride', 'right', 'rip', 'road', 'rock', 'room', 'run', 'running', 'sad', 'sadly', 'safe', 'said', 'same', 'sat', 'saturday', 'save', 'saw', 'say', 'saying', 'says', 'scared', 'scary', 'school', 'season', 'second', 'see', 'seeing', 'seem', 'seems', 'seen', 'send', 'sent', 'seriously', 'set', 'shall', 'shame', 'share', 'she', 'shirt', 'shit', 'shoes', 'shop', 'shopping', 'short', 'should', 'show', 'shower', 'shows', 'sick', 'side', 'sigh', 'sign', 'silly', 'sims', 'since', 'singing', 'sister', 'site', 'sitting', 'sleep', 'sleeping', 'sleepy', 'slept', 'slow', 'small', 'smile', 'so', 'some', 'someone', 'something', 'sometimes', 'son', 'song', 'songs', 'soo', 'soon', 'sooo', 'soooo', 'sore', 'sorry', 'sound', 'sounds', 'special', 'spend', 'spending', 'spent', 'star', 'start', 'started', 'starting', 'starts', 'stay', 'still', 'stomach', 'stop', 'store', 'story', 'straight', 'stuck', 'study', 'studying', 'stuff', 'stupid', 'such', 'suck', 'sucks', 'summer', 'sun', 'sunday', 'sunny', 'sunshine', 'super', 'support', 'supposed', 'sure', 'sweet', 'take', 'takes', 'taking', 'talk', 'talking', 'taylor', 'tea', 'team', 'tell', 'test', 'text', 'than', 'thank', 'thanks', 'that', 'thats', 'the', 'their', 'them', 'then', 'there', 'these', 'they', 'thing', 'things', 'think', 'thinking', 'thinks', 'this', 'tho', 'those', 'though', 'thought', 'three', 'throat', 'through', 'thru', 'thursday', 'thx', 'tickets', 'til', 'till', 'time', 'times', 'tinyurl', 'tired', 'to', 'today', 'together', 'told', 'tom', 'tommcfly', 'tomorrow', 'tonight', 'too', 'took', 'top', 'totally', 'tour', 'town', 'traffic', 'train', 'tried', 'trip', 'true', 'try', 'trying', 'tuesday', 'turn', 'tv', 'tweet', 'tweeting', 'tweets', 'twilight', 'twitpic', 'twitter', 'two', 'ugh', 'uk', 'under', 'understand', 'unfortunately', 'until', 'up', 'update', 'updates', 'upset', 'ur', 'us', 'use', 'used', 'using', 'vacation', 've', 'vegas', 'version', 'very', 'via', 'video', 'visit', 'voice', 'wait', 'waiting', 'wake', 'walk', 'wanna', 'want', 'wanted', 'wants', 'warm', 'was', 'wasn', 'watch', 'watched', 'watching', 'water', 'way', 'we', 'wear', 'weather', 'website', 'wedding', 'wednesday', 'week', 'weekend', 'weeks', 'weird', 'welcome', 'well', 'went', 'were', 'what', 'whats', 'when', 'where', 'which', 'while', 'white', 'who', 'whole', 'why', 'wife', 'will', 'win', 'wine', 'wish', 'wishes', 'wishing', 'wit', 'with', 'without', 'woke', 'won', 'wonder', 'wonderful', 'wondering', 'wont', 'woo', 'word', 'words', 'work', 'worked', 'working', 'works', 'world', 'worried', 'worry', 'worse', 'worst', 'worth', 'would', 'wouldn', 'wow', 'write', 'writing', 'wrong', 'wtf', 'www', 'xd', 'xoxo', 'xx', 'xxx', 'ya', 'yay', 'yea', 'yeah', 'year', 'years', 'yep', 'yes', 'yesterday', 'yet', 'yo', 'you', 'young', 'your', 'yourself', 'youtube', 'yum', 'yup']

Usually when we use the vectorizer, we write code like this:

vectors = vectorizer.fit_transform(....)

Which both learns all the words and counts them. In this case we already have the list of words we know, we only want to count them. So instead of .fit_transform, we just use .transform:

unknown_vectors = vectorizer.transform(unknown.content)
unknown_words_df = ......

Finish making your unknown_words_df in the cell below.

# Put it through the vectoriser

# transform, not fit_transform, because we already learned all our words
unknown_vectors = vectorizer.transform(unknown.content)
unknown_words_df = pd.DataFrame(unknown_vectors.toarray(), columns=vectorizer.get_feature_names())
unknown_words_df.head()
10 100 11 12 15 1st 20 2day 2nd 30 ... yesterday yet yo you young your yourself youtube yum yup
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.537291 0.0 0.0 0.244939 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1000 columns

Confirm unknown_words_df is 11 rows and 2,000 columns.

unknown_words_df.shape
(8, 1000)

Predicting with our models#

To make a prediction for each of our sentences, you can use .predict with each of our models. For example, it would look like this for linear regression:

unknown['pred_logreg'] = logreg.predict(unknown_words_df)

To add the prediction for logistic regression, you'd run similar .predict code, which will give you a 0 (negative) or a 1 (positive). A difference between the two is that for logistic regression, you can also ask for the probability that the sentence is in the 1 category instead of just simply the category. To do that, you use this code:

unknown['pred_logreg_prob'] = linreg.predict_proba(unknown_words_df)[:,1]

Add new columns for each of the models you trained. If the model has a .predict_proba, add that as a column as well.

  • Tip: Tab is helpful for knowing whether .predict_proba is an option.
  • Tip: Don't forget the [:,1] after .predict_proba, it means "give me the probability for category 1
# Predict using all our models. 

# Logistic Regression predictions + probabilities
unknown['pred_logreg'] = logreg.predict(unknown_words_df)
unknown['pred_logreg_proba'] = logreg.predict_proba(unknown_words_df)[:,1]

# Random forest predictions + probabilities
unknown['pred_forest'] = forest.predict(unknown_words_df)
unknown['pred_forest_proba'] = forest.predict_proba(unknown_words_df)[:,1]

# SVC predictions
unknown['pred_svc'] = svc.predict(unknown_words_df)

# Bayes predictions + probabilities
unknown['pred_bayes'] = bayes.predict(unknown_words_df)
unknown['pred_bayes_proba'] = bayes.predict_proba(unknown_words_df)[:,1]
unknown
content pred_logreg pred_logreg_proba pred_forest pred_forest_proba pred_svc pred_bayes pred_bayes_proba
0 I love love love love this kitten 1 0.950442 1 0.848665 1 1 0.747222
1 I hate hate hate hate this keyboard 0 0.009593 0 0.000000 0 0 0.122383
2 I'm not sure how I feel about toast 0 0.180952 0 0.240000 0 0 0.416819
3 Did you see the baseball game yesterday? 1 0.615063 1 0.660000 1 1 0.509662
4 The package was delivered late and the contents were broken 0 0.058171 0 0.460000 0 0 0.219788
5 Trashy television shows are some of my favorites 0 0.330293 0 0.440000 0 1 0.534234
6 I'm seeing a Kubrick film tomorrow, I hear not so great things about it. 1 0.558548 0 0.260000 1 1 0.533493
7 I find chirping birds irritating, but I know I'm not the only one 0 0.060122 0 0.440000 0 0 0.295739

Questions#

  • What do the numbers mean? What's the difference between a 0 and a 1? A 0.5? Negative numbers?
  • Were there any sentences where the classifiers seemed to disagree about? How do you feel about the amount they disagree?
  • What's the difference between using a 0/1 to talk about sentiment compared to 0-1? When might you use one compared to another?
  • What's the difference between the linear regression model and the other models we're using? Why might it fit or not fit?
  • Between 0-1, what range do you think counts as "negative," "positive" and "neutral"?
  • Does the variation in scores reflect the variation you would see among people? Or is it better or worse?

Testing our models#

We can actually see which model performs the best! Remember how we trained our models on tweets? We can ask each model about each tweet, and see if it gets the right answer.

df.head()
polarity text
0 0 @kconsidder You never tweet
1 0 Sick today coding from the couch.
2 1 @ChargerJenn Thx for answering so quick,I was afraid I was gonna crash twitter with all the spamming I did 2 RR..sorry bout that
3 1 Wii fit says I've lost 10 pounds since last time
4 0 @MrKinetik Not a thing!!! I don't really have a life.....

Our original dataframe is a list of many, many tweets. We turned this into X - vectorized words - and y - whether the tweet is negative or positive.

Before we used .fit(X, y) to train on all of our data. Instead, we can test our models by doing a test/train split and see if the predictions match the actual labels.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)
%%time

print("Training logistic regression")
logreg.fit(X_train, y_train)

print("Training random forest")
forest.fit(X_train, y_train)

print("Training SVC")
svc.fit(X_train, y_train)

print("Training Naive Bayes")
bayes.fit(X_train, y_train)
Training logistic regression
Training random forest
Training SVC
Training Naive Bayes
CPU times: user 44.9 s, sys: 809 ms, total: 45.7 s
Wall time: 47.9 s
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Confusion matrices#

To see how well they did, we'll use a "confusion matrix" for each one. I think confusion matrices are called that because they are confusing.

from sklearn.metrics import confusion_matrix

Logistic Regression#

y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted negative Predicted positive
Is negative 2782 1016
Is positive 869 2833

Random forest#

y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted negative Predicted positive
Is negative 2783 1015
Is positive 1019 2683

SVC#

y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted negative Predicted positive
Is negative 2772 1026
Is positive 854 2848

Multinomial Naive Bayes#

y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)
Predicted negative Predicted positive
Is negative 2815 983
Is positive 935 2767

Percentage-based confusion matrices#

Those are kind of irritating in that they're just numbers. Let's try percentages instead

Logisitic#

y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Predicted negative Predicted positive
Is negative 0.732491 0.267509
Is positive 0.234738 0.765262

Logistic regression#

y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Predicted negative Predicted positive
Is negative 0.732491 0.267509
Is positive 0.234738 0.765262

Random forest#

y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Predicted negative Predicted positive
Is negative 0.732754 0.267246
Is positive 0.275257 0.724743

SVC#

y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Predicted negative Predicted positive
Is negative 0.729858 0.270142
Is positive 0.230686 0.769314

Multinomial Naive Bayes#

y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Predicted negative Predicted positive
Is negative 0.741180 0.258820
Is positive 0.252566 0.747434

Review#

If you find yourself unsatisfied with a tool, you can try to build your own! This is exactly what we tried to do, using the Sentiment140 dataset and several machine learning algorithms.

Sentiment140 is a database of tweets that come pre-labeled with positive or negative sentiment, assigned automatically by presence of a :) or :(. Our first step was using a vectorizer to convert the tweets into numbers a computer could understand.

After that, we build four different models using different machine learning algorithms. Each one was fed a list of each tweet's features - the words - and each tweet's label - the sentiment - in the hopes that later it could predict labels if given a new tweets. This process of teaching the algorithm is called training.

In order to test our algorithms, we split our data into sections - train and test datasets. You teach the algorithm with the first group, and then ask it for predictions on the second set. You can then compare its predictions to the right answers using a confusion matrix.

Although different algorithms took different amounts of time to train, they all ended up with about 70-75% accuracy.

Discussion topics#

  • Which models performed the best? Were there big differences?
  • Do you think it's more important to be sensitive to negativity or positivity? Do we want more positive things incorrectly marked as negative, or more negative things marked as positive?
  • They all had very different training times. Which ones offer the best combination of performance and not making you wait around for an hour?
  • If you have a decent algorithm that trains more quickly, that could that mean about feature selection or the size of your training set? Why did we use max_features= and df.sample?
  • Is 75% accuracy good?
  • Do your feelings change if the performance is described as "incorrect one out of every four times?"
  • What would your accuracy be for a random guess?
  • How do you feel about sentiment analysis?
  • How do you feel about this piece from the UpShot that uses the Emotional Lexicon?
  • What would you feel comfortable using our sentiment classifier for?