Designing your own sentiment analysis tool#
While there are a lot of tools that will automatically give us a sentiment of a piece of text, we learned that they don't always agree! Let's design our own to see both how these tools work internally, along with how we can test them to see how well they might perform.
I've cleaned the dataset up a bit.
# !pip install sklearn
Training on tweets#
Let's say we were going to analyze the sentiment of tweets. If we had a list of tweets that were scored positive vs. negative, we could see which words are usually associated with positive scores and which are usually associated with negative scores.
Luckily, we have Sentiment140 - http://help.sentiment140.com/for-students - a list of 1.6 million tweets along with a score as to whether they're negative or positive. We'll use it to build our own machine learning algorithm to see separate positivity from negativity.
Read in our data#
import pandas as pd
df = pd.read_csv("data/sentiment140-subset.csv", nrows=30000)
df.head()
It isn't a very complicated dataset. polarity
is whether it's positive or not, text
is the text of the tweet itself.
How many rows do we have?
df.shape
How many positive tweets compared to how many negative tweets?
df.polarity.value_counts()
Train our algorithm#
Vectorize our tweets#
Create a TfidfVectorizer
and use it to vectorize our tweets. Since we don't have all the time in the world, we should probably use max_features
to only take a selection of terms - how about 1000 for now?
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=1000)
vectors = vectorizer.fit_transform(df.text)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()
Setting up our variables#
Because we want to fit in with all the other programmers, we need to create two variables: one called X
and one called y
.
X
is all of our features, the things we use to predict positive or negative. That's going to be our words.
y
is all of our labels, the positive or negative rating. We'll use the polarity
column for that.
X = words_df
y = df.polarity
Picking an algorithm#
What kind of algorithm do we want? Who knows, we don't know anything about machine learning! Let's just pick ALL OF THEM.
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
Training our algorithms#
When we teach our algorithm about what a positive or a negative tweet looks like, this is called training. Training can take different amounts of time based on what kind of algorithm you are using.
%%time
# Create and train a logistic regression
logreg = LogisticRegression(C=1e9, solver='lbfgs', max_iter=1000)
logreg.fit(X, y)
%%time
# Create and train a random forest classifier
forest = RandomForestClassifier(n_estimators=50)
forest.fit(X, y)
%%time
# Create and train a linear support vector classifier (LinearSVC)
svc = LinearSVC()
svc.fit(X, y)
%%time
# Create and train a multinomial naive bayes classifier (MultinomialNB)
bayes = MultinomialNB()
bayes.fit(X, y)
How long did each take to train? How much faster were some compared to others?
# Create some test data
pd.set_option("display.max_colwidth", 200)
unknown = pd.DataFrame({'content': [
"I love love love love this kitten",
"I hate hate hate hate this keyboard",
"I'm not sure how I feel about toast",
"Did you see the baseball game yesterday?",
"The package was delivered late and the contents were broken",
"Trashy television shows are some of my favorites",
"I'm seeing a Kubrick film tomorrow, I hear not so great things about it.",
"I find chirping birds irritating, but I know I'm not the only one",
]})
unknown
First we need to vectorizer our sentences into numbers, so the algorithm can understand them.
Our algorithm only knows certain words. Run vectorizer.get_feature_names()
to show you the list of the words it knows.
print(vectorizer.get_feature_names())
Usually when we use the vectorizer, we write code like this:
vectors = vectorizer.fit_transform(....)
Which both learns all the words and counts them. In this case we already have the list of words we know, we only want to count them. So instead of .fit_transform
, we just use .transform
:
unknown_vectors = vectorizer.transform(unknown.content)
unknown_words_df = ......
Finish making your unknown_words_df
in the cell below.
# Put it through the vectoriser
# transform, not fit_transform, because we already learned all our words
unknown_vectors = vectorizer.transform(unknown.content)
unknown_words_df = pd.DataFrame(unknown_vectors.toarray(), columns=vectorizer.get_feature_names())
unknown_words_df.head()
Confirm unknown_words_df
is 11 rows and 2,000 columns.
unknown_words_df.shape
Predicting with our models#
To make a prediction for each of our sentences, you can use .predict
with each of our models. For example, it would look like this for linear regression:
unknown['pred_logreg'] = logreg.predict(unknown_words_df)
To add the prediction for logistic regression, you'd run similar .predict
code, which will give you a 0
(negative) or a 1
(positive). A difference between the two is that for logistic regression, you can also ask for the probability that the sentence is in the 1
category instead of just simply the category. To do that, you use this code:
unknown['pred_logreg_prob'] = linreg.predict_proba(unknown_words_df)[:,1]
Add new columns for each of the models you trained. If the model has a .predict_proba
, add that as a column as well.
- Tip: Tab is helpful for knowing whether
.predict_proba
is an option. - Tip: Don't forget the
[:,1]
after.predict_proba
, it means "give me the probability for category1
# Predict using all our models.
# Logistic Regression predictions + probabilities
unknown['pred_logreg'] = logreg.predict(unknown_words_df)
unknown['pred_logreg_proba'] = logreg.predict_proba(unknown_words_df)[:,1]
# Random forest predictions + probabilities
unknown['pred_forest'] = forest.predict(unknown_words_df)
unknown['pred_forest_proba'] = forest.predict_proba(unknown_words_df)[:,1]
# SVC predictions
unknown['pred_svc'] = svc.predict(unknown_words_df)
# Bayes predictions + probabilities
unknown['pred_bayes'] = bayes.predict(unknown_words_df)
unknown['pred_bayes_proba'] = bayes.predict_proba(unknown_words_df)[:,1]
unknown
Questions#
- What do the numbers mean? What's the difference between a 0 and a 1? A 0.5? Negative numbers?
- Were there any sentences where the classifiers seemed to disagree about? How do you feel about the amount they disagree?
- What's the difference between using a 0/1 to talk about sentiment compared to 0-1? When might you use one compared to another?
- What's the difference between the linear regression model and the other models we're using? Why might it fit or not fit?
- Between 0-1, what range do you think counts as "negative," "positive" and "neutral"?
- Does the variation in scores reflect the variation you would see among people? Or is it better or worse?
Testing our models#
We can actually see which model performs the best! Remember how we trained our models on tweets? We can ask each model about each tweet, and see if it gets the right answer.
df.head()
Our original dataframe is a list of many, many tweets. We turned this into X
- vectorized words - and y
- whether the tweet is negative or positive.
Before we used .fit(X, y)
to train on all of our data. Instead, we can test our models by doing a test/train split and see if the predictions match the actual labels.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
%%time
print("Training logistic regression")
logreg.fit(X_train, y_train)
print("Training random forest")
forest.fit(X_train, y_train)
print("Training SVC")
svc.fit(X_train, y_train)
print("Training Naive Bayes")
bayes.fit(X_train, y_train)
Confusion matrices#
To see how well they did, we'll use a "confusion matrix" for each one. I think confusion matrices are called that because they are confusing.
from sklearn.metrics import confusion_matrix
Logistic Regression#
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Random forest#
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
SVC#
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Multinomial Naive Bayes#
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names)
Percentage-based confusion matrices#
Those are kind of irritating in that they're just numbers. Let's try percentages instead
Logisitic#
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Logistic regression#
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Random forest#
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
SVC#
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Multinomial Naive Bayes#
y_true = y_test
y_pred = bayes.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)
label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
columns='Predicted ' + label_names,
index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
Review#
If you find yourself unsatisfied with a tool, you can try to build your own! This is exactly what we tried to do, using the Sentiment140 dataset and several machine learning algorithms.
Sentiment140 is a database of tweets that come pre-labeled with positive or negative sentiment, assigned automatically by presence of a :)
or :(
. Our first step was using a vectorizer to convert the tweets into numbers a computer could understand.
After that, we build four different models using different machine learning algorithms. Each one was fed a list of each tweet's features - the words - and each tweet's label - the sentiment - in the hopes that later it could predict labels if given a new tweets. This process of teaching the algorithm is called training.
In order to test our algorithms, we split our data into sections - train and test datasets. You teach the algorithm with the first group, and then ask it for predictions on the second set. You can then compare its predictions to the right answers using a confusion matrix.
Although different algorithms took different amounts of time to train, they all ended up with about 70-75% accuracy.
Discussion topics#
- Which models performed the best? Were there big differences?
- Do you think it's more important to be sensitive to negativity or positivity? Do we want more positive things incorrectly marked as negative, or more negative things marked as positive?
- They all had very different training times. Which ones offer the best combination of performance and not making you wait around for an hour?
- If you have a decent algorithm that trains more quickly, that could that mean about feature selection or the size of your training set? Why did we use
max_features=
anddf.sample
? - Is 75% accuracy good?
- Do your feelings change if the performance is described as "incorrect one out of every four times?"
- What would your accuracy be for a random guess?
- How do you feel about sentiment analysis?
- How do you feel about this piece from the UpShot that uses the Emotional Lexicon?
- What would you feel comfortable using our sentiment classifier for?