Predicting reports of bullying, racism, and unwanted sexual behavior from app store reviews#
In the Washington Post's analysis, they designed a machine learning algorithm to detect patterns of "unsafe" behavior through analyzing App Store reviews. Using a spreadsheet of hand-tagged reviews, we'll both train an algorithm to spot these behaviors as well as learn what words tips our algorithm off. Instead of reading all the reviews ourselves, we'll use the classifier to track down the reviews that are probably interesting for us as journalists.
Our data#
Starting from the app store reviews we scraped, we then opened up the results in Google Sheets and manually tagged reviews for the behaviors we're looking for (racism, bullying, sexual content).
To speed things in finding negativity to classify I filtered to only show reviews that are 1 or 2 stars. Let's move on to reading this dataset into pandas.
import pandas as pd
pd.set_option("display.max_colwidth", 300)
# Read in our data, then drop ones without a text
# review and get rid of a few unwannted columns
df = pd.read_csv("data/reviews-marked.csv")
df = df.dropna(subset=['Review'])
df = df.drop(columns=['Country', 'Date', 'Version'])
df.shape
Overall we have around 56k reviews. What do they look like?
df.head()
We've only filled in 0
and 1
for racism, bullying and unwanted sexual behavior in a handful of reviews. We'll separate our content into two pieces - known reviews that we've labeled, and unknown reviews that we have not labeled. We'll use the known ones to train our classifier, and then run it on unknown to find possible reviews to examine.
known = df[df.sexual.notna()].copy()
unknown = df[df.sexual.isna()].copy()
We're using .copy()
in case we add new columns - if we don't do this we'll get yelled at by pandas, because pandas won't know whether to save the new column in known
also to df
. How many did we label?
known.shape
known.head()
That's not nearly enough, but ok! Let's move on.
Vectorize our text#
We'll be using a stemmed TF-IDF vectorizer to both combine similar words - "pic" and "pics" - as well as have uncommon words carry a little more weight.
#!pip install pystemmer
#!pip install sklearn
%%time
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer
# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')
analyzer = TfidfVectorizer().build_analyzer()
# Override CountVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: stemmer.stemWords(analyzer(doc))
# Create a new StemmedCountVectorizer
vectorizer = StemmedTfidfVectorizer()
matrix = vectorizer.fit_transform(known.Review)
# Build a dataframe of words, purely out of curiosity
words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
words_df.head(5)
We have 1,324 different words we're looking at, but only around 350 reviews. You should never have more features than data points. We're going to do it anyway, though, and see if things work out.
In the spirit of doing things right, we'll tweak out vectorizer a little bit - we only want to see words that show up in fewer than 30% of reviews, and we'll take a maximum of 500 features.
vectorizer = StemmedTfidfVectorizer(max_features=500, max_df=0.30)
matrix = vectorizer.fit_transform(known.Review)
# Build a dataframe of words, purely out of curiosity
words_df = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
words_df.head(5)
What's the split between "normal" reviews and ones featuring unwanted sexual behavior?
known.sexual.value_counts()
Notice we've run into another problem, which is very imbalanced data here. Can a classifier accurately figure out what a sexual comment looks like if it's only seen sixteen of them?
Train a classifier#
While we could use most any kind of classifier, the LinearSVC classifier is a good one for text analysis. To deal with the class imbalance issue, we could do it the more complex/correct way or we could just suggest the classifier balance out the classes when learning by using class_weight='balanced'
.
%%time
from sklearn.svm import LinearSVC
X = matrix
y = known.sexual
clf = LinearSVC(class_weight='balanced')
clf.fit(X, y)
It would be nice to train our classifier, but I'm going to be honest: with only sixteen rows labeled as sexual content it isn't going to do us much good. We know we aren't doing this in the best possible way, but we're hoping we can find it helpful regardless.
Use our classifier#
When we vectorized our reviews before, we used vectorizer.fit_transform
. This taught the vectorizer all of the words in the reviews and transformed them into numbers the algorithm could understand.
This time we just use .transform
, to convert our text into numbers. We don't need .fit
because we don't need the vectorizer to learn any new words. Any word our classifier hasn't seen before won't be helpful because our classifier won't know whether it unwanted implies sexual behavior or not.
X = vectorizer.transform(unknown.Review)
unknown['predicted'] = clf.predict(X)
unknown['predicted_proba'] = clf.decision_function(X)
# If you use a different classifier, you might use .predict_proba instead
# unknown['predicted_proba'] = clf.predict_proba(X)[:, 1]
How many did our classifier predict as related to sexual content?
unknown.predicted.value_counts()
While there are over 500 that have been predicted to probably be about unwanted sexual behavior, we can actually dig a little deeper. Let's take a look at the value for the predicted_proba
column.
unknown.sort_values(by='predicted_proba', ascending=True).head(2)
unknown.sort_values(by='predicted_proba', ascending=False).head(2)
Ones where predicted_proba
is low are unlikely to be about unwanted sexual behavior, while ones where predicted_proba
is high are likely to be about unwanted sexual behavior. The closer to 0
the value is, the more uncertain the classifier is.
We called this column
predicted_proba
because for some classifiers it's probability, e.g. an 80% chance it's about unwanted sexual behavior. In the spirit of cutting and pasting code it seems easier to just re-use the name and remember the difference instead of calling itdecision_function
.
Instead of just taking reviews the classifier marked as definitely interesting to us, we could also sort by predicted_proba
and take the 1000. That way we get the ones that are "definitely" about unwanted sexual behavior as well as some more borderline cases.
to_investigate = unknown.sort_values(by='predicted_proba', ascending=False).head(1000)
to_investigate.sample(5)
While we could look at them in the notebook here, it's probably better to save them to a CSV and take a better look later. Let's save the top 1000 for later research.
to_investigate.to_csv("data/to-investigate.csv", index=False)
Explaining our classifier#
While we're at it, what's the classifier even doing? We can assume that it's looking at the words and figuring out that certain words signal unwanted sexual behavior or not, but which words do which things?
This is called explainability, and it's a big part of being able to argue about whether your machine learning algorithm makes sense or not. The specific question of which words are more or less important is called feature importance.
There's a lot of different, somewhat complicated ways to ask scikit-learn what features are important, but there's a Python library called eli5 that does an almost-perfect job making it only take one line of code.
import eli5
# eli5 gets our classifier and our vectorizer, so it knows what
# numbers are what words (otherwise you just get numbers and weights)
eli5.explain_weights(clf, vec=vectorizer)
Seems reasonable.
Review#
We have a lot of app store reviews, some of which we have manually tagged for certain kinds of behavior. We use those to train a machine learning algorithm, teaching it which words are associated with which kinds of behavior.
After we've trained the algorithm on our known, labeled reviews we can then send it unknown reviews, and it will flag those we need to review. Since we don't mind reviewing a large number of reviews, we also save a handful that the algorithm flagged as barely not-interesting.
Discussion topics#
Did we label enough reviews? Beyond the idea of "more data is better," what are the benefits of labeling more reviews?
If we need to label more reviews, would it be helpful to search for words like "nude" and "guy" to see if we can easily track down some unwanted sexual behavior? Or should we just keep going one-by-one?
We selected the top 500 features, not including any words that showed up more than 30% of the time. What if a representative word like "nude" or "pervert" showed up in half of the reviews? How could we prevent this from happening?
We didn't test our classifier because we didn't have enough labeled as sexual content to really make a meaningful test. Is that okay? Think about what we were using this for.
The Washington Post used sentences like "At least 19 percent of the reviews on ChatLive mentioned unwanted sexual approaches." What would you need to do to feel comfortable making such a numbers-based statement?
Why did we use a stemmer? If we had thousands of labeled reviews instead of just a few dozen, would that have been as necessary?