8.2 Improving our vectorizer

If we want to make our classifier work a bit better, one technique we can use is stemming. Stemming is the process of chopping ends off of similar words so they show up as being the “same.”

For example: fish, fishes, fishing and fished. Stemming is the process of removing the endings to turn them all into a simple fish. That way every single sentence that is about fish or fishing at can have that word in common!

To stem with your TfidfVectorizer, you need to jump through a few hoops. It might end up looking like this:

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer,self).build_analyzer()
        return lambda doc:(stemmer.stem(word) for word in analyzer(doc))


vec = StemmedTfidfVectorizer(stop_words='english', min_df=5, max_df=0.5)
X = vec.fit_transform(df.DO_NARRATIVE)
print(vec.get_feature_names()[100:200])

## ['accus', 'acquaint', 'act', 'action', 'activ', 'adam', 'addit', 'address', 'adjac', 'administr', 'admit', 'adn', 'adult', 'adv', 'advanc', 'advd', 'advis', 'adw', 'affair', 'affili', 'afraid', 'aft', 'again', 'aggit', 'aggress', 'aggressor', 'agit', 'ago', 'agre', 'agress', 'agrument', 'aid', 'aim', 'aint', 'air', 'airsoft', 'aknif', 'alcohol', 'alleg', 'alley', 'allow', 'alongsid', 'alt', 'alter', 'alterc', 'altercatin', 'altercaton', 'alterct', 'aluminum', 'alvarado', 'ambul', 'amd', 'amt', 'an', 'anargu', 'and', 'and2', 'andbegan', 'andbit', 'andchok', 'andgrab', 'andhit', 'andkick', 'andpul', 'andpunch', 'andpush', 'andrew', 'andscratch', 'andstruck', 'andsusp', 'andthen', 'andthrew', 'andv', 'andvict', 'anger', 'angri', 'angrier', 'angryand', 'ankl', 'annoy', 'answer', 'anv', 'anymor', 'apart', 'app', 'appar', 'appear', 'appli', 'appox', 'appr', 'apprach', 'appraoch', 'apprch', 'apprchd', 'apprd', 'appro', 'approach', 'approachd', 'approahc', 'approch']

Stemming can be kind of slow, however! The stemmer built in to NLTK is actually notoriously bad, so you might want to upgrade to PyStemmer. Using PyStemmer in a similar way looks like this:

from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer

stemmer = Stemmer.Stemmer('en')

analyzer = TfidfVectorizer().build_analyzer()

class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

vec = StemmedTfidfVectorizer(stop_words='english', min_df=5, max_df=0.5)
X = vec.fit_transform(df.DO_NARRATIVE)

It’s an extra install, however, so I thought I’d give you the option to just move ahead with NLTK if you wanted.