Vectorizing snippets

Python data science coding reference from investigate.ai

Simple counts

Word counts, one document

The built-in Counter tool is a easy way to count words in a single document.

from collections import Counter

text = 'We saw her duck'
counted = Counter(text)

Top words, one document

If you only have one document and you'd like to count the words in it, Counter makes it easy to find the top n terms. If you'd like more, just change .most_common(3) to a larger number.

from collections import Counter

text = """
Time flies like an arrow;
fruit flies like a banana
"""

# Count the words, print top three
counted = Counter(text)
counted.most_common(3)

Word counts

If you have multiple documents or are doing machine learning, a CountVectorizer from scikit-learn might be a better option than Counter. If you're just doing simple work, though, Counter should be fine.

from sklearn.feature_extraction.text import CountVectorizer

# Make a vectorizer
vectorizer = CountVectorizer()

# Learn and count the words in df.content
matrix = vectorizer.fit_transform(df.content)

Words used, yes/no v.1

This will give you a dataframe where each column is a word, and each row has a 0 or 1 as to whether it contains the word or not.

Instead of getting fancy with scikit-learn or spaCy, you can just make a dataframe that uses .str.contains to see if there's a word inside. You'll use this one when there is a short list of specific words.

You use .astype(int) to change the result from True and False to 1 and 0.

# Create a dataframe of 1's and 0's for each of the words
pd.DataFrame({
  'cat': df.content.str.contains('cat', na=False).astype(int),
  'dog': df.content.str.contains('dog', na=False).astype(int),
  'mouse': df.content.str.contains('mouse', na=False).astype(int)
})

Words used, yes/no v.2

Sometimes you only want to say whether a word is included or not, you don't really care about it being said three or four or sixteen times.

To make this work, you pass binary=True to your CountVectorizer, and it will only give you ones and zeroes, for text appearing or not appearing.

from sklearn.feature_extraction.text import CountVectorizer

# Make a vectorizer that only says 0/1
# instead of counting
vectorizer = CountVectorizer(binary=True)

matrix = vectorizer.fit_transform(df.content)

Standard TF-IDF

TF-IDF adjusts for both the length of a document, how often a word shows up in the document, and how frequently the word appears in the entire dataset.

from sklearn.feature_extraction.text import TfidfVectorizer

# Make a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Learn and count the words in df.content
matrix = vectorizer.fit_transform(df.content)

Advanced counts

Counting n-grams

Passing ngrams=(x,y) will count multi-token phrases instead of just one word at a time.

from sklearn.feature_extraction.text import TfidfVectorizer

# Count 1- and 2-token phrases
vectorizer = TfidfVectorizer(ngrams=(1,2))
matrix = vectorizer.fit_transform(df.content)

Stem and vectorize

Stemming combines words by stripping their endings. For example, it will convert fish, fishes and fishing to fish.

We're using pyStemmer instead of NLTK's Snowball or Porter stemmers because it's much much faster. This example is for English.

In an ideal world we'd use spaCy. This will not work with ngrams.

from sklearn.feature_extraction.text import CountVectorizer
import Stemmer

# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')

analyzer = CountVectorizer().build_analyzer()

# Override CountVectorizer
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(CountVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

# Create a new StemmedCountVectorizer
vectorizer = StemmedCountVectorizer()
matrix = vectorizer.fit_transform(df.content)

Stem and vectorize TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer

# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')

analyzer = TfidfVectorizer().build_analyzer()

# Override TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(TfidfVectorizer, self).build_analyzer()
        return lambda doc: stemmer.stemWords(analyzer(doc))

# Create a new StemmedTfidfVectorizer
vectorizer = StemmedTfidfVectorizer()
matrix = vectorizer.fit_transform(df.content)

Word use percentages

In the sentence "Buffalo cannot buffalo here," the word buffalo will get a score of 0.5, while cannot and here both get a 0.25. Typically this number will go through a number of adjustments to be "real" TF-IDF, but setting these options for the vectorizer make it simple percentages.

from sklearn.feature_extraction.text import TfidfVectorizer

# Vectorizer only counts percentages of words,
# not real TF-IDF
vectorizer = TfidfVectorizer(use_idf=False, norm='l1')

matrix = vectorizer.fit_transform(df.content)

Using the vectorizer

Vectorizer vocabulary

This will get you a list of all of the words the vectorizer has seen (make sure you .fit_transform or .fit it first!) work with both a CountVectorizer as well as a TfidfVectorizer.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(df.content)
matrix = vectorizer.fit_transform(df.content)

# Show the words from .fit_transform
vectorizer.get_feature_names()

Display counts as dataframe

You'll need to convert the "sparse matrix" of word counts using .toarray(), but after that it's easy to send it into a dataframe. To make each column be named after the word it's counting, you'll need the columns= line.

If you have a short description in each row of your original dataframe - a speaker, a filename, etc - you can also send that to your new dataframe as index=.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(df.content)

# Convert the matrix of counts to a dataframe
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())

Specific word counts

Your word count dataframe is exactly same as a normal dataframe, even though it comes from weird scikit-learn stuff. If you'd like to pick the count of a single word, you can just ask for that column. You can also ask for multiple words at a time, e.g. words_df[['fluffy', 'scratchy']].

If you're using a TfidfVectorizer, note that the number will not be a count. It also won't be a percentage, unless you used use_idf=False and norm='l1' when creating the vectorizer.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(df.content)

# Convert the matrix of counts to a dataframe
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())

# How many times was one word used
words_df['fluffy']

Add word counts

If you have a dataframe of word counts, you can see how many times a combination of words appears by creating a subset using pandas and .sum(axis=1). If the first row uses "fluffy" twice and "scratchy" once, it will return a 3.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(df.content)
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())

# Get selected word counts, add across the row
words_df[['fluffy', 'scratchy', 'magic']].sum(axis=1)