Vectorizing snippets
Python data science coding reference from investigate.ai
Simple counts
Word counts, one document
The built-in Counter
tool is a easy way to count words in a single document.
from collections import Counter
text = 'We saw her duck'
counted = Counter(text)
Top words, one document
If you only have one document and you'd like to count the words in it, Counter
makes it easy to find the top n terms. If you'd like more, just change .most_common(3)
to a larger number.
from collections import Counter
text = """
Time flies like an arrow;
fruit flies like a banana
"""
# Count the words, print top three
counted = Counter(text)
counted.most_common(3)
Word counts
If you have multiple documents or are doing machine learning, a CountVectorizer
from scikit-learn might be a better option than Counter
. If you're just doing simple work, though, Counter
should be fine.
from sklearn.feature_extraction.text import CountVectorizer
# Make a vectorizer
vectorizer = CountVectorizer()
# Learn and count the words in df.content
matrix = vectorizer.fit_transform(df.content)
Words used, yes/no v.1
This will give you a dataframe where each column is a word, and each row has a 0
or 1
as to whether it contains the word or not.
Instead of getting fancy with scikit-learn or spaCy, you can just make a dataframe that uses .str.contains
to see if there's a word inside. You'll use this one when there is a short list of specific words.
You use .astype(int)
to change the result from True
and False
to 1
and 0
.
# Create a dataframe of 1's and 0's for each of the words
pd.DataFrame({
'cat': df.content.str.contains('cat', na=False).astype(int),
'dog': df.content.str.contains('dog', na=False).astype(int),
'mouse': df.content.str.contains('mouse', na=False).astype(int)
})
Words used, yes/no v.2
Sometimes you only want to say whether a word is included or not, you don't really care about it being said three or four or sixteen times.
To make this work, you pass binary=True
to your CountVectorizer
, and it will only give you ones and zeroes, for text appearing or not appearing.
from sklearn.feature_extraction.text import CountVectorizer
# Make a vectorizer that only says 0/1
# instead of counting
vectorizer = CountVectorizer(binary=True)
matrix = vectorizer.fit_transform(df.content)
Standard TF-IDF
TF-IDF adjusts for both the length of a document, how often a word shows up in the document, and how frequently the word appears in the entire dataset.
from sklearn.feature_extraction.text import TfidfVectorizer
# Make a TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Learn and count the words in df.content
matrix = vectorizer.fit_transform(df.content)
Advanced counts
Counting n-grams
Passing ngrams=(x,y)
will count multi-token phrases instead of just one word at a time.
from sklearn.feature_extraction.text import TfidfVectorizer
# Count 1- and 2-token phrases
vectorizer = TfidfVectorizer(ngrams=(1,2))
matrix = vectorizer.fit_transform(df.content)
Stem and vectorize
Stemming combines words by stripping their endings. For example, it will convert fish, fishes and fishing to fish.
We're using pyStemmer instead of NLTK's Snowball or Porter stemmers because it's much much faster. This example is for English.
In an ideal world we'd use spaCy. This will not work with ngrams.
from sklearn.feature_extraction.text import CountVectorizer
import Stemmer
# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')
analyzer = CountVectorizer().build_analyzer()
# Override CountVectorizer
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(CountVectorizer, self).build_analyzer()
return lambda doc: stemmer.stemWords(analyzer(doc))
# Create a new StemmedCountVectorizer
vectorizer = StemmedCountVectorizer()
matrix = vectorizer.fit_transform(df.content)
Stem and vectorize TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
import Stemmer
# English stemmer from pyStemmer
stemmer = Stemmer.Stemmer('en')
analyzer = TfidfVectorizer().build_analyzer()
# Override TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: stemmer.stemWords(analyzer(doc))
# Create a new StemmedTfidfVectorizer
vectorizer = StemmedTfidfVectorizer()
matrix = vectorizer.fit_transform(df.content)
Word use percentages
In the sentence "Buffalo cannot buffalo here," the word buffalo will get a score of 0.5
, while cannot and here both get a 0.25
. Typically this number will go through a number of adjustments to be "real" TF-IDF, but setting these options for the vectorizer make it simple percentages.
from sklearn.feature_extraction.text import TfidfVectorizer
# Vectorizer only counts percentages of words,
# not real TF-IDF
vectorizer = TfidfVectorizer(use_idf=False, norm='l1')
matrix = vectorizer.fit_transform(df.content)
Using the vectorizer
Vectorizer vocabulary
This will get you a list of all of the words the vectorizer has seen (make sure you .fit_transform
or .fit
it first!) work with both a CountVectorizer
as well as a TfidfVectorizer
.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(df.content)
matrix = vectorizer.fit_transform(df.content)
# Show the words from .fit_transform
vectorizer.get_feature_names()
Display counts as dataframe
You'll need to convert the "sparse matrix" of word counts using .toarray()
, but after that it's easy to send it into a dataframe. To make each column be named after the word it's counting, you'll need the columns=
line.
If you have a short description in each row of your original dataframe - a speaker, a filename, etc - you can also send that to your new dataframe as index=
.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(df.content)
# Convert the matrix of counts to a dataframe
words_df = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
Specific word counts
Your word count dataframe is exactly same as a normal dataframe, even though it comes from weird scikit-learn stuff. If you'd like to pick the count of a single word, you can just ask for that column. You can also ask for multiple words at a time, e.g. words_df[['fluffy', 'scratchy']]
.
If you're using a TfidfVectorizer
, note that the number will not be a count. It also won't be a percentage, unless you used use_idf=False
and norm='l1'
when creating the vectorizer.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(df.content)
# Convert the matrix of counts to a dataframe
words_df = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
# How many times was one word used
words_df['fluffy']
Add word counts
If you have a dataframe of word counts, you can see how many times a combination of words appears by creating a subset using pandas and .sum(axis=1)
. If the first row uses "fluffy" twice and "scratchy" once, it will return a 3
.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(df.content)
words_df = pd.DataFrame(matrix.toarray(),
columns=vectorizer.get_feature_names())
# Get selected word counts, add across the row
words_df[['fluffy', 'scratchy', 'magic']].sum(axis=1)