Conceptual document similarity with word embeddings#

There are a lot of different ways to compare the similarity of words and documents. Sometimes it's technical, sometimes it's a little more theoretical. In this section we'll look at comparing the concepts of multiple documents instead of just comparing the words.

The issue#

Computers are very, very, very particular about what counts as "the same." For example, cat is the same as cat but the capitalized version of cat is not the same as the small version.

print('Is cat the same as cat?', "cat" == "cat")
print('Is CAT the same as cat?', "CAT" == "cat")
Is cat the same as cat? True
Is CAT the same as cat? False

If we're looking at similarity just using the letters in the words, we can do things like rearrange letters, or use edit distance for partial matches or all sorts of things in order to make "better" comparisons.

This pickiness ends up being a bit of a pain when comparing documents to se how similar they are to one another. No matter what we do, no amount of counting words is going to get us the concepts in the piece, or the idea of what it's really about.

Let's start by taking a look at "normal" document similarity, then compare that approach with using word embeddings on the exact same sentences. Here are the sentences we'll be using.

sentences = [
    'Molly ate a donut',
    'Molly ate a fish',
    'Jen consumed a carp',
    'Lenny fears the lions'
]

print('\n'.join(sentences))
Molly ate a donut
Molly ate a fish
Jen consumed a carp
Lenny fears the lions

Which pair do you think is the most similar? Two of them are Molly eating something, while two of them are women eating fish. Lenny is... definitely an outlier. While you weigh the options, let's get analyzing!

Word counting#

If you read our piece on n-gram document similarity, you know that there's a lot that goes into whether two documents are similar or not. And if you didn't, no big deal! It totally doesn't matter.

To judge similarity between these sentences, we're going to use a TfidfVectorizer from scikit-learn. Less common words are stressed, more common words are more important, and words in long sentences mean less than words in short sentences. We're counting words in a fancy way.

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer(binary=True)
matrix = vectorizer.fit_transform(sentences)
counts = pd.DataFrame(
    matrix.toarray(),
    index=sentences,
    columns=vectorizer.get_feature_names())
counts
ate carp consumed donut fears fish jen lenny lions molly the
Molly ate a donut 1 0 0 1 0 0 0 0 0 1 0
Molly ate a fish 1 0 0 0 0 1 0 0 0 1 0
Jen consumed a carp 0 1 1 0 0 0 1 0 0 0 0
Lenny fears the lions 0 0 0 0 1 0 0 1 1 0 1

We'll be measuring similarity via cosine similarity, a standard measure of similarity in natural language processing. It's similar to how we might look at a graph with points at (0,0) and (2,3) and measure the distance between them - just a bit more complicated.

from sklearn.metrics.pairwise import cosine_similarity

# Compute the similarities using the word counts
similarities = cosine_similarity(matrix)

# Make a fancy colored dataframe about it
pd.DataFrame(similarities,
             index=sentences,
             columns=sentences) \
            .style \
            .background_gradient(axis=None)
Molly ate a donut Molly ate a fish Jen consumed a carp Lenny fears the lions
Molly ate a donut 1 0.666667 0 0
Molly ate a fish 0.666667 1 0 0
Jen consumed a carp 0 0 1 0
Lenny fears the lions 0 0 0 1

Document similarity is on a scale of zero to one, with zero being completely dissimilar and one being an exact match. Each sentence has a 1 when compared to itself - they're totally equal!

  • "Molly ate a donut" and "Molly ate a fish" are both pretty similar - over half - since there's only one word that's different between the two.
  • "Jen consumed a carp" only has the nigh-useless "a" in common with them, so it has a similarity score of 0 to both of the others.
  • Lenny's sentence also has no shared words with anything else. Nor any topics in common, although it doesn't matter right now.

In our brains, though, consumed means just about the same thing as ate. And a carp is a kind of fish, right? If only there were some way of teaching a computer the meaning behind words!

Word embeddings#

Word embeddings are a step up from just counting words. Word embeddings give words meaning to computers, teaching it that puppies are kind of like kittens, kittens are like cats, and shoes are very very different from all of those animals.

We're going to be using the spaCy word embeddings. Each word comes with a 300-dimension vector that expresses things like how catlike the word is, whether you can wear it, if it's something people do during a basketball game (not those exactly, but the same idea). Think of it like 300 different scores for each word, all in different categories.

If you're running this notebook yourself, it'll take a while to load it in. It's a lot of data!

import spacy

nlp = spacy.load("en_core_web_md")

For example, let's check out the 300 dimensions of facts and feelings that spaCy knows about the word cat.

nlp('cat').vector
array([-0.15067  , -0.024468 , -0.23368  , -0.23378  , -0.18382  ,
        0.32711  , -0.22084  , -0.28777  ,  0.12759  ,  1.1656   ,
       -0.64163  , -0.098455 , -0.62397  ,  0.010431 , -0.25653  ,
        0.31799  ,  0.037779 ,  1.1904   , -0.17714  , -0.2595   ,
       -0.31461  ,  0.038825 , -0.15713  , -0.13484  ,  0.36936  ,
       -0.30562  , -0.40619  , -0.38965  ,  0.3686   ,  0.013963 ,
       -0.6895   ,  0.004066 , -0.1367   ,  0.32564  ,  0.24688  ,
       -0.14011  ,  0.53889  , -0.80441  , -0.1777   , -0.12922  ,
        0.16303  ,  0.14917  , -0.068429 , -0.33922  ,  0.18495  ,
       -0.082544 , -0.46892  ,  0.39581  , -0.13742  , -0.35132  ,
        0.22223  , -0.144    , -0.048287 ,  0.3379   , -0.31916  ,
        0.20526  ,  0.098624 , -0.23877  ,  0.045338 ,  0.43941  ,
        0.030385 , -0.013821 , -0.093273 , -0.18178  ,  0.19438  ,
       -0.3782   ,  0.70144  ,  0.16236  ,  0.0059111,  0.024898 ,
       -0.13613  , -0.11425  , -0.31598  , -0.14209  ,  0.028194 ,
        0.5419   , -0.42413  , -0.599    ,  0.24976  , -0.27003  ,
        0.14964  ,  0.29287  , -0.31281  ,  0.16543  , -0.21045  ,
       -0.4408   ,  1.2174   ,  0.51236  ,  0.56209  ,  0.14131  ,
        0.092514 ,  0.71396  , -0.021051 , -0.33704  , -0.20275  ,
       -0.36181  ,  0.22055  , -0.25665  ,  0.28425  , -0.16968  ,
        0.058029 ,  0.61182  ,  0.31576  , -0.079185 ,  0.35538  ,
       -0.51236  ,  0.4235   , -0.30033  , -0.22376  ,  0.15223  ,
       -0.048292 ,  0.23532  ,  0.46507  , -0.67579  , -0.32905  ,
        0.08446  , -0.22123  , -0.045333 ,  0.34463  , -0.1455   ,
       -0.18047  , -0.17887  ,  0.96879  , -1.0028   , -0.47343  ,
        0.28542  ,  0.56382  , -0.33211  , -0.38275  , -0.2749   ,
       -0.22955  , -0.24265  , -0.37689  ,  0.24822  ,  0.36941  ,
        0.14651  , -0.37864  ,  0.31134  , -0.28449  ,  0.36948  ,
       -2.8174   , -0.38319  , -0.022373 ,  0.56376  ,  0.40131  ,
       -0.42131  , -0.11311  , -0.17317  ,  0.1411   , -0.13194  ,
        0.18494  ,  0.097692 , -0.097341 , -0.23987  ,  0.16631  ,
       -0.28556  ,  0.0038654,  0.53292  , -0.32367  , -0.38744  ,
        0.27011  , -0.34181  , -0.27702  , -0.67279  , -0.10771  ,
       -0.062189 , -0.24783  , -0.070884 , -0.20898  ,  0.062404 ,
        0.022372 ,  0.13408  ,  0.1305   , -0.19546  , -0.46849  ,
        0.77731  , -0.043978 ,  0.3827   , -0.23376  ,  1.0457   ,
       -0.14371  , -0.3565   , -0.080713 , -0.31047  , -0.57822  ,
       -0.28067  , -0.069678 ,  0.068929 , -0.16227  , -0.63934  ,
       -0.62149  ,  0.11222  , -0.16969  , -0.54637  ,  0.49661  ,
        0.46565  ,  0.088294 , -0.48496  ,  0.69263  , -0.068977 ,
       -0.53709  ,  0.20802  , -0.42987  , -0.11921  ,  0.1174   ,
       -0.18443  ,  0.43797  , -0.1236   ,  0.3607   , -0.19608  ,
       -0.35366  ,  0.18808  , -0.5061   ,  0.14455  , -0.024368 ,
       -0.10772  , -0.0115   ,  0.58634  , -0.054461 ,  0.0076487,
       -0.056297 ,  0.27193  ,  0.23096  , -0.29296  , -0.24325  ,
        0.10317  , -0.10014  ,  0.7089   ,  0.17402  , -0.0037509,
       -0.46304  ,  0.11806  , -0.16457  , -0.38609  ,  0.14524  ,
        0.098122 , -0.12352  , -0.1047   ,  0.39047  , -0.3063   ,
       -0.65375  , -0.0044248, -0.033876 ,  0.037114 , -0.27472  ,
        0.0053147,  0.30737  ,  0.12528  , -0.19527  , -0.16461  ,
        0.087518 , -0.051107 , -0.16323  ,  0.521    ,  0.10822  ,
       -0.060379 , -0.71735  , -0.064327 ,  0.37043  , -0.41054  ,
       -0.2728   , -0.30217  ,  0.015771 , -0.43056  ,  0.35647  ,
        0.17188  , -0.54598  , -0.21541  , -0.044889 , -0.10597  ,
       -0.54391  ,  0.53908  ,  0.070938 ,  0.097839 ,  0.097908 ,
        0.17805  ,  0.18995  ,  0.49962  , -0.18529  ,  0.051234 ,
        0.019574 ,  0.24805  ,  0.3144   , -0.29304  ,  0.54235  ,
        0.46672  ,  0.26017  , -0.44705  ,  0.28287  , -0.033345 ,
       -0.33181  , -0.10902  , -0.023324 ,  0.2106   , -0.29633  ,
        0.81506  ,  0.038524 ,  0.46004  ,  0.17187  , -0.29804  ],
      dtype=float32)

We have absolutely no idea what it means, but sure, okay! If you don't trust it means something, go pop back to the original word embeddings section where we make some nice charts and graphs about puppies and kittens. They'll hopefully convince you!

In the same way that each word has a 300-dimension vector, entire sentences are just combinations of the words inside. We can feed spaCy a sentence and it'll spit out another 300 numbers, just like it did for cat.

nlp('Some people have never eaten a taco').vector
array([-9.35721472e-02, -6.58728555e-02, -1.09004125e-01, -7.26737157e-02,
       -3.09685934e-02,  2.14037150e-01, -8.41091350e-02, -1.10409282e-01,
        1.90102849e-02,  2.27796984e+00, -1.21881999e-01,  1.30912876e-02,
        1.47405550e-01, -9.49697122e-02, -1.73579574e-01, -8.44508559e-02,
       -7.09174722e-02,  7.88924217e-01, -1.76804587e-01,  9.65175703e-02,
        4.83218543e-02, -1.74057707e-01, -5.80976345e-02,  1.79618560e-02,
       -2.44358610e-02, -1.17278710e-01, -1.59582719e-01, -2.81308684e-02,
        2.24118426e-01, -1.89423576e-01,  2.19992865e-02,  6.82982877e-02,
       -3.31260003e-02, -1.45007715e-01,  5.38914166e-02,  6.10024370e-02,
        5.77065684e-02, -8.49509910e-02, -1.44049406e-01,  1.53504863e-01,
       -7.40417242e-02,  9.36794057e-02,  1.16259269e-01,  4.04221453e-02,
        4.16349284e-02,  2.40260854e-01, -2.14091435e-01,  8.17197189e-02,
        2.16164559e-01,  9.44399908e-02, -4.08502042e-01,  8.20949897e-02,
        9.62229446e-02, -6.38950020e-02,  3.17175716e-01,  2.05604173e-03,
       -9.53308716e-02,  1.39911519e-02,  2.02739313e-01, -7.19879419e-02,
        6.23806790e-02, -6.93957061e-02, -8.22085664e-02,  3.44308317e-01,
        2.65597850e-01, -3.00451577e-01,  1.44855291e-01,  4.71555404e-02,
        1.90307163e-02,  2.90072709e-01,  1.30759522e-01,  2.76162863e-01,
        1.75230265e-01, -6.57582805e-02,  3.85398529e-02, -6.87131882e-02,
        9.69880000e-02, -1.67390421e-01, -9.10142809e-02,  2.05520555e-01,
        4.29107063e-02, -6.57997131e-02, -1.61019713e-01, -4.04506505e-01,
        1.41531425e-02, -2.30476856e-01,  2.19237134e-01, -1.10937722e-01,
        1.02468587e-01, -9.79157165e-02, -7.46142864e-02, -7.69622847e-02,
       -8.88517424e-02,  1.71398520e-02,  1.93643287e-01,  3.26947540e-01,
        4.95455675e-02, -1.71902813e-02,  1.57075137e-01,  2.48948596e-02,
       -2.76879996e-01,  1.11022867e-01,  9.19000059e-02, -5.26774228e-02,
        6.60898611e-02, -1.01777709e+00,  2.40281433e-01, -7.11064339e-02,
        4.51328568e-02,  5.90729974e-02,  1.13391005e-01, -3.26913029e-01,
       -7.16882870e-02, -1.97745100e-01,  6.46417066e-02, -9.87487063e-02,
       -5.30570038e-02,  1.01930141e-01, -9.88009647e-02, -8.43843166e-03,
        1.11043565e-01, -1.35540769e-01,  1.27655283e-01, -2.35805824e-01,
        2.02826187e-01, -2.63518579e-02,  9.77357104e-02, -2.41911814e-01,
       -9.88130718e-02,  6.52734265e-02,  3.56577113e-02,  3.56990099e-03,
       -1.31837860e-01,  1.44046575e-01, -2.81907059e-02, -9.81576443e-02,
       -1.47752436e-02, -7.14665651e-02, -1.32309571e-01, -4.93672900e-02,
       -1.84619999e+00,  8.69504288e-02,  2.28673145e-01,  2.17525572e-01,
        8.86931494e-02, -3.49196941e-01, -2.16553703e-01,  7.58744329e-02,
        4.38381471e-02, -1.40141711e-01, -2.07391400e-02,  8.84117112e-02,
        2.27352589e-01,  3.74281704e-02, -1.32738560e-01, -9.07657575e-03,
        6.12827204e-02, -3.68411355e-02,  2.00857431e-01, -3.51425707e-02,
       -3.55821438e-02,  3.40611413e-02,  5.43617122e-02,  8.53485689e-02,
       -2.83111427e-02, -9.20164213e-02,  8.49358588e-02, -1.82713017e-01,
        1.08157143e-01, -1.53087139e-01,  1.35794729e-01, -1.25162890e-02,
        1.60192281e-01, -7.34578595e-02, -3.74644846e-01,  1.06393717e-01,
       -3.30167152e-02, -4.92965877e-02,  9.45858471e-03, -2.05162019e-01,
        8.53839964e-02, -1.15575567e-01, -2.00374439e-01, -1.73942715e-01,
       -1.14028148e-01, -7.96882287e-02, -7.54531473e-02,  8.16881433e-02,
        1.43071130e-01, -2.97021237e-03,  1.12871425e-02,  7.61740059e-02,
       -3.08336437e-01,  1.36619806e-03, -1.77368578e-02,  1.66434571e-01,
       -1.02977138e-02,  1.30870283e-01, -2.66993698e-02,  4.50940058e-02,
        8.27514194e-03, -1.20252989e-01, -1.29719853e-01, -5.60528878e-03,
        2.87312716e-01,  3.22982408e-02,  2.66613928e-03,  8.25715892e-04,
        2.81943709e-01,  1.43556282e-01, -1.04755424e-01,  8.37289914e-02,
       -1.49728566e-01, -9.85008478e-02,  1.09441550e-02,  4.33447175e-02,
       -1.80495784e-01, -4.64711748e-02, -1.16397150e-01, -4.20853682e-02,
       -4.46927138e-02,  4.53404300e-02,  1.88081563e-02,  4.89265732e-02,
        6.05882816e-02, -3.10234297e-02,  1.89144313e-02,  1.46742135e-01,
       -2.00652853e-01, -1.28169283e-01, -1.99491426e-01,  2.71632690e-02,
        6.68754801e-02, -4.79014255e-02,  2.00150281e-01, -4.61449958e-02,
        2.52022911e-02, -1.14205129e-01, -9.52874348e-02,  5.95805645e-02,
       -5.59562966e-02,  4.11645211e-02, -1.22511014e-01,  9.06759873e-02,
        1.96193129e-01, -3.29734012e-02, -3.23014450e-03, -2.19254717e-01,
       -2.41627008e-01,  1.17962845e-01, -1.18746571e-01, -1.49619982e-01,
       -7.90571515e-03,  2.41627172e-01, -1.69894267e-02,  3.32376450e-01,
        1.30081428e-02, -2.05825698e-02,  2.70169042e-02,  1.14712432e-01,
        4.37868424e-02,  1.26330018e-01, -1.85477287e-01,  3.10937595e-02,
        1.59345642e-01,  1.94155708e-01,  7.24185780e-02,  2.47033685e-02,
        8.27942938e-02,  1.61020145e-01,  1.49450287e-01,  3.09002846e-02,
       -1.49694577e-01, -2.64910012e-01, -1.13315262e-01,  8.05728585e-02,
       -1.44515008e-01,  1.37744620e-01,  3.91963422e-02,  3.92842859e-01,
        1.37859717e-01,  1.05463997e-01,  2.92863157e-02, -6.87809959e-02,
       -3.96080017e-02,  4.15625758e-02,  2.71515697e-01,  6.74895719e-02,
       -2.44102869e-02,  8.35534334e-02, -2.16302887e-01,  1.65082011e-02,
       -1.67593569e-01, -5.22805750e-02, -4.26700059e-03,  4.76507172e-02,
       -5.41351363e-02, -2.59147853e-01,  1.98398128e-01,  5.48950024e-02],
      dtype=float32)

Again: no idea what it means, but we'll trust it works (and we'll see later on!).

In order to find the similarity of each of our sentences, we'll need to convert them each into vectors.

# We aren't printing this because it's 3 * 300 = 900 numbers
vectors = [nlp(sentence).vector for sentence in sentences]

# Print out some notes about it
print("We have", len(vectors), "different vectors")
print("And the first one has", len(vectors[0]), "measurements")
print("And the second one has", len(vectors[1]), "measurements")
print("And the third one has", len(vectors[2]), "measurements")
print("And the fourth one has", len(vectors[3]), "measurements")
We have 4 different vectors
And the first one has 300 measurements
And the second one has 300 measurements
And the third one has 300 measurements
And the fourth one has 300 measurements

It might be useful to compare these 300 measurements-per-sentence to what we were doing before. If we look back to when we were doing similarity by counting words, we only had eleven measurements for each sentence: one count for every unique word.

counts
ate carp consumed donut fears fish jen lenny lions molly the
Molly ate a donut 1 0 0 1 0 0 0 0 0 1 0
Molly ate a fish 1 0 0 0 0 1 0 0 0 1 0
Jen consumed a carp 0 1 1 0 0 0 1 0 0 0 0
Lenny fears the lions 0 0 0 0 1 0 0 1 1 0 1

Those extra 289 dimensions that word embeddings bring are packed with concepts instead of just word comparisons. Fishiness, eatinginess, catlikeness, it's all in there! (kind of, maybe, somewhat)

We'll use the same similarity measure as before, cosine similarity. It takes our numeric description of each sentence and checks the distances between them.

# Compute similarities
similarities = cosine_similarity(vectors)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)
Molly ate a donut Molly ate a fish Jen consumed a carp Lenny fears the lions
Molly ate a donut 1 0.86394 0.697743 0.346385
Molly ate a fish 0.86394 1 0.860143 0.442142
Jen consumed a carp 0.697743 0.860143 1 0.491809
Lenny fears the lions 0.346385 0.442142 0.491809 1

And there we go! The first thing to notice is that none of these sentences are totally dissimilar. Last time we had multiple zeroes - this time there are none at all, not even with Lenny's weird adventures.

Next up and most importantly: "Molly ate a donut" and "Molly ate a fish" are still very similar with a choice in the 80's, but it turns out "Molly ate a fish" is also almost just as similar to "Jen consumed a carp"!

  • 0.86394 eating a donut vs. eating a fish
  • 0.86014 eating a fish vs. consuming a carp

Thanks to spaCy's word embeddings understanding the concepts behind the words, the computer was able to measure the similarity between "ate" and "consumed," along with "carp" and "fish." As a result we got magically high scores despite the important words not matching exactly!

We might also note that the "fish" and "carp" sentences are slightly more similar to Lenny's lions than the donut sentence. Might it be because those three involve animals?

Gotchas#

While this is definitely exciting, using word embeddings for similarity isn't without its limits. The biggest issue is what we think is similar compared to what the computer thinks is similar. Because of the magic of word embeddings, we can't exactly measure what the computer knows!

# Here are our sentences
sentences = [
    'Veronica hates mustard',
    'Veronica loves ketchup',
    'Joseph hates ketchup',
]
# Turn into vectors
vectors = [nlp(sentence).vector for sentence in sentences]

# Compute similarities
similarities = cosine_similarity(vectors)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)
Veronica hates mustard Veronica loves ketchup Joseph hates ketchup
Veronica hates mustard 1 0.866612 0.824512
Veronica loves ketchup 0.866612 1 0.823573
Joseph hates ketchup 0.824512 0.823573 1

So what's more similar, two people hating condiments or two people having emotional feelings about ketchup? This is a terrible example, but you'll survive until I figure out a better one.

Review#

In this section, we looked at how word embeddings can allow more in-depth comparisons between texts than just plain word counting. Word embeddings allow the computer to understand nuance and conceptual similarity as opposed to just word-by-word counting.

There are shortcomings to both word counting and word embeddings, but depending on whether you're looking for exact matches or a more conceptual pairing, either one can be the correct choice.

Discussion topics#

I'm looking for news websites that republish works without attribution, giving them slight edits. Would I search for them using word counts or word embeddings? What would be the benefit compared to doing this versus reading them manually?

If I wanted to compare the platforms of two politicians, should I use word counts or word embeddings? What would be the benefit compared to doing this versus reading them manually?

If I wanted to compare the platforms of hundreds of local races, should I use word counts or word embeddings? What would be the benefit compared to doing this versus reading them manually? What about using something like topic modeling instead?