Comparing documents across languages with Universal Sentence Encoding and Tensorflow#

What do we do when we have terabytes of documents scattered across multiple languages? Well, if we find one document that's interesting, we might want to ask the computer to anything that's similar to it. If we ask especially politely, we can have it find similar documents even in a different language.

I found out about this technique based on a writeup of Quartz's analysis of the Luanda Leaks. I recommend giving it a read-through before you go through here, just for a bit of context.

Note: I talk about documents a lot in this section, but what we're really interested in is sentences. When we get to the next section - how to apply these techniques to large datasets - the difference will become more clear.

Let's say we have a handful of sentences.

import pandas as pd

sentences = [
    "Molly ate a fish",
    "Jen consumed a carp",
    "I would like to sell you a house",
    "Я пытаюсь купить дачу", # I'm trying to buy a summer home
    "J'aimerais vous louer un grand appartement", # I would like to rent a large apartment to you
    "This is a wonderful investment opportunity",
    "Это прекрасная возможность для инвестиций", # investment opportunity
    "C'est une merveilleuse opportunité d'investissement", # investment opportunity
    "これは素晴らしい投資機会です", # investment opportunity
    "野球はあなたが思うよりも面白いことがあります", # baseball can be more interesting than you think
    "Baseball can be interesting than you'd think"
]

I used Google Translate to mix and match between languages - some Russian, some Japanese, some French - to varying degrees of similarity. Some are exactly the same (investment opportunities), while others are only roughly about the same topic (renting or buying houses/apartments).

Without spending time going through them one-by-one ourselves, how can we find sentences that are similar to one another?

Old method: Counting words#

Traditionally, document similarity is based on the words two documents have in common.

First, we'll count the number of times each word appears.

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer(binary=True)
matrix = vectorizer.fit_transform(sentences)
counts = pd.DataFrame(
    matrix.toarray(),
    index=sentences,
    columns=vectorizer.get_feature_names())
counts.head()
aimerais appartement ate baseball be can carp consumed est fish ... возможность дачу для инвестиций купить прекрасная пытаюсь это これは素晴らしい投資機会です 野球はあなたが思うよりも面白いことがあります
Molly ate a fish 0 0 1 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
Jen consumed a carp 0 0 0 0 0 0 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
I would like to sell you a house 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Я пытаюсь купить дачу 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 1 0 1 0 0 0
J'aimerais vous louer un grand appartement 1 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 44 columns

Then we'll see how many words each sentence has in common with each other sentence. The more words two sentences have in common, the higher their similarity should be.

from sklearn.metrics.pairwise import cosine_similarity

# Compute the similarities using the word counts
similarities = cosine_similarity(matrix)

# Make a fancy colored dataframe about it
pd.DataFrame(similarities,
             index=sentences,
             columns=sentences) \
            .style \
            .background_gradient(axis=None)
Molly ate a fish Jen consumed a carp I would like to sell you a house Я пытаюсь купить дачу J'aimerais vous louer un grand appartement This is a wonderful investment opportunity Это прекрасная возможность для инвестиций C'est une merveilleuse opportunité d'investissement これは素晴らしい投資機会です 野球はあなたが思うよりも面白いことがあります Baseball can be interesting than you'd think
Molly ate a fish 1 0 0 0 0 0 0 0 0 0 0
Jen consumed a carp 0 1 0 0 0 0 0 0 0 0 0
I would like to sell you a house 0 0 1 0 0 0 0 0 0 0 0.154303
Я пытаюсь купить дачу 0 0 0 1 0 0 0 0 0 0 0
J'aimerais vous louer un grand appartement 0 0 0 0 1 0 0 0 0 0 0
This is a wonderful investment opportunity 0 0 0 0 0 1 0 0 0 0 0
Это прекрасная возможность для инвестиций 0 0 0 0 0 0 1 0 0 0 0
C'est une merveilleuse opportunité d'investissement 0 0 0 0 0 0 0 1 0 0 0
これは素晴らしい投資機会です 0 0 0 0 0 0 0 0 1 0 0
野球はあなたが思うよりも面白いことがあります 0 0 0 0 0 0 0 0 0 1 0
Baseball can be interesting than you'd think 0 0 0.154303 0 0 0 0 0 0 0 1

Pretty boring, right? These sentences share almost no words (ignoring things like a or the), so the only two sentences that are actually marked as similar are...

  • Baseball can be interesting than you'd think
  • I would like to sell you a house

...because they both contain the word you! While it's useless, it isn't unexpected. These sentences are all in different languages, how in the world are we supposed to judge whether they're similar or not?

New method: Universal sentence encoder#

Once upon a time we talked about word embeddings, which are ways for each word to have multiple dimensions of meaning. "cat" and "lion" might both be catlike, while "lion" and "wolf" are both wild.

Imagine a graph that looks like this, but with three hundred dimensions:

To find words that are similar, you just find ones that are close to each other in that 300-dimension space: a certain amount about cats, a certain amount wild, a certain amount edible, a certain amount red, etc etc etc. Notice in the chart above, shoe is far far off to the left: that means it isn't very similar to those other four words! If you haven't seen it yet, it's a great idea to go read our word embeddings page for more details.

Researchers took this idea of word embeddings and used some fun computer magic to take it one step further: they learned to apply it across different languages!

We aren't talking just strict translation! While yes, cat and gato and all translate to the same word, multi-language word embeddings mean a lot more. A sentence that talks about meowing can be marked as similar to one that talks about gatos, even though the words aren't exact translation matches, just because both of those words are cat-related!

The Multilingual Universal Sentence Encoder is our new best friend. Using it along with Tensorflow, we'll be able to match up our similarly sentences, even if they're in completely different languages.

Big thanks to Jeremy Merrill's tensorflow v1 example Asia spa, even though I can't agree with his choice in bagels

And hey, 300 dimensions? Forget about that, let's upgrade to 512.

The code#

If you need to install tensorflow or its associated packages, uncomment and run the next line. Otherwise we're pretty much good to go!

# !pip install tensorflow tensorflow_hub tensorflow_text
# Import tensorflow and friends

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text

We'll start by loading the Multilingual Universal Sentence Encoder. We're using version 3, which is super user-friendly.

I believe this requires that we're Tensorflow v2, but don't quote me on that.

# Load the Multilingual Universal Sentence Encoder, v3
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

We can now use this embed to create our multilingual sentence embeddings. Congratulations!

What's it look like when we run an encoding? Let's find the 512 dimensions of knowledge about bagels.

embed("the only kind of bagel is everything")
<tf.Tensor: id=57071, shape=(1, 512), dtype=float32, numpy=
array([[-3.69759873e-02,  3.79814878e-02, -1.50387250e-02,
        -3.46106850e-02,  2.21144240e-02,  5.16897328e-02,
         8.20917264e-03,  1.37943355e-02, -3.79155353e-02,
        -1.65961019e-03,  5.37911337e-03,  1.48542887e-02,
         7.86846355e-02, -2.62473281e-02,  6.43585697e-02,
         4.98673990e-02, -7.89802819e-02, -3.48499864e-02,
         7.56129548e-02, -2.97897067e-02,  1.87768098e-02,
         6.11422174e-02,  9.61908046e-03,  8.94820690e-03,
        -6.60641526e-04, -3.11440807e-02, -1.06579633e-02,
        -3.30661237e-02,  5.29161189e-03,  4.56077345e-02,
        -2.63070073e-02, -2.36417707e-02,  4.46549021e-02,
        -5.67555539e-02,  5.66278994e-02,  4.85747606e-02,
         7.41910040e-02,  2.24836003e-02, -1.96227692e-02,
        -3.48150916e-02, -7.31992200e-02, -6.30672723e-02,
         3.54410671e-02,  1.33525990e-02,  7.31556565e-02,
         3.63616413e-03, -5.82444593e-02, -2.85111647e-02,
        -9.70507860e-02,  3.93075272e-02, -3.62347774e-02,
         1.41324457e-02,  8.10919795e-03,  2.64607463e-02,
         7.92743415e-02,  5.81673682e-02, -2.54460387e-02,
        -6.31796196e-02, -3.43535841e-02,  5.83359823e-02,
        -1.39280595e-02, -7.32193366e-02, -7.12036788e-02,
        -3.38253565e-03,  1.41925523e-02, -2.09060572e-02,
         7.14521483e-02, -2.88539138e-02, -4.43585776e-02,
         1.80798536e-03,  5.03119938e-02, -9.52464435e-03,
         2.14359239e-02,  7.95859657e-03,  3.79250906e-02,
         6.16297275e-02, -3.85400630e-03,  2.98931412e-02,
         4.10915278e-02,  6.33522123e-02, -9.40413550e-02,
         7.22554773e-02, -1.00268330e-02,  2.46127564e-02,
        -5.24484999e-02,  4.80766334e-02,  8.06390960e-03,
        -4.76065874e-02,  3.68852727e-02,  7.41375517e-03,
        -1.02010332e-02, -3.20407562e-02,  1.08915577e-02,
        -3.08416206e-02,  3.15842703e-02,  4.89321686e-02,
        -6.17381521e-02, -4.41623442e-02,  4.48219944e-03,
         2.18568556e-02, -5.12665920e-02, -5.68548255e-02,
         4.24940288e-02,  6.21532612e-02, -7.99592286e-02,
        -8.00034124e-03,  6.50194362e-02, -2.80270353e-02,
        -9.41600942e-04,  3.31163704e-02, -5.82195772e-03,
         3.99386957e-02, -1.48728639e-02, -1.39280427e-02,
        -4.34936285e-02, -5.35531938e-02,  4.61341180e-02,
        -6.81031421e-02,  8.82902965e-02, -3.97792123e-02,
         9.68311680e-04, -7.61798546e-02, -7.40375221e-02,
        -5.15683740e-02,  3.47172446e-03, -2.93960050e-02,
         1.99779384e-02,  8.74220729e-02, -4.94794920e-02,
         8.30933452e-02, -1.67170428e-02,  3.00323237e-02,
        -8.55879486e-02,  2.87602339e-02, -9.60664824e-02,
         7.32482746e-02, -2.68924031e-02,  3.78773212e-02,
        -4.59613875e-02, -6.91506565e-02,  6.93772361e-03,
         3.46894227e-02, -8.89625959e-03, -7.16783032e-02,
         4.37109321e-02,  5.09838909e-02, -6.21132553e-02,
         7.74390697e-02,  3.44788730e-02, -6.27935631e-03,
         1.39412303e-02,  7.35700056e-02, -9.47634727e-02,
        -3.50511447e-02,  6.94617331e-02, -5.53163961e-02,
         5.81471175e-02, -7.69591704e-02, -2.11736914e-02,
        -6.06859252e-02,  7.15053827e-02,  4.46358547e-02,
         2.42748298e-02,  1.54749798e-02,  1.08365268e-02,
         7.99995139e-02,  7.58065060e-02,  1.51214665e-02,
         2.03052592e-02,  5.27294874e-02,  4.77281176e-02,
         6.26818761e-02, -5.47395786e-04, -5.85503988e-02,
         5.47178611e-02,  1.02013946e-02, -3.36555950e-02,
         1.39712142e-02,  6.68759570e-02, -7.22111240e-02,
         2.58826390e-02,  1.74345840e-02, -7.67405927e-02,
        -5.33879586e-02,  4.12015244e-02, -1.79446824e-02,
         2.44576298e-02,  3.08953561e-02, -1.60510410e-02,
         8.39557797e-02, -2.60847881e-02,  4.11604904e-02,
        -1.43767996e-02, -5.31761311e-02,  3.51675530e-03,
         1.25689059e-02, -5.22525683e-02, -5.62273245e-03,
         4.22066338e-02,  3.73546854e-02,  1.15205310e-02,
         2.56110486e-02, -1.66541934e-02,  5.23796529e-02,
        -2.89855432e-02,  1.38174165e-02,  9.33580920e-02,
         1.20746475e-02, -8.60168412e-02, -7.53229558e-02,
        -3.66476886e-02, -4.30206582e-02,  7.09665066e-04,
         2.38361638e-02, -2.19409186e-02,  6.36263192e-02,
         4.53140447e-03,  2.11156346e-02,  5.57899475e-02,
        -6.80286139e-02, -4.37521338e-02, -8.21405202e-02,
         8.25821515e-03, -3.10159177e-02,  6.52143434e-02,
         3.32336314e-02, -5.03658084e-03, -7.25874230e-02,
         8.72287974e-02, -3.60807404e-02,  5.41775525e-02,
         1.50700854e-02,  7.90126026e-02,  2.86863651e-02,
         6.87979832e-02, -2.88775545e-02, -2.95095537e-02,
        -2.79238932e-02, -4.64438200e-02, -5.07920384e-02,
        -5.23046516e-02,  4.12296280e-02, -6.07346883e-03,
         6.44223839e-02,  2.46095266e-02,  2.52780542e-02,
         1.75630152e-02,  2.47574542e-02, -4.24813665e-02,
         9.73835267e-05,  9.94504150e-03, -6.55800030e-02,
         1.38729962e-03, -9.11064446e-03,  3.37656867e-03,
        -4.93610874e-02,  1.71818975e-02, -1.59767941e-02,
         6.33369461e-02,  5.42201772e-02,  1.25628002e-02,
         4.61697951e-02, -2.61488259e-02, -8.83363858e-02,
        -3.27492096e-02,  2.56966278e-02, -1.69585310e-02,
        -1.31883780e-02, -5.96446320e-02, -1.93749368e-02,
         6.35374635e-02,  4.03213799e-02,  1.50206396e-02,
        -3.30444537e-02,  4.51200977e-02, -3.72802652e-02,
         1.53144859e-02,  3.61363068e-02, -5.15875295e-02,
         4.27309833e-02,  8.54399148e-03, -5.93104064e-02,
        -8.88970029e-03,  5.95029034e-02,  1.43050188e-02,
         4.82057557e-02, -4.50867079e-02, -1.42679838e-02,
        -1.75049808e-02, -6.97534010e-02,  3.26799080e-02,
        -4.25592512e-02,  3.98812480e-02,  4.43578139e-02,
         6.87086061e-02, -7.22177699e-02, -6.84368089e-02,
         2.63370145e-02, -5.19983796e-03,  3.33114676e-02,
        -4.62440811e-02, -1.71188023e-02,  2.20262837e-02,
        -4.01439182e-02, -6.31575752e-03,  1.39666954e-02,
         4.65051495e-02,  2.49833278e-02, -6.01417758e-02,
         2.07149461e-02,  4.24126051e-02,  2.20183656e-02,
         1.85010955e-02, -4.78874706e-02,  4.42837588e-02,
        -8.97486694e-03, -4.83428985e-02, -3.95011716e-02,
        -6.19368851e-02, -3.97754647e-02, -7.47699961e-02,
        -7.32123554e-02, -7.45374337e-02, -7.39914924e-02,
         5.96006354e-03, -4.28537801e-02,  1.54198408e-02,
         4.98052947e-02,  6.51330724e-02, -2.96430737e-02,
        -1.49712358e-02, -1.08850775e-02, -5.07013239e-02,
         4.29822225e-03,  4.53428328e-02, -7.38566695e-03,
        -7.25991949e-02,  4.44002971e-02, -6.75813779e-02,
         1.18211927e-02, -2.97866892e-02, -3.73482518e-02,
        -4.67794947e-02,  3.05357184e-02, -1.21647986e-02,
         1.03800138e-02, -7.16410875e-02, -1.92064494e-02,
         6.72035292e-02, -2.99240481e-02, -7.09833428e-02,
        -6.13728836e-02,  2.70982310e-02, -4.65584062e-02,
         5.95511980e-02, -1.07485009e-02, -9.09862742e-02,
         6.19890727e-02,  2.46958770e-02, -4.43307031e-03,
        -3.04338802e-02, -2.94903982e-02, -1.95469502e-02,
        -6.29114499e-03, -2.35814806e-02, -2.30679251e-02,
        -4.01032381e-02,  3.82015258e-02, -1.01673668e-02,
         5.97134419e-03,  6.34997785e-02,  1.98718235e-02,
         5.89793250e-02, -4.62367833e-02, -4.86558117e-02,
        -3.51219401e-02, -3.38688605e-02, -3.06257401e-02,
        -6.32720068e-02, -3.25872265e-02,  5.16675413e-02,
        -3.51945013e-02,  4.85528074e-03,  1.71884224e-02,
         7.72346463e-03, -6.55070394e-02,  1.26291877e-02,
        -5.99653758e-02,  2.14297213e-02,  3.52965854e-02,
        -3.97071242e-03, -3.85490581e-02, -1.08859958e-02,
        -1.69256963e-02, -1.45414770e-02, -4.00506631e-02,
        -1.26000894e-02,  2.80001177e-03, -6.67512044e-03,
         5.08578978e-02, -1.37485405e-02, -6.61612749e-02,
         6.23165704e-02,  6.67946637e-02,  7.26433694e-02,
         2.15116981e-02, -4.77252118e-02,  7.99191836e-03,
        -4.56132516e-02,  3.04939933e-02, -2.27753241e-02,
        -3.81513499e-02,  6.66936934e-02, -2.02692579e-02,
         5.10043018e-02,  5.38241118e-03,  5.10982908e-02,
         6.05449863e-02,  2.77093835e-02,  5.21293879e-02,
         3.06411199e-02,  2.29520258e-03,  2.54960638e-02,
        -2.53749061e-02,  5.16510755e-02,  3.49155366e-02,
        -1.76921170e-02, -4.21949057e-03,  5.75346649e-02,
         3.40715274e-02, -2.60011870e-02, -2.31301617e-02,
        -2.24575177e-02,  4.20966148e-02,  7.15262294e-02,
         2.84943520e-03,  5.55586033e-02, -8.45558718e-02,
        -8.70346278e-02,  2.86608059e-02,  1.87469982e-02,
        -5.04754484e-02, -5.69880530e-02,  7.74223544e-03,
         3.73192341e-03, -5.65687902e-02,  8.77455547e-02,
         9.47866775e-03, -3.28676626e-02, -4.45270129e-02,
        -3.44688296e-02,  3.46173309e-02, -1.59422085e-02,
        -7.16032758e-02, -3.50505151e-02,  2.19682138e-02,
        -1.15693994e-02,  5.15987119e-03,  1.24197965e-02,
         4.86385562e-02,  4.66769412e-02, -3.39384414e-02,
        -5.91628812e-03, -3.57727185e-02,  2.89626531e-02,
         7.08281025e-02,  2.87774038e-02, -8.60370994e-02,
         4.42840196e-02,  4.36315611e-02, -3.02716661e-02,
         5.86745255e-02,  8.80599860e-03,  2.31303275e-02,
         8.78818426e-03,  3.76377404e-02, -5.98288625e-02,
        -2.32752468e-02,  5.25611602e-02, -7.05482140e-02,
        -3.60888466e-02, -4.51437533e-02,  3.18690725e-02,
         6.47546276e-02,  3.91254425e-02, -1.38891526e-02,
         1.20653771e-02, -3.18169221e-02, -1.03919273e-02,
         5.05215973e-02, -2.71414015e-02,  2.72577051e-02,
         6.02792948e-02, -1.34695508e-02,  2.01314427e-02,
        -3.72480750e-02,  4.02763300e-02, -5.74180968e-02,
        -3.81324664e-02, -3.94039601e-03, -1.68544881e-03,
        -1.97184626e-02, -6.36554360e-02, -4.06978801e-02,
        -7.84360990e-03,  4.36188653e-02,  1.68496016e-02,
        -5.26079684e-02, -4.31789458e-02, -3.07654589e-02,
        -3.78476046e-02, -1.37724436e-03]], dtype=float32)>

Fun, right? So now we're going to feed all of our sentences into the encoder. Each sentences will get its own 512-dimensional representation, and then we'll use that to see which ones are close to each other.

# Generate embeddings for each sentence
embeddings = embed(sentences)
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarities exactly the same as we did before!
similarities = cosine_similarity(embeddings)

# Turn into a dataframe
pd.DataFrame(similarities,
            index=sentences,
            columns=sentences) \
            .style \
            .background_gradient(axis=None)
Molly ate a fish Jen consumed a carp I would like to sell you a house Я пытаюсь купить дачу J'aimerais vous louer un grand appartement This is a wonderful investment opportunity Это прекрасная возможность для инвестиций C'est une merveilleuse opportunité d'investissement これは素晴らしい投資機会です 野球はあなたが思うよりも面白いことがあります Baseball can be interesting than you'd think
Molly ate a fish 1 0.527974 0.069064 0.0583723 0.0330744 -0.013103 -0.0262051 0.0200289 -0.053362 0.081585 0.119151
Jen consumed a carp 0.527974 1 0.101584 0.138269 0.0447615 0.00845337 -0.0199944 0.0514989 0.00944404 0.0830695 0.147007
I would like to sell you a house 0.069064 0.101584 1 0.52998 0.542384 0.231101 0.215794 0.187328 0.214123 0.149138 0.182979
Я пытаюсь купить дачу 0.0583723 0.138269 0.52998 1 0.30713 0.156921 0.145542 0.169162 0.13936 -0.0209739 0.0458156
J'aimerais vous louer un grand appartement 0.0330744 0.0447615 0.542384 0.30713 1 0.283597 0.275903 0.279139 0.2666 0.162576 0.169971
This is a wonderful investment opportunity -0.013103 0.00845337 0.231101 0.156921 0.283597 1 0.920411 0.902763 0.90484 0.0907904 0.191868
Это прекрасная возможность для инвестиций -0.0262051 -0.0199944 0.215794 0.145542 0.275903 0.920411 1 0.885628 0.824693 0.0500936 0.147731
C'est une merveilleuse opportunité d'investissement 0.0200289 0.0514989 0.187328 0.169162 0.279139 0.902763 0.885628 1 0.831138 0.094717 0.192856
これは素晴らしい投資機会です -0.053362 0.00944404 0.214123 0.13936 0.2666 0.90484 0.824693 0.831138 1 0.104263 0.230147
野球はあなたが思うよりも面白いことがあります 0.081585 0.0830695 0.149138 -0.0209739 0.162576 0.0907904 0.0500936 0.094717 0.104263 1 0.703603
Baseball can be interesting than you'd think 0.119151 0.147007 0.182979 0.0458156 0.169971 0.191868 0.147731 0.192856 0.230147 0.703603 1

Magic, right?

The ones about housing are all grouped together, investment opportunities are marked as similar, and baseball as well. You'll notice it (somewhat obviously) even works within the same language - Jen consumed a carp and Molly ate a fish are both similar.

While this is fun conceptually and all, next up we'll see how to put this into production use!