What are word embeddings?#

Our relationship is troubled! We like words, but computers like math. Word embeddings are a way of bridging that gap (and saving our love!).

The problem#

You know how when we look at a crazy math formula, maybe our brain explodes a little?

\begin{equation*} \left( \sum_{k=1}^n a_k b_k \right)^2 \leq \left( \sum_{k=1}^n a_k^2 \right) \left( \sum_{k=1}^n b_k^2 \right) \end{equation*}

Yeah, that's exactly how computers feel when you use words. In the same way we might say "that weird angry capital E thing" to refer to Σ, computers look at the word "cat" and is like "uh, 0x63 0x61 0x74?"

While software might be able to understand that cat is three letters long, it's a c and an a and a t, and look up the definition in a dictionary for us, the computer doesn't really emotionally know what cats are. It can't feel what a cat is, know about its fur or how it meows, know about how it sleeps in the sun or tears apart our furniture or cruelly makes us take it to the vet on Christmas Day.

Word embeddings are a way of bridging that gap, a way of using math to describe all of those delightful/horrible things about cats (and everything else).

An axis of meaning#

Let's say we have the concept of a cat. Everything we know about a cat, thrown down on the screen, all of it sitting inside a little pink dot. We'll make it look computational so the computer doesn't get scared yet.

[<matplotlib.lines.Line2D at 0x1175b0f28>]

So far so good! Cats don't exist in the world by themselves, though, they exist in relation to other things. Like dogs, for example. Dogs are different than cats, so they should go... somewhere on the other side from cats, I guess?

Cool, great, amazing, wonderful.

It makes makes as much sense as something meaningless can, but it doesn't seem very much like math. Let's add an axis label to explain to the computer what's changing between "dog" on the left and "cat" on the right.

Text(1, -0.35, 'Less catlike')

If we count those little lines as points, we can see that cat is four points more catlike than dog. That's math! We can even put it into a pandas dataframe:

import pandas as pd

pd.DataFrame([
    { 'name': 'cat', 'cat_points': 4 },
    { 'name': 'dog', 'cat_points': 0 }
])
name cat_points
0 cat 4
1 dog 0

There are more animals than just cats and dogs, though, so let's add 'em! How about... a lion?

Lions are pretty catlike, but they're bigger and stronger and more powerful than most of the housecats that live with me (no offense). So we can give them a little fewer cat points than cats, but definitely not as far over as dogs.

Text(1, -0.35, 'Less catlike')

And again, because computers love spreadsheets and counting, we can make another dataframe.

pd.DataFrame([
    { 'name': 'cat', 'cat_points': 4 },
    { 'name': 'dog', 'cat_points': 0 },
    { 'name': 'lion', 'cat_points': 3.5 }
])
name cat_points
0 cat 4.0
1 dog 0.0
2 lion 3.5

I've heard rumors of even more animals, so let's keep going. How about wolves?

A wolf is definitely much closer to a dog than to a cat. Since a wolf is more intimidating than a dog, I think it's even further away than dog is.

Text(1, -0.35, 'Less catlike')

And just so the computer won't feel left out, we can put it into a dataframe to make it nice and math-y.

pd.DataFrame([
    { 'name': 'cat', 'cat_points': 4 },
    { 'name': 'dog', 'cat_points': 0 },
    { 'name': 'lion', 'cat_points': 3.5 },
    { 'name': 'wolf', 'cat_points': -0.5 }
])
name cat_points
0 cat 4.0
1 dog 0.0
2 lion 3.5
3 wolf -0.5

Another dimension#

We've all been to the zoo, we're all animal scientists, we've all watched Beastars, and we're all very very angry at this classification. Why are wolves and lions separated by dogs? How does that make any sense?

Sure, lions and cats are both felines, and wolves and dogs are both canines, but let's think about it:

  • Cats: totally domesticated
  • Dogs: totally domesticated
  • Wolves: totally wild
  • Lions: totally wild

If we're teaching our computer with just "hey this is like a cat" or "hey this is less like a cat" it isn't going to learn anything important. This is the nuance of our human experience of the world that computers are missing out on!

It's this nuance we're going to teach right now by giving our graph a brand new axis: wild or domesticated.

Text(-0.55, 1, 'More wild')

Look at that beauty!!! It's explaining everything I could ever want. And just so we don't leave out the computer:

pd.DataFrame([
    { 'name': 'cat', 'cat_points': 4, 'wildness': 0.5 },
    { 'name': 'dog', 'cat_points': 0, 'wildness': 0 },
    { 'name': 'lion', 'cat_points': 3.5, 'wildness': 4 },
    { 'name': 'wolf', 'cat_points': -0.5, 'wildness': 4 }
])
name cat_points wildness
0 cat 4.0 0.5
1 dog 0.0 0.0
2 lion 3.5 4.0
3 wolf -0.5 4.0

This is an excellent graph, and it's an excellent (if not perfect) way to describe all sorts of animals! We can describe a few just for fun:

  • Tigers (basically where lions are)
  • Killer whales (not catlike at all, pretty wild)
  • Worms (very very not catlike, a little wild)

We keep putting numbers in that chart, and the computer keeps having a better and better idea of what animals are similar to what other animals. Eventually it builds up a whole worldview of how catlike things are, and how wild they are, and then it can probably analyze something very complicated about zoology!

There's a problem lurking around the corner, though, and it's this: our computer is interested in things that aren't animals.

A third dimension#

We were feeling good for a hot second, but then we realized things other than animals existed. Like shoes, for example.

Text(-6.45, 1, 'More wild')

Shoes aren't like cats at all and are not very wild. But that doesn't do a good job describing them at all. It's like when we added wolves and lions and needed a new axis.

So what are we going to do? The exact same thing: add another piece of data to it! We'll call this axis something like "things you can wear."

df = pd.DataFrame([
    { 'name': 'cat', 'cat_points': 4, 'wildness': 0.5, 'wearability': 0.5 },
    { 'name': 'dog', 'cat_points': 0, 'wildness': 0, 'wearability': 0.25  },
    { 'name': 'lion', 'cat_points': 3.5, 'wildness': 4, 'wearability': -1  },
    { 'name': 'wolf', 'cat_points': -0.5, 'wildness': 4, 'wearability': -1  },
    { 'name': 'shoe', 'cat_points': -3.5, 'wildness': 0, 'wearability': 3  }
])
df
name cat_points wildness wearability
0 cat 4.0 0.5 0.50
1 dog 0.0 0.0 0.25
2 lion 3.5 4.0 -1.00
3 wolf -0.5 4.0 -1.00
4 shoe -3.5 0.0 3.00

And while two dimensions was all right, plotting this in three should really get our blood pumping! We're going to use a library called plotly to take care of this for us.

import plotly.graph_objects as go
import plotly.io as pio

pio.renderers.default = 'notebook'

fig = go.Figure(data=go.Scatter3d(
    x=df.cat_points,
    y=df.wildness,
    z=df.wearability,
    text=df.name,
    mode='markers + text',
    marker=dict(
        color = 'pink',
    )
))

fig.update_layout(scene = dict(
                    xaxis_title='cat points',
                    yaxis_title='wildness',
                    zaxis_title='wearability'
                 ))

pio.show(fig)

Word embeddings#

Now take this idea, and expand it into more and more and more dimensions. Is this word about houses? Is it about fish or space or is it something you can sit on or find in a treasure chest or breathe?

That's what word embeddings are. Things like GLoVe or word2vec are many many many dimensions of knowledge, about many many many words. They basically forced a computer to read Wikipedia until it realized how everything was related.

Is that exactly how it worked? No, just kind of. Can we use word embeddings without knowing how they work? Definitely!

Showing off#

We're going to use the spaCy library in order to show off how this weird example of cats and dogs and shoes and stuff actually kind of works. First we'll import spacy and load in the database of word embeddings (fair warning, it's a lot of stuff, it might take up to a minute).

import spacy

nlp = spacy.load("en_core_web_md")

For each of the 1.3 million words in the database, it has 300 dimensions of information. The dimensions aren't broken down into "catlike" or "wearable," unfortunately, but 300 dimensions of data is a heck of a lot of nuance to carry around.

To see what it can do, we're going to pick a few words that we're interested in. Because we can't graph all 300 dimensions at once, we're going to use some magic machine learning stuff called PCA to reduce it to 3 dimensions.

How does the "going from 300 dimensions to 3 dimensions" trick work? Later on you can read something like this.

We're going to go with elements with a few patterns - cats and kittens, dogs and puppies, and then our good friend shoe.

from sklearn.decomposition import PCA

words = ['cat', 'kitten', 'dog', 'puppy', 'shoe']
X = [nlp(word).vector for word in words]

pca = PCA(n_components=3)
pca.fit(X)

transformed = pca.transform(X)
df = pd.DataFrame(transformed, columns=['x', 'y', 'z'])
df['word'] = words
df
x y z word
0 -1.388262 -2.104630 -1.753189 cat
1 -1.305184 -2.062685 1.843480 kitten
2 -1.475557 1.811804 -1.735215 dog
3 -2.146706 2.105458 1.533496 puppy
4 6.315708 0.250052 0.111429 shoe

Our x, y and z dimensions don't necessarily mean anything, but once we graph them we'll see some patterns emerge.

import plotly.express as px

fig = px.scatter_3d(df, x='x', y='y', z='z', text='word')
fig.show()

Click and drag to move it around and explore!

Notice how shoe is really far from all the animals, cat/kitten is separate from dog/puppy, and (most excitingly) how the relationship between cat and kitten is the same as dog and puppy! Fun, right?

While not all examples are as cut-and-dried as this, this is the general idea behind word embeddings. Instead of having a computer say "a cat is a cat and a lion is a lion and a dog is a dog," every word gets a bunch of ratings -whether it's catlike, how wild it is, whether it's wearable, things like that - which the computer can then use to see what words and concepts are related to each other.

Review#

In this section we looked at word embeddings, which are ways of teaching computers nuance about words and concepts. Words are scored on many, many fields - 300 dimensions, in the case of word2vec - and these scores can be used to compare and contrast the words with one another.

These word embeddings (the official name for the collection of scores) were created automatically by having the computer "read" large texts like Wikipedia. While the embeddings are machine-generated categories and not human-defined, we saw from the final example that they can be reduced into something that makes sense to real people.

Discussion topics#

TODO