GloVe is an unsupervised learning algorithm for obtaining vector representations for words. This post will give a brief overview of the theory behind them and how to use them from Python. For more details, the GloVe website is a useful resource.


Theory

The insight that led to the development of GloVe can be seen in the following table.

We start by looking at conditional probabilities of words and their context P(w_1 | w_2), loosely speaking the probability that we’ll observe word w_1 given we have observed word w_2 nearby.

The first two rows of the table above show some examples of what such a conditional probability distribution might look like. As you can see, it’s pretty hard to interpret. We can see some basic things, such as the fact that the word ‘solid’ is more likely to appear in the context of ‘ice’ than the word ‘gas’, but it’s hard to get a sense of whether a number like 1.9 \times 10^{-4} is significant.

If we instead look at the bottom row, where we consider a ratio of conditional probability distributions, we see that it captures the sort of information we care about in a more interpretable way.

For example, the first column tells us that the word ‘solid’ is significantly more likely to appear in the context of ice than steam, as the ratio is much larger than 1. Similarly, in the second column we see that the word ‘gas’ is significantly more likely to appear in the context of ‘steam’ than ‘ice’. The final two columns have ratios close to 1 because the words ‘water’ and ‘fashion’ are equally likely and unlikely respectively to appear in the contexts of ‘ice’ and ‘steam’.

We conclude that the ratio of the conditional probability distributions appears to capture useful information about language, and so GloVe attempts to find low dimensional word vectors that can be used to recover this ratio.

Specifically, we want to find word vectors v_i and v_j corresponding to words w_i and w_j such that v_i \cdot v_j \approx P(w_i \mid w_j)

This is done via gradient descent to optimise a simple least squares cost function. For more details check out the GloVe paper, which is very accessible.


Properties

GloVe exhibits many interesting properties that are also features of similar embedding schemes. For example:

Similar words have similar embedding vectors

Embedding vectors that are close in space (e.g. in Euclidean or cosine distance) are typically close in semantic meaning. Consider this list obtained by the GloVe authors of the words most similar to ‘frog’:

  1. frog
  2. frogs
  3. toad
  4. litoria
  5. leptodactylidae
  6. rana
  7. lizard
  8. eleutherodactylus

It’s a feature of the vast data sets that these models are trained on that they can learn useful contextual information – even about words that are outside the vocabulary of most humans.

The space has a meaningful linear substructure

The embedding space has a rich linear substructure, with direction vectors often capturing contextual or semantic information. For example, if we compare the direction vector between the embeddings of ‘king’ and ‘queen’ we find that it’s very close to the direction vector between ‘uncle’ and ‘aunt’ or ‘man’ and ‘woman’


Using GloVe from Python

The easiest way to use GloVe vectors in your work from Python is by installing SpaCy.

pip install spacy

From here you can easily download a number of pre-trained NLP models that include pre-trained GloVe vectors. For example, to download SpaCy’s model trained on a medium web crawl of English language data, you can run the following at the command line:

python -m spacy download en_core_web_md

Now from Python you can run the following to get a list of GloVe vectors for the words in a given sentence:

import spacy

nlp = spacy.load("en_core_web_md")
parsed_text = nlp("This is a sentence")

glove_vectors = [w.vector for w in parsed_text]

Alternatively, to get the vector for a full sentence, simply run:

sentence_vector = nlp("This is a sentence").vector

This will average the word vectors over the sentence.


Interfacing with scikit-learn

To use SpaCy to generate GloVe-based features for machine learning models in scikit-learn, it can be useful to create a custom transformer class built on top of SpaCy for easy integration with scikit-learn pipelines. Here’s a very simple example of how you could do this:

import numpy as np
import spacy
from sklearn.base import BaseEstimator, TransformerMixin

class GloveVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name="en_core_web_md"):
        self._nlp = spacy.load(model_name)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return np.concatenate(
            [self._nlp(doc).vector.reshape(1, -1) for doc in X]
        )   

This can now be used like any other scikit-learn estimator. Let’s look at a toy example. Here’s a simple data set that contains a number of sentences, some mentioning animals, some not:

X = [
    "I like cats",
    "I like dogs",
    "Show me a rabbit",
    "I like hot chocolate",
    "I want to live on the moon",
    "What time is dinner?",
]

We label each sentence as True or False depending on whether it mentions an animal or not.

y = [True, True, True, False, False, False]

We also invent a test set. Note that in our test set, none of the animals mentioned appear in the training set.

X_test = [
    "Where is my hamster",
    "I own a nice car",
    "I can ride a horse",
    "What are the chances of that?",
]

We can now very easily build a classifier using scikit-learn pipelines to distinguish sentences mentioning an animal from those that do not.

from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC

g = GloveVectorizer()
svm = LinearSVC()

pipeline = make_pipeline(g, svm)

pipeline.fit(X, y)

y_pred = pipeline.predict(X_test)
print(y_pred)
array([ True, False,  True, False])

Our test set here is pretty small, so we shouldn’t take the perfect accuracy too seriously, but it is nevertheless encouraging. It also demonstrates the extra power brought to our model by a pre-trained word embedding such as GloVe. The information needed to classify the test set correctly is not present in the training set. It’s only with additional linguistic information learned by training on huge web crawls that we are able to get this result.