# GloVe

by Dr Tom Begley

January 30, 2019

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. This post will give a brief overview of the theory behind them and how to use them from Python. For more details, the GloVe website is a useful resource.

## Theory

The insight that led to the development of GloVe can be seen in the following table.

We start by looking at conditional probabilities of words and their context $P(w_1 | w_2)$, loosely speaking the probability that we’ll observe word $w_1$ given we have observed word $w_2$ nearby.

The first two rows of the table above show some examples of what such a conditional probability distribution might look like. As you can see, it’s pretty hard to interpret. We can see some basic things, such as the fact that the word ‘solid’ is more likely to appear in the context of ‘ice’ than the word ‘gas’, but it’s hard to get a sense of whether a number like $1.9 \times 10^{-4}$ is significant.

If we instead look at the bottom row, where we consider a ratio of conditional probability distributions, we see that it captures the sort of information we care about in a more interpretable way.

For example, the first column tells us that the word ‘solid’ is significantly more likely to appear in the context of ice than steam, as the ratio is much larger than $1$. Similarly, in the second column we see that the word ‘gas’ is significantly more likely to appear in the context of ‘steam’ than ‘ice’. The final two columns have ratios close to $1$ because the words ‘water’ and ‘fashion’ are equally likely and unlikely respectively to appear in the contexts of ‘ice’ and ‘steam’.

We conclude that the ratio of the conditional probability distributions appears to capture useful information about language, and so GloVe attempts to find low dimensional word vectors that can be used to recover this ratio.

Specifically, we want to find word vectors $v_i$ and $v_j$ corresponding to words $w_i$ and $w_j$ such that $$v_i \cdot v_j \approx P(w_i | w_j)$$

This is done via gradient descent to optimise a simple least squares cost function. For more details check out the GloVe paper, which is very accessible.

## Properties

GloVe exhibits many interesting properties that are also features of similar embedding schemes. For example:

### Similar words have similar embedding vectors

Embedding vectors that are close in space (e.g. in Euclidean or cosine distance) are typically close in semantic meaning. Consider this list obtained by the GloVe authors of the words most similar to ‘frog’:

1. frog
2. frogs
4. litoria
5. leptodactylidae
6. rana
7. lizard
8. eleutherodactylus

It’s a feature of the vast data sets that these models are trained on that they can learn useful contextual information – even about words that are outside the vocabulary of most humans.

### The space has a meaningful linear substructure

The embedding space has a rich linear substructure, with direction vectors often capturing contextual or semantic information. For example, if we compare the direction vector between the embeddings of ‘king’ and ‘queen’ we find that it’s very close to the direction vector between ‘uncle’ and ‘aunt’ or ‘man’ and ‘woman’

## Using GloVe from Python

The easiest way to use GloVe vectors in your work from Python is by installing SpaCy.

pip install spacy

From here you can easily download a number of pre-trained NLP models that include pre-trained GloVe vectors. For example, to download SpaCy’s model trained on a medium web crawl of English language data, you can run the following at the command line:

python -m spacy download en_core_web_md

Now from Python you can run the following to get a list of GloVe vectors for the words in a given sentence:

import spacy

parsed_text = nlp("This is a sentence")

glove_vectors = [w.vector for w in parsed_text]

Alternatively, to get the vector for a full sentence, simply run:

sentence_vector = nlp("This is a sentence").vector

This will average the word vectors over the sentence.

### Interfacing with scikit-learn

To use SpaCy to generate GloVe-based features for machine learning models in scikit-learn, it can be useful to create a custom transformer class built on top of SpaCy for easy integration with scikit-learn pipelines. Here’s a very simple example of how you could do this:

import numpy as np
import spacy
from sklearn.base import BaseEstimator, TransformerMixin

class GloveVectorizer(BaseEstimator, TransformerMixin):
def __init__(self, model_name="en_core_web_md"):

def fit(self, X, y=None):
return self

def transform(self, X):
return np.concatenate(
[self._nlp(doc).vector.reshape(1, -1) for doc in X]
)

This can now be used like any other scikit-learn estimator. Let’s look at a toy example. Here’s a simple data set that contains a number of sentences, some mentioning animals, some not:

X = [
"I like cats",
"I like dogs",
"Show me a rabbit",
"I like hot chocolate",
"I want to live on the moon",
"What time is dinner?",
]

We label each sentence as True or False depending on whether it mentions an animal or not.

y = [True, True, True, False, False, False]

We also invent a test set. Note that in our test set, none of the animals mentioned appear in the training set.

X_test = [
"Where is my hamster",
"I own a nice car",
"I can ride a horse",
"What are the chances of that?",
]

We can now very easily build a classifier using scikit-learn pipelines to distinguish sentences mentioning an animal from those that do not.

from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC

g = GloveVectorizer()
svm = LinearSVC()

pipeline = make_pipeline(g, svm)

pipeline.fit(X, y)

y_pred = pipeline.predict(X_test)
print(y_pred)
array([ True, False,  True, False])

Our test set here is pretty small, so we shouldn’t take the perfect accuracy too seriously, but it is nevertheless encouraging. It also demonstrates the extra power brought to our model by a pre-trained word embedding such as GloVe. The information needed to classify the test set correctly is not present in the training set. It’s only with additional linguistic information learned by training on huge web crawls that we are able to get this result.

To find out more about what Faculty can do for you and your organisation, get in touch.

Faculty Science Ltd (“Faculty”, “we”, “us” or “our”) respect the privacy of its users (“User”, “you” or “your”) and is committed to protect the information that you share with us, whether it’s directly, through using our Services such as our Data Science Platform Faculty Platform (“Faculty Platform”), or through a third party (“Third Party” or “Third Parties”). We want to be transparent about our practices regarding the data we may collect when you use our Sites and our Services.

Our Sites

This Privacy Policy covers the information practices of faculty.ai, https://cloud.my.faculty.ai, and subdomains of both. Collectively these are referred to as our “Sites”.

Our Services

This Privacy Policy also covers other ways you might interact with us – such as by attending one of our events, signing up to our mailing list or the use of Faculty Platform – collectively these are referred to as Faculty’s “Services”.

What this policy does not cover

This Policy covers all Services and Sites of Faculty unless another Privacy Policy is displayed. In any such circumstance you will be made fully aware of the existence of another Policy. An example of this is when you sign a contract under which we supply you with our bespoke data science services.

End Users

Our Services are primarily used by Companies and Organisations. Where we are providing Services to you under a Company or Organisation contract (for example where a company holds a licence enabling you to use Faculty Platform), any data held about you personally is controlled by your Company or Organisation. If this applies to you, you can find further information below in the section entitled “Notice to End Users”.

The information we collect

Faculty collects information from individuals who visit our Sites and individuals who register to use the Services, either directly on our Sites or on third party Sites.

Types of Data

We may collect two types of data from our Users:

(1) Non-identifiable and anonymous information (referred to in this Policy as “Non-Personal Data”) where we are not aware of the identity of the User from which we have collected the Non-Personal Data;

(2) Individually identifiable information (referred to as “Personal Data”) where we may be able to identify an individual or the information may be of a private and/or sensitive nature.

Faculty will not request any “Sensitive Personal Data” (that is, information concerning an individual’s racial or ethnic origin, political opinions, religious or similar beliefs, trade union membership (or non-membership), physical or mental health condition, criminal offences or related proceedings, or any other data considered as sensitive under applicable law) unless it is in connection with your employment by Faculty or an application for employment or is related to our bespoke services which are covered by separate Privacy Policies.

As a User you may choose to ask us to process Sensitive Personal Data where you do so we will only use that data as you have requested as explained below (see Data Added or Collected by you).

Data we collect from you

Registration and Contact Information:

When you register to use our Services, or amend your previous registration details, we collect your username, first name, last name, company name, email address and in some circumstances where it is necessary to contact you about the Services, a postal address and phone number (“Registration Information”).

Billing Information

When purchasing Services which require payment, we collect billing information such as billing name, address, credit/debit card information. Sometimes we require some additional information to calculate and verify your bill, such as the number of people in your Company that require licences, your VAT registration number, and your Company registration number (“Billing Information”).

Information you provide through our Support Service

When you request help from us to use our Sites or Services through the Contact Form or Chatbot, you may choose to submit information about your usage of our Services. We will require an email address and name to provide you with assistance, and may ask you to provide further information in order to be able to solve your query (“Support Information”).

Optional Information

Whilst using our Sites and Services, you may provide us with additional information that is not required (“Optional Information”). Such Optional Information might include your job title, survey answers, feedback, or additional information in your support requests. We may ask you for feedback on our Support Service, but such information is optional and you do not have to give it to us. If we ask for this information from you and it is not required for use of our Services, such information will be clearly marked as optional. All such Optional Information shall be treated as Personal Data for the purposes of this policy.

We automatically collect information as you use our Sites and Services about how you interact with us. Such information includes your IP address, the browser you are using, the type of device you are using to connect to us, the links that you click on, and the date and time you interact with us (“Navigational Information”). We use cookies to help us collect Navigational Information. You can find further information about our use of cookies in the section at the end of this document entitled Our Cookie Policy.

Data Added or Collected by you

As a User of our Services, in particular Faculty Platform, you may choose to add / invite other Users to our Services. Where you do so, we will only use that data as you have requested, to invite the User to our Services. Such data will be retained in our system until you remove it and will not be used other than for the purposes specified by you. You may also upload or ask us to collect (via APIs – application program interfaces – or other means) various types of information or data for processing and hosting (“Customer Material”). We will only process such Customer Material for the purposes set out in the Terms of Services.

Third Party Collectors

In some situations we may use a third party (that is, a separate organisation) to register your information so that you can use our Services, for example invitees to our events are asked to register via Eventbrite. You can find out more information about these “Third Parties” and their activities  in the section entitled “Third Party Processing”.

Other Information

If you provide us with any information not covered in the above, we will still use such information in accordance with this policy, or as permitted by you.

How we use the information we collect

We use your Registration Information, Billing Information and Optional Information in order to:

Operate the Service:

To provide customer support

We will require Registration Information and Optional Information in order to provide technical assistance, answer your queries, send you updates on account (for example if your payment is overdue), and to provide other support where it is requested from you.

To improve our Services

We may use Support Information, Optional Information, and Navigational Information to improve delivery of our Services to you. For example to identify common issues and fix them, or to identify bugs. Where we collect such data, such as bugs, your Personal Information will be removed, so we only have statistical information. Where we ask for Optional Information such as User feedback or surveys, such data helps us improve our Services in the future, and is anonymised when stored.

To provide to third party contractors who provide services to Faculty

In some cases we use third party contractors to assist us in providing our Services, for example, we use Stripe to process your payments, and Zendesk to process your Support requests. A list of the third parties we work with is provided in the Third Party Processing section below.

To enforce our policies, or identify criminal behaviour

We may use your Registration Information, Billing Information and Navigational Information to ensure that your use falls within our Acceptable Use Policy and Terms and Conditions, or to identify any cases of fraudulent or criminal activity.

To update you on our Services

We may use your Registration Information to contact you about important updates to the Services for which you are Registered, such as product updates or changes to our Terms and Conditions, Acceptable Use Policy or Privacy Policy. We may from time to time contact you about updates to our Service which we feel you may be relevant to you, where it satisfies a legitimate interest (which is not overridden by your data protection interests) such as user surveys, or similar Services. You can request that we do not send you similar updates at any time.

To send you information you have consented to

Where you have given us your specific consent, we will send you information about our Services in general, such as our newsletter. You may withdraw your consent at anytime by clicking the link in any of the correspondence, or by clicking here.

Legal bases for processing

The legal bases for collecting and using your data vary depending on the way in which you are interacting with our Services. We collect and use your data only where:

• We require it for the provision of the Services, to protect the safety and security of the Services, and without such data we would not be able to provide the Services
• You have given consent for us to use it for specific purposes. Where you have provided consent, you may withdraw it at any time through this link.
• We need to process your data to fulfil a legal obligation (e.g. to report criminal activity)
• It satisfies a legitimate interest (which is not overridden by your data protection interests) such as the provision of updates on our Services. You may object to this use at any time by clicking this link

Sharing with Third Parties

We do not sell, share or transfer your data to Third Parties, except in the following specific situations:

Requested by you, the User

For Collaboration

You may request for us to share your Customer Material with a Third Party for the purposes of collaborating on our Services. An example of this is when you invite a User to collaborate on a Faculty Platform project, they will be sent an invitation by us which includes your user name and the name of your organisation (if appropriate), and if accepted, they will get access to any of your Customer Material that you choose to share with them.

Managed Services

You may request us to share information with Third Parties where you are interacting with our Services as an organisation and wish us to share Customer Material with other people in your organisation. An example might be where you ask us to share training information via our Sites to your employees, or where you ask us to issue licences for Faculty Platform to your employees.

To interact with other Third Party Services

You may request that we link other Third Party Services to your Services with us. An example of this is when you create an API (Application Program Interface) on Faculty Platform. You may be required to include your Registration credentials for such Third Parties in order to operate the API.

Necessary for the Sites or Services

For third party processing

We may share your data with Third Parties where it is necessary for the operation, integration, hosting, or support of our Services.  We ensure that each Third Party has the same stringent confidentiality and security measures as Faculty.

We use the following Third Party processors for the following reasons and copies of their respective Privacy Policies are available if you follow the links provided:

• Active Campaign – for the storage of your Registration Information, and if you have consented, or the purposes of issuing our newsletter. Privacy Policy.
• Eventbrite – Where we monitor the guestlists for our events. Privacy Policy.
• Intercom – The platform for live chat on our website. Privacy Policy.

Where you are accessing our Services under a licence in the name of your Organisation, we may provide your Customer Material and your Registration Information to your Company where they request us to do so.

For legal or vital interest reasons

We may be required to share your Personal Data with a Third Party for a legal reason, for example

• To comply with any applicable law, regulation, legal process or governmental request
• To enforce our agreements such as Terms and Conditions and Acceptable Use Policy
• To protect the security or integrity of our Services
• To protect our Users or the public from harm or from criminal activity
• To respond to an emergency which we believe in good faith requires us to disclose information to assist in preventing bodily harm or death of a User (an example of this might be if you collapse at an event).

Where you have consented

Where you consent for us to share your Data, as for marketing purposes. For example, you may consent to us using a testimonial from you in our marketing material, or to our listing you as one of our customers.

Change in control

We may provide your Personal Data to a Third Party in the event that Faculty enters into discussions that might lead to a change in control, such as a merger, acquisition or purchase, unless this results in any change to this Privacy Policy or would affect confidentiality.

Analysis and to improve our services

We may share aggregate Non-Personal Data publicly or with Third Parties, for example through displaying marketing trends on our Sites, or for a Third Party to analyse usage statistics.

Modification or deletion of your Information

If for any reason you would like to Modify or Delete the Personal Data we hold for you, you can do one of the following:

• If you are a Faculty Platform user, click “My Account”. Please note that if your Organisation has provided a licence for you, certain information (your name, username and email address) can not be modified in this way. In this situation you should contact your Organisation, as Faculty is only the data processor and my need the Organisation’s authorisation to modify or delete your information. Please note that if you remove all of your Registration Information, we will no longer be able to provide you with our Services.
• If you have subscribed to our mailing list, you will see an “Unsubscribe” link in all our emails to unsubscribe or modify your details. If you are unable to access this you can also contact us through our contact page and ask for your details to be removed or changed.
• If you believe you have provided Faculty with your Personal Data through any other form, you can also contact us through our contact page and ask for your details to be removed or changed.
• You can also ask to be removed from our systems by emailing info@faculty.ai.

Please note that if you delete or request deletion of your Personal Data, we may still retain Non-Personal Data for the purposes of operating the Service, for example to provide historical user levels. We will also retain a single copy of your Registration Information to ensure that you are not re-added to our systems.

Data Retention

Faculty will hold your Personal Information as long as it is required for you to enjoy the use of our Services. Upon termination of any of our Services for any reason, we will retain the data mentioned below for the following time periods:

• If you have been on the free trial of Faculty Platform, your Registration Information and Customer Material will be retained for 60 days after the end of your free trial in case you wish to reactivate your account and to avoid any accidental loss of your Customer Material. This period may be extended if you request us to.
• If you have been an licence holder of Faculty Platform, your Registration Information and Customer Material will be retained for 90 days in case you wish to reactivate your account and to avoid any accidental loss of your Customer Material. This period may be extended if you request us to.
• If you are interacting with your Services under a contract with your Company, your Registration Information and Customer Material is owned and controlled by your Company, and the data retention periods of your data will be subject to the retention period of your Account holder.
• Where you have been a paying Customer of Faculty, your Registration Information will be kept for up to 6 years for tax purposes. However any specific Billing information which is no longer required (such as your credit card details) will be deleted from our systems 30 days after any final payment is taken in case any final charges are required.
• Where you have interacted with our Services in any other ways, such as attending an event, your Registration Information will be kept for 1 year after your last contact with the company for Legitimate Interest reasons.

In all cases, you may ask us to remove or modify your data in accordance with the section “Deletion or Modification of Information”, although in some cases this may compromise our ability to deliver our Services.

Where your data is provided to us through a Third Party (e.g. Eventbrite), the same deletion periods will apply as above, but the Third Party may have different policies, and you should use the links provided in “Sharing with Third Parties” and contact those Third Parties directly to ensure deletion of your Data. Where we transfer your data to a Third Party, we will be responsible for the deletion of your data with such Third Parties, as outlined above.

Security and Storage of Information

Faculty takes great care in implementing, enforcing and maintaining security policies to help ensure the security of our Services, Sites and our User’s Personal Data. You can find out more information about our Security procedures here.

Faculty takes steps to ensure as far as possible that it’s staff are honest, reliable and take all due care in the processing, care and handling of all Data.

Faculty limits access to any Personal Data we hold to staff who:

• Appropriately trained on the requirements applicable to the processing, care and handling of Personal Data
• Are under confidentiality obligations
• Are required to access, process and use the data to carry out the various tasks outlined in the section “How we use your data”
• Who required access in order for Faculty to fulfill its obligations under this Privacy Policy, Terms or Service and Acceptable Use Policy

Customer Material in Faculty Platform (with the exception of Customer Material in the form of Registration Information) is hosted on AWS in Ireland which provides advanced security features and is compliant with ISO 27001. All Customer Material is stored with logical separation from information of other customers. Faculty limits access to Customer Material to the following Faculty staff and contractors:

• Where you have requested for us or allowed us to access your account for Support Services
• Where we are providing essential security and service upgrades, and in such cases the staff have been appropriately trained on the requirements applicable to the processing, care and handling of Personal Data, and are under confidentiality obligations.

Faculty shall notify the User without undue delay, in the event that any Personal Data held by Faculty on the User or on behalf of the User is lost, stolen, or where there has been any unauthorised access to the Personal Data which is likely to result in a high risk to the User’s rights or freedoms. Furthermore Faculty undertakes to cooperate with the User in investigating and remedying any such security breach. In any security breach involving Personal Data, Faculty shall immediately take remedial measures, including without limitation, reasonable measures to restore the security of the Personal Data and limit unauthorised or illegal dissemination of the Personal Data or any part thereof. Faculty maintains documentation regarding compliance with the requirements of the law, including but not limited to documentation of any known breaches and holds reasonable insurance policies in connection with data security.

Transfer of Data outside of the EEA

Personal Data submitted may be transferred by us to Third Parties (as set out under the heading “
Sharing with Third Parties”), including service providers that may be situated outside the European Economic Area (EEA) and may be processed by staff operating outside the EEA. Where this is the case we will take reasonable steps to ensure that your privacy rights continue to be protected. In countries where they do not have similar data protection laws to the UK, we will take reasonable steps to ensure that the Third Parties have policies, terms and conditions that provide similar protection to that offered within the EEA as a minimum. By using the Site you agree to this storing, processing and/or transfer.

Customer Data is hosted on AWS in Ireland, and is not transferred outside of the EEA without specific and independent permission.

Faculty does not transfer any personal data outside of any jurisdiction in a manner incompatible with the requirements of applicable law.

Upon termination of any of our Services for any reason, you may request a copy of your Personal Data, which Faculty will provide in a reasonably acceptable format.

Other Information

Notice to End Users

Many of the Services we provide are primarily used by Companies and Organisations. Where we are providing Services to you under a Company or Organisation contract (for example where a company holds a licence for Faculty Platform), any Personal Data held is controlled by your Company or Organisation. Where this is the case, your Personal Data will be subject to the Privacy Policy of your organisation, and questions about your information should be directed to your organisation.

Organisation account holders are able to:

• Access and retain your Registration Information and Customer Material
• Control the interaction of third parties with your Customer Material

Where the Services are not provided under the control of an Organisation, if you register for our Services with an email address owned by an Organisation, that Organisation may assert control over your Registration Information and Customer Material at a later date. You will be notified if this happens.

We use cookies and other tracking products to customise our Services, to allow you to login without re-entering your Registration Information, and to understand how our customers use our Services in order to continuously improve them.

We use them in the following circumstances:

• Where they are necessary for you to be able to enable the Services to to provide the feature you have requested (e.g. to login)
• To customise the functionality where you have selected preferences, for example when you select to turn features off or on
• To collect information on how you interact with our Sites and Services, and how you have come to interact with us. For example we use Google Analytics to understand how you came to our Sites and therefore improve our access in the future.
• We use social media cookies to allow you to follow links on our Sites to our social media accounts, or for you to “like” or “follow” information or articles on our Sites.

Most browsers allow you to opt out of accepting cookies through their settings and will also allow you to delete cookies already stored on your computer, however, blocking or deleting all cookies may have a negative impact on your use of our Services, and might prevent them from working altogether.

You can opt-out of Google Analytics on all websites by following this link.

Children Under 16

Our Services are not directed towards children under the age of 16, and therefore (other than in Customer Material controlled by you) we do not hold any Personal Data relating to Children under 16. If you have reason to believe that we may have been provided with Personal Data on a child under 16, please contact us immediately via our contact form.

Right to Object

You have the right to object to the processing of your Personal data by Faculty:

• Based on legitimate interests
• For Direct marketing
• For the purposes of research and statistics.

If you would like to object to the above, you can contact us via our contact page.

Report a concern

If you have a concern about our use of your Personal Data or our information rights practices please let us know. You also have the right to lodge a complaint with the Information Commissioner’s Office (“ICO”), the UK data protection authority, via this link or by calling 0303 123 1113.

• Providing notice on our website where the changes are any unsubstantial changes and do not fundamentally alter the spirit of this policy;
• Sending an email regarding the changes to the email address that you provided in your Registration Information where the changes are substantial.

The changes will take effect seven (7) days after notice has been provided.

Unless otherwise stated, all changes to this privacy policy are effective as of the stated Last Revised date, and your continued use of the Site and/or Services after the Last Revised date will constitute acceptance of, and agreement to be bound by, those changes.

Contact Information

For any queries or comments on the Policy or its content, or for any other purposes you can contact us by using our contact page or by:

Sending an email to: info@faculty.ai

Writing to: Operations Department

Faculty Science Ltd

54 Welbeck Street

London

W1G 9XS

By telephone on:  +44 (0)203 637 9415

search faculty.ai

It looks like you are using a legacy browser. For the best experience of our website we recommend using Chrome, Safari or Firefox.