{{ message }}

Instantly share code, notes, and snippets.

# shagunsodhani/GloVe.md

Created Sep 18, 2016
Notes for GloVe paper

# GloVe: Global Vectors for Word Representation

## Introduction

• Introduces a new global log-bilinear regression model which combines the benefits of both global matrix factorization and local context window methods.

## Global Matrix Factorization Methods

• Decompose large matrices into low-rank approximations.
• eg - Latent Semantic Analysis (LSA)

### Limitations

• Poor performance on word analogy task
• Frequent words contribute disproportionately high to the similarity measure.

## Shallow, Local Context-Based Window Methods

• Learn word representations using adjacent words.
• eg - Continous bag-of-words (CBOW) model and skip-gram model.

### Limitations

• Since they do not operate directly on the global co-occurrence counts, they can not utilise the statistics of the corpus effectively.

## GloVe Model

• To capture the relationship between words i and j, word vector models should use ratios of co-occurene probabilites (with other words k) instead of using raw probabilites themselves.
• In most general form:
• F(wi, wj, wk~ ) = Pik/Pjk
• We want F to encode information in the vector space (which have a linear structure), so we can restrict to the difference of wi and wj
• F(wi - wj, wk~ ) = Pik/Pjk
• Since right hand side is a scalar and left hand side is a vector, we take dot product of the arguments.
• F( (wi - wj)T, wk~ ) = Pik/Pjk
• F should be invariant to order of the word pair i and j.
• F(wiTwk~) = Pik
• Doing further simplifications and optimisations (refer paper), we get cost function,
• J = Sum (over all i, j pairs in the vocabulary)[wiTwk + bi + bk - log(Xik)]2
• f is a weighing function.
• f(x) = min((x/xmax)α, 1)
• Typical values, xmax = 100 and α = 3/4
• b are the bias terms.

### Complexity

• Depends on a number of non-zero elements in the input matrix.
• Upper bound by the square of vocabulary size
• Since for shallow window-based approaches, complexity depends on |C| (size of the corpus), tighter bounds are needed.
• By modelling number of co-occurrences of words as power law function of frequency rank, the complexity can be shown to be proportional to |C|0.8

## Evaluation

• Word Analogies

• a is to b as c is to ___?
• Both semantic and syntactic pairs
• Find closest d to wb - wc + wa (using cosine similarity)
• Word Similarity

• Named Entity Recognition

### Datasets

• Wikipedia Dumps - 2010 and 2014
• Gigaword5
• Combination of Gigaword5 and Wikipedia2014
• CommonCrawl
• 400,000 most frequent words considered from the corpus.

### Hyperparameters

• Size of context window.
• Whether to distinguish left context from right context.
• f - Word pairs that are d words apart contribute 1/d to the total count.
• xmax = 100
• α = 3/4

### Models Compared With

• Singular Value Decomposition
• Continous Bag-Of-Words
• Skip-Gram

### Results

• Glove outperforms all other models significantly.
• Diminishing returns for vectors larger than 200 dimensions.
• Small and asymmetric context windows (context window only to the left) works better for syntactic tasks.
• Long and symmetric context windows (context window to both the sides) works better for semantic tasks.
• Syntactic task benefited from larger corpus though semantic task performed better with Wikipedia instead of Gigaword5 probably due to the comprehensiveness of Wikipedia and slightly outdated nature of Gigaword5.
• Word2vec’s performance decreases if the number of negative samples increases beyond about 10.
• For the same corpus, vocabulary, and window size GloVe consistently achieves better results, faster.

### Liskutin commented Mar 12, 2019

I know its old post, but I have a question. If the model is basically trying to reconstruct the big co-occurrence matrix [well in the paper they said they transform that into probability ratios matrix, i guess], is the model reconstructing the matrix line by line or word by word?

to join this conversation on GitHub. Already have an account? Sign in to comment