shagunsodhani/Improving Word Representations via Global Context and Multiple Word Prototypes.md

## Improving Word Representations via Global Context and Multiple Word Prototypes.md

      
    Raw
  

              Improving Word Representations via Global Context and Multiple Word Prototypes.md
            
          
    Improving Word Representations via Global Context and Multiple Word Prototypes

Introduction


This paper pre-dated papers like Glove and Word2Vec and proposed an architecture that

combines local and global context while learning word embeddings to capture the word semantics.
learns multiple embeddings per word to account for homonymy and polysemy.


Link to the paper

Global Context-Aware Neural Language Model

Training Objective


Given a word sequence s (local context) and a document d in which the sequence occurs (global context), learn word representations while learning to discriminate the last correct word in s from other words.
g(s, d) - scoring function giving liklihood of correct sequence.
g(s^w, d) - scoring function giving liklihood of last word in s repalced by a word w.
Objective - g(s, d) > g(s^w, d) + 1 for any other word w.

Architecture


Two scoring components (neural networks) to capture:

Local Context

Map word sequence s into an ordered list of vectors x = [x₁, ..., x_m].
x_i - embedding corresponding to i^th word in the sequence.
Compute local score score_l by using a neural network (with one hidden layer) over x.
Preserves word order and syntactic information.


Global Context

Map document d to an ordered list of word embeddings, d = (d₁, ..., d_k).
Compute c, the weighted average of all word vectors in document.
The paper uses idf score for weighting the documents.
*x = * concatenation of c and vector of the last word in s.
Compute global score score_g by using a neural network (with two hidden layers) over x.
Similar to bag-of-words features.
score = score_l + score_g


Train the weights of the hidden layers and the word embeddings.


Multi-Prototype Neural Language Model


Words can have different meanings in different contexts which are difficult to capture when we train only one vector per word.


Solution - train multiple vectors per word to capture the different meanings.


Approach

Gather all the fixed-sized context windows for all occurrences of a given word.
Find the context vector by performing weighted averaging of all the words in the context window.
Cluster the context vectors using spherical k-means.
Each word occurrence in the corpus is re-labeled to its associated cluster.
To find similarity between a pair of words (w, w'):

For each possible cluster of i and j corresponding to the words w and w', find distance between cluster centers for i and j and weight them by the product of probabilities of w belonging to i and w' belonging to j given their respective contexts.
Average the value over the k² pairs.


Training


Dataset

Wikipedia corpus


Parameters

10-word windows
100 hidden units
No weight regularization
10 different word embeddings learnt for words having multiple meanings.


Evaluation


Dataset

WordSim-353

353 pairs of nouns
words represented without context
contains human similarity judgements on pair of words


The paper contributed a new dataset

captures human similarity judgements on pair of words in the context of a sentence
consists of verbs and adjectives along with nouns
for details on how the dataset is constructed, refer the paper


Performance

Proposed model achieves higher correlation to human scores than models using only the local or global context.
Performance can be improved by removing the stop words.
Using multi-prototype approach (multiple vectors for the same word) benefits the model on the tasks where the context is also given.


Comments


This work predated the more general word embedding models like Word2Vec and Glove. While this model performs good at intrinsic evaluation tasks like word similarity, it is outperformed by the more general and recent models on downstream tasks like NER.