shagunsodhani/Word2Vec.md

## Word2Vec.md

      
    Raw
  

              Word2Vec.md
            
          
    Efficient Estimation of Word Representations in Vector Space

Introduction


Introduces techniques to learn word vectors from large text datasets.
Can be used to find similar words (semantically, syntactically, etc).
Link to the paper
Link to open source implementation

Model Architecture


Computational complexity defined in terms of a number of parameters accessed during model training.
Proportional to E*T*Q
E - Number of training epochs
T - Number of words in training set
Q - depends on the model

Feedforward Neural Net Language Model (NNLM)


Probabilistic model with input, projection, hidden and output layer.
Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size).
Input layer projected to projection layer P with dimensionality N*D
Hidden layer (of size H) computes the probability distribution over all words.
Complexity per training example Q =N*D + N*D*H + H*V
Can reduce Q by using hierarchical softmax and Huffman binary tree (for storing vocabulary).

Recurrent Neural Net Language Model (RNNLM)


Similar to NNLM minus the projection layer.
Complexity per training example Q =H*H + H*V
Hierarchical softmax and Huffman tree can be used here as well.

Log-Linear Models


Nonlinear hidden layer causes most of the complexity.
NNLMs can be successfully trained in two steps:

Learn continuous word vectors using simple models.
N-gram NNLM trained over the word vectors.


Continuous Bag-of-Words Model


Similar to feedforward NNLM.
No nonlinear hidden layer.
Projection layer shared for all words and order of words does not influence projection.
Log-linear classifier uses a window of words to predict the middle word.
Q = N*D + D*log₂V

Continuous Skip-gram Model


Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window.
Distant words are given less weight by sampling fewer distant words.
Q = C*(D + D*log₂V) where C is the max distance of the word from the middle word.
Given a C and a training data, a random R is chosen in range 1 to C.
For each training word, R words from history (previous words) and R words from future (next words) are marked as target output and model is trained.

Results


Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece).
Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance.
Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge.
Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman").