Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Summary of paper titled "Efficient Estimation of Word Representations in Vector Space"

Efficient Estimation of Word Representations in Vector Space


Model Architecture

  • Computational complexity defined in terms of a number of parameters accessed during model training.
  • Proportional to E*T*Q
  • E - Number of training epochs
  • T - Number of words in training set
  • Q - depends on the model

Feedforward Neural Net Language Model (NNLM)

  • Probabilistic model with input, projection, hidden and output layer.
  • Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size).
  • Input layer projected to projection layer P with dimensionality N*D
  • Hidden layer (of size H) computes the probability distribution over all words.
  • Complexity per training example Q =N*D + N*D*H + H*V
  • Can reduce Q by using hierarchical softmax and Huffman binary tree (for storing vocabulary).

Recurrent Neural Net Language Model (RNNLM)

  • Similar to NNLM minus the projection layer.
  • Complexity per training example Q =H*H + H*V
  • Hierarchical softmax and Huffman tree can be used here as well.

Log-Linear Models

  • Nonlinear hidden layer causes most of the complexity.
  • NNLMs can be successfully trained in two steps:
    • Learn continuous word vectors using simple models.
    • N-gram NNLM trained over the word vectors.

Continuous Bag-of-Words Model

  • Similar to feedforward NNLM.
  • No nonlinear hidden layer.
  • Projection layer shared for all words and order of words does not influence projection.
  • Log-linear classifier uses a window of words to predict the middle word.
  • Q = N*D + D*log2V

Continuous Skip-gram Model

  • Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window.
  • Distant words are given less weight by sampling fewer distant words.
  • Q = C*(D + D*log2V) where C is the max distance of the word from the middle word.
  • Given a C and a training data, a random R is chosen in range 1 to C.
  • For each training word, R words from history (previous words) and R words from future (next words) are marked as target output and model is trained.


  • Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece).
  • Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance.
  • Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge.
  • Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman").
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment