Skip to content

Instantly share code, notes, and snippets.

Created March 20, 2016 15:04
  • Star 22 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?
Summary of paper titled "Efficient Estimation of Word Representations in Vector Space"

Efficient Estimation of Word Representations in Vector Space


Model Architecture

  • Computational complexity defined in terms of a number of parameters accessed during model training.
  • Proportional to E*T*Q
  • E - Number of training epochs
  • T - Number of words in training set
  • Q - depends on the model

Feedforward Neural Net Language Model (NNLM)

  • Probabilistic model with input, projection, hidden and output layer.
  • Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size).
  • Input layer projected to projection layer P with dimensionality N*D
  • Hidden layer (of size H) computes the probability distribution over all words.
  • Complexity per training example Q =N*D + N*D*H + H*V
  • Can reduce Q by using hierarchical softmax and Huffman binary tree (for storing vocabulary).

Recurrent Neural Net Language Model (RNNLM)

  • Similar to NNLM minus the projection layer.
  • Complexity per training example Q =H*H + H*V
  • Hierarchical softmax and Huffman tree can be used here as well.

Log-Linear Models

  • Nonlinear hidden layer causes most of the complexity.
  • NNLMs can be successfully trained in two steps:
    • Learn continuous word vectors using simple models.
    • N-gram NNLM trained over the word vectors.

Continuous Bag-of-Words Model

  • Similar to feedforward NNLM.
  • No nonlinear hidden layer.
  • Projection layer shared for all words and order of words does not influence projection.
  • Log-linear classifier uses a window of words to predict the middle word.
  • Q = N*D + D*log2V

Continuous Skip-gram Model

  • Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window.
  • Distant words are given less weight by sampling fewer distant words.
  • Q = C*(D + D*log2V) where C is the max distance of the word from the middle word.
  • Given a C and a training data, a random R is chosen in range 1 to C.
  • For each training word, R words from history (previous words) and R words from future (next words) are marked as target output and model is trained.


  • Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece).
  • Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance.
  • Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge.
  • Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman").
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment