Efficient Estimation of Word Representations in Vector Space
- Introduces techniques to learn word vectors from large text datasets.
- Can be used to find similar words (semantically, syntactically, etc).
- Link to the paper
- Link to open source implementation
- Computational complexity defined in terms of a number of parameters accessed during model training.
- Proportional to E*T*Q
- E - Number of training epochs
- T - Number of words in training set
- Q - depends on the model
Feedforward Neural Net Language Model (NNLM)
- Probabilistic model with input, projection, hidden and output layer.
- Input layer encodes N previous word using 1-of-V encoding (V is vocabulary size).
- Input layer projected to projection layer P with dimensionality N*D
- Hidden layer (of size H) computes the probability distribution over all words.
- Complexity per training example Q =N*D + N*D*H + H*V
- Can reduce Q by using hierarchical softmax and Huffman binary tree (for storing vocabulary).
Recurrent Neural Net Language Model (RNNLM)
- Similar to NNLM minus the projection layer.
- Complexity per training example Q =H*H + H*V
- Hierarchical softmax and Huffman tree can be used here as well.
- Nonlinear hidden layer causes most of the complexity.
- NNLMs can be successfully trained in two steps:
- Learn continuous word vectors using simple models.
- N-gram NNLM trained over the word vectors.
Continuous Bag-of-Words Model
- Similar to feedforward NNLM.
- No nonlinear hidden layer.
- Projection layer shared for all words and order of words does not influence projection.
- Log-linear classifier uses a window of words to predict the middle word.
- Q = N*D + D*log2V
Continuous Skip-gram Model
- Similar to Continuous Bag-of-Words but uses the middle world of the window to predict the remaining words in the window.
- Distant words are given less weight by sampling fewer distant words.
- Q = C*(D + D*log2V) where C is the max distance of the word from the middle word.
- Given a C and a training data, a random R is chosen in range 1 to C.
- For each training word, R words from history (previous words) and R words from future (next words) are marked as target output and model is trained.
- Skip-gram beats all other models for semantic accuracy tasks (eg - relating Athens with Greece).
- Continuous Bag-of-Words Model outperforms other models for semantic accuracy tasks (eg great with greater) - with skip-gram just behind in performance.
- Skip-gram architecture combined with RNNLMs outperforms RNNLMs (and other models) for Microsoft Research Sentence Completion Challenge.
- Model can learn relationships like "Queen is to King as Woman is to Man". This allows algebraic operations like Vector("King") - Vector("Man") + Vector("Woman").