shagunsodhani/SkipThoughtVectors.md

## SkipThoughtVectors.md

      
    Raw
  

              SkipThoughtVectors.md
            
          
    Skip-Thought Vectors

Introduction


The paper describes an unsupervised approach to train a generic, distributed sentence encoder.
It also describes a vocabulary expansion method to encode words not seen at training time.
Link to the paper

Skip-Thoughts


Train an encoder-decoder model where the encoder maps the input sentence to a sentence vector and the decoder generates the sentences surrounding the original sentence.
The model is called skip-thoughts and the encoded vectors are called skip-thought vectors.
Similar to the skip-gram model in the sense that surrounding sentences are used to learn sentence vectors.

Architecture


Training data is in form of sentence tuples (previous sentence, current sentence, next sentence).
Encoder

RNN Encoder with GRU.


Decoder

RNN Decoder with conditional GRU.
Conditioned on encoder output.
Extra matrices introduced to bias the update gate, reset gate and hidden state, given the encoder output.
Vocabulary matrix (V) - Weight matrix having one row (vector) for each word in the vocabulary.
Separate decoders for the previous and next sentence which share only V.
Given the decoder context h (at any time), encoder output, and list of words already generated for the output sentence, the probability of choosing w as the next word is proportional to exp(V(word)h)


Objective

Sum of the log-probabilities for the forward and backwards sentences conditioned on the encoder output.


Vocabulary Expansion


Use a model like Word2Vec which can be trained to induce word representations and train it to obtain embeddings for all the words that are likely to be seen by the encoder.
Learn a matrix W such that encoder(word) = cross-product(W, Word2Vec(word)) for all words that are common to both Word2Vec model and encoder model.
Use W to generate embeddings for words are not seen during encoder training.

Dataset


BookCorpus dataset having books across 16 genres.

Training


uni-skip

Unidirectional auto-encoder with 2400 dimensions.


bi-skip

Bidirectional model with forward (sentence given in correct order) and backward (sentence given in reverse order) encoders of 1200 dimensions each.


combine-skip

concatenation of uni-skip and bi-skip vectors.


Initialization

Recurrent matricies - orthogonal initialization.
Non-recurrent matricies - uniform distribution in [-0.1,0.1].


Mini-batches of size 128.
Gradient Clipping at norm = 10.
Adam optimizer.

Experiments


After learning skip-thoughts, freeze the model and use the encoder as feature extractor only.
Evaluated the vectors with linear models on following tasks:

Semantic Relatedness


Given a sentence pair, predict how closely related the two sentences are.
skip-thoughts method outperforms all systems from SemEval 2014 competition and is outperformed only by dependency tree-LSTMs.
Using features learned from image-sentence embedding model on COCO boosts performance and brings it at par with dependency tree-LSTMs.

Paraphrase detection


skip-thoughts outperforms recursive nets with dynamic pooling if no hand-crafted features are used.
skip-thoughts with basic pairwise statistics produce results comparable with the state-of-the-art systems that house complicated features and hand engineering.

Image-sentence Ranking


MS COCO dataset
Task

Image annotation

Given an image, rank the sentences on basis of how well they describe the image.


Image search - Given a caption, find the image that is being described.


Though the system does not outperform baseline system in all cases, the results does indicate that skip-thought vectors can capture image descriptions without having to learn their representations from scratch.

Classification


skip-thoughts perform about as good as bag-of-words baselines but are outperformed by methods where sentence representation has been learnt for the task at hand.
Combining skip-thoughts with bi-gram Naive Bayes (NB) features improves the performance.

Future Work


Variants to be explored include:

Fine tuning the encoder-decoder model during the downstream task instead of freezing the weights.
Deep encoders and decoders.
Larger context windows.
Encoding and decoding paragraphs.
Encoders, such as convnets.