Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Created December 3, 2016 09:36
Show Gist options
  • Save shagunsodhani/4a4eb32de8cabf21bda9a4ada15c46e8 to your computer and use it in GitHub Desktop.
Save shagunsodhani/4a4eb32de8cabf21bda9a4ada15c46e8 to your computer and use it in GitHub Desktop.
Notes for Skip-Thought Vectors paper

Skip-Thought Vectors

Introduction

  • The paper describes an unsupervised approach to train a generic, distributed sentence encoder.
  • It also describes a vocabulary expansion method to encode words not seen at training time.
  • Link to the paper

Skip-Thoughts

  • Train an encoder-decoder model where the encoder maps the input sentence to a sentence vector and the decoder generates the sentences surrounding the original sentence.
  • The model is called skip-thoughts and the encoded vectors are called skip-thought vectors.
  • Similar to the skip-gram model in the sense that surrounding sentences are used to learn sentence vectors.

Architecture

  • Training data is in form of sentence tuples (previous sentence, current sentence, next sentence).
  • Encoder
    • RNN Encoder with GRU.
  • Decoder
    • RNN Decoder with conditional GRU.
    • Conditioned on encoder output.
    • Extra matrices introduced to bias the update gate, reset gate and hidden state, given the encoder output.
    • Vocabulary matrix (V) - Weight matrix having one row (vector) for each word in the vocabulary.
    • Separate decoders for the previous and next sentence which share only V.
    • Given the decoder context h (at any time), encoder output, and list of words already generated for the output sentence, the probability of choosing w as the next word is proportional to exp(V(word)h)
  • Objective
    • Sum of the log-probabilities for the forward and backwards sentences conditioned on the encoder output.

Vocabulary Expansion

  • Use a model like Word2Vec which can be trained to induce word representations and train it to obtain embeddings for all the words that are likely to be seen by the encoder.
  • Learn a matrix W such that encoder(word) = cross-product(W, Word2Vec(word)) for all words that are common to both Word2Vec model and encoder model.
  • Use W to generate embeddings for words are not seen during encoder training.

Dataset

Training

  • uni-skip
    • Unidirectional auto-encoder with 2400 dimensions.
  • bi-skip
    • Bidirectional model with forward (sentence given in correct order) and backward (sentence given in reverse order) encoders of 1200 dimensions each.
  • combine-skip
    • concatenation of uni-skip and bi-skip vectors.
  • Initialization
    • Recurrent matricies - orthogonal initialization.
    • Non-recurrent matricies - uniform distribution in [-0.1,0.1].
  • Mini-batches of size 128.
  • Gradient Clipping at norm = 10.
  • Adam optimizer.

Experiments

  • After learning skip-thoughts, freeze the model and use the encoder as feature extractor only.
  • Evaluated the vectors with linear models on following tasks:

Semantic Relatedness

  • Given a sentence pair, predict how closely related the two sentences are.
  • skip-thoughts method outperforms all systems from SemEval 2014 competition and is outperformed only by dependency tree-LSTMs.
  • Using features learned from image-sentence embedding model on COCO boosts performance and brings it at par with dependency tree-LSTMs.

Paraphrase detection

  • skip-thoughts outperforms recursive nets with dynamic pooling if no hand-crafted features are used.
  • skip-thoughts with basic pairwise statistics produce results comparable with the state-of-the-art systems that house complicated features and hand engineering.

Image-sentence Ranking

  • MS COCO dataset
  • Task
    • Image annotation
      • Given an image, rank the sentences on basis of how well they describe the image.
    • Image search - Given a caption, find the image that is being described.
  • Though the system does not outperform baseline system in all cases, the results does indicate that skip-thought vectors can capture image descriptions without having to learn their representations from scratch.

Classification

  • skip-thoughts perform about as good as bag-of-words baselines but are outperformed by methods where sentence representation has been learnt for the task at hand.
  • Combining skip-thoughts with bi-gram Naive Bayes (NB) features improves the performance.

Future Work

  • Variants to be explored include:
    • Fine tuning the encoder-decoder model during the downstream task instead of freezing the weights.
    • Deep encoders and decoders.
    • Larger context windows.
    • Encoding and decoding paragraphs.
    • Encoders, such as convnets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment