Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save shagunsodhani/a18fe14b74c7292129c6c5ecb37f33b5 to your computer and use it in GitHub Desktop.
Save shagunsodhani/a18fe14b74c7292129c6c5ecb37f33b5 to your computer and use it in GitHub Desktop.
Summary of "Addressing the Rare Word Problem in Neural Machine Translation" Paper

Addressing the Rare Word Problem in Neural Machine Translation

Introduction

  • NMT(Neural Machine Translation) systems perform poorly with respect to OOV(out-of-vocabulary) words or rare words.
  • The paper presents a word-alignment based technique for translating such rare words.
  • Link to the paper

Technique

  • Annotate the training corpus with information about what do different OOV words (in the target sentence) correspond to in the source sentence.
  • NMT learns to track the alignment of rare words across source and target sentences and emits such alignments for the test sentences.
  • As a post-processing step, use a dictionary to map rare words from the source language to target language.

Annotating the Corpus

Copy Model

  • Annotate the OOV words in the source sentence with tokens unk1, unk2,..., etc such that repeated words get the same token.
  • In target language, each OOV word, that is aligned to some OOV word in the source language, is assigned the same token as the word in the source language.
  • The OOV word in the target language, which has no alignment or is aligned with a known word in the source language. is assigned the null token.
  • Pros
    • Very straightforward
  • Cons
    • Misses out on words which are not labelled as OOV in the source language.

PosAll - Positional All Model

  • All OOV words in the source language are assigned a single unk token.
  • All words in the target sentences are assigned positional tokens which denote that the jth word in the target sentence is aligned to the ith word in the source sentence.
  • Aligned words that are too far apart, or are unaligned, are assigned a null token.
  • Pros
    • Captures complete alignment between source and target sentences.
  • Cons
    • It doubles the length of target sentences.

PosUnk - Positional Unknown Model

  • All OOV words in the source language are assigned a single unk token.
  • All OOV words in the target sentences are assigned unk token with the position which gives the relative position of the word in the target language with respect to its aligned source word.
  • Pros:
    • Faster than PosAll model.
  • Cons
    • Does not capture alignment for all words.

Experiments

Results

  • All the 3 approaches (more specifically the PosUnk approach) improve the performance of existing NMTs in the order PosUnk > PosAll > Copy.
  • Ensemble models benefit more than individual models as the ensemble of NMT models works better at aligning the OOV words.
  • Performance gains are more when using smaller vocabulary.
  • Rare word analysis shows that performance gains are more when proposition of OOV words is higher.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment