Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Created March 5, 2017 17:27
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shagunsodhani/fa1f387f084355dfafdf7550b1899af6 to your computer and use it in GitHub Desktop.
Save shagunsodhani/fa1f387f084355dfafdf7550b1899af6 to your computer and use it in GitHub Desktop.
Summary of "PPDB: The Paraphrase Database" paper

PPDB: The Paraphrase Database

Introduction

  • The paper presents a database of ranked English and Spanish paraphrases derived by:
    • Extracting lexical, phrasal, and syntactic paraphrases from large bilingual parallel corpora.
    • Computing the similarity scores for the pair of paraphrases using Google ngrams and the Annotated Gigaword corpus.
  • Link to the paper

Extracting Paraphrase from Bilingual Text

  • The basic idea is that if two English strings e1 and e2 translate to the same foreign string f (also called pivot), they should have the same meaning.
  • Informally speaking, the input to the system is translation triplets of the form < f, e, φ >, where
    • f is a foreign string
    • e is an english string
    • φ is a vector of feature functions
  • The system can pivot over f to create paraphrase triplets < e1, e2, φp > where φp is computed using translation feature vectors φ1 and φ2
  • For example, conditional paraphrase probability p(e2|e1) can be computed by marginalizing over all shared foreign language translations f:
    • p(e2|e1) = Sum over all f, p(e2|f)p(e1|f)

Scoring Paraphrases Using Monolingual Distributional Similarity

  • Measure similarity of phrases using Distributional similarity.
  • Can be used to rerank the paraphrases obtained from bilingual text or to obtain the paraphrases which could not be obtained from bilingual text alone.
  • To describe a given phrase e1, collect contextual features like:
    • n-gram based features for words (to the left and right of the given phrase)
    • Lexical, lemma-based, POS and named entity unigrams and bigrams
    • Dependency link features
    • Syntactic features
  • Aggregate all the features, over all the occurences of e, to obtain distributional signature se.
  • Define similarity between 2 phrases e1 and e2 as :
    • *sim(e1, e2) = dot(se1, s2)/(|se1||se2|)
  • Paper mentions two instances:
    • English paraphrases - 169.6 Million paraphrases
    • Spanish paraphrases - 161.6 Million paraphrases

Analysis

  • The paper performed tests to analyse the precision-recall tradeoff for coverage of Propbank predictions and predicate-argument tuples.
  • Human evaluation was performed over a sample of 1900 paraphrases to establish the correlation of PPDB scores with human judgement.

Areas of Improvement

  • Segregation of data by domain or topic
  • Support for more languages
  • Improving paraphrasing scores by using additional sources of information and better handling of paraphrases ambiguity.
@shagunsodhani
Copy link
Author

I would recommend contacting the authors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment