shagunsodhani/PPDB.md

## PPDB.md

      
    Raw
  

              PPDB.md
            
          
    PPDB: The Paraphrase Database

Introduction


The paper presents a database of ranked English and Spanish paraphrases derived by:

Extracting lexical, phrasal, and syntactic paraphrases from large bilingual parallel corpora.
Computing the similarity scores for the pair of paraphrases using Google ngrams and the Annotated Gigaword corpus.


Link to the paper

Extracting Paraphrase from Bilingual Text


The basic idea is that if two English strings e₁ and e₂ translate to the same foreign string f (also called pivot), they should have the same meaning.
Informally speaking, the input to the system is translation triplets of the form < f, e, φ >, where

f is a foreign string
e is an english string
φ is a vector of feature functions


The system can pivot over f to create paraphrase triplets < e₁, e₂, φ_p > where φ_p is computed using translation feature vectors φ₁ and φ₂
For example, conditional paraphrase probability p(e₂|e₁) can be computed by marginalizing over all shared foreign language translations f:

p(e₂|e₁) = Sum over all f, p(e₂|f)p(e₁|f)


Scoring Paraphrases Using Monolingual Distributional Similarity


Measure similarity of phrases using Distributional similarity.
Can be used to rerank the paraphrases obtained from bilingual text or to obtain the paraphrases which could not be obtained from bilingual text alone.
To describe a given phrase e₁, collect contextual features like:

n-gram based features for words (to the left and right of the given phrase)
Lexical, lemma-based, POS and named entity unigrams and bigrams
Dependency link features
Syntactic features


Aggregate all the features, over all the occurences of e, to obtain distributional signature s_e.
Define similarity between 2 phrases e₁ and e₂ as :

*sim(e₁, e₂) = dot(s_e1, s₂)/(|s_e1||s_e2|)


Paper mentions two instances:

English paraphrases - 169.6 Million paraphrases
Spanish paraphrases - 161.6 Million paraphrases


Analysis


The paper performed tests to analyse the precision-recall tradeoff for coverage of Propbank predictions and predicate-argument tuples.
Human evaluation was performed over a sample of 1900 paraphrases to establish the correlation of PPDB scores with human judgement.

Areas of Improvement


Segregation of data by domain or topic
Support for more languages
Improving paraphrasing scores by using additional sources of information and better handling of paraphrases ambiguity.