- The paper presents a database of ranked English and Spanish paraphrases derived by:
- Extracting lexical, phrasal, and syntactic paraphrases from large bilingual parallel corpora.
- Computing the similarity scores for the pair of paraphrases using Google ngrams and the Annotated Gigaword corpus.
- Link to the paper
- The basic idea is that if two English strings e1 and e2 translate to the same foreign string f (also called pivot), they should have the same meaning.
- Informally speaking, the input to the system is translation triplets of the form < f, e, φ >, where
- f is a foreign string
- e is an english string
- φ is a vector of feature functions
- The system can pivot over f to create paraphrase triplets < e1, e2, φp > where φp is computed using translation feature vectors φ1 and φ2
- For example, conditional paraphrase probability p(e2|e1) can be computed by marginalizing over all shared foreign language translations f:
- p(e2|e1) = Sum over all f, p(e2|f)p(e1|f)
- Measure similarity of phrases using Distributional similarity.
- Can be used to rerank the paraphrases obtained from bilingual text or to obtain the paraphrases which could not be obtained from bilingual text alone.
- To describe a given phrase e1, collect contextual features like:
- n-gram based features for words (to the left and right of the given phrase)
- Lexical, lemma-based, POS and named entity unigrams and bigrams
- Dependency link features
- Syntactic features
- Aggregate all the features, over all the occurences of e, to obtain distributional signature se.
- Define similarity between 2 phrases e1 and e2 as :
- *sim(e1, e2) = dot(se1, s2)/(|se1||se2|)
- Paper mentions two instances:
- English paraphrases - 169.6 Million paraphrases
- Spanish paraphrases - 161.6 Million paraphrases
- The paper performed tests to analyse the precision-recall tradeoff for coverage of Propbank predictions and predicate-argument tuples.
- Human evaluation was performed over a sample of 1900 paraphrases to establish the correlation of PPDB scores with human judgement.
- Segregation of data by domain or topic
- Support for more languages
- Improving paraphrasing scores by using additional sources of information and better handling of paraphrases ambiguity.
Could you point me to the paper where it is mentioned? Cant find it on the website.