shagunsodhani/WikiReading.md

## WikiReading.md

      
    Raw
  

              WikiReading.md
            
          
    WikiReading : A Novel Large-scale Language Understanding Task over Wikipedia

Introduction


Large scale natural language understanding task - predict text values given a knowledge base.
Accompanied by a large dataset generated using Wikipedia
Link to the paper

Dataset


WikiReading dataset built using Wikidata and Wikipedia.
Wikidata consists of statements of the form (property, value) about different items
80M statements, 16M items and 884 properties.
These statements are grouped by items to get (item, property, answer) tuples where the answer is a set of values.
Items are further replaced by their Wikipedia documents to generate 18.58M statements of the form (document, property, answer).
Task is to predict answer given document and property.
Properties are divided into 2 classes:

Categorical properties - properties with a small number of possible answers. Eg gender.
Relational properties - properties with unique answers. Eg date of birth.


This classification is done on the basis of the entropy of answer distribution.
Properties with entropy less than 0.7 are classified as categorical properties.
Answer distribution has a small number of very high-frequency answers (head) and a large number of answers with very small frequency (tail).
30% of the answers do not appear in the training set and must be inferred from the document.

Models

Answer Classification


Consider WikiReading as classification task and treat each answer as a class label.

Baseline


Linear model over Bag of Words (BoW) features.
Two BoW vectors computed - one for the document and other for the property. These are concatenated into a single feature vector.

Neural Networks Method


Encode property and document into a joint representation which is fed into a softmax layer.


Average Embeddings BoW

Average the BoW embeddings for documents and property and concatenate to get joint representation.


Paragraph Vectors

As a variant of the previous method, encode document as a paragraph vector.


LSTM Reader

LSTM reads the property and document sequence, word-by-word, and uses the final state as joint representation.


Attentive Reader

Use attention mechanism to focus on relevant parts of the document for a given property.


Memory Networks

Maps a property p and list of sentences x₁, x₂, ...x_n in a joint representation by attention over the sentences in the document.


Answer Extraction


For relational properties, it makes more sense to model the problem as information extraction than classification.


RNNLabeler

Use an RNN to read the sequence of words and estimate if a given word is part of the answer.


Basic SeqToSeq (Sequence to Sequence)

Similar to LSTM Reader but augmented with a second RNN to decode answer as a sequence of words.


Placeholder SeqToSeq

Extends Basic SeqToSeq to handle OOV (Out of Vocabulary) words by adding placeholders to the vocabulary.
OOV words in the document and answer are replaced by placeholders so that input and output sentences are a mixture of words and placeholders only.


Basic Character SeqToSeq

Property encoder RNN reads the property, character-by-character and transforms it into a fixed length vector.
This becomes the initial hidden state for the second layer of a 2-layer document encoder RNN.
Final state of this RNN is used by answer decoder RNN to generate answer as a character sequence.


Character SeqToSeq with pretraining

Train a character-level language model on input character sequence from the training set and use the weights to initiate the first layer of encoder and decoder.


Experiments


Evaluation metric is F1 score (harmonic mean of precision and accuracy).
All models perform well on categorical properties with neural models outperforming others.
In the case of relational properties, SeqToSeq models have a clear edge.
SeqToSeq models also show a great deal of balance between relational and categorical properties.
Language model pretraining enhances the performance of character SeqToSeq approach.
Results demonstrate that end-to-end SeqToSeq models are most promising for WikiReading like tasks.