shagunsodhani/NewsQA.md

## NewsQA.md

      
    Raw
  

              NewsQA.md
            
          
    NewsQA: A Machine Comprehension Dataset

Introduction


The paper presents NewsQA, a machine comprehension dataset of 119,633 natural language questions obtained from 12,744 CNN articles.
Link to the paper

Issues With Existing Datasets


Too small - eg MCTest
USe synthetically generated questions - eg BookTest Dataset
SQuAD is similar to NewsQA but is not as challenging and diverse as NewsQA.

Desired Characteristics Of A Machine Comprehension Dataset


Answers of arbitrary length instead of candidate answers to choose from.
Some questions should have no correct answer in the document.
Lexical and syntactic divergence between questions and answers.
Questions should require reasoning beyond simple word and context matching.

Collection Methodology


Article Curation

Retrieve and sample articles from CNN.
Partition data into a training set (90%), a development set (5%), and a test set (5%).


Question Sourcing

Questioners see a news article's headline and its summary and use that to formulate questions about the article.


Answer Sourcing

Answerers receive the questions along with the full article.
They either mark the answer in the article or reject the question as nonsensical or select null answer if the article contains insufficient information.


Validation

Another set of crowd workers sees the full article, a question and the set of unique answers to that question.
They either choose the best answer among the candidates or reject all the answers.


Data Analysis


Answer Types

Linguistically diverse answer set with following distribution:

common noun phrases (22.2%), clause phrase (18.3%), person (14.8%), numeric (9.8%), and other (11.2%) types.


Reasoning Types

Type of reasoning required, in ascending order of difficulty, along with approx. percentage of questions:

Word Matching (32.7%)
Paraphrasing (27%)
Inference (13.2%)
Synthesis (20.7%)
Ambiguous/Insufficient (6.4%)


Baseline Models


match-LSTM

LSTM network encodes the article and the question as sequences of hidden states.
mLSTM network compares the article encodings with the question encodings.
A Pointer Network uses the hidden states of the mLSTM to select the boundaries of the answer span.


Bilinear Annotation Re-encoding Boundary (BARB) Model

Encode all the words in the articles and the question using GloVe embeddings and further into contextual states using GRU.
Compare the document and the question encodings using C bilinear transformations to obtain the tensor of annotation scores.
Take the maximum over the question-token dimension to obtain annotation over document word dimension.
For each document word, input the document encodings, annotation vector and binary feature (indicating whether the document appears in the question) to the re-encoding RNN and obtain encodings for the boundary-pointing stage.
Use convolutional networks to determine the boundaries of answer span (similar to edge detection).
For further details, refer the paper.


Observations


Gap between human and machine performance on NewsQA is much higher than that for SQuAD probably because of longer sentences in NewsQA.
This suggests that NewsQA is a far more challenging dataset than SQuAD and presents a large scope for improvement for machine comprehension tasks.
Questions requiring inference and synthesis are more challenging for the model as compared to other kinds of questions.
Interestingly, BARB outperforms human annotators on SQuAD in terms of answering ambiguous questions or those with incomplete information.