Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Last active September 27, 2017 02:48
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save shagunsodhani/c47f0d5c1dfe60ce5da0dd8241e506ea to your computer and use it in GitHub Desktop.
Save shagunsodhani/c47f0d5c1dfe60ce5da0dd8241e506ea to your computer and use it in GitHub Desktop.
Notes for "NewsQA: A Machine Comprehension Dataset" paper

NewsQA: A Machine Comprehension Dataset

Introduction

  • The paper presents NewsQA, a machine comprehension dataset of 119,633 natural language questions obtained from 12,744 CNN articles.
  • Link to the paper

Issues With Existing Datasets

  • Too small - eg MCTest
  • USe synthetically generated questions - eg BookTest Dataset
  • SQuAD is similar to NewsQA but is not as challenging and diverse as NewsQA.

Desired Characteristics Of A Machine Comprehension Dataset

  • Answers of arbitrary length instead of candidate answers to choose from.
  • Some questions should have no correct answer in the document.
  • Lexical and syntactic divergence between questions and answers.
  • Questions should require reasoning beyond simple word and context matching.

Collection Methodology

  • Article Curation

    • Retrieve and sample articles from CNN.
    • Partition data into a training set (90%), a development set (5%), and a test set (5%).
  • Question Sourcing

    • Questioners see a news article's headline and its summary and use that to formulate questions about the article.
  • Answer Sourcing

    • Answerers receive the questions along with the full article.
    • They either mark the answer in the article or reject the question as nonsensical or select null answer if the article contains insufficient information.
  • Validation

    • Another set of crowd workers sees the full article, a question and the set of unique answers to that question.
    • They either choose the best answer among the candidates or reject all the answers.

Data Analysis

  • Answer Types

    • Linguistically diverse answer set with following distribution:
      • common noun phrases (22.2%), clause phrase (18.3%), person (14.8%), numeric (9.8%), and other (11.2%) types.
  • Reasoning Types

    • Type of reasoning required, in ascending order of difficulty, along with approx. percentage of questions:
      • Word Matching (32.7%)
      • Paraphrasing (27%)
      • Inference (13.2%)
      • Synthesis (20.7%)
      • Ambiguous/Insufficient (6.4%)

Baseline Models

  • match-LSTM

    • LSTM network encodes the article and the question as sequences of hidden states.
    • mLSTM network compares the article encodings with the question encodings.
    • A Pointer Network uses the hidden states of the mLSTM to select the boundaries of the answer span.
  • Bilinear Annotation Re-encoding Boundary (BARB) Model

    • Encode all the words in the articles and the question using GloVe embeddings and further into contextual states using GRU.
    • Compare the document and the question encodings using C bilinear transformations to obtain the tensor of annotation scores.
    • Take the maximum over the question-token dimension to obtain annotation over document word dimension.
    • For each document word, input the document encodings, annotation vector and binary feature (indicating whether the document appears in the question) to the re-encoding RNN and obtain encodings for the boundary-pointing stage.
    • Use convolutional networks to determine the boundaries of answer span (similar to edge detection).
    • For further details, refer the paper.

Observations

  • Gap between human and machine performance on NewsQA is much higher than that for SQuAD probably because of longer sentences in NewsQA.
  • This suggests that NewsQA is a far more challenging dataset than SQuAD and presents a large scope for improvement for machine comprehension tasks.
  • Questions requiring inference and synthesis are more challenging for the model as compared to other kinds of questions.
  • Interestingly, BARB outperforms human annotators on SQuAD in terms of answering ambiguous questions or those with incomplete information.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment