Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save shagunsodhani/7cf3677ff2b0028a33e6702fbd260bc5 to your computer and use it in GitHub Desktop.
Save shagunsodhani/7cf3677ff2b0028a33e6702fbd260bc5 to your computer and use it in GitHub Desktop.
Notes for "WikiQA: A challenge dataset for open-domain question answering" paper

WikiQA: A challenge dataset for open-domain question answering

Introduction

  • Presents WikiQA - a publicly available set of question and sentence pairs for open-domain question answering.
  • Link to the paper

Dataset

  • 3047 questions sampled from Bing query logs.
  • Each question associated with a Wikipedia page.
  • All sentences in the summary paragraph of the page become the candidate answers.
  • Only 1/3rd questions have a correct answer in the candidate answer set.
  • Solutions crowdsourced through MTurk like platform.
  • Answer sentences are associated with answer phrases (shortest substring of a sentence that answers the question) though this annotation is not used in the experiments reported by the paper.

Other Datasets

  • QASent datset
    • Uses questions from TREC-QA dataset (questions from both query logs and human editors) and selects sentences which share at least one non-stopword from the question.
    • Lexical overlap makes QA task easier.
    • Does not support evaluating for answer triggering (detecting if the correct answer even exists in the candidate sentences).

Experiments

Baseline Systems

  • Word Count - Counts the number of non-stopwords common to question and answer sentences.
  • Weighted Word Count - Re-weight word counts by the IDF values of the question words.
  • LCLR - Uses rich lexical semantic features like WordNet and vector-space lexical semantic models.
  • Paragraph Vectors - Considers cosine similarity between question vector and sentence vector.
  • Convolutional Neural Network (CNN) - Bigram CNN model with average pooling.
  • PV-Cnt and CNN-Cnt - Logistic regression classifier combining PV (and CNN) models and Word Count models.

Metrics

  • MAP and MRR for answer selection problem.
  • Precision, recall and F1 scores for answer triggering problem.

Observations

  • CNN-cnt outperforms all other models on both the tasks.
  • Three additional features, namely the length of the question (QLen), the length of sentence (SLen), and the class of the question (QClass) are added to track question hardness and sentence comprehensiveness.
  • Adding QLen improves performance significantly while adding SLen (QClass) improves (degrades) performance marginally.
  • For the same model, the performance on the WikiQA dataset is inferior to that on the QASent dataset.
  • Note: The dataset is very small to train end-to-end networks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment