shagunsodhani/WikiQA: A challenge dataset for open-domain question answering.md

## WikiQA: A challenge dataset for open-domain question answering.md

      
    Raw
  

              WikiQA: A challenge dataset for open-domain question answering.md
            
          
    WikiQA: A challenge dataset for open-domain question answering

Introduction


Presents WikiQA - a publicly available set of question and sentence pairs for open-domain question answering.
Link to the paper

Dataset


3047 questions sampled from Bing query logs.
Each question associated with a Wikipedia page.
All sentences in the summary paragraph of the page become the candidate answers.
Only 1/3rd questions have a correct answer in the candidate answer set.
Solutions crowdsourced through MTurk like platform.
Answer sentences are associated with answer phrases (shortest substring of a sentence that answers the question) though this annotation is not used in the experiments reported by the paper.

Other Datasets


QASent datset

Uses questions from TREC-QA dataset (questions from both query logs and human editors) and selects sentences which share at least one non-stopword from the question.
Lexical overlap makes QA task easier.
Does not support evaluating for answer triggering (detecting if the correct answer even exists in the candidate sentences).


Experiments

Baseline Systems


Word Count - Counts the number of non-stopwords common to question and answer sentences.
Weighted Word Count - Re-weight word counts by the IDF values of the question words.
LCLR - Uses rich lexical semantic features like WordNet and vector-space lexical semantic models.
Paragraph Vectors - Considers cosine similarity between question vector and sentence vector.
Convolutional Neural Network (CNN) - Bigram CNN model with average pooling.
PV-Cnt and CNN-Cnt - Logistic regression classifier combining PV (and CNN) models and Word Count models.

Metrics


MAP and MRR for answer selection problem.
Precision, recall and F1 scores for answer triggering problem.

Observations


CNN-cnt outperforms all other models on both the tasks.
Three additional features, namely the length of the question (QLen), the length of sentence (SLen), and the class of the question (QClass) are added to track question hardness and sentence comprehensiveness.
Adding QLen improves performance significantly while adding SLen (QClass) improves (degrades) performance marginally.
For the same model, the performance on the WikiQA dataset is inferior to that on the QASent dataset.
Note: The dataset is very small to train end-to-end networks.