vihari/kaggle-ai-science.md

## kaggle-ai-science.md

      
    Raw
  

              kaggle-ai-science.md
            
          
    If you are an AI nerd, then you probably have heard about the AI²'s (Allen Institute for Artificial Intelligence)
Aristo project to train a computer to pass the standardized tests faced by an eighth grade student. 

In order to simplify the problem, it is narrowed down to just one subject (science, what else?) and the test involves selecting one of the four possible choices for every question.


A Kaggle challenge is launched with a first prize of $50,000 for the same and ended in Feb. 2016.
According to me, the problem is not an incremental easy-tweaky sort, it is a real forward leap (or is it? that
depends on the adapted solutions). 

Some interesting things about the challenge:

    
      The challenge is somewhat similar to the Jeopardy! game (but only easier) and looked like something that IBM Watson can easily take a hit at. 
      IBM Watson is openly challenged to participate by Oren through this tweet.
      Unfortunately, IBM Watson declined for a very weird reason.
    
    
      The challenge is much simpler than what Aristo (AI²'s in-house project to solve this problem) project can handle (or aims to handle). 
      Aristo handles questions with diagrams and does a proper logical deduction to answer 
      (for example see),
      catch a demo here. 
      A particularly interesting thing about Aristo is that it can also explain the answer rather than to just make a smart guess over the available options.
    

To better understand the difficulty of the challenge, I suggest you to download a small set of questions from here and look at it yourself. As a sample, consider the following example questions 

The students in an engineering class built a robot that stacks wooden blocks. A built-in computer controls the movement of the robot. The computer in the robot performs a function most similar to which part of the human body?
(A) lungs (B) heart (C) brain (D) arms


Base your answers on the information below. One hot, summer day it rained very heavily. After the rain, a plastic pan on a picnic table had 2 cm of rainwater in it. Four hours later, all the rainwater in the pan was gone. Which process caused the rainwater in the pan to disappear as it sat outside in the hot air?
(A) condensation (B) evaporation (C) precipitation (D) erosion


Frogs lay eggs that develop into tadpoles and then into adult frogs. This sequence of changes is an example of how living things
(A) go through a life cycle (B) form a food web (C) act as a source of food (D) affect other parts of the ecosystem

Refer this insightful paper ("A Study of the Knowledge Base Requirements for Passing an Elementary Science Test") to better understand the kind of questions and the knowledge required to solve this problem.
However, it is possible to partially solve the problem without any common-sense reasoning and with existing knowledge base construction and representation techniques.
This challenge puts an empirical bound on the best that can be achieved by the existing methods (which is around 60%, the performance of the best submitted model).


In this gist, I will make an effort to understand the best three solutions from the code that is available on Github, see leaderboard.


Note: I did not try to run any of the scripts on my local cluster since they are very computationally intensive and require huge downloads.
Also the training, test and validation datasets are taken down after the end (13/02/2016) of the challenge!
Although it could have been fun to run the scripts and make a deeper analysis of where the model(s) is(are) failing.


In order to draw comparisons between the three approaches, we will look at the features (lexical, semantic, search or whatever),
resources utilized and the model used.
These aspects of the program will hopefully help us in reasoning why a method performed or failed.


Alejandro Mosquera (standing: 3)

This is the only solution with a writeup. Find the writeup along with the code in the Git repo. 

Surprisingly this solution did not use Wikipedia corpus in preparing a knowledge base.
The various datsets that went into the knowledge base are

  Questions from Quizlet of specific tags related to science
  CK-12 workbooks
  Regents 4 grade QA dataset from Aristo
  Multiple choice type QA dataset from Aristo

The next crucial thing is features and is often the game changer in my experience. The features used by this technique are combination of IR (Information Retrieval) and embedding vector ([Word embedding](https://en.wikipedia.org/wiki/Word_embedding)) of words. The features are discussed in section 2 (Features selection) of the writeup. I will discuss these here anyway.


The features considered are:

	Features based on lookup of external data resources. Various text to document overlap scores are assigned to each question,answer pair.
	The strings in the following bullets are the programmatic name(s) given to the feature(s), included here for reference.
	
		Default score assigned by Elastic Search (ES) (ES_raw_lemma)
		Score assigned to the retrieved results based on character overlap between the query and the hits (ES_lemma_regex)
		LMJM (whatever it is) similarity measure over the retrieved documents and query (ES_raw_lemmaLMJM)
		and finally, a hybrid of character overlap based similarity and LMJM similarity (ES_lemma_regexLMJM)
	
	
	Word embeddings related features
	
		
		Word embeddings are trained over cleaned text from CK12 dataset, Quizlet flashcards and most common bigrams and trigrams from the training data. The vector representation of words so formed are used to compute the cosine similarity between the words in question and answer. (W2V_COSQA).
		
	
	Features that are explicitly related to either question or answer
	
		
		A question (that is the entire question+all the answers) is categorized based on the average number of words in the choices available and the type of question, i.e. if it is a multi-sentence question or a simple one. Three categories are identified and referred to as CAT_A1, CAT_A2, CAT_A3.  
		
		
		Based on if the choices contain one of "all the above", "none of the above" and "both X and Y", features such as ANS_all_above, ANS_none_above, ANS_both respectively are identified. 
		
	
It looks like the author has experimented with various features, cross validated and chose features based on weight assigned by the model.
A linear regression model is used to identify the best features and to model.
poweredByTalkwalker (standing: 2)

I suggest exploring this repository, it is well written and has some handy scripts, say to fetch data off Quizlet with its API.
Corpus

In comparison, this solution used a larger knowledge base than the solution that finished third.
The model uses the following external resources

	Simple Wiki English and Wikibooks, indexed such that every paragraph is a different document. (see scripts/downloadWiki.sh)
	CK 12 lessons and books (see scripts/downloadCK12-Lessons.sh and scripts/downloadCK12.sh) 
	Books, these are some collection of PDF docs. from openstaxcollege.org and Wikipedia Commons (see scripts/downloadBooks.sh)
	Quizlet, see https://github.com/bwilbertz/kaggle_allen_ai/tree/master/python i.e. python/Readme.md
 
Features

The features used by this solution are very extensive and interesting.
The authors classify the features into three broad categories

	
	Basic stat. features related to number of words and sentences in question and answer. Specifically, number of sentences in question and answer, number of words in question; number of words, min. words, max. words in the answers. 
	
	
	Query related features -- the external data files are indexed with an elastic installation and features are generated by querying either the question or the answer or both in the index. Various features are generated by varying query related parameters resulting in multiplication in number of features. The query parameters varied are: slope parameter (the tolerable distance among/between words when searching for a phrase like "Monster Beer"), number of search hits considered (top 1, 3, 5, 10, 50). The features are computed either over Quizlet index or a cumulative index (which includes CK12 and Wikipedia related datasets). A query for a phrase with more than one word is just the phrase without quotes, for example the query for the question Who is the current american president? is an OR of all the words in it. The basic features omitting the minor variations are: (The strings in the following bullets are the programmatic name(s) given to the feature(s), included here for reference.)
	
		Absolute number of hits of question, answer and their ratio over the Quizlet index and cumulative index. (df.hits.simple.And.quizlet.RData, df.hits.simple.And.seeccbq.RData)
		Sum of scores (assigned by elasticsearch) of top n hits for question and answer over Quizlet index and cumulative index. (df.maxScore.simple.And.quizlet.RData, df.maxScore.explain.simple.seeccbq.RData)
		Normalized sum of scores (see the above bullet) per word over cumulative index. (df.maxScore.perWord.explain.simple.seeccbq.RData)
		The proportion of words in question and answer found in the best matching document in cumulative index. (df.maxScore.matchRatio.explain.simple.seeccbq.RData)
		Further, fifty nearest neighbours to question and answer pair are identified by querying for the question and filtering with the answers. Sum of scores (elasticsearch assigned) of nearest n hits is emitted as a feature.
	
	
	Mutual information related feature -- which reflect the number of times a word in question has co-occurred with a word in answer normalized by their respective frequencies. (df.lpmi.mean.secbq2.tcm_50_xenglish.RData, df.coProb.mean.bm25.en-si-ck.NVAO_glove_50_500.RData)
	Hash of all the words in the full text including the question and answer (df.quizlet.hash12.neg3_10_5000.RData). (using FeatureHashing package, CRAN)

Model: A gradient boosting model is used to train and predict the answer. (See xgboost CRAN package)


Cardal (standing: 1)

Note that this is a very well-written solution (although it needs a lot of documentation) and contains many useful utility routines for example to read Wikipedia files or HTML stuff. The monolithic script is worth taking a look at. I have added a few comments for important methods to make it easier to read, find the script with added comments here.
Corpus

The resources used by the model are well listed in the README of the solution's Github page, I am copying the particular section here for reference.

   (a) AI2 - data downloaded from the AI2 web-site: http://allenai.org/data.html. 
I extracted all the text from these datasets and organized it into a single corpus file at corpus/AI2_data/ai2_corpus.txt
 
   (b) CK12 - I downloaded two types of resources from the http://www.ck12.org:

       First, I used the HTML files in the EPUB file "Concepts_b_v8_vdt.epub".

       Second, I downloaded the science-related books in pdf format ("FlexBooks"), and converted them into text files using MS Word (with the "Save as" option). The converted files are under corpus/CK12 with the ".text" suffix; the original pdf books are quite large, so they aren't included here (I kept the original name, only replaced the suffix).
       

       I prepared four corpus files from the CK12 books:

       corpus/CK12/OEBPS/ck12.txt - based on the HTML files in "Concepts_b_v8_vdt.epub";

       corpus/CK12/OEBPS/ck12_paragraphs.txt - as above, but using a different criterion for splitting the text into sections (paragraphs);

       corpus/CK12/ck12_text.txt - based on the pdf files;

       corpus/CK12/ck12_text_sentences.txt - as above, but using a different criterion for splitting the text into sections (sentences).
       

   (c) Quizlet - study cards from quizlet.com. I parsed the cards (e.g., in order to remove wrong answers from multiple-choice questions), and manually editted some of them (e.g., to fix format problems). The full corpus is in corpus/quizlet/quizlet_corpus.txt
   

   (d) Saylor and OpenStax - books in pdf format from Saylor Academy (http://www.saylor.org/books/) and OpenStax College (https://www.openstaxcollege.org/books). I extracted the text from the pdf files using MS Word or (for large files) the pdfminer tool ("pdf2txt.py").
       The full corpus is in corpus/Saylor/saylor_text.txt
       

   (e) SimpleWiki - I downloaded the SimpleWiki dump file "simplewiki-20151102-pages-articles.xml" (496MB) 
       from https://dumps.wikimedia.org/simplewiki/20151102/, and constructed three corpus files by extracting 
       some sections from subsets of the pages. The files are under corpus/simplewiki:

          simplewiki_1.0000_0.0500_0_5_True_True_True_corpus.txt (117MB)

          simplewiki_1.0000_0.1000_0_3_True_True_False_corpus.txt (114MB)

          simplewiki_1.0000_0.0100_0_3_True_True_False_pn59342_corpus.txt (4.4MB)

   (f) StudyStack - study cards from http://www.studystack.com. I parsed the cards (e.g., to
       remove sentences marked as False), and saved four versions under corpus/StudyStack, each 
       containing a different subset of the cards:
          studystack_corpus.txt  (72MB)

          studystack_corpus2.txt  (4MB)

          studystack_corpus3.txt  (9MB)

          studystack_corpus4.txt (17MB)

   (g) UtahOER - Utah Science Open Educational Resources textbooks downloaded from
       http://www.schools.utah.gov/CURR/science/OER.aspx. I extracted the text from the pdf
       files using the pdfminer tool. The entire corpus is in corpus/UtahOER/oer_text.txt 

   (h) Wiki - I downloaded the Wiki dump file "enwiki-20150901-pages-articles.xml" (50.6GB)
       from https://dumps.wikimedia.org/enwiki/20150901/. I created two corpus files, containing
       different sections from subsets of the pages, under corpus/wiki:

          wiki_1.0000_0.0200_0_5_True_True_False_corpus.txt (161MB)

          wiki_0.5000_0.1000_0_5_True_True_False_pn59342_corpus.txt (33MB)

   (i) Wikibooks - I downloaded the WikiBooks dump file "enwikibooks-20151102-pages-articles.xml" (560MB)
       from https://dumps.wikimedia.org/enwikibooks/20151102/, and created a corpus file with
       some sections from a subset of the pages:

          wikibooks_1.0000_0.0200_0_10_True_True_False_corpus.txt (126MB)
 
#### Features
Features used by this method are much more extensive than the other two solutions. There are 71 different pickle files in the funcs_train folder (folder containing features over the training data). The author broadly classifies the features in to 28 different types ranging from BASIC type which are features based on just the question and answers to features related to corpus statistics. I will concisely discuss the features used in order to understand how the system works, a more serious reader should consult the code for specifics. Features are considered over several combination of data resources, parsers and scoring parameters. 

The parser used to tokenize any text (either the question or answer) alternates with three different stemmers: PorterStemmer, LancasterStemmer and EnglishStemmer (Krovetz). The parsers also sometimes generate ngrams of size one, two or three.


The data resources are combined in an intriguing manner to generate different set of indices. The individual Lucene indices of the corpora are combined in seven different ways with a specific parser (stemmer), the first three of the seven are as shown below.

EnglishStemmer; corpora: Study Stack copus 1, Quizlet corpus
LancasterStemmer; corpora: Study Stack corpus 3, Quizlet corpus, CK12 corpus, Wiki corpus, SimpleWiki corpus
PorterStemmer; copora: Study Stack corpus 4, Quizlet corpus, UtahOER corpus, Saylor and OpenStax corpus, CK 12, AI² corpus
 
I fail to see any reason for the combinations or the parser, for the time being I have ignored this question. 

A parallel index is maintained to support queries that Lucene cannot. Several features (almost 20 types) are generated from lookup on non-Lucene index by varying scoring parameters and parser. The parser in the non-Lucene index case is sometimes customized to output tri-grams and bi-grams instead of uni-gram tokens, which affects the queries in turn. 


	The BASIC features, which depend only on the question and answer are (1) fraction of words in answer that overlap with the question (2) maximum word overlap of an option with the rest of options, for each option (3) number of times the words in the answer are mentioned in a question or an answer in the training set (4) a measure of number of times a token is mentioned in a correct answer to total mentions (5) A hypergeometric prob. measure for answer likely to be correct based on number of times a word appeared in correct answer to any answer (6) Z score of the prob. generated in the previous feature (7) fraction of times a word is mentioned in the correct answer to number of times it is mentioned in a question (8) length (character) of the answers (9) word length of the answers (10) is the answer one of B or C option (11) The label of the choice of answer.


	The features (10) and (11) are very specific to the task at hand. The author observed that the correct options over the training data are skewed towards the options B and C, hence these features capture the prior over the options.


The rest of the features (non-BASIC ones that depend on the corpus statistics) are indicative of how often the choice answer and question appear together in a single corpora document.
 

Also, the submitted model uses two features that capture the most similar choice to every choice in the available alternatives (st_qz_saylor_ck-t.a1_vs_a2, wk-pn_sw-pn_wb.a1_vs_a2.1). This feature captures the intuition that, often, the correct answer contains a similar choice which is as well likely to be correct so as to confuse the person answering. 
  
It came as no surprise to me that the best solution used a boosting model.
Conclusion

We have seen an interesting correlation between the amount of external data used and the model performance (standing). I think the importance of the features and the training model used follows in that order.
Frankly, I was disappointed by the winning solutions, they all have one thing in common.
All the solutions have nothing to do with Natural Language Processing (NLP) and like many systems that deal with symbols, they have no idea what the symbols actually mean.
Maybe it is meant to be this way, that the organizers want to set a benchmark for what can be achieved by answering based on simple lookup.