aneesh-joshi/17 May, 2018, SL Discussion.md

## 17 May, 2018, SL Discussion.md

      
    Raw
  

              17 May, 2018, SL Discussion.md
            
          
    17 May, 2018 Discussion:

Predecided Objectives:


Come up with a way of evaluating models (in the form of a script)
Look for more data sets to evaluate models

Datasets:


WikiQA : [Ranking/Regression]
QuoraQP [Binary Classification]
The Stanford Natural Language Inference (SNLI) Corpus [Multi Class Classification]
The Multi-Genre NLI Corpus  [Multi Class Classification]
Ubuntu Dialogue Corpus v2.0  [Binary Classification]

Reference : The alternate GSOC proposals
Evaluation pipeline

For this part:

I went through a lot of material that I found online (links in references)
I tried to reproduce the MatchZoo published stats

I realized that MatchZoo results aren't exactly reporoducible. I ran the scripts that they provided on
the WikiQA data set for several models
ANMM :
theirs:  ndcg@3=0.6 ndcg@5=0.6 map=0.6
mine: ndcg@3=0.366241 ndcg@5=0.443056 map=0.411140
CDSSM :
theirs:  ndcg@3=0.6 ndcg@5=0.5 map=0.6
mine: ndcg@3=0.379165 ndcg@5=0.451163 map=0.416128
MV LSTM :
theirs:  ndcg@3=0.6 ndcg@5=0.5 map=0.6
mine:  ndcg@3=0.341159 ndcg@5=0.415373 map=0.391363
(More results to come after corrective measures are taken)
Action taken:

I have added an issue and sent a mail in the mailing list regarding this. I have asked them where I might be going wrong.
Link to issue
After reading the material on Document Ranking and its evaluation on Wikipedia and reading all the papers cited in the proposal for each of the models, I found that consistently the following metrics were used:

NDCG@3
NDCG@5
NDCG@10
Precision
Mean Average Precision(MAP)
Mean Reciprocal Rank(MRR)

Plan for the next few days:


I need to make my results some what consistent with those of MatchZoo. I will have to figure it out or wait till I get a response.
I will compile a comprehensive table for all the models with the following parameters:


Time taken for training
Memory consumed
with and without using word embeddings
Different datasets
NDCG@1 (for QA purposes)
NDCG@3
NDCG@5
NDCG@10
Precision
Mean Average Precision(MAP)
Mean Reciprocal Rank(MRR)

Once this table is compiled, it will give a helicopter view of the whole scenario.
I plan to continue using MatchZoo for this benchmarking, however, the unreproducability needs to be handled.