Skip to content

Instantly share code, notes, and snippets.

@aneesh-joshi
Last active May 18, 2018 07:10
Show Gist options
  • Save aneesh-joshi/317010bdeb9a413e64648a5ad8fc19cd to your computer and use it in GitHub Desktop.
Save aneesh-joshi/317010bdeb9a413e64648a5ad8fc19cd to your computer and use it in GitHub Desktop.

17 May, 2018 Discussion:

Predecided Objectives:

  • Come up with a way of evaluating models (in the form of a script)
  • Look for more data sets to evaluate models

Datasets:

  • WikiQA : [Ranking/Regression]
  • QuoraQP [Binary Classification]
  • The Stanford Natural Language Inference (SNLI) Corpus [Multi Class Classification]
  • The Multi-Genre NLI Corpus [Multi Class Classification]
  • Ubuntu Dialogue Corpus v2.0 [Binary Classification]

Reference : The alternate GSOC proposals

Evaluation pipeline

For this part:

  • I went through a lot of material that I found online (links in references)
  • I tried to reproduce the MatchZoo published stats

I realized that MatchZoo results aren't exactly reporoducible. I ran the scripts that they provided on the WikiQA data set for several models

ANMM : theirs: ndcg@3=0.6 ndcg@5=0.6 map=0.6 mine: ndcg@3=0.366241 ndcg@5=0.443056 map=0.411140

CDSSM : theirs: ndcg@3=0.6 ndcg@5=0.5 map=0.6 mine: ndcg@3=0.379165 ndcg@5=0.451163 map=0.416128

MV LSTM : theirs: ndcg@3=0.6 ndcg@5=0.5 map=0.6 mine: ndcg@3=0.341159 ndcg@5=0.415373 map=0.391363

(More results to come after corrective measures are taken)

Action taken:

I have added an issue and sent a mail in the mailing list regarding this. I have asked them where I might be going wrong. Link to issue

After reading the material on Document Ranking and its evaluation on Wikipedia and reading all the papers cited in the proposal for each of the models, I found that consistently the following metrics were used:

  • NDCG@3
  • NDCG@5
  • NDCG@10
  • Precision
  • Mean Average Precision(MAP)
  • Mean Reciprocal Rank(MRR)

Plan for the next few days:

  1. I need to make my results some what consistent with those of MatchZoo. I will have to figure it out or wait till I get a response.
  2. I will compile a comprehensive table for all the models with the following parameters:
  • Time taken for training
  • Memory consumed
  • with and without using word embeddings
  • Different datasets
  • NDCG@1 (for QA purposes)
  • NDCG@3
  • NDCG@5
  • NDCG@10
  • Precision
  • Mean Average Precision(MAP)
  • Mean Reciprocal Rank(MRR)

Once this table is compiled, it will give a helicopter view of the whole scenario. I plan to continue using MatchZoo for this benchmarking, however, the unreproducability needs to be handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment