Skip to content

Instantly share code, notes, and snippets.

@aneesh-joshi
Last active August 5, 2018 22:33
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aneesh-joshi/015d4b3b772b8c23e104a05be773e269 to your computer and use it in GitHub Desktop.
Save aneesh-joshi/015d4b3b772b8c23e104a05be773e269 to your computer and use it in GitHub Desktop.
Final Report

Similarity Learning using Neural Networks

Index

  1. Problem Statement
  2. Similarity Learning Tasks
  3. Evaluation Metrics
  4. Establishing Baselines
  5. About Datasets
  6. The journey
  7. Notes on Finetuning Models
  8. Thoughts on Deep Learning Models
  9. Conclusion
  10. Future Work

1. Problem Statement

The task of Similarity Learning is a task of learning similarities or disimilarites between sentences/paragraphs. The core idea is to find some sort of vector representation for a sentence/document/paragraph. This vector representation can then be used to compare a given sentence with other sentences. For example
Let's say we want to compare the sentences: "Hello World", "Hi there friend" and "Bye bye".
We have to develop a function, similarity_fn such that

similarity_fn("Hello there", "Hi there friend") > similarity_fn("Hello there friend", "Bye Bye")
similarity_fn("Hello there", "Hi there friend") = 0.9
similarity_fn("Hello there friend", "Bye Bye") = 0.1

As such, Similarity Learning does not require Neural Networks (tf-idf and other such sentence representations can work as well). However, with the recent domination of Deep Learning in NLP, newer models have been performing the best on Similarity Learning tasks.

The idea of finding similarities can be used in the following ways [1]:
1. Regresssion Similarity Learning: given document1 and document2 we need a similiarity measure as a float value.

2. Classification Similarity Learning: given document1 and document2 we need to classify them as similar or disimilar

3. Ranking Similarity Learning: given document, similar-document and disimilar-document, we want to learn a similarity_function such that similarity_function(document, similar-document) > similarity_function(document, disimilar-document)

4. Locality Sensitivity Hashing: hashes input items so that similar items map to the same "buckets" in memory with high probability.


Usually, the similarity can be measures like:

  1. A vector representation is calculated for the two documents and the cosine similarity between them is their similarity score
  2. The two documents are made to interact in several different ways (neural attention, matrix multiplication, LSTMs) and a softmax probablity is calculated of them being similar or dissimilar. Example: [0.2, 0.8] -> [probablity of being dissimilar, probablity of being similar]

2. Similarity Learning Tasks

1. Question Answering/Information Retrieval
Given a question and a number of candidate answers, we want to rank the answers based on their likelihood of being the answer to the question. Example:

Q: When was Gandhi born?
A1: Gandhi was a freedom fighter
A2: He was born 2 October, 1989
A3: He was a freedom fighter

The answer selected by the model is
argmax_on_A(simialrity_fn(Q, A1), simialrity_fn(Q, A2), simialrity_fn(Q, A3)) = A2

Conversation/Respone Selection is similar to Question Answering, for a given Question Q, we want to select the most appropriate response from a pool of responses (A1, A2, A3, ...)
Example datasets: WikiQA

2. Textual Entailment
Given a text and a hypothesis, the model must make a judgement to classify the text
Example from SNLI

- "A man selling donuts to a customer during a world exhibition event held in the city of Angeles" -- "A man selling donuts to a customer." --> Entailment
- "Two young boys of opposing teams play football, while wearing full protection uniforms and helmets." -- "boys scoring a touchdown" --> Neutral
- "Under a blue sky with white clouds, a child reaches up to touch the propeller of a plane standing parked on a field of grass." -- "A child is playing with a ball." --> Contradiction

3. Duplicate Document Detection/Paraphrase Detection
Given 2 docs/sentences/questions we want to predict the probablity of the two questions being duplicate
Example:

- "What is the full form of AWS" -- "What does AWS stand for?" --> Duplicate
- "What is one plus one?" -- "Who is president?" --> Not Duplicate

Example Dataset: Quora Duplicate Question Dataset

4. Span Selection:
Given a context passage and a question, we have to find the span of words in the passage which answers the question.
Example:

passage: "Bob went to the doctor. He wasn't feeling too good"
question: "Who went to the doctor"
span: "*Bob* went to the doctor. He wasn't feeling too good"

Example Dataset: SQUAD

3. Evaluation Metrics

While some similarity functions have simple metrics(Accuracy, Precision), it can get a bit complicated/different for others.

Duplicate Question/Paraphrase Detection and Textual Entailment use Accuracy.
Since it is easy to say that an answer is either correct or incorrect. Accuracy is simply the ratio of correct answers to the total number of questions.

Span Selection uses Exact Match and F1 score
How many times did the model predict the correct span?

(A full list of datasets and metrics can be found here)

It gets a bit tricky when you come to evaluate Question Answering systems. The reason for this is: The results given by these systems is usually in an ordereing of the candidate documents.
For a query Q1 and set of candidate answers (very-relevant-answer, slightly-relevant-answer, irrelevant-answer)
The evaluation metric should score the ordering of

(1: very-relevant-answer, 2: slightly-relevant-answer, 3:irrelevant-answer) > (1: slightly-relevant-answer, 2: very-relevant-answer, 3:irrelevant-answer) > (1: irrelevant-answer, 2: slightly-relevant-answer, 3:irrelevant-answer)

Basically, the metric should give a higher score to an ordering where the more relevant documents come first and less relevant documents come later.

The metric commonly used for this are Mean Average Precision(MAP), Mean Reciprocal Rank(MRR) and Normalized Discounted Cumulative Gain(nDCG).

Mean Average Precision (MAP)

It is a measure to be calculated over a full dataset. A dataset like WikiQA, will have a setup like:

q_1, (d1, d2, d3)
q_2, (d4, d5)
q_3, (d6, d7, d8, d9, d10)
.
.
.

we find the average precision for every q_i and then take the mean over all the qs

Mean (Over all the queries) Average Precision(Over all the docs for a query)

Some psuedo code to describe the idea:

num_queries = len(queries)
precision = 0
for q, candidate_docs in zip(queries, docs):
    precision += average_precision(q, candidate_docs)
mean_average_precision = precision / num_queries

I will solve an example here:

Model predictions: (q1-d1, 0.2), (q1-d2, 0.7) , (q1-d3, 0.3)
Ground Truth     : (q1-d1, 1),   (q1-d2, 1) ,     (q1-d3, 0)

We will sort our predictions based on our similarity score (Since our ordering/ranking is in that order)

(d2, 0.7, 1), (d3, 0.3, 0), (d1, 0.2, 1)


Average Precision = (Sum (Precision@k * relevance@k) for k=1 to n) / Number of Relevant documents

(TODO Insert Image Formula)


where

n : the number of candidate docs
relevance@k : 1 if the kth doc is relevant, 0 if not
Precision@k : precision till the kth cut off where precision is : (number of retrieved relevant docs / total retrieved docs)

So, Average Precision = (1/1 + 0 + 2/3)/2 = 0.83

Here's how:

At this point, we can discard our score (since we've already used it for ordering)

(d2, 1) (d3, 0) (d1, 1)

Number of Relevant Documents is (1 + 0 + 1) = 2

Precision at 1: We only look at (d2, 1)

p@1 = (number of retrieved relevant@1/total number of retrieved docs@1) * relevance@1
    = (1 / 1) * 1
    = 1

Precision at 2: We only look at (d2, 1) (d3, 0)

p@2 = (number of retrieved relevant@2/total number of retrieved docs@2) * relevance@2
    = (1 / 2) * 0
    = 0

Precision at 3: We only look at (d2, 1) (d3, 0) (d1, 1)

p@3 = (number of retrieved relevant@3/total number of retrieved docs@3) * relevance@3
    = (2 / 3) * 1
    = 2/3

Therefore, Average Precision for q1

								= (Precision@1 + Precision@2 + Precision@3) / Number of Relevant Documents
                                = (1 + 0 + 2/3) / 2
                                = 0.83

To get Mean Average Precision, we calculate Average Precision for all qs in it and average it/take the mean.

Normalized Discounted Cumulative Gain

This Wikipedia Article is a great source for understanding NDCG. First, let's understand Discounted Cumulative Gain.

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.

DCG@P = Sum for i=1 to P (relevance_score(i) / log2(i + 1))

Let's take the previous example:
(d2, 0.7, 1), (d3, 0.3, 0), (d1, 0.2, 1)

DCG@3 = 0.7/log2(1 + 1) + 0.3/log2(1 + 2) + 0.2/log2(1 + 3)

As can be seen, if a high score is given

For getting the Normalized Discounted [TODO]

Should I use MAP or NDCG?

Use MAP when you a candidate document is either relevant or not relevant (1 or 0) Use nDCG when you have a range of values for relevance. For example, when documents have relevance (d1 : 1, d2: 2, d3: 3)

For a more detailed idea, refer to this issue on the MatchZoo repo.

As per @bwanglzu

@aneesh-joshi I'd say it can be evaluated with nDCG, but not meaningful or representative. For binary retrieval problem (say the labels are 0 or 1 indicates relevant or not), we use P, AP, MAP etc. For non-binary retrieval problem (say the labels are real numbers greater equal than 0 indicates the "relatedness"]), we use nDCG. In this situation, clearly, mAP@k is a better metric than nDCG@k. In some cases for non-binary retrieval problem, we use both nDCG & mAP such as microsoft learning to rank challenge link.

About implementing MAP

While the idea of the metric seems easy, there are varying implementations of it which can be found below in the section "MatchZoo Baselines". Ideally, the most trusted metric implementations should be the one provided by TREC and I would imagine that's what people use when publishing results.

About TREC and trec_eval

The TextRetrievalConference(TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. They provide a standard method of evaluating QA systems called trec_eval. It's a C binary whose code can be found here

Please refer to this blog to get a complete understanding of trec

Here's the short version:
As the above tutorial link will tell, you will have to download the TREC evaluation code from https://trec.nist.gov/trec_eval/ run make in the folder and it will produce the trec_eval binary.

The binary requires 2 inputs:

  1. qrels: It is the query relevance file which will hold the correct answers.
  2. pred file: It is the predicition file which will have the predictions made by your model

It can be run like:
$ ./trec_eval qrels pred

The above will provide some standard metrics. You can also specify metrics on your own like:
$ ./trec_eval -m map -m ndcg_cut.1,3,5,10,20 qrels pred

The above will provide MAP value and nDCG at cutoffs of 1, 3, 5, 10 and 20

The qrels format is:

Format
------
<query_id>\t<0>\t<document_id>\t<relevance>

Note: parameter <0> is ignored by the model

Example
-------
Q1  0   D1-0    0
Q1  0   D1-1    0
Q1  0   D1-2    0
Q1  0   D1-3    1
Q1  0   D1-4    0
Q16 0   D16-0   1
Q16 0   D16-1   0
Q16 0   D16-2   0
Q16 0   D16-3   0
Q16 0   D16-4   0

The model's prediction, pred should be like:

Format
------
<query_id>\t<Q0>\t<document_id>\t<rank>\t<model_score>\t<STANDARD>

Note: parameters <Q0>, <rank> and <STANDARD> are ignored by the model and can be kept as anything
I have chose 99 as the rank. It has no meaning.

Example
-------
Q1  Q0  D1-0    99  0.64426434  STANDARD
Q1  Q0  D1-1    99  0.26972288  STANDARD
Q1  Q0  D1-2    99  0.6259719   STANDARD
Q1  Q0  D1-3    99  0.8891963   STANDARD
Q1  Q0  D1-4    99  1.7347554   STANDARD
Q16 Q0  D16-0   99  1.1078827   STANDARD
Q16 Q0  D16-1   99  0.22940424  STANDARD
Q16 Q0  D16-2   99  1.7198141   STANDARD
Q16 Q0  D16-3   99  1.7576259   STANDARD
Q16 Q0  D16-4   99  1.548423    STANDARD

4. Establishing Baselines

In our journey of developing a similarity learning model, we first decided to use the WikiQA dataset since it is a well studied and common baseline used in many papers like SeqMatchSeq, BiMPM and QA-Transfer.

Word2Vec Baseline

To understand the level of improvement we could make, we set up a baseline using word2vec. We use the average of word vectors in a sentence to get the vector for a sentence/document.

"Hello World" -> (vec("Hello") + vec("World"))/2

When 2 documents are to be compared for similarity/relevance, we take the Cosine Similarity between their vectors as their similarity. (300 dimensional vectors were seen to perform the best, so we chose them.)

The w2v 300 dim MAP score on the full set(100%) of WikiQA is 0.59
train split(80%) of WikiQA is 0.57
test split(20%) of WikiQA is 0.62
dev split(10%) of WikiQA is 0.62

MatchZoo Baselines

I had found a repo called MatchZoo which had almost 10 different Similarity Learning models developed and benchmarked. According to their benchmarks, their best model performed around 0.65 MAP on WikiQA.

I tried to reproduce their results using their scripts but found that the scores weren't keeping up with what they advertised. I raised an issue. (There was a similar issue before mine as well.) They went about fixing it. I was later able to reproduce their results (by reverting to a point earlier in the commit history) but there was now a lack of trust in their results. I sought out to write my own script in order to make my evaluation independent of theirs. The problem was that they had their own way of representing words and other such internal functions. Luckily, they were dumping their results in the TREC format. I was able to scrape that and evaluate it using this script.

I developed a table with my baselines as below:

alt

Note: the values are different from the established baseline because there is some discrepancy on how MAP should be calculated. I initially used my implementation of it for the table. Later, I moved to that provided by trec. This is how I implement it:

def mapk(Y_true, Y_pred):
    """Calculates Mean Average Precision(MAP) for a given set of Y_true, Y_pred

    Note: Currently doesn't support mapping at k. Couldn't use only map as it's a
    reserved word

    Parameters
    ----------
    Y_true : numpy array or list of ints either 1 or 0
        Contains the true, ground truth values of the relevance between a query and document
    Y_pred : numpy array or list of floats
        Contains the predicted similarity score between a query and document

    Examples
    --------
    >>> Y_true = [[0, 1, 0, 1], [0, 0, 0, 0, 1, 0], [0, 1, 0]]
    >>> Y_pred = [[0.1, 0.2, -0.01, 0.4], [0.12, -0.43, 0.2, 0.1, 0.99, 0.7], [0.5, 0.63, 0.92]]
    >>> print(mapk(Y_true, Y_pred))
    0.75
    """
    # s = 0
    # n = 0
    # for i, j in zip(Y_true, Y_pred):
    #     i, j = np.reshape(np.array(i), (-1,1)), np.reshape(np.array(j), (-1,1))
    #     s += average_precision_score(i, j)
    #     n += 1
    # return s/n

    aps = []
    n_skipped = 0
    for y_true, y_pred in zip(Y_true, Y_pred):
        # skip datapoints where there is no solution
        if np.sum(y_true) < 1:
            n_skipped += 1
            continue

        pred_sorted = sorted(zip(y_true, y_pred), key=lambda x: x[1], reverse=True)
        avg = 0
        n_relevant = 0

        for i, val in enumerate(pred_sorted):
            if val[0] == 1:
                avg += 1. / (i + 1.)
                n_relevant += 1

        if n_relevant != 0:
            ap = avg / n_relevant
            aps.append(ap)
    return np.mean(np.array(aps))

This is how it's implemented in MatchZoo:

def map(y_true, y_pred, rel_threshold=0):
    s = 0.
    y_true = _to_list(np.squeeze(y_true).tolist())
    y_pred = _to_list(np.squeeze(y_pred).tolist())
    c = list(zip(y_true, y_pred))
    random.shuffle(c)
    c = sorted(c, key=lambda x:x[1], reverse=True)
    ipos = 0
    for j, (g, p) in enumerate(c):
        if g > rel_threshold:
            ipos += 1.
            s += ipos / ( j + 1.)
    if ipos == 0:
        s = 0.
    else:
        s /= ipos
    return s

The above has a random shuffle because it prevents false numbers. Refer to this issue.

After studying these values, we decided to implement the DRMM_TKS model since it had the best score for less tradeoffs (like time to train and memory)

Retrospection Point: It should've hit us at this point that the best MAP score of 0.65 as compared to the baseline of 0.58 (of word2vec) isn't so much more (only 0.07). In the later weeks it became clearer that we wanted to develop a model which could perform at least 0.75 (i.e., 0.15 more than the word2vec baseline.) On changing the way we evaluated (shift from my script to trec), the w2v baseline went upto 0.62. Which meant, we need a MAP of at least 0.78 to be practically useful. Moreover, the very paper which introduced WikiQA claimed a MAP of 0.65. We should've realized at this point that investigating with these models was probably pointless(since the best they had so far was merely 0.65). Maybe, we hoped that finetuning these models would lead to a higher score.

Non standard things about Similarity Learning Models

There are some things which area bit non-standard when it comes to Similarity Learning. These are the things one wouldn't know while working on standard Deep Learning things.
The first one is evaluating with MAP, nDCG, etc which I have already explained above. The second is the

loss function.

Ideally, we would like to use a loss function which would improve our metric (MAP). Unforunately, MAP isn't a differentiable function. It's something like saying, I want to increase Accuracy, let's make that the loss function. In the Similarity Learning context, we use the Rank Hinge Loss This loss function is especially good for Ranking Problems. More details can be found here

loss = maximum(0., margin + y_neg - y_pos) where margin is set to 1 or 0.5

Let's look at the keras definitions to get a better idea of the intuition

def hinge(y_true, y_pred):
    return K.mean(K.maximum(1. - y_true * y_pred, 0.), axis=-1)

In this case, the y_true can be {1, -1} For example, y_true = [1 , -1, 1] , y_pred = [0.6, 0.1, -0.9]

K.mean(K.maximum(1 - [1, -1, 1] * [0.6, 0.1, -0.9], 0))
= K.mean(K.maximum(1 - [0.6, -0.1, -0.9], 0))
= K.mean(K.maximum([0.4, 1.1, 1.9], 0))
= K.mean([0.4, 1.1, 1.9])

Effectively, the more wrong answers contribute more to the loss

def categorical_hinge(y_true, y_pred):
    pos = K.sum(y_true * y_pred, axis=-1)
    neg = K.max((1.0 - y_true) * y_pred, axis=-1)
    return K.mean(K.maximum(0.0, neg - pos + 1), axis=-1)

In this case, there is a categorization, so the true values are one hots
For example, y_true = [[1, 0, 0], [0, 0, 1]], y_pred = [[0.8, 0.1, 0.1], [0.7, 0.2, 0.1]]

pos = K.sum([[1, 0, 0], [0, 0, 1]] * [[0.9, 0.1, 0.1], [0.7, 0.2, 0.1]], axis=-1)
    = K.sum([[0.9, 0, 0], [0, 0, 0.1]], axis=-1)  # check the predictions for the correct answer
    = [0.9, 0.1]
neg = K.max((1.0 - [[1, 0, 0], [0, 0, 1]]) * [[0.8, 0.1, 0.1], [0.7, 0.2, 0.1]], axis=-1)
    = K.max([[0, 1, 1], [1, 1, 0]] * [[0.8, 0.1, 0.1], [0.7, 0.2, 0.1]], axis=-1)
    = K.max([[0, 0.1, 0.1], [0.7, 0.2, 0]], axis=-1)
    = [0.1, 0.7]
loss = K.mean(K.maximum(0.0, neg - pos + 1), axis=-1)
     = K.mean(K.maximum(0.0, [0.1, 0.7] - [0.9, 0.1] + 1), axis=-1)
     = K.mean(K.maximum(0.0, [0.2, 1.6]), axis=-1)
     = K.mean([0.2, 1.6])

Again, effectively, the more wrong value (0.7) contributes more to the loss, whereas the correct value(0.9) reduces it greatly

(K: is the Keras backend, effectively K=tf, K.maximum : is elementwise maximum, K.max gets the maximum value in a tensor)

The third "non-standard" idea is that of

Pairwise, Listwise and Pointwise.

Here, here and here are some good resources.

Here's my take on it:
These 3 represent different ways of training a ranking model by virtue of the loss function.

Pointwise

Here we take a sentence and try to give it a rank. For example, we take the sentence ("I like chicken wings") and try to give it a numerical score (0.34) for the question ("What do I like to eat?").

Pairwise

We take two docs and try to rank them as compared to each other. "x1 is more relevant than x2"

Given a pair of documents, they try and come up with the optimal ordering for that pair and compare it to the ground truth. The goal for the ranker is to minimize the number of inversions in ranking i.e. cases where the pair of results are in the wrong order relative to the ground truth.

ListWise

Listwise approaches directly look at the entire list of documents and try to come up with the optimal ordering for it

5. About Datasets

Here, I will give examples of different datasets.

Dataset Name Link Suggested Metrics Some Papers that use the dataset Brief Description
WikiQA
  • MAP
  • MRR
  • SeqMatchSeq(MULT MAP=0.74, MRR=0.75)
  • BiMPM(MAP=0.71, MRR=0.73)
  • QA-Transfer(SQUAD* MAP=0.83, MRR=84.58, P@1=75.31)
Question-Candidate_Answer1_to_N-Relevance1_to_N
SQUAD 2.0 Website
  • Exact Match
  • F1
QA-Transfer(for pretraining) Question-Context-Answer_Range_in_context
Quora Duplicate Question Pairs Accuracy, F1
  • BiMPM(Acc=88.17%)
    Q1-Q2-DuplicateProbablity
    Sem Eval 2016 Task 3A genism-data(semeval-2016-2017-task3-subtaskA-unannotated)
    • MAP
    • AvgRecall
    • MRR
    • P
    • R
    • F1
    • Acc
    QA-Transfer(SQUAD* MAP=80.2, MRR=86.4, P@1=89.1) Question-Comment-SimilarityProbablity
    MovieQA Accuracy SeqMatchSeq(SUBMULT+NN test=72.9%, dev=72.1%) Plot-Question-Candidate_Answers
    InsuranceQA Website Accuracy SeqMatchSeq(SUBMULT+NN test1=75.6%, test2=72.3%, dev=77%) Question-Ground_Truth_Answer-Candidate_answer
    SNLI Accuracy
    • QA-Transfer(for pretraining)
    • SeqMatchSeq(SUBMULT+NN train=89.4%, test=86.8%)
    • BiMPM()Ensemble Acc=88.8%)
    Text-Hypothesis-Judgement
    TRECQA
    • MAP
    • MRR
    BiMPM(MAP:0.802, MRR:0.875) Question-Candidate_Answer1_to_N-relevance1_to_N
    SICK Website Accuracy QA-Transfer(Acc=88.2) sent1-sent2-entailment_label-relatedness_score

    More dataset info can be found at the SentEval repo.

    Model/Paper Link
    SeqMatchSeq https://arxiv.org/pdf/1611.01747.pdf
    QA-Transfer http://aclweb.org/anthology/P17-2081
    BiMPM https://arxiv.org/pdf/1702.03814.pdf

    Some useful examples

    SQUAD

    SQUAD

    SQUAD, WikiQA, SemEval, SICK

    alt

    MovieQA and InsuranceQA

    alt

    6. My Journey

    After going through the initial evaluation, we decided to implement the DRMM TKS model, since it gave the best score (0.65 MAP on WikiQA) with acceptable memory usage and time for training. The first model was considerable important since it set the standard for all other models to follow. I had to learn and implement a lot of Software Engineering principles like doctest, docstrings, etc. This model also helped me set up the skeleton structure needed for any Similarity Learning model. The code can be found here

    This is generally what needs to be done:

    1. Iterate through the train set's sentences and develop your model's word2index dictionary(one index for each word). This is useful, especially when it comes to predicting on new data and we need to translate new sentences.
    2. Translate all the words into indexes using the above dict.
    3. Load a pretrained Embedding Matrix (like Glove) and reorder it so that its indices match with your word2index dictionary.
    4. There will be some words in the embedding matrix which aren't there in the train set. Although these extra embeddings aren't useful when it comes to training, they will be useful while testing and validating. So, append the word2index dict with these new words and reorder the embedding matrix accordingly.
    5. Make 2 additional entries in the embedding matrix for the pad word and the unkown word.
      • Since all sentences won't be of the same shape, we will inevitably have to pad the sentences to the same length. You should decide at this point whether the pad vector is a zero vector or a random vector.
      • While testing and validating, you will inevitably encounter an unkown word (a word not in your word2index dictionary). All these unkown words should be set to the same vector. Again, you can choose between a zero vector and a random vector.
    6. Once all the data is in the right format, we need to create a generator which will stream out batches of training examples (with padding added ofcourse). In the Question Answering context, there is a special consideration here. Usually, a training set will be of the form [(q1, (d1, d2, d3), (q2, (d4, d5)))] with the releavnce labels like [(1, 0, 0), (0, 1)]. While setting up training examples, it's a good idea to make training pairs like [(q, d_rel), (q, d_not_relevant), (q, d_relevant), ...]. This "should" help the model learn the correct classification by contrasting relevant and non relevant answers.
    7. Now that you batch generator is spitting out batches to train on, create a model, fix the loss, optimizer and other such parameters and train.
    8. The metrics usually reported by a library like Keras won't be very useful to evaluate your model as it trains. So, at the end of each epoch, you should set up your own callback as I have done in this line.
    9. Once the model training is done, evaluate it on the test set. You can obviously use your own implementation, but as a standard, it would be better to write your results into the TREC format as I have done here and then use the trec_eval binary described before for evaluating. This standardizes evaluation and makes it easier for others to trust the result.
    10. Saving models can get a bit tricky, especially if you're using keras. Keras prefers it's own way of saving against the satandard gensim util.SaveLoad You will have to save the keras model separate from the non-keras model and combine them while loading. Moreover, if you have any Lambda Layer, you will have to replace them with custom layers like I did here for the TopK Layer.

    Most of the above has been done in [my script] and you can mostly just use that skeleton while changing just the _get_keras_model function.

    6. The journey

    Once my DRMM_TKS model was set up, it was performing very poorly (0.54 on MAP). After trying every idea I could imagine for fine tuning it, I ended up with a model which performed 0.63 on MAP. Beyond this point, however, I didn't feel like any changes I made would make it perform any better. You are welcome to try.
    Seeing that DRMM_TKS wasn't doing so well, I moved my attention to the next best performing model in my list of model, MatchPyramid. This model, after some tuning, managed to score 0.64-0.65 MAP.

    It was around this point that we came to the realization that the score wasn't much more than the baseline and a score of unsupervised baseline plus 0.15 would be needed to make it a worthwhile model to collect supervised data for.

    String of events leading to a change in models

    I started looking for other datasets and happened to stumble upon the BiMPM paper which claimed a score of 0.71 on WikiQA. In it's score comparison table

    alt

    it had cited another paper SeqMatchSeq which claimed a further score of 0.74. I went about implementing BiMPM here and here which didn't perform as expected. I then ran the author's original code, where it unfortunately turned out that the author had reported the score on the dev set, not the test set (as per my run at least (with the author's random seeds)). (On asking the author in this issue, the author claims the split was test.) The actual test score was 0.69, not much more than the baseline for such a complicated model. I also found a pytorch implementation of SeqMatchSeq here in which there was an issue about MAP coming to 0.62 instead of 0.72 on dev set. The author commented:

    I am afraid you are right. I used to reach ~ 72% via the given random seed on an old version of pytorch, but now with the new version of pytorch, I wasn't able to reproduce the result. My personal opinion is that the model is neither deep or sophisticated, and usually for such kind of model, tuning hyper parameters will change the results a lot (although I don't think it's worthy to invest time tweaking an unstable model structure). If you want guaranteed decent accuracy on answer selection task, I suggest you take a look at those transfer learning methods from reading comprehension. One of them is here https://github.com/pcgreat/qa-transfer

    About QA-Transfer and BiDAF

    And thus, my hunt has lead me to the paper : Question Answering through Transfer Learning from Large Fine-grained Supervision Data which makes a crazier claim on MAP : 0.83 (On Ensemble of 12)

    The paper's author provides the implementation here in tensorflow.

    The author makes some notable claims in it's abstract:

    We show that the task of question answering (QA) can significantly benefit from the transfer learning of models trained on a different large, fine-grained QA dataset. We achieve the state of the art in two well-studied QA datasets, WikiQA and SemEval-2016 (Task 3A), through a basic transfer learning technique from SQuAD. For WikiQA, our model outperforms the previous best model by more than 8%. We demonstrate that finer supervision provides better guidance for learning lexical and syntactic information than coarser supervision, through quantitative results and visual analysis. We also show that a similar transfer learning procedure achieves the state of the art on an entailment task.

    So, how this model works is It takes an existing model for QA called BiDirectional Attention Flow (BiDAF), which would take in a query and a context. It would then predict the range/span of words in the context which is relevant to the query. It was adapted to the SQUAD dataset.

    Example of SQUAD:

    SQUAD

    The QA-Transfer takes the BiDAF net and chops off the last layer to make it more QA like, ie., something more like WikiQA:

    q1 - d1 - 0
    q1 - d2 - 0
    q1 - d3 - 1
    q1 - d4 - 0
    

    They call this modified network BiDAF-T

    Then, they take the SQUAD dataset and break the context into sentences and labels each sentence as relevant or irrelevant. This new dataset is called SQUAD-T

    [TODO add table here]

    The new model, BiDAF-T is then trained on SQUAD-T. It gets MAP : 0.75 on WikiQA test set.
    When BiDAF-T is finetuned on WikiQA, it gets MAP : 0.76
    When BiDAF is trained on SQUAD, then the weights are transferred to SQUAD-T and it is further finetuned on WikiQA, it gets the MAP : 0.79 When it's done is an ensemble, it gets MAP : 0.83

    They call it Transfer Learning.

    So, as such, it's not exactly a new model. It's just an old model(BiDAF), trained on a modified dataset and then used on WikiQA and SentEval. However, the author claims that the model does suprisingly well on both of them. The author has provided their own repo.

    Since, 0.79 feels like a good score, significantly (0.17) over the w2v baseline, I went about implementing it. You can find my implementation here Unfortunately, the scores don't perform as well.

    Steps invovled:
    I had to first convert the SQUAD dataset, a span level dataset, into a QA dataset. I have uploaded the new dataset and the script for conversion on my repo. I then trained the BiDAF-T model on the SQUAD-T dataset to check if it did infact reach the elusive score of I got 0.64 on average (getting a 0.67 at one point) but this isn't more than MatchPyramid and it has several drawbacks:

    Things to consider regarding QA Transfer/BiDAF:

    • The original BiDAF code on SQUAD takes 20 hours to converge [reference]
    • It is a very heavy model
    • Since there is a QA Transfer from SQUAD, it won't work on non-english words. In that case, it's better to try an older method like BiMPM or SeqMatchSeq

    List of existing implementations

    Benchmarked Data

    WordEmbedding Baselines and Model on WikiQA

    WikiQA test set w2v 50 dim w2v 100 dim w2v 200 dim w2v 300 dim FT 300 dim MatchPyramid DRMM_TKS MatchPyramid Random Baseline DRMM_TKS Random Baseline
    map 0.6016 0.6148 0.6285 0.6277 0.6199 0.6406 0.6311 0.5107 0.5394
    gm_map 0.4642 0.4816 0.4972 0.4968 0.4763 0.5041 0.4928 0.3753 0.4111
    Rprec 0.4318 0.4551 0.4709 0.4667 0.4715 0.4918 0.4752 0.3471 0.3512
    bpref 0.4251 0.4457 0.4613 0.456 0.4642 0.4886 0.4768 0.3344 0.3469
    recip_rank 0.6147 0.628 0.6419 0.6373 0.6336 0.6471 0.6401 0.519 0.5473
    iprec_at_recall_0.00 0.6194 0.6322 0.6469 0.6437 0.6375 0.6543 0.6447 0.5242 0.5534
    iprec_at_recall_0.10 0.6194 0.6322 0.6469 0.6437 0.6375 0.6543 0.6447 0.5242 0.5534
    iprec_at_recall_0.20 0.6194 0.6322 0.6469 0.6437 0.6375 0.6543 0.6447 0.5242 0.5534
    iprec_at_recall_0.30 0.6146 0.6269 0.6431 0.6401 0.6314 0.6505 0.6447 0.5213 0.5515
    iprec_at_recall_0.40 0.6125 0.6269 0.6404 0.6394 0.6293 0.6474 0.6425 0.5189 0.5488
    iprec_at_recall_0.50 0.6125 0.6269 0.6404 0.6394 0.6293 0.6474 0.6425 0.5189 0.5488
    iprec_at_recall_0.60 0.5937 0.6068 0.6196 0.6219 0.6115 0.6393 0.6255 0.5073 0.5348
    iprec_at_recall_0.70 0.5937 0.6068 0.6196 0.6219 0.6115 0.6393 0.6255 0.5073 0.5348
    iprec_at_recall_0.80 0.5914 0.6039 0.6175 0.619 0.6094 0.6368 0.6216 0.5049 0.5333
    iprec_at_recall_0.90 0.5914 0.6039 0.6175 0.619 0.6094 0.6368 0.6216 0.5049 0.5333
    iprec_at_recall_1.00 0.5914 0.6039 0.6175 0.619 0.6094 0.6368 0.6216 0.5049 0.5333
    P_5 0.1893 0.1951 0.1967 0.1975 0.1926 0.1984 0.1959 0.1704 0.1835
    P_10 0.1107 0.1111 0.1119 0.114 0.1119 0.1165 0.114 0.1095 0.1144
    P_15 0.0774 0.0776 0.0787 0.0787 0.0774 0.0787 0.0785 0.0787 0.0787
    P_20 0.0595 0.0597 0.0597 0.0597 0.0591 0.0599 0.0597 0.0603 0.0601
    ndcg_cut_1 0.4403 0.4486 0.4691 0.4587 0.4774 0.4938 0.4876 0.0402 0.0402
    ndcg_cut_3 0.5867 0.6077 0.6213 0.626 0.6033 0.6261 0.6209 0.0121 0.0121
    ndcg_cut_5 0.6417 0.6598 0.6722 0.6743 0.6593 0.6774 0.6684 0.006 0.006
    ndcg_cut_10 0.6825 0.693 0.7055 0.7102 0.6982 0.7228 0.7108 0.0024 0.0024
    ndcg_cut_20 0.6993 0.7101 0.7208 0.7211 0.7115 0.728 0.7223 0.0012 0.0012

    Quora Duplicate Questions

      Accuracy
    MatchPyramid 69.20%
    DRMM TKS 67.05

    SICK

      Accuracy
    MatchPyramid 57.67
    DTKS 56.81

    SNLI

      Accuracy
    MatchPyramid 53.49
    DRMM_TKS 51.2 (Recheck)

    InsuranceQA

    MatchPyramid

      Test1 Test2 Test1 Random Baseline Test2 Random Baseline
    map 0.8103 0.8003 0.4074 0.41
    gm_map 0.727 0.7104 0.3118 0.313
    Rprec 0.7124 0.7005 0.2274 0.2294
    bpref 0.7177 0.7065 0.2175 0.2195
    recip_rank 0.8348 0.8269 0.4295 0.4374
    iprec_at_recall_0.00 0.8413 0.8316 0.444 0.4501
    iprec_at_recall_0.10 0.8413 0.8316 0.444 0.4501
    iprec_at_recall_0.20 0.8412 0.8314 0.4437 0.4499
    iprec_at_recall_0.30 0.8398 0.8294 0.4411 0.4471
    iprec_at_recall_0.40 0.834 0.8225 0.4305 0.437
    iprec_at_recall_0.50 0.8336 0.8224 0.4303 0.4368
    iprec_at_recall_0.60 0.7967 0.7853 0.3963 0.3943
    iprec_at_recall_0.70 0.7964 0.7851 0.3961 0.3939
    iprec_at_recall_0.80 0.7847 0.7741 0.3874 0.3842
    iprec_at_recall_0.90 0.7841 0.7731 0.3868 0.3835
    iprec_at_recall_1.00 0.7841 0.7731 0.3868 0.3835
    P_5 0.2642 0.2596 0.174 0.1739
    P_10 0.1453 0.1441 0.1453 0.1441
    P_15 0.0969 0.096 0.0969 0.096
    P_20 0.0727 0.072 0.0727 0.072
    P_30 0.0484 0.048 0.0484 0.048
    P_100 0.0145 0.0144 0.0145 0.0144
    P_200 0.0073 0.0072 0.0073 0.0072
    P_500 0.0029 0.0029 0.0029 0.0029
    P_1000 0.0015 0.0014 0.0015 0.0014

    DRMM_TKS

      Test1 Test2 Test1 Random Baseline Test2 Random Baseline
    num_q 1800 1800 1800 1800
    num_ret 18000 18000 18000 18000
    num_rel 2616 2593 2616 2593
    num_rel_ret 2616 2593 2616 2593
    map 0.617 0.5998 0.24 0.2414
    gm_map 0.5024 0.4835 0.1879 0.1899
    Rprec 0.4516 0.4372 0.0806 0.0829
    bpref 0.4532 0.4377 0.0728 0.0743
    recip_rank 0.6496 0.6301 0.2476 0.2488
    iprec_at_recall_0.00 0.6575 0.6396 0.2682 0.2698
    iprec_at_recall_0.10 0.6575 0.6396 0.2682 0.2698
    iprec_at_recall_0.20 0.6573 0.6394 0.2676 0.2697
    iprec_at_recall_0.30 0.6547 0.6379 0.2659 0.268
    iprec_at_recall_0.40 0.6475 0.6293 0.2589 0.2611
    iprec_at_recall_0.50 0.6473 0.6289 0.2589 0.2611
    iprec_at_recall_0.60 0.6 0.5825 0.2383 0.2391
    iprec_at_recall_0.70 0.5995 0.5822 0.2381 0.2391
    iprec_at_recall_0.80 0.5833 0.5694 0.2362 0.2364
    iprec_at_recall_0.90 0.5819 0.5688 0.2357 0.2363
    iprec_at_recall_1.00 0.5819 0.5688 0.2357 0.2363
    P_5 0.2352 0.2292 0.0804 0.0856
    P_10 0.1453 0.1441 0.1453 0.1441
    P_15 0.0969 0.096 0.0969 0.096
    P_20 0.0727 0.072 0.0727 0.072
    P_30 0.0484 0.048 0.0484 0.048
    P_100 0.0145 0.0144 0.0145 0.0144
    P_200 0.0073 0.0072 0.0073 0.0072
    P_500 0.0029 0.0029 0.0029 0.0029
    P_1000 0.0015 0.0014 0.0015 0.0014

    7. Notes on Finetuning Models

    Fine tuning deep learning models can be tough, considering the model runtimes and the number of parameters.
    Here, I will list some parameters to consider while tuning:

    • The word vector dimensionality(50, 100, 200, 300) and source(Glove, GoogleNews)
    • Dropout on different layers
    • Making the Fully Connected/Dense layer deeper or wider
    • Running lesser vs more epochs
    • Bigger or Smaller Batch Sizes

    8. Thoughts on Deep Learning Models

    While the models seem to be doing the best currently, they are a bit difficult to reproduce. There are several reasons I can imagine for that:

    • Difference in Deep Learning Libraries and their versions used in implementations may vary or be outdated. Different libraries can have varying differences in small implementation details(like random initializations) which can lead to huge differences later on and are difficult to track.
    • The large number of parameters coupled with it's stochastic nature means that the model's performance can vary greatly. Include the time it takes to get one result makes tuning them very very difficult to tune.
    • "When a model doesn't perform well, is it because the model is wrong or is my implementation wrong?" This is a difficult question to answer.

    9. Conclusion

    My time spent with understanding, developing and testing Neural Networks for Similarity Learning has brought out the conclusions that:

    • the current methods are not significantly better than a simple unsupervised word2vec baseline.
    • the current methods are unreproducible/difficult to reproduce. This work and effort has provided:
    • a method to evaluate models
    • implementations of some of the current SOTA models
    • a list of datasets to evaluate models on and their readers

    10. Future Work

    If anybody was to carry on with this work, I would recommend looking at the QA-Transfer model, which claims the current "State Of The Art". Although my implementation couldn't reproduce the result, I feel like there is some merit in the model. The reason for this belief is not based in QA-Transfer, but the underlying BiDAF model, which provides two demos which does really well. Since there is so much discrepancies in different implementations, it would be better to use the ported tf1.8 BiDAF
    If you are looking to do "newer research", I would recommend you do what the authors of QA-Transfer did. They looked at the best performing model on the SQUAD 1.1 dataset, pretrained on it, removed the last layer and evaluated. Things have changed since: There is a newer better SQUAD (2.0) and that leader board has got a new "best" model. In fact, BiDAF model is no longer the best on SQUAD 1.1 or 2.0. Just take the current best model and do the pretraining! If there's anything to take away from QA-Transfer, it's that span supervision and pretraining can be good ideas. This, especially, seems to be a new trend regarding Language Models

    References

    [1] Wikipedia article on Similarity Learning

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment