-
-
Save aditya00kumar/011b6ad309de616e15c32b5efcd9f66d to your computer and use it in GitHub Desktop.
from sklearn.metrics.pairwise import cosine_similarity | |
def maximal_marginal_relevance(sentence_vector, phrases, embedding_matrix, lambda_constant=0.5, threshold_terms=10): | |
""" | |
Return ranked phrases using MMR. Cosine similarity is used as similarity measure. | |
:param sentence_vector: Query vector | |
:param phrases: list of candidate phrases | |
:param embedding_matrix: matrix having index as phrases and values as vector | |
:param lambda_constant: 0.5 to balance diversity and accuracy. if lambda_constant is high, then higher accuracy. If lambda_constant is low then high diversity. | |
:param threshold_terms: number of terms to include in result set | |
:return: Ranked phrases with score | |
""" | |
# todo: Use cosine similarity matrix for lookup among phrases instead of making call everytime. | |
s = [] | |
r = sorted(phrases, key=lambda x: x[1], reverse=True) | |
r = [i[0] for i in r] | |
while len(r) > 0: | |
score = 0 | |
phrase_to_add = '' | |
for i in r: | |
first_part = cosine_similarity([sentence_vector], [embedding_matrix.loc[i]])[0][0] | |
second_part = 0 | |
for j in s: | |
cos_sim = cosine_similarity([embedding_matrix.loc[i]], [embedding_matrix.loc[j[0]]])[0][0] | |
if cos_sim > second_part: | |
second_part = cos_sim | |
equation_score = lambda_constant*(first_part)-(1-lambda_constant) * second_part | |
if equation_score > score: | |
score = equation_score | |
phrase_to_add = i | |
if phrase_to_add == '': | |
phrase_to_add = i | |
r.remove(phrase_to_add) | |
s.append((phrase_to_add, score)) | |
return (s, s[:threshold_terms])[threshold_terms > len(s)] |
Hey @AnubhavCR7, Yes in the paper it is written as equation_score = lambda_constant * ( first_part - (1-lambda_constant) * second_part)
but I think there is typo in the equation.
Why?
Because setting any value of λ gives the mix of diversity and accuracy in the result set. The value of λ can be set based on the use-case and your dataset. If you consider the equation given in the paper and set the value of λ =1, then your equation becomes equation_score = first_part
but if you set the value of λ =0, then your equation equates to 0, which should not be the case. That is why I have modified the above equation. Hope this answers your question.
I have written a blog post on MMR on medium, here is the link, don't forget to check the comments section :).
Hello @aditya00kumar,
I referred to the original paper of MMR (http://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf). According to the formula given in the paper, the code should be :
Please correct me if I am getting it wrong somewhere. Awaiting your response.
Regards.