Skip to content

Instantly share code, notes, and snippets.

@aditya00kumar
Last active May 13, 2021 12:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aditya00kumar/011b6ad309de616e15c32b5efcd9f66d to your computer and use it in GitHub Desktop.
Save aditya00kumar/011b6ad309de616e15c32b5efcd9f66d to your computer and use it in GitHub Desktop.
from sklearn.metrics.pairwise import cosine_similarity
def maximal_marginal_relevance(sentence_vector, phrases, embedding_matrix, lambda_constant=0.5, threshold_terms=10):
"""
Return ranked phrases using MMR. Cosine similarity is used as similarity measure.
:param sentence_vector: Query vector
:param phrases: list of candidate phrases
:param embedding_matrix: matrix having index as phrases and values as vector
:param lambda_constant: 0.5 to balance diversity and accuracy. if lambda_constant is high, then higher accuracy. If lambda_constant is low then high diversity.
:param threshold_terms: number of terms to include in result set
:return: Ranked phrases with score
"""
# todo: Use cosine similarity matrix for lookup among phrases instead of making call everytime.
s = []
r = sorted(phrases, key=lambda x: x[1], reverse=True)
r = [i[0] for i in r]
while len(r) > 0:
score = 0
phrase_to_add = ''
for i in r:
first_part = cosine_similarity([sentence_vector], [embedding_matrix.loc[i]])[0][0]
second_part = 0
for j in s:
cos_sim = cosine_similarity([embedding_matrix.loc[i]], [embedding_matrix.loc[j[0]]])[0][0]
if cos_sim > second_part:
second_part = cos_sim
equation_score = lambda_constant*(first_part)-(1-lambda_constant) * second_part
if equation_score > score:
score = equation_score
phrase_to_add = i
if phrase_to_add == '':
phrase_to_add = i
r.remove(phrase_to_add)
s.append((phrase_to_add, score))
return (s, s[:threshold_terms])[threshold_terms > len(s)]
@AnubhavCR7
Copy link

AnubhavCR7 commented Apr 27, 2021

Hello @aditya00kumar,
I referred to the original paper of MMR (http://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf). According to the formula given in the paper, the code should be :

                             equation_score = lambda_constant * ( first_part - (1-lambda_constant) * second_part)

Please correct me if I am getting it wrong somewhere. Awaiting your response.
Regards.

@aditya00kumar
Copy link
Author

Hey @AnubhavCR7, Yes in the paper it is written as equation_score = lambda_constant * ( first_part - (1-lambda_constant) * second_part) but I think there is typo in the equation.
Why?
Because setting any value of λ gives the mix of diversity and accuracy in the result set. The value of λ can be set based on the use-case and your dataset. If you consider the equation given in the paper and set the value of λ =1, then your equation becomes equation_score = first_part but if you set the value of λ =0, then your equation equates to 0, which should not be the case. That is why I have modified the above equation. Hope this answers your question.

I have written a blog post on MMR on medium, here is the link, don't forget to check the comments section :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment