Last active
April 19, 2023 15:34
-
-
Save Aditya1001001/0dcb858001998d042e453425ca46eb15 to your computer and use it in GitHub Desktop.
Comparing Text Similarity Measures & Text Embedding Methods
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def tagged_document(list_of_list_of_words): | |
for i, list_of_words in enumerate(list_of_list_of_words): | |
yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i]) | |
training_data = list(tagged_document(data)) | |
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30) | |
model.build_vocab(training_data) | |
model.train(training_data, total_examples=model.corpus_count, epochs=model.epochs) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def cos_similarity(x,y): | |
""" return cosine similarity between two lists """ | |
numerator = sum(a*b for a,b in zip(x,y)) | |
denominator = squared_sum(x)*squared_sum(y) | |
return round(numerator/float(denominator),3) | |
cos_similarity(embeddings[0], embeddings[1]) | |
# OUTPUT | |
0.891 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.metrics.pairwise import cosine_similarity | |
from sklearn.feature_extraction.text import CountVectorizer | |
vectorizer = CountVectorizer() | |
X = vectorizer.fit_transform(headlines) | |
arr = X.toarray() | |
create_heatmap(cosine_similarity(arr)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nlp = spacy.load('en_core_web_md') | |
docs = [nlp(headline) for headline in headlines] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from simple_elmo import ElmoModel | |
model = ElmoModel() | |
model.load("/content/209.zip") | |
sentence = "After stealing gold from the bank vault, the bank robber was seen fishing on the river bank." |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
labels = [headline[:20] for headline in headlines] | |
def create_heatmap(similarity, cmap = "YlGnBu"): | |
df = pd.DataFrame(similarity) | |
df.columns = labels | |
df.index = labels | |
fig, ax = plt.subplots(figsize=(5,5)) | |
sns.heatmap(df, cmap=cmap) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def distance_to_similarity(distance): | |
return 1/exp(distance) | |
distance_to_similarity(distance) | |
# OUTPUT | |
0.8450570465624478 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
wget http://vectors.nlpl.eu/repository/20/209.zip |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
python -m spacy download en_core_web_md |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
elmo_vectors = model.get_elmo_vectors(sentence, layers="average") | |
print(f"Tensor shape: {elmo_vectors.shape}") | |
# OUTPUT | |
Tensor shape: (1, 92, 1024) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vault = np.sum(elmo_vectors[0][29:33], axis = 0)/4 | |
robber = np.sum(elmo_vectors[0][45:49], axis = 0)/4 | |
river = np.sum(elmo_vectors[0][87:91], axis = 0)/4 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from math import sqrt, pow, exp | |
def squared_sum(x): | |
""" return 3 rounded square rooted value """ | |
return round(sqrt(sum([a*a for a in x])),3) | |
def euclidean_distance(x,y): | |
""" return euclidean distance between two lists """ | |
return sqrt(sum(pow(a-b,2) for a, b in zip(x, y))) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sentence_transformers import SentenceTransformer, util | |
model = SentenceTransformer('stsb-roberta-large') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
embeddings = [nlp(sentence).vector for sentence in sentences] | |
distance = euclidean_distance(embeddings[0], embeddings[1]) | |
print(distance) | |
# OUTPUT | |
1.8646982721454675 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import gensim | |
import gensim.downloader as api | |
dataset = api.load("text8") | |
data = [i for i in dataset] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import tensorflow as tf | |
import tensorflow_hub as hub | |
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" | |
model = hub.load(module_url) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
headlines = [ | |
#Crypto | |
'Investors unfazed by correction as crypto funds see $154 million inflows', | |
'Bitcoin, Ethereum prices continue descent, but crypto funds see inflows', | |
#Inflation | |
'The surge in euro area inflation during the pandemic: transitory but with upside risks', | |
"Inflation: why it's temporary and raising interest rates will do more harm than good", | |
#common | |
'Will Cryptocurrency Protect Against Inflation?'] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pip install transformers sentence-transformers |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def jaccard_similarity(x,y): | |
""" returns the jaccard similarity between two lists """ | |
intersection_cardinality = len(set.intersection(*[set(x), set(y)])) | |
union_cardinality = len(set.union(*[set(x), set(y)])) | |
return intersection_cardinality/float(union_cardinality) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vectors = [model.infer_vector([word for word in sent]).reshape(1,-1) for sent in sentences] | |
similarity = [] | |
for i in range(len(sentences)): | |
row = [] | |
for j in range(len(sentences)): | |
row.append(cosine_similarity(vectors[i],vectors[j])[0][0]) | |
similarity.append(row) | |
create_heatmap(similarity) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
diff_bank_1 = cosine_similarity(vault, river) | |
diff_bank_2 = cosine_similarity(river, robber) | |
same_bank = cosine_similarity(vault, robber) | |
print('Vector similarity for *similar* meanings: %.2f' % same_bank) | |
print('Vector similarity for *different* meanings: %.2f' % diff_bank_1) | |
print('Vector similarity for *different* meanings: %.2f' % diff_bank_2) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sentences = ["The bottle is empty", | |
"There is nothing in the bottle"] | |
sentences = [sent.lower().split(" ") for sent in sentences] | |
jaccard_similarity(sentences[0], sentences[1]) | |
# OUPUT | |
0.42857142857142855 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
embeddings = model.encode(sentences, convert_to_tensor=True) | |
similarity = [] | |
for i in range(len(sentences)): | |
row = [] | |
for j in range(len(sentences)): | |
row.append(util.pytorch_cos_sim(embeddings[i], embeddings[j]).item()) | |
similarity.append(row) | |
create_heatmap(similarity) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
embeddings = model(text) | |
similarity = cosine_similarity(embeddings) | |
create_heatmap(similarity) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.feature_extraction.text import TfidfVectorizer | |
vectorizer = TfidfVectorizer() | |
X = vectorizer.fit_transform(headlines) | |
arr = X.toarray() | |
create_heatmap(cosine_similarity(arr)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
similarity = [] | |
for i in range(len(docs)): | |
row = [] | |
for j in range(len(docs)): | |
row.append(docs[i].similarity(docs[j])) | |
similarity.append(row) | |
create_heatmap(similarity) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
print(docs[0].vector) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thank you for your great article about text similarity, it becomes hard to find working examples with a lots of breaking changes in python's libraries, some of the functions can be simplified using third party libraries.
1- for the euclidean distance we could use
scipy
:2- for the
cosine_similarity
we could use:In
test_elmo_word_vectors.py
example, get the following error using colab: