Skip to content

Instantly share code, notes, and snippets.

@pgolding
Created May 27, 2017 20:26
Show Gist options
  • Save pgolding/fdf74a3e8e797fad0391befd5a906ddb to your computer and use it in GitHub Desktop.
Save pgolding/fdf74a3e8e797fad0391befd5a906ddb to your computer and use it in GitHub Desktop.
Cosine Similarity Python Scikit Learn
# http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# The usual creation of arrays produces wrong format (as cosine_similarity works on matrices)
x = np.array([2,3,1,0])
y = np.array([2,3,0,0])
# Need to reshape these
x = x.reshape(1,-1)
y = y.reshape(1,-1)
# Or just create as a single row matrix
z = np.array([[1,1,1,1]])
# Now we can compute similarities
cosine_similarity(x,y) # = array([[ 0.96362411]]), most similar
cosine_similarity(x,z) # = array([[ 0.80178373]]), next most similar
cosine_similarity(y,z) # = array([[ 0.69337525]]), least similar
@aparnavarma123
Copy link

What is the need to reshape the array ?

x = x.reshape(1,-1)
What changes are being made by this ?

@shreyavshetty
Copy link

Reshape is necessary because you end up getting a value error.
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

@ravigurnatham
Copy link

Thanks a lot

@akshay172
Copy link

Hi,
Instead of passing 1D array to the function, what if we have a huge list to be compared with another list?
e.g. - checking for similarity between customer names present in two different lists.
How to apply cosine similarity in that case? .......will there be any matrix populated giving the cosine distances?

@djkpandian
Copy link

djkpandian commented Nov 5, 2018

Hay,

There is another way you can do the same without reshaping the dataset.

Say I take three sentences
sentence_m = “Mason really loves food”
sentence_h = “Hannah loves food too”
sentence_w = “The whale is food”

sentence_m: Mason=1, really=1, loves=1, food=1, too=0, Hannah=0, The=0, whale=0, is=0
sentence_h: Mason=0, really=0, loves=1, food=1, too=1, Hannah=1, The=0, whale=0, is=0
sentence_w: Mason=0, really=0, loves=0, food=1, too=0, Hannah=0, The=1, whale=1, is=1

import numpy as np
def cos_sim(a, b):
        dot_product = np.dot(a, b)
	norm_a = np.linalg.norm(a)
	norm_b = np.linalg.norm(b)
	return dot_product / (norm_a * norm_b)
sentence_m = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0]) 
sentence_h = np.array([0, 0, 1, 1, 1, 1, 0, 0, 0])
sentence_w = np.array([0, 0, 0, 1, 0, 0, 1, 1, 1])

print(cos_sim(sentence_m, sentence_h))
print(cos_sim(sentence_m, sentence_h))
```
`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment