Skip to content

Instantly share code, notes, and snippets.

@tomericco
Created January 26, 2019 10:58
Show Gist options
  • Save tomericco/14b5ceac90d6eed6f9ba6cb5305f8fab to your computer and use it in GitHub Desktop.
Save tomericco/14b5ceac90d6eed6f9ba6cb5305f8fab to your computer and use it in GitHub Desktop.
Cosine similarity implementation in JS
const str1 = 'This is an example to test cosine similarity between two strings';
const str2 = 'This example is testing cosine similatiry for given two strings';
//
// Preprocess strings and combine words to a unique collection
//
const str1Words = str1.trim().split(' ').map(omitPunctuations).map(toLowercase);
const str2Words = str2.trim().split(' ').map(omitPunctuations).map(toLowercase);
const allWordsUnique = Array.from(new Set(str1Words.concat(str2Words)));
//
// Calculate IF-IDF algorithm vectors
//
const str1Vector = calcTfIdfVectorForDoc(str1Words, [str2Words], allWordsUnique);
const str2Vector = calcTfIdfVectorForDoc(str2Words, [str1Words], allWordsUnique);
//
// Main
//
console.log('Cosine similarity', cosineSimilarity(str1Vector, str2Vector));
//
// Main function
//
function cosineSimilarity(vec1, vec2) {
const dotProduct = vec1.map((val, i) => val * vec2[i]).reduce((accum, curr) => accum + curr, 0);
const vec1Size = calcVectorSize(vec1);
const vec2Size = calcVectorSize(vec2);
return dotProduct / (vec1Size * vec2Size);
};
//
// tf-idf algorithm implementation (https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
//
function calcTfIdfVectorForDoc(doc, otherDocs, allWordsSet) {
return Array.from(allWordsSet).map(word => {
return tf(word, doc) * idf(word, doc, otherDocs);
});
};
function tf(word, doc) {
const wordOccurences = doc.filter(w => w === word).length;
return wordOccurences / doc.length;
};
function idf(word, doc, otherDocs) {
const docsContainingWord = [doc].concat(otherDocs).filter(doc => {
return !!doc.find(w => w === word);
});
return (1 + otherDocs.length) / docsContainingWord.length;
};
//
// Helper functions
//
function omitPunctuations(word) {
return word.replace(/[\!\.\,\?\-\?]/gi, '');
};
function toLowercase(word) {
return word.toLowerCase();
};
function calcVectorSize(vec) {
return Math.sqrt(vec.reduce((accum, curr) => accum + Math.pow(curr, 2), 0));
};
@gabriel-aleixo
Copy link

Thanks for sharing that! I used your implementation of cosine similarity to measure image similarity in my recent project for the CS50 course. It worked very well. Here's a link to the repo https://github.com/gabriel-aleixo/cs50-final-project. Thanks!

@tomericco
Copy link
Author

Sure, Gabriel. I'd be honored :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment