Skip to content

Instantly share code, notes, and snippets.

@w00lf
Created June 13, 2017 10:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save w00lf/6ebd10d1294a9a58a40b5257e725a7f3 to your computer and use it in GitHub Desktop.
Save w00lf/6ebd10d1294a9a58a40b5257e725a7f3 to your computer and use it in GitHub Desktop.
Computes cosine similarity of two text, matched by words.
# Inspired by: https://stackoverflow.com/questions/1746501/can-someone-give-an-example-of-cosine-similarity-in-a-very-simple-graphical-wa
# And: https://github.com/agarie/measurable/blob/8ff8efbab1a0892bdddf6e865dd8864956168a91/lib/measurable/cosine.rb
# https://github.com/agarie/measurable/blob/8ff8efbab1a0892bdddf6e865dd8864956168a91/lib/measurable/euclidean.rb
# Accept two text adn calcualtes cosine similarity by words of these texts, returns Float, between 0.0(not similat at all) and 1.0(identical)
def cosine_similarity(one, two)
indexes = [ one, two ].map { |text| text.scan(/[a-zA-Z]{1,}/) }.flatten.uniq
counters = [ one, two ].map do |text|
counter = text.scan(/[a-zA-Z]{1,}/).reduce(Hash.new(0)) { |x,y| x.tap {|n| n[y] += 1 } }
indexes.map { |key| counter[key] }
end
dot_product = counters.first.zip(counters.last).reduce(0.0) {|acc,n| acc += n[0] * n[1] }
# Measure euclidian product: Sqroot of (A * B) ** 2
eucl_product = Math.sqrt(counters.first.reduce(0.0) {|acc,n| acc += n ** 2 }) * Math.sqrt(counters.last.reduce(0.0) {|acc,n| acc += n ** 2 })
dot_product/eucl_product
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment