Skip to content

Instantly share code, notes, and snippets.

@crashtech
Last active August 31, 2018 09:09
Show Gist options
  • Save crashtech/92acde79829dd696e17b36fa5ee76542 to your computer and use it in GitHub Desktop.
Save crashtech/92acde79829dd696e17b36fa5ee76542 to your computer and use it in GitHub Desktop.
Similarity algorithms
def cosine_distance(this, that)
return 1.0 if this === that
a_vector = Hash.new(0).tap{ |v| this.each{ |e| v[e] += 1 } }
b_vector = Hash.new(0).tap{ |v| that.each{ |e| v[e] += 1 } }
a_mag = 0.0
product = a_vector.inject(0.0) do |total, (key, value)|
a_mag += value ** 2
total += value * b_vector[key]
end
b_mag = b_vector.values.each_with_object(2).map(&:**).sum
product / (Math.sqrt(a_mag) * Math.sqrt(b_mag))
end
# http://www.informit.com/articles/article.aspx?p=683059&seqNum=36
def levenshtein_distance(this, that, ins = 2, del = 2, sub = 1)
# ins, del, sub are weighted costs
return nil if this.nil?
return nil if that.nil?
dm = [] # distance matrix
# Initialize first row values
dm[0] = (0..this.length).collect {|i| i * ins }
fill = [0] * (this.length - 1)
# Initialize first column values
(1..that.length).each do |i|
dm[i] = [i * del, fill.flatten]
end
# populate matrix
(1..that.length).each do |i|
(1..this.length).each do |j|
# critical comparison
dm[i][j] = [
dm[i - 1][j - 1] + (this[j - 1] == that[i - 1] ? 0 : sub),
dm[i][j - 1] + ins,
dm[i - 1][j] + del
].min
end
end
# The last value in matrix is the Levenshtein distance between the strings
dm[that.length][this.length]
end
# Get the worst possible value
def levenshtein_maximum(expected, options, ins = 2, del = 2, sub = 1)
diff = options.size - expected.size
adding = expected.size * ins
removing = (diff * del) + (expected.size * sub)
adding >= removing ? adding : removing
end
# Get how precise that is when comparing it with this given the list of options
def levenshtein_precision(this, that, options)
return 0.0 if that.length == 0 && this.length > 0
maximum = levenshtein_maximum(this, options)
(maximum - levenshtein_distance(this, that)).fdiv(maximum)
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment