Skip to content

Instantly share code, notes, and snippets.

@tomstuart
Last active May 26, 2023 23:12
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tomstuart/9e4fd5cd96527debf7a685d0b5399635 to your computer and use it in GitHub Desktop.
Save tomstuart/9e4fd5cd96527debf7a685d0b5399635 to your computer and use it in GitHub Desktop.
Approximate substring matching with Levenshtein distance
module Levenshtein
def self.substring_distance(needle, haystack)
distances = Array.new(haystack.length.succ, 0)
needle.each_char.with_index do |needle_char, needle_index|
next_distances = [needle_index.succ]
haystack.each_char.with_index do |haystack_char, haystack_index|
deletion, insertion, substitution =
distances[haystack_index.succ].succ,
next_distances[haystack_index].succ,
distances[haystack_index] + (needle_char == haystack_char ? 0 : 1)
next_distances.push([deletion, insertion, substitution].min)
end
distances = next_distances
end
distances.min
end
end
require 'levenshtein'
RSpec.describe 'Levenshtein substring edit distance' do
[
'Machu Picchu', 'Machu Picchu', 0,
'Machu Picchu', 'Where is Machu Picchu?', 0,
'Machu Picchu', 'Where is Machu Pichu?', 1,
'Machu Picchu', 'Where is Macchu Picchu?', 1,
'Stonehenge', 'How do I get to Stone Henge?', 2,
'', '', 0,
'', 'Machu Picchu', 0,
'u', 'Machu Picchu', 0,
'z', 'Machu Picchu', 1,
'Uluru', '', 5
].each_slice(3) do |needle, haystack, distance|
specify "the shortest distance between “#{needle}” and part of “#{haystack}” is #{distance}" do
expect(Levenshtein.substring_distance(needle, haystack)).to eq distance
end
end
end
@tomstuart
Copy link
Author

Based on this information from https://en.wikipedia.org/wiki/Approximate_string_matching#Problem_formulation_and_algorithms:

Computing E(m, j) is very similar to computing the edit distance between two strings. In fact, we can use the Levenshtein distance computing algorithm for E(m, j), the only difference being that we must initialize the first row with zeros, and save the path of computation, that is, whether we used E(i − 1, j), E(i, j − 1) or E(i − 1, j − 1) in computing E(i, j).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment