Created
January 27, 2014 09:07
-
-
Save shaggyone/8645347 to your computer and use it in GitHub Desktop.
Ruby implementation of Sift3 strings distance much faster alternative to levenshtein distance
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Ruby version of Siderite's fast and accurate string distance algorith described here: | |
# | |
# http://siderite.blogspot.com/2007/04/super-fast-and-accurate-string-distance.html | |
class StringTools | |
def self.sift3distance(s1, s2, maxOffset, maxDistance) | |
if (s1.try(:length) || 0) == 0 | |
return s2.try(:length) || 0 | |
end | |
if (s2.try(:length) || 0) == 0 | |
return s1.length | |
end | |
if s1 == s2 | |
return 0 | |
end | |
c1 = c2 = lcs = 0 | |
temporaryDistance = nil | |
while (c1 < s1.length) && (c2 < s2.length) | |
if s1[c1] == s2[c2] | |
lcs += 1 | |
else | |
if (c1<c2) | |
c2=c1 | |
else | |
c1=c2 | |
end | |
maxOffset.times do |i| | |
if ((c1 + i < s1.length) && (s1[c1 + i] == s2[c2])) | |
c1 += i | |
break | |
end | |
if ((c2 + i < s2.length) && (s1[c1] == s2[c2 + i])) | |
c2 += i | |
break | |
end | |
end | |
end | |
c1 += 1 | |
c2 += 1 | |
if maxDistance.present? | |
temporaryDistance=(c1+c2)/1.5-lcs | |
if (temporaryDistance>=maxDistance) | |
return temporaryDistance.round | |
end | |
end | |
end | |
return ((s1.length + s2.length) / 1.5 - lcs).round | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment