Instantly share code, notes, and snippets.

Embed
What would you like to do?
--[[
Function: EditDistance
Finds the edit distance between two strings or tables. Edit distance is the minimum number of
edits needed to transform one string or table into the other.
Parameters:
s - A *string* or *table*.
t - Another *string* or *table* to compare against s.
lim - An *optional number* to limit the function to a maximum edit distance. If specified
and the function detects that the edit distance is going to be larger than limit, limit
is returned immediately.
Returns:
A *number* specifying the minimum edits it takes to transform s into t or vice versa. Will
not return a higher number than lim, if specified.
Example:
:EditDistance( "Tuesday", "Teusday" ) -- One transposition.
:EditDistance( "kitten", "sitting" ) -- Two substitutions and a deletion.
returns...
:1
:3
Notes:
* Complexity is O( (#t+1) * (#s+1) ) when lim isn't specified.
* This function can be used to compare array-like tables as easily as strings.
* The algorithm used is Damerau–Levenshtein distance, which calculates edit distance based
off number of subsitutions, additions, deletions, and transpositions.
* Source code for this function is based off the Wikipedia article for the algorithm
<http://en.wikipedia.org/w/index.php?title=Damerau%E2%80%93Levenshtein_distance&oldid=351641537>.
* This function is case sensitive when comparing strings.
* If this function is being used several times a second, you should be taking advantage of
the lim parameter.
* Using this function to compare against a dictionary of 250,000 words took about 0.6
seconds on my machine for the word "Teusday", around 10 seconds for very poorly
spelled words. Both tests used lim.
Revisions:
v1.00 - Initial.
]]
function EditDistance( s, t, lim )
local s_len, t_len = #s, #t -- Calculate the sizes of the strings or arrays
if lim and math.abs( s_len - t_len ) >= lim then -- If sizes differ by lim, we can stop here
return lim
end
-- Convert string arguments to arrays of ints (ASCII values)
if type( s ) == "string" then
s = { string.byte( s, 1, s_len ) }
end
if type( t ) == "string" then
t = { string.byte( t, 1, t_len ) }
end
local min = math.min -- Localize for performance
local num_columns = t_len + 1 -- We use this a lot
local d = {} -- (s_len+1) * (t_len+1) is going to be the size of this array
-- This is technically a 2D array, but we're treating it as 1D. Remember that 2D access in the
-- form my_2d_array[ i, j ] can be converted to my_1d_array[ i * num_columns + j ], where
-- num_columns is the number of columns you had in the 2D array assuming row-major order and
-- that row and column indices start at 0 (we're starting at 0).
for i=0, s_len do
d[ i * num_columns ] = i -- Initialize cost of deletion
end
for j=0, t_len do
d[ j ] = j -- Initialize cost of insertion
end
for i=1, s_len do
local i_pos = i * num_columns
local best = lim -- Check to make sure something in this row will be below the limit
for j=1, t_len do
local add_cost = (s[ i ] ~= t[ j ] and 1 or 0)
local val = min(
d[ i_pos - num_columns + j ] + 1, -- Cost of deletion
d[ i_pos + j - 1 ] + 1, -- Cost of insertion
d[ i_pos - num_columns + j - 1 ] + add_cost -- Cost of substitution, it might not cost anything if it's the same
)
d[ i_pos + j ] = val
-- Is this eligible for tranposition?
if i > 1 and j > 1 and s[ i ] == t[ j - 1 ] and s[ i - 1 ] == t[ j ] then
d[ i_pos + j ] = min(
val, -- Current cost
d[ i_pos - num_columns - num_columns + j - 2 ] + add_cost -- Cost of transposition
)
end
if lim and val < best then
best = val
end
end
if lim and best >= lim then
return lim
end
end
return d[ #d ]
end
@reidoda

This comment has been minimized.

reidoda commented Oct 19, 2018

Hi I'd like to use this code for a project at my company. Could you choose an open source license so that I can know whether or not you will permit this? I'm happy to speak directly with you about how it will be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment