Skip to content

Instantly share code, notes, and snippets.

@phpenterprise
Created June 25, 2019 15:21
Show Gist options
  • Star 19 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save phpenterprise/be9560412824e5b05e1c1b3f38266b57 to your computer and use it in GitHub Desktop.
Save phpenterprise/be9560412824e5b05e1c1b3f38266b57 to your computer and use it in GitHub Desktop.
String Similarity with MySQL
DELIMITER $$
CREATE DEFINER=`root`@`localhost` FUNCTION `COMPARE_STRING`( s1 text, s2 text) RETURNS int(11)
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 text;
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;
DELIMITER $$
CREATE DEFINER=`root`@`localhost` FUNCTION `SIMILARITY_STRING`(a text, b text) RETURNS double
BEGIN
RETURN ABS(((COMPARE_STRING(a, b) / length(b)) * 100) - 100);
END$$
DELIMITER ;
@phpenterprise
Copy link
Author

phpenterprise commented Jun 25, 2019

String Similarity with MySQL

For use to compare two strings/database collumns

image

The need is the mother of the invention: with this thinking and lack of resources of MySQL, unlike other databases like Postgres and Microsoft SQL Server, the MySQL, now the giant Oracle, does not have a native function in its kernel to calculate the similarity between two strings.

So I decided to bring an algorithm I built into another language for MySQL that brings a certain degree of efficiency and precision in the comparison, but as the whole comparative algorithm has between 1-3% margin of imprecision.

Come on!

@InI4
Copy link

InI4 commented Nov 7, 2019

I prefer the following expresstion for the second function:

RETURN ABS(((COMPARE_STRING(a, b) / greatest(length(a),length(b)) ) * 100) - 100);

@phpenterprise
Copy link
Author

I prefer the following expresstion for the second function:

RETURN ABS(((COMPARE_STRING(a, b) / greatest(length(a),length(b)) ) * 100) - 100);

Thanks for sharing. If this is more efficient, surely we have a more reliable version for use in production environment.

@InI4
Copy link

InI4 commented Nov 12, 2019

It is a symmetric solution, if you want the metric to be symmetric might depend on the use case, but if you know nothing beforehand, you should start symmetric.

Major advantage, it does not crash when length(b) = 0. (Still crashes, when both lengths are zero though).

@rodoo9
Copy link

rodoo9 commented Mar 24, 2020

Hi,

Thank you so much for the funcion, I have a little problem i hope you can solve

If I execute this works.
SELECT p.id, SIMILARITY_STRING(p.name, "HELLO2") AS score
FROM products p

If I execute this DOES NOT work.
SELECT p.id, SIMILARITY_STRING(p.name, "HELLO2") AS score
FROM products p
HAVING score > 80

#1366 - Incorrect string value: '\x80' for column 'cv0' at row 1

It seems that the problem are special chracters:

SELECT SIMILARITY_STRING("SPECIALCHARACTER😊", "SPECIALCHARACTER😊");
#1366 - Incorrect string value: '\xF0\x9F\x98\x8A' for column 'a' at row 1

Any idea?

Best,

@jamesgathu
Copy link

Hello friends,

while at it, you may also want to take a look at Levenshtein-MySQL-UDF if you require to use sim1 for ordering or filtering

@editsons
Copy link

How can 'JT Mini Pizza ´¼ü%[ag$%;*' and 'AKTIONSPREIS' return a similarity of 91.66 % ?

@ggb667
Copy link

ggb667 commented Oct 14, 2022

What kind of licensing applies to these gists? Can I use the MIT license or creative commons?

@ggb667
Copy link

ggb667 commented Oct 21, 2022

What exactly is UNHEX(HEX(j)) doing? I thought those were inverse operations, but they aren't. I can't just replace them with j and have it work. I was hoping to speed this up, but adding and index for my column, and fully specifying the table and fields made no difference. Is there any way to make this faster?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment