Skip to content

Instantly share code, notes, and snippets.

@lispc

lispc/report.md Secret

Last active May 1, 2023 22:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save lispc/d4ee8de81e7bfa6bc352 to your computer and use it in GitHub Desktop.
Save lispc/d4ee8de81e7bfa6bc352 to your computer and use it in GitHub Desktop.
A brief comparison on edit distance based string clustering between simile-vicino and a passjoin-based library

I am trying to integrate PassJoin into OpenRefine . Here is some documentation and experiment reports.

Summary :
(1) PassJoin is a extremely efficient algorithm for joining and clustering strings based ONLY edit-distance-similarity, without any use of BLOCKING like simile-vicino so it can provide accurate results.
(2) EditDistanceClusterer is a java lib based on PassJoin. It can finish calculating clusters among 0.6M strings with average length of 13.8 in 1 minute while simile-vicino takes 64 minutes on a 4-core machine (tested in Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz).
(3) Simile-vicino uses blocking technique, so it will miss some right results. Experiment in my dataset shows that simile-vicino can only get 73% results compared with EditDistanceClusterer(PassJoin).

Detail/Notes:
(1) Rightness Problem on Simile-vicino
Simile-vicino cannot get right clusters sometimes. Here is one example, say we are calculating clusters from this string set:

abcdeabcde
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde

with edit distance 1 ( which is called 'radius' in simile-vicino ), the logic of Vicino is to iterate all strings and then find all strings AFTER it with edit distance less than threshold and then make all these strings a cluster, after the loop, sort all these clusters by size and return to the caller. Vicino will result in the following result:

=== new cluster === 
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde
=== end cluster === 
=== new cluster === 
abcdeabcde
bbcdeabcde
dbcdeabcde
ebcdeabcde
=== end cluster === 
=== new cluster === 
bbcdeabcde
dbcdeabcde
ebcdeabcde
=== end cluster === 
=== new cluster === 
bbcdeabcdf
bbcdeabcdg
=== end cluster === 
=== new cluster === 
dbcdeabcde
ebcdeabcde
=== end cluster === 

In face, all the six strings should be grouped into one big cluster because edit distances from 'bbcdeabcde' to the other five are all 1. I fix the bug and it can give right results like below. The fixed vicino will be called vicino-fixed in this gist.

=== new cluster ===
abcdeabcde
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde
=== end cluster ===
=== new cluster ===
abcdeabcde
bbcdeabcde
dbcdeabcde
ebcdeabcde
=== end cluster ===
=== new cluster ===
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
=== end cluster ===

(2) Subsets problem
From above, even the result of vicino-fixed is not 'that good'. The last two clusters are subsets of the first one. Maybe the last two should be trimmed? I write a script to erase subsets among sets of sets. But this depends on how 'clustering' is defined so I have not integrate the logic in the java lib. The results trimmed should be:

=== new cluster ===
abcdeabcde
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde
=== end cluster ===

(3) Since we have discussed the above two problems, I will list the cluster numbers found by three algorithms.

Algorithm Vicino original Vicino fixed EditDistanceClusterer
cluster number 214264 256007 237926
cluster number(subsets trimmed) 129969 130563 178408
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment