I am trying to integrate PassJoin into OpenRefine . Here is some documentation and experiment reports.
Summary :
(1) PassJoin is a extremely efficient algorithm for joining and clustering strings based ONLY edit-distance-similarity, without any use of BLOCKING like simile-vicino so it can provide accurate results.
(2) EditDistanceClusterer is a java lib based on PassJoin. It can finish calculating clusters among 0.6M strings with average length of 13.8 in 1 minute while simile-vicino takes 64 minutes on a 4-core machine (tested in Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz).
(3) Simile-vicino uses blocking technique, so it will miss some right results. Experiment in my dataset shows that simile-vicino can only get 73% results compared with EditDistanceClusterer(PassJoin).
Detail/Notes:
(1) Rightness Problem on Simile-vicino
Simile-vicino cannot get right clusters sometimes. Here is one example, say we are calculating clusters from this string set:
abcdeabcde
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde
with edit distance 1 ( which is called 'radius' in simile-vicino ), the logic of Vicino is to iterate all strings and then find all strings AFTER it with edit distance less than threshold and then make all these strings a cluster, after the loop, sort all these clusters by size and return to the caller. Vicino will result in the following result:
=== new cluster ===
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde
=== end cluster ===
=== new cluster ===
abcdeabcde
bbcdeabcde
dbcdeabcde
ebcdeabcde
=== end cluster ===
=== new cluster ===
bbcdeabcde
dbcdeabcde
ebcdeabcde
=== end cluster ===
=== new cluster ===
bbcdeabcdf
bbcdeabcdg
=== end cluster ===
=== new cluster ===
dbcdeabcde
ebcdeabcde
=== end cluster ===
In face, all the six strings should be grouped into one big cluster because edit distances from 'bbcdeabcde' to the other five are all 1. I fix the bug and it can give right results like below. The fixed vicino will be called vicino-fixed in this gist.
=== new cluster ===
abcdeabcde
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde
=== end cluster ===
=== new cluster ===
abcdeabcde
bbcdeabcde
dbcdeabcde
ebcdeabcde
=== end cluster ===
=== new cluster ===
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
=== end cluster ===
(2) Subsets problem
From above, even the result of vicino-fixed is not 'that good'. The last two clusters are subsets of the first one. Maybe the last two should be trimmed? I write a script to erase subsets among sets of sets. But this depends on how 'clustering' is defined so I have not integrate the logic in the java lib.
The results trimmed should be:
=== new cluster ===
abcdeabcde
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde
=== end cluster ===
(3) Since we have discussed the above two problems, I will list the cluster numbers found by three algorithms.
Algorithm | Vicino original | Vicino fixed | EditDistanceClusterer |
---|---|---|---|
cluster number | 214264 | 256007 | 237926 |
cluster number(subsets trimmed) | 129969 | 130563 | 178408 |