lispc/report.md Secret

## report.md

      
    Raw
  

              report.md
            
          
    I am trying to integrate PassJoin into OpenRefine . Here is some documentation and experiment reports.
Summary :

(1) PassJoin is a extremely efficient algorithm for joining and clustering strings based ONLY edit-distance-similarity, without any use of BLOCKING like simile-vicino so it can provide accurate results.

(2) EditDistanceClusterer is a java lib based on PassJoin. It can finish calculating clusters among 0.6M strings with average length of 13.8 in 1 minute while simile-vicino takes 64 minutes on a 4-core machine (tested in Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz).

(3) Simile-vicino uses blocking technique, so it will miss some right results. Experiment in my dataset shows that simile-vicino can only get 73% results compared with EditDistanceClusterer(PassJoin).
Detail/Notes:

(1) Rightness Problem on Simile-vicino

Simile-vicino cannot get right clusters sometimes. Here is one example, say we are calculating clusters from this string set:
abcdeabcde
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde

with edit distance 1 ( which is called 'radius' in simile-vicino ),  the logic of Vicino is to iterate all strings and then find all strings AFTER it with edit distance less than threshold and then make all these strings a cluster, after the loop, sort all these clusters by size and return to the caller.  Vicino will result in the following result:
=== new cluster === 
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde
=== end cluster === 
=== new cluster === 
abcdeabcde
bbcdeabcde
dbcdeabcde
ebcdeabcde
=== end cluster === 
=== new cluster === 
bbcdeabcde
dbcdeabcde
ebcdeabcde
=== end cluster === 
=== new cluster === 
bbcdeabcdf
bbcdeabcdg
=== end cluster === 
=== new cluster === 
dbcdeabcde
ebcdeabcde
=== end cluster === 

In face, all the six strings should be grouped into one big cluster because edit distances from 'bbcdeabcde' to the other five are all 1. I fix the bug and it can give right results like below. The fixed vicino will be called vicino-fixed in this gist.
=== new cluster ===
abcdeabcde
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde
=== end cluster ===
=== new cluster ===
abcdeabcde
bbcdeabcde
dbcdeabcde
ebcdeabcde
=== end cluster ===
=== new cluster ===
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
=== end cluster ===

(2) Subsets problem

From above, even the result of vicino-fixed is not 'that good'. The last two clusters are subsets of the first one. Maybe the last two should be trimmed? I write a script to erase subsets among sets of sets. But this depends on how 'clustering' is defined so I have not integrate the logic in the java lib.
The results trimmed should be:
=== new cluster ===
abcdeabcde
bbcdeabcde
bbcdeabcdf
bbcdeabcdg
dbcdeabcde
ebcdeabcde
=== end cluster ===

(3)
Since we have discussed the above two problems, I will list the cluster numbers found by three algorithms.


Algorithm
Vicino original
Vicino fixed
EditDistanceClusterer


cluster number
214264
256007
237926


cluster number(subsets trimmed)
129969
130563
178408
Algorithm	Vicino original	Vicino fixed	EditDistanceClusterer
cluster number	214264	256007	237926
cluster number(subsets trimmed)	129969	130563	178408