I became interested in string clustering through developing fpos, a set of Python scripts for graphing my spending habits. In a clear failure of research I missed the existance of libraries like:
Rather, I dove off down the path of developing my own string clusterer.
I wanted the approach to be quick; fpos
is a commandline tool and I wanted it
to respond to me now, dammit, not a few minutes down the track when I was
trying to understand how my spending was tracking my budget.
String clustering is a fairly computationally expensive problem: Picking the
best match across clusters using a similarity measure like
longest common subsequence
or Levenshtein distance
can yield a complexity in the order of O(m * n * o^2)
, where m
represents
the longest known string length, n
represents the input string length and o
represents the number of strings across all clusters.
strgrp is what I've cooked up as a result. Sure, it's cheating by dropping to C for its implementation, but it does have a Python wrapper. And with a small amount of effort if you're already using difflib you can leverage strgrp's speed from your Python scripts. You can grab the wrapper from the fpos repository.
Having discovered the alternatives after-the-fact I was interested in comparing their performance with strgrp. Whilst strgrp doesn't use much CS research to underpin its approach (and really reflects my own exploration of the problem), as shown below the implementation is a large improvement over some comparable (at a high level) difflib- or python-levenshtein- based approaches.
Now, having said that, the research indicates that approximate string matching is at best quadratic and points to Wagner-Fischer (Levenshtein) as the best algorithm we have. And if this is the case, then tricks are all we have left. strgrp's tricks haven't yet got to the point of implementing some optimisations suggested by Berghel and Roach in An Extension of Ukkonen's Enhanced Dynamic Programming ASM Algorithm or Allison in Lazy Dynamic-Programming can be Eager, but as outlined in the CCAN info page it implements some alternative measures to cut down on the runtime. The tricks include:
- Only comparing input strings with a well-defined subset of the currently clustered strings
- Caching clusters in a map for exact match lookups
- Using functions with cheaper worst-case complexities to filter out groups prior to computing the longest common subsequence, and
- Distributing comparisons across multiple threads using OpenMP
On this last point, Python's reference implementation (CPython) is hamstrung and only multi-process concurrency will help. From the threading library documentation:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.
I haven't fully investigated implementing such an approach for comparison, so
in the testing below the difflib and python-levenshtein solutions are
single-threaded. This is possibly unfair as the strgrp solutions benefit from
unintrusive multi-threading via OpenMP, but it might also be seen as a benefit
of dropping to C for more control. Finally, the abstraction provided by
difflib's get_close_matches()
prevents a user choice of multi-threading: the task
that can be executed in parallel is that of calculating the match scores of the
input string against each cluster. It is up to difflib to implement a
multi-threaded approach, and it appears this is not the case.
Test hardware: Thinkpad X201s (2010)
- RAM: 3813MiB
- CPU: x86_64 Intel(R) Core(TM) i7 CPU L 640 @ 2.13GHz (4 cores)
Test software:
- GCC (Gentoo 4.9.3 p1.0, pie-0.6.2) 4.9.3
- Python 3.4.3
The sections below analyse six clustering implementations and their absolute and relative performance:
- Naive clustering using difflib and python-levenshtein
- Slight less naive clustering using difflib and python-levenshtein
- Native strgrp clustering
- strgrp clustering through the Python wrapper
The shape of the test data according to Python's Pandas is:
>>> df.describe()
count 3500
unique 1957
...
>>> df.applymap(len).describe()
count 3500.000000
mean 46.896571
std 14.662562
min 4.000000
25% 37.000000
50% 44.000000
75% 60.000000
max 97.000000
The test strategy is to execute each approach with a number of different input
set sizes from the data set described above, and to execute ten runs for each
input set. Input sizes increase in batches of 500 from 500 to 3500. The
runtimes are all in seconds and were calculated using the 'real' output of
time(1)
.
The graph and table below compares the average run times for each approach across the different input sizes:
N | gcm0 | gcm1 | pylev0 | pylev1 | pystrgrp | strgrp |
---|---|---|---|---|---|---|
500 | 1.005 | 0.384 | 0.66 | 0.31 | 0.05 | 0.01 |
1000 | 3.266 | 1.355 | 2.653 | 1.191 | 0.09 | 0.041 |
1500 | 6.953 | 3.295 | 6.009 | 3.021 | 0.17 | 0.105 |
2000 | 11.055 | 5.842 | 9.658 | 5.36 | 0.253 | 0.173 |
2500 | 16.2 | 9.712 | 14.34 | 8.166 | 0.389 | 0.29 |
3000 | 22.917 | 12.745 | 19.46 | 11.191 | 0.524 | 0.396 |
3500 | 30.935 | 18.289 | 25.801 | 16.233 | 0.684 | 0.521 |
Similarly, the performance improvement that strgrp provides is shown below:
Count | gcm0:pystrgrp | gcm1:pystrgrp | pylev0:pystrgrp | pylev1:pystrgrp | pystrgrp:strgrp |
---|---|---|---|---|---|
500 | 20.1 | 7.7 | 13.2 | 6.2 | 5.0 |
1000 | 36.3 | 15.1 | 29.5 | 13.2 | 2.2 |
1500 | 40.9 | 19.4 | 35.3 | 17.8 | 1.6 |
2000 | 43.7 | 23.1 | 38.2 | 21.2 | 1.5 |
2500 | 41.6 | 25.0 | 36.9 | 21.0 | 1.3 |
3000 | 43.7 | 24.3 | 37.1 | 21.4 | 1.3 |
3500 | 45.2 | 26.7 | 37.7 | 23.7 | 1.3 |
- strgrp is much faster than both difflib and python-levenshtein for string clustering
- strgrp is easy to integrate into existing python scripts via wrapper
Naive clustering implementation using get_close_matches()
:
import difflib
def gcm0(strings):
clusters = {}
for string in (x.strip() for x in strings):
match = difflib.get_close_matches(string, clusters.keys(), 1, 0.85)
if match:
clusters[match[0]].append(string)
else:
clusters[string] = [ string ]
return clusters
N | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
500 | 1.60 | 0.92 | 0.96 | 0.92 | 0.95 | 0.93 | 0.93 | 0.96 | 0.96 | 0.92 |
1000 | 3.21 | 3.20 | 3.27 | 3.22 | 3.31 | 3.29 | 3.23 | 3.26 | 3.33 | 3.34 |
1500 | 6.98 | 6.86 | 6.91 | 6.89 | 6.92 | 6.92 | 6.87 | 7.05 | 7.12 | 7.01 |
2000 | 11.31 | 10.75 | 10.64 | 10.96 | 10.90 | 10.96 | 11.12 | 10.85 | 11.20 | 11.86 |
2500 | 15.52 | 15.88 | 15.27 | 15.74 | 15.70 | 15.68 | 15.31 | 20.96 | 16.34 | 15.60 |
3000 | 19.99 | 26.32 | 20.14 | 20.06 | 26.84 | 26.58 | 20.60 | 20.00 | 28.33 | 20.31 |
3500 | 33.28 | 35.14 | 30.04 | 28.28 | 30.76 | 25.92 | 33.86 | 26.17 | 33.62 | 32.28 |
def pylev0(strings, f):
f = StringMatcher.ratio
clusters = {}
for s in (x.strip() for x in strings):
if not clusters.keys():
scoreMax = grpscore(None, 0)
else:
scoreMax = max([ grpscore(k, f(k, s)) for k in clusters.keys() ],
key = lambda x: x.score)
if 0.85 <= scoreMax.score:
clusters[scoreMax.grp].append(s)
else:
clusters[s] = [ s ]
return clusters
The slightly less naive difflib clustering implementation accounts for exact
matches before invoking get_close_matches()
:
import difflib
def gcm1(strings):
clusters = {}
for string in (x.strip() for x in strings):
if string in clusters:
clusters[string].append(string)
else:
match = difflib.get_close_matches(string, clusters.keys(), 1, 0.85)
if match:
clusters[match[0]].append(string)
else:
clusters[string] = [ string ]
return clusters
N | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
500 | 0.38 | 0.39 | 0.40 | 0.38 | 0.39 | 0.38 | 0.38 | 0.38 | 0.38 | 0.38 |
1000 | 1.34 | 1.35 | 1.33 | 1.38 | 1.35 | 1.37 | 1.34 | 1.33 | 1.37 | 1.39 |
1500 | 3.26 | 3.32 | 3.34 | 3.22 | 3.27 | 3.37 | 3.31 | 3.29 | 3.29 | 3.28 |
2000 | 5.81 | 5.78 | 5.84 | 5.90 | 5.80 | 5.99 | 5.82 | 5.79 | 5.86 | 5.83 |
2500 | 8.58 | 8.52 | 8.51 | 8.47 | 13.74 | 8.45 | 8.33 | 8.39 | 8.49 | 15.64 |
3000 | 11.26 | 11.34 | 11.08 | 14.36 | 14.13 | 11.44 | 12.06 | 17.21 | 11.04 | 13.53 |
3500 | 18.88 | 19.74 | 18.00 | 15.81 | 15.97 | 23.27 | 15.99 | 16.03 | 23.35 | 15.85 |
def pylev1(strings):
f = StringMatcher.ratio
clusters = {}
for s in (x.strip() for x in strings):
if s in clusters:
clusters[s].append(s)
else:
if not clusters.keys():
scoreMax = grpscore(None, 0)
else:
scoreMax = max([ grpscore(k, f(k, s)) for k in clusters.keys() ],
key = lambda x: x.score)
if 0.85 <= scoreMax.score:
clusters[scoreMax.grp].append(s)
else:
clusters[s] = [ s ]
return clusters
The driver used for timing strgrp's implementation was as follows:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "ccan/strgrp/strgrp.h"
int main(void) {
FILE *f;
char *buf;
struct strgrp *ctx;
f = fdopen(0, "r");
#define BUF_SIZE 512
buf = malloc(BUF_SIZE);
ctx = strgrp_new(0.85);
while(fgets(buf, BUF_SIZE, f)) {
buf[strcspn(buf, "\r\n")] = '\0';
if (!strgrp_add(ctx, buf, NULL)) {
printf("Failed to classify %s\n", buf);
}
}
strgrp_free(ctx);
free(buf);
fclose(f);
return 0;
}
N | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
500 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
1000 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 |
1500 | 0.11 | 0.10 | 0.10 | 0.10 | 0.10 | 0.10 | 0.10 | 0.10 | 0.10 | 0.10 |
2000 | 0.18 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 |
2500 | 0.29 | 0.28 | 0.28 | 0.28 | 0.28 | 0.28 | 0.28 | 0.28 | 0.28 | 0.28 |
3000 | 0.39 | 0.38 | 0.38 | 0.38 | 0.38 | 0.39 | 0.38 | 0.38 | 0.38 | 0.39 |
3500 | 0.51 | 0.51 | 0.51 | 0.51 | 0.51 | 0.51 | 0.51 | 0.51 | 0.51 | 0.51 |
Using the strgrp Python wrapper:
from pystrgrp import Strgrp
def pystrgrp(strings):
clusters = Strgrp(0.85)
for string in (x.strip() for x in strings):
clusters.add(string, None)
return clusters
N | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
500 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 |
1000 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 | 0.09 |
1500 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 | 0.17 |
2000 | 0.25 | 0.25 | 0.25 | 0.26 | 0.25 | 0.26 | 0.26 | 0.25 | 0.25 | 0.25 |
2500 | 0.39 | 0.39 | 0.38 | 0.39 | 0.39 | 0.39 | 0.39 | 0.39 | 0.39 | 0.39 |
3000 | 0.53 | 0.53 | 0.52 | 0.52 | 0.52 | 0.52 | 0.52 | 0.53 | 0.53 | 0.52 |
3500 | 0.68 | 0.68 | 0.68 | 0.68 | 0.68 | 0.68 | 0.69 | 0.68 | 0.69 | 0.7 |
- Correctness: Sort output and diff
#!/usr/bin/python3
import sys
def printgrps(txtgrps):
grps = {}
grp = []
for l in txtgrps:
sl = l.strip()
if "" == sl:
pass
elif l.startswith("\t"):
grp.append(sl)
else:
grps[sl] = []
grp = grps[sl]
for k in sorted(grps.keys()):
print(k)
for v in sorted(grps[k]):
print("\t{}".format(v))
print()
if __name__ == "__main__":
printgrps(sys.stdin)
- Performance measurement. All of the timing tables were made with the help of a rough-and-ready Bash script.
#!/bin/bash
MAX_VALS=$1
DATA=$2
shift 2
read -r -d '' AWK_TIMER <<EOF
BEGIN { samples = 0; timer = "off"; rd[0] = 0; rl = 0 }
/sampling start/ { samples = \$3 }
/time start/ { timer = "on" }
/time stop/ { timer = "off" }
/real/ { if (timer == "on") { rd[rl++] = \$2 } }
/sampling stop/ { printf("%s, ", samples); for (e in rd) printf("%s, ", rd[e]); printf("\n"); rd[0] = 0; rl = 0; }
EOF
for i in $( seq 500 500 $MAX_VALS )
do
echo sampling start $i
for j in $( seq 1 10 )
do
echo time start
tail -n $i $DATA | time -p $@ > /dev/null
echo time stop
done;
echo sampling stop $i
done 2>&1 | awk "$AWK_TIMER"
Sorry, I only just saw your comment @bacloud14. I had forgotten I had omitted that function. What we want is given some candidate string and the current set of groups to calculate a percentage similarity for each group against the candidate string and find the best match. So the question is: How do we generate a representative value in [0.0, 1.0] from the length of the longest common subsequence? First, some basic rules:
grpscore()
must return 1.0 for strings that exactly similar, andgrpscore()
must return 0.0 for strings that have no common subsequence,Assuming
lcs()
is a function that returns the actual subsequence string,len(lcs(key, candidate))
is the interesting input property that we want to convert into a percentage. The way to do this is somehow via the length of the input strings.A naive approach is simply something like:
where we then do:
as in the code above. LCS is commutative, so
lcs(key, candidate) = lcs(candidate, key)
, but it may simultaneously be true thatlen(key) != len(candidate)
, which means that the naive implementation ofgrpscore()
is not commutative. This is bad.To demonstrate, consider the case where
key
is a complete subsequence ofcandidate
andlen(key) < len(candidate)
(e.g.key
is a prefix ofcandidate
): this will producelen(lcs(key, candidate)) == len(key)
, and thereforegrpscore(key, len(lcs(key, candidate))) == 1.0
. Now, assume that we swapkey
andcandidate
such thatkeyp, candidatep = candidate, key
: In this case,len(lcs(keyp, candidatep)) < len(keyp)
, and thereforegrpscore(keyp, len(lcs(keyp, candidatep))) < 1.0
despite the fact that we're still dealing with the same two strings.Non-commutative
grpscore()
means that the clustering is unstable with respect to input order, so we need to make it commutative to avoid insanity. We defined two properties forgrpscore()
up the top, now we'll add a third:grpscore()
must be commutative.One way to associate two lengths in two dimensions is with Pythagoras' theorem - with a remodeling of the
grpscore()
function signature from what's used in the code listings in the gist what we can do here is take a ratio of the hypotenuses:Does our new
grpscore()
above satisfy our three requirements?For the
key == candidate
case, it's trivial thatlen(key) == len(candidate)
andlen(lcs(key, candidate)) = len(key)
as key is a complete prefix of candidate by definition. By substitution it's obvious that under these conditionsgrpscore(key, candidate) == 1.0
.For the no common subsequence case, then
len(lcs(key, candidate)) == 0
, which makes the numerator of the division zero, thereforegrpscore(key, candidate) == 0.0
For the commutative case, the denominator performs addition of a function of both of the input lengths, and addition is commutative, therefore grpscore is now commutative.
The remaining question is whether it's linear (i.e. whether we have a proper percentage, or something warped). The strings 'aa' and 'ab' are 50% similar. Our improved
grpscore()
gives usa.
math.sqrt(2 * 1 * 1) / math.sqrt(2 * 2 + 2 * 2)
b.
= math.sqrt(2) / math.sqrt(8)
By mathwarehouse, dividing square roots is also commutative, so:
c.
= math.sqrt(2/8)
d.
= 0.5
So from a stupidly brief inspection it looks reasonable. However given execution time is also a concern, by backing up slightly we can apply the mathwarehouse optimisation to our implementation to reduce the number of invocations of
math.sqrt()
:Hope that helps, even if it is a bit of a late reply 🙂