Skip to content

Instantly share code, notes, and snippets.

@amboar
Last active June 8, 2022 08:39
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save amboar/fd3703950960cc8da826 to your computer and use it in GitHub Desktop.
Save amboar/fd3703950960cc8da826 to your computer and use it in GitHub Desktop.
String Clustering Performance

String Clustering with Python (and C)

I became interested in string clustering through developing fpos, a set of Python scripts for graphing my spending habits. In a clear failure of research I missed the existance of libraries like:

Rather, I dove off down the path of developing my own string clusterer. I wanted the approach to be quick; fpos is a commandline tool and I wanted it to respond to me now, dammit, not a few minutes down the track when I was trying to understand how my spending was tracking my budget.

String clustering is a fairly computationally expensive problem: Picking the best match across clusters using a similarity measure like longest common subsequence or Levenshtein distance can yield a complexity in the order of O(m * n * o^2), where m represents the longest known string length, n represents the input string length and o represents the number of strings across all clusters.

strgrp is what I've cooked up as a result. Sure, it's cheating by dropping to C for its implementation, but it does have a Python wrapper. And with a small amount of effort if you're already using difflib you can leverage strgrp's speed from your Python scripts. You can grab the wrapper from the fpos repository.

Having discovered the alternatives after-the-fact I was interested in comparing their performance with strgrp. Whilst strgrp doesn't use much CS research to underpin its approach (and really reflects my own exploration of the problem), as shown below the implementation is a large improvement over some comparable (at a high level) difflib- or python-levenshtein- based approaches.

Now, having said that, the research indicates that approximate string matching is at best quadratic and points to Wagner-Fischer (Levenshtein) as the best algorithm we have. And if this is the case, then tricks are all we have left. strgrp's tricks haven't yet got to the point of implementing some optimisations suggested by Berghel and Roach in An Extension of Ukkonen's Enhanced Dynamic Programming ASM Algorithm or Allison in Lazy Dynamic-Programming can be Eager, but as outlined in the CCAN info page it implements some alternative measures to cut down on the runtime. The tricks include:

  • Only comparing input strings with a well-defined subset of the currently clustered strings
  • Caching clusters in a map for exact match lookups
  • Using functions with cheaper worst-case complexities to filter out groups prior to computing the longest common subsequence, and
  • Distributing comparisons across multiple threads using OpenMP

On this last point, Python's reference implementation (CPython) is hamstrung and only multi-process concurrency will help. From the threading library documentation:

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

I haven't fully investigated implementing such an approach for comparison, so in the testing below the difflib and python-levenshtein solutions are single-threaded. This is possibly unfair as the strgrp solutions benefit from unintrusive multi-threading via OpenMP, but it might also be seen as a benefit of dropping to C for more control. Finally, the abstraction provided by difflib's get_close_matches() prevents a user choice of multi-threading: the task that can be executed in parallel is that of calculating the match scores of the input string against each cluster. It is up to difflib to implement a multi-threaded approach, and it appears this is not the case.

Hardware, Tests and Data

Test hardware: Thinkpad X201s (2010)

  • RAM: 3813MiB
  • CPU: x86_64 Intel(R) Core(TM) i7 CPU L 640 @ 2.13GHz (4 cores)

Test software:

  • GCC (Gentoo 4.9.3 p1.0, pie-0.6.2) 4.9.3
  • Python 3.4.3

The sections below analyse six clustering implementations and their absolute and relative performance:

  • Naive clustering using difflib and python-levenshtein
  • Slight less naive clustering using difflib and python-levenshtein
  • Native strgrp clustering
  • strgrp clustering through the Python wrapper

The shape of the test data according to Python's Pandas is:

    >>> df.describe()
    count                   3500
    unique                  1957
    ...
    >>> df.applymap(len).describe()
    count  3500.000000
    mean     46.896571
    std      14.662562
    min       4.000000
    25%      37.000000
    50%      44.000000
    75%      60.000000
    max      97.000000

The test strategy is to execute each approach with a number of different input set sizes from the data set described above, and to execute ten runs for each input set. Input sizes increase in batches of 500 from 500 to 3500. The runtimes are all in seconds and were calculated using the 'real' output of time(1).

Performance Comparison Charts

The graph and table below compares the average run times for each approach across the different input sizes:

gcm0 vs gcm1 vs pystrgrp vs strgrp

N gcm0 gcm1 pylev0 pylev1 pystrgrp strgrp
500 1.005 0.384 0.66 0.31 0.05 0.01
1000 3.266 1.355 2.653 1.191 0.09 0.041
1500 6.953 3.295 6.009 3.021 0.17 0.105
2000 11.055 5.842 9.658 5.36 0.253 0.173
2500 16.2 9.712 14.34 8.166 0.389 0.29
3000 22.917 12.745 19.46 11.191 0.524 0.396
3500 30.935 18.289 25.801 16.233 0.684 0.521

Similarly, the performance improvement that strgrp provides is shown below:

Implementation Performance Ratios

Count gcm0:pystrgrp gcm1:pystrgrp pylev0:pystrgrp pylev1:pystrgrp pystrgrp:strgrp
500 20.1 7.7 13.2 6.2 5.0
1000 36.3 15.1 29.5 13.2 2.2
1500 40.9 19.4 35.3 17.8 1.6
2000 43.7 23.1 38.2 21.2 1.5
2500 41.6 25.0 36.9 21.0 1.3
3000 43.7 24.3 37.1 21.4 1.3
3500 45.2 26.7 37.7 23.7 1.3

Conclusion

  • strgrp is much faster than both difflib and python-levenshtein for string clustering
  • strgrp is easy to integrate into existing python scripts via wrapper

Test Implementations and Measurements

Naive Python Clustering

Naive clustering implementation using get_close_matches():

import difflib

def gcm0(strings):
    clusters = {}
    for string in (x.strip() for x in strings):
        match = difflib.get_close_matches(string, clusters.keys(), 1, 0.85)
        if match:
            clusters[match[0]].append(string)
        else:
            clusters[string] = [ string ]
    return clusters
N 0 1 2 3 4 5 6 7 8 9
500 1.60 0.92 0.96 0.92 0.95 0.93 0.93 0.96 0.96 0.92
1000 3.21 3.20 3.27 3.22 3.31 3.29 3.23 3.26 3.33 3.34
1500 6.98 6.86 6.91 6.89 6.92 6.92 6.87 7.05 7.12 7.01
2000 11.31 10.75 10.64 10.96 10.90 10.96 11.12 10.85 11.20 11.86
2500 15.52 15.88 15.27 15.74 15.70 15.68 15.31 20.96 16.34 15.60
3000 19.99 26.32 20.14 20.06 26.84 26.58 20.60 20.00 28.33 20.31
3500 33.28 35.14 30.04 28.28 30.76 25.92 33.86 26.17 33.62 32.28
def pylev0(strings, f):
    f = StringMatcher.ratio
    clusters = {}
    for s in (x.strip() for x in strings):
        if not clusters.keys():
            scoreMax = grpscore(None, 0)
        else:
            scoreMax = max([ grpscore(k, f(k, s)) for k in clusters.keys() ],
                    key = lambda x: x.score)
        if 0.85 <= scoreMax.score:
            clusters[scoreMax.grp].append(s)
        else:
            clusters[s] = [ s ]
    return clusters

Slightly Less Naive Python Clustering

The slightly less naive difflib clustering implementation accounts for exact matches before invoking get_close_matches():

import difflib

def gcm1(strings):
    clusters = {}
    for string in (x.strip() for x in strings):
        if string in clusters:
            clusters[string].append(string)
        else:
            match = difflib.get_close_matches(string, clusters.keys(), 1, 0.85)
            if match:
                clusters[match[0]].append(string)
            else:
                clusters[string] = [ string ]
    return clusters
N 0 1 2 3 4 5 6 7 8 9
500 0.38 0.39 0.40 0.38 0.39 0.38 0.38 0.38 0.38 0.38
1000 1.34 1.35 1.33 1.38 1.35 1.37 1.34 1.33 1.37 1.39
1500 3.26 3.32 3.34 3.22 3.27 3.37 3.31 3.29 3.29 3.28
2000 5.81 5.78 5.84 5.90 5.80 5.99 5.82 5.79 5.86 5.83
2500 8.58 8.52 8.51 8.47 13.74 8.45 8.33 8.39 8.49 15.64
3000 11.26 11.34 11.08 14.36 14.13 11.44 12.06 17.21 11.04 13.53
3500 18.88 19.74 18.00 15.81 15.97 23.27 15.99 16.03 23.35 15.85
def pylev1(strings):
    f = StringMatcher.ratio
    clusters = {}
    for s in (x.strip() for x in strings):
        if s in clusters:
            clusters[s].append(s)
        else:
            if not clusters.keys():
                scoreMax = grpscore(None, 0)
            else:
                scoreMax = max([ grpscore(k, f(k, s)) for k in clusters.keys() ],
                        key = lambda x: x.score)
            if 0.85 <= scoreMax.score:
                clusters[scoreMax.grp].append(s)
            else:
                clusters[s] = [ s ]
    return clusters

strgrp Clustering

The driver used for timing strgrp's implementation was as follows:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "ccan/strgrp/strgrp.h"

int main(void) {
    FILE *f;
    char *buf;
    struct strgrp *ctx;
    f = fdopen(0, "r");
#define BUF_SIZE 512
    buf = malloc(BUF_SIZE);
    ctx = strgrp_new(0.85);
    while(fgets(buf, BUF_SIZE, f)) {
        buf[strcspn(buf, "\r\n")] = '\0';
        if (!strgrp_add(ctx, buf, NULL)) {
            printf("Failed to classify %s\n", buf);
        }
    }
    strgrp_free(ctx);
    free(buf);
    fclose(f);
    return 0;
}
N 0 1 2 3 4 5 6 7 8 9
500 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
1000 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04
1500 0.11 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10
2000 0.18 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17
2500 0.29 0.28 0.28 0.28 0.28 0.28 0.28 0.28 0.28 0.28
3000 0.39 0.38 0.38 0.38 0.38 0.39 0.38 0.38 0.38 0.39
3500 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.51 0.51

pystrgrp

Using the strgrp Python wrapper:

from pystrgrp import Strgrp

def pystrgrp(strings):
    clusters = Strgrp(0.85)
    for string in (x.strip() for x in strings):
        clusters.add(string, None)
    return clusters
N 0 1 2 3 4 5 6 7 8 9
500 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
1000 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09 0.09
1500 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17 0.17
2000 0.25 0.25 0.25 0.26 0.25 0.26 0.26 0.25 0.25 0.25
2500 0.39 0.39 0.38 0.39 0.39 0.39 0.39 0.39 0.39 0.39
3000 0.53 0.53 0.52 0.52 0.52 0.52 0.52 0.53 0.53 0.52
3500 0.68 0.68 0.68 0.68 0.68 0.68 0.69 0.68 0.69 0.7

Methodology

  • Correctness: Sort output and diff
#!/usr/bin/python3

import sys

def printgrps(txtgrps):
    grps = {}
    grp = []
    for l in txtgrps:
        sl = l.strip()
        if "" == sl:
            pass
        elif l.startswith("\t"):
            grp.append(sl)
        else:
            grps[sl] = []
            grp = grps[sl]
    for k in sorted(grps.keys()):
        print(k)
        for v in sorted(grps[k]):
            print("\t{}".format(v))
        print()

if __name__ == "__main__":
    printgrps(sys.stdin)
  • Performance measurement. All of the timing tables were made with the help of a rough-and-ready Bash script.
#!/bin/bash

MAX_VALS=$1
DATA=$2
shift 2

read -r -d '' AWK_TIMER <<EOF
BEGIN { samples = 0; timer = "off"; rd[0] = 0; rl = 0 }
/sampling start/ { samples = \$3 }
/time start/ { timer = "on" }
/time stop/ { timer = "off" }
/real/ { if (timer == "on") { rd[rl++] = \$2 } }
/sampling stop/ { printf("%s, ", samples); for (e in rd) printf("%s, ", rd[e]); printf("\n"); rd[0] = 0; rl = 0; }
EOF

for i in $( seq 500 500 $MAX_VALS )
do
    echo sampling start $i
    for j in $( seq 1 10 )
    do
        echo time start
        tail -n $i $DATA | time -p $@ > /dev/null
        echo time stop
    done;
    echo sampling stop $i
done 2>&1 | awk "$AWK_TIMER"
Copy link

ghost commented Sep 9, 2018

What's the grpscore function please, what's package it belongs to, I want to get pylev1 running :)

@amboar
Copy link
Author

amboar commented Feb 9, 2019

What's the grpscore function please, what's package it belongs to, I want to get pylev1 running :)

Sorry, I only just saw your comment @bacloud14. I had forgotten I had omitted that function. What we want is given some candidate string and the current set of groups to calculate a percentage similarity for each group against the candidate string and find the best match. So the question is: How do we generate a representative value in [0.0, 1.0] from the length of the longest common subsequence? First, some basic rules:

  1. grpscore() must return 1.0 for strings that exactly similar, and
  2. grpscore() must return 0.0 for strings that have no common subsequence,

Assuming lcs() is a function that returns the actual subsequence string, len(lcs(key, candidate)) is the interesting input property that we want to convert into a percentage. The way to do this is somehow via the length of the input strings.

A naive approach is simply something like:

def grpscore(key, score):
    return score / len(key)

where we then do:

grpscore(key, len(lcs(key, candidate)))

as in the code above. LCS is commutative, so lcs(key, candidate) = lcs(candidate, key), but it may simultaneously be true that len(key) != len(candidate), which means that the naive implementation of grpscore() is not commutative. This is bad.

To demonstrate, consider the case where key is a complete subsequence of candidate and len(key) < len(candidate) (e.g. key is a prefix of candidate): this will produce len(lcs(key, candidate)) == len(key), and therefore grpscore(key, len(lcs(key, candidate))) == 1.0. Now, assume that we swap key and candidate such that keyp, candidatep = candidate, key: In this case, len(lcs(keyp, candidatep)) < len(keyp), and therefore grpscore(keyp, len(lcs(keyp, candidatep))) < 1.0 despite the fact that we're still dealing with the same two strings.

Non-commutative grpscore() means that the clustering is unstable with respect to input order, so we need to make it commutative to avoid insanity. We defined two properties for grpscore() up the top, now we'll add a third:

  1. grpscore() must be commutative.

One way to associate two lengths in two dimensions is with Pythagoras' theorem - with a remodeling of the grpscore() function signature from what's used in the code listings in the gist what we can do here is take a ratio of the hypotenuses:

def grpscore(key, candidate):
    llcs = len(lcs(key, candidate))
    lk, lc = len(key), len(candidate)
    return math.sqrt(2 * llcs * llcs) / math.sqrt(lk * lk + lc * lc)

Does our new grpscore() above satisfy our three requirements?

  1. For the key == candidate case, it's trivial that len(key) == len(candidate) and len(lcs(key, candidate)) = len(key) as key is a complete prefix of candidate by definition. By substitution it's obvious that under these conditions grpscore(key, candidate) == 1.0.

  2. For the no common subsequence case, then len(lcs(key, candidate)) == 0, which makes the numerator of the division zero, therefore grpscore(key, candidate) == 0.0

  3. For the commutative case, the denominator performs addition of a function of both of the input lengths, and addition is commutative, therefore grpscore is now commutative.

The remaining question is whether it's linear (i.e. whether we have a proper percentage, or something warped). The strings 'aa' and 'ab' are 50% similar. Our improved grpscore() gives us

a. math.sqrt(2 * 1 * 1) / math.sqrt(2 * 2 + 2 * 2)
b. = math.sqrt(2) / math.sqrt(8)

By mathwarehouse, dividing square roots is also commutative, so:

c. = math.sqrt(2/8)
d. = 0.5

So from a stupidly brief inspection it looks reasonable. However given execution time is also a concern, by backing up slightly we can apply the mathwarehouse optimisation to our implementation to reduce the number of invocations of math.sqrt():

def grpscore(key, candidate):
    llcs = len(lcs(key, candidate))
    lk, lc = len(key), len(candidate)
    return math.sqrt((2 * llcs * llcs) / (lk * lk + lc * lc))

Hope that helps, even if it is a bit of a late reply 🙂

@amboar
Copy link
Author

amboar commented Feb 9, 2019

@bacloud14 Having said all that, I'm trying to recall why the current implementation isn't something like

def grpscore(key, candidate):
    return len(lcs(key, candidate)) / max(len(key), len(candidate))

I think I was doing that at one stage and switched away, not sure why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment