amboar/README.md

## README.md

      
    Raw
  

              README.md
            
          
    String Clustering with Python (and C)

I became interested in string clustering through developing
fpos, a set of Python scripts for graphing
my spending habits. In a clear failure of research I missed the existance of
libraries like:

difflib
python-levenshtein, and
textcluster

Rather, I dove off down the path of developing my own string clusterer.
I wanted the approach to be quick; fpos is a commandline tool and I wanted it
to respond to me now, dammit, not a few minutes down the track when I was
trying to understand how my spending was tracking my budget.
String clustering is a fairly computationally expensive problem: Picking the
best match across clusters using a similarity measure like
longest common subsequence
or Levenshtein distance
can yield a complexity in the order of O(m * n * o^2), where m represents
the longest known string length, n represents the input string length and o
represents the number of strings across all clusters.
strgrp is what I've cooked up as a
result. Sure, it's cheating by dropping to C for its implementation, but it
does have a Python wrapper. And with a small amount of effort if you're already
using difflib you can leverage strgrp's speed from your Python scripts. You
can grab the wrapper
from the fpos repository.
Having discovered the alternatives after-the-fact I was interested in comparing
their performance with strgrp. Whilst strgrp doesn't use much CS research to
underpin its approach (and really reflects my own exploration of the problem),
as shown below the implementation is a large improvement over some comparable
(at a high level) difflib- or python-levenshtein- based approaches.
Now, having said that, the research indicates that
approximate string matching is at best quadratic
and points to Wagner-Fischer (Levenshtein) as the best algorithm we have. And
if this is the case, then tricks are all we have left. strgrp's tricks haven't
yet got to the point of implementing some optimisations suggested by Berghel
and Roach in An Extension of Ukkonen's Enhanced Dynamic Programming ASM Algorithm
or Allison in Lazy Dynamic-Programming can be Eager,
but as outlined in the CCAN info page
it implements some alternative measures to cut down on the runtime. The tricks
include:

Only comparing input strings with a well-defined subset of the currently
clustered strings
Caching clusters in a map for exact match lookups
Using functions with cheaper worst-case complexities to filter out groups
prior to computing the longest common subsequence, and
Distributing comparisons across multiple threads using OpenMP

On this last point, Python's reference implementation (CPython) is hamstrung
and only multi-process concurrency will help. From the
threading library documentation:

CPython implementation detail: In CPython, due to the Global Interpreter
Lock, only one thread can execute Python code at once (even though certain
performance-oriented libraries might overcome this limitation). If you want
your application to make better use of the computational resources of
multi-core machines, you are advised to use multiprocessing or
concurrent.futures.ProcessPoolExecutor. However, threading is still an
appropriate model if you want to run multiple I/O-bound tasks simultaneously.

I haven't fully investigated implementing such an approach for comparison, so
in the testing below the difflib and python-levenshtein solutions are
single-threaded. This is possibly unfair as the strgrp solutions benefit from
unintrusive multi-threading via OpenMP, but it might also be seen as a benefit
of dropping to C for more control. Finally, the abstraction provided by
difflib's get_close_matches() prevents a user choice of multi-threading: the task
that can be executed in parallel is that of calculating the match scores of the
input string against each cluster. It is up to difflib to implement a
multi-threaded approach, and it appears this is not the case.
Hardware, Tests and Data

Test hardware: Thinkpad X201s (2010)

RAM: 3813MiB
CPU: x86_64 Intel(R) Core(TM) i7 CPU L 640 @ 2.13GHz (4 cores)

Test software:

GCC (Gentoo 4.9.3 p1.0, pie-0.6.2) 4.9.3
Python 3.4.3

The sections below analyse six clustering implementations and their absolute
and relative performance:

Naive clustering using difflib and python-levenshtein
Slight less naive clustering using difflib and python-levenshtein
Native strgrp clustering
strgrp clustering through the Python wrapper

The shape of the test data according to Python's Pandas is:
    >>> df.describe()
    count                   3500
    unique                  1957
    ...
    >>> df.applymap(len).describe()
    count  3500.000000
    mean     46.896571
    std      14.662562
    min       4.000000
    25%      37.000000
    50%      44.000000
    75%      60.000000
    max      97.000000
The test strategy is to execute each approach with a number of different input
set sizes from the data set described above, and to execute ten runs for each
input set. Input sizes increase in batches of 500 from 500 to 3500. The
runtimes are all in seconds and were calculated using the 'real' output of
time(1).
Performance Comparison Charts

The graph and table below compares the average run times for each approach
across the different input sizes:


N
gcm0
gcm1
pylev0
pylev1
pystrgrp
strgrp


500
1.005
0.384
0.66
0.31
0.05
0.01


1000
3.266
1.355
2.653
1.191
0.09
0.041


1500
6.953
3.295
6.009
3.021
0.17
0.105


2000
11.055
5.842
9.658
5.36
0.253
0.173


2500
16.2
9.712
14.34
8.166
0.389
0.29


3000
22.917
12.745
19.46
11.191
0.524
0.396


3500
30.935
18.289
25.801
16.233
0.684
0.521


Similarly, the performance improvement that strgrp provides is shown below:


Count
gcm0:pystrgrp
gcm1:pystrgrp
pylev0:pystrgrp
pylev1:pystrgrp
pystrgrp:strgrp


500
20.1
7.7
13.2
6.2
5.0


1000
36.3
15.1
29.5
13.2
2.2


1500
40.9
19.4
35.3
17.8
1.6


2000
43.7
23.1
38.2
21.2
1.5


2500
41.6
25.0
36.9
21.0
1.3


3000
43.7
24.3
37.1
21.4
1.3


3500
45.2
26.7
37.7
23.7
1.3


Conclusion


strgrp is much faster than both difflib and python-levenshtein for string clustering
strgrp is easy to integrate into existing python scripts via wrapper

Test Implementations and Measurements

Naive Python Clustering

Naive clustering implementation using get_close_matches():
import difflib

def gcm0(strings):
    clusters = {}
    for string in (x.strip() for x in strings):
        match = difflib.get_close_matches(string, clusters.keys(), 1, 0.85)
        if match:
            clusters[match[0]].append(string)
        else:
            clusters[string] = [ string ]
    return clusters


N
0
1
2
3
4
5
6
7
8
9


500
1.60
0.92
0.96
0.92
0.95
0.93
0.93
0.96
0.96
0.92


1000
3.21
3.20
3.27
3.22
3.31
3.29
3.23
3.26
3.33
3.34


1500
6.98
6.86
6.91
6.89
6.92
6.92
6.87
7.05
7.12
7.01


2000
11.31
10.75
10.64
10.96
10.90
10.96
11.12
10.85
11.20
11.86


2500
15.52
15.88
15.27
15.74
15.70
15.68
15.31
20.96
16.34
15.60


3000
19.99
26.32
20.14
20.06
26.84
26.58
20.60
20.00
28.33
20.31


3500
33.28
35.14
30.04
28.28
30.76
25.92
33.86
26.17
33.62
32.28


def pylev0(strings, f):
    f = StringMatcher.ratio
    clusters = {}
    for s in (x.strip() for x in strings):
        if not clusters.keys():
            scoreMax = grpscore(None, 0)
        else:
            scoreMax = max([ grpscore(k, f(k, s)) for k in clusters.keys() ],
                    key = lambda x: x.score)
        if 0.85 <= scoreMax.score:
            clusters[scoreMax.grp].append(s)
        else:
            clusters[s] = [ s ]
    return clusters
Slightly Less Naive Python Clustering

The slightly less naive difflib clustering implementation accounts for exact
matches before invoking get_close_matches():
import difflib

def gcm1(strings):
    clusters = {}
    for string in (x.strip() for x in strings):
        if string in clusters:
            clusters[string].append(string)
        else:
            match = difflib.get_close_matches(string, clusters.keys(), 1, 0.85)
            if match:
                clusters[match[0]].append(string)
            else:
                clusters[string] = [ string ]
    return clusters


N
0
1
2
3
4
5
6
7
8
9


500
0.38
0.39
0.40
0.38
0.39
0.38
0.38
0.38
0.38
0.38


1000
1.34
1.35
1.33
1.38
1.35
1.37
1.34
1.33
1.37
1.39


1500
3.26
3.32
3.34
3.22
3.27
3.37
3.31
3.29
3.29
3.28


2000
5.81
5.78
5.84
5.90
5.80
5.99
5.82
5.79
5.86
5.83


2500
8.58
8.52
8.51
8.47
13.74
8.45
8.33
8.39
8.49
15.64


3000
11.26
11.34
11.08
14.36
14.13
11.44
12.06
17.21
11.04
13.53


3500
18.88
19.74
18.00
15.81
15.97
23.27
15.99
16.03
23.35
15.85


def pylev1(strings):
    f = StringMatcher.ratio
    clusters = {}
    for s in (x.strip() for x in strings):
        if s in clusters:
            clusters[s].append(s)
        else:
            if not clusters.keys():
                scoreMax = grpscore(None, 0)
            else:
                scoreMax = max([ grpscore(k, f(k, s)) for k in clusters.keys() ],
                        key = lambda x: x.score)
            if 0.85 <= scoreMax.score:
                clusters[scoreMax.grp].append(s)
            else:
                clusters[s] = [ s ]
    return clusters
strgrp Clustering

The driver used for timing strgrp's implementation was as follows:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "ccan/strgrp/strgrp.h"

int main(void) {
    FILE *f;
    char *buf;
    struct strgrp *ctx;
    f = fdopen(0, "r");
#define BUF_SIZE 512
    buf = malloc(BUF_SIZE);
    ctx = strgrp_new(0.85);
    while(fgets(buf, BUF_SIZE, f)) {
        buf[strcspn(buf, "\r\n")] = '\0';
        if (!strgrp_add(ctx, buf, NULL)) {
            printf("Failed to classify %s\n", buf);
        }
    }
    strgrp_free(ctx);
    free(buf);
    fclose(f);
    return 0;
}


N
0
1
2
3
4
5
6
7
8
9


500
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01
0.01


1000
0.04
0.04
0.04
0.04
0.04
0.04
0.04
0.04
0.04
0.04


1500
0.11
0.10
0.10
0.10
0.10
0.10
0.10
0.10
0.10
0.10


2000
0.18
0.17
0.17
0.17
0.17
0.17
0.17
0.17
0.17
0.17


2500
0.29
0.28
0.28
0.28
0.28
0.28
0.28
0.28
0.28
0.28


3000
0.39
0.38
0.38
0.38
0.38
0.39
0.38
0.38
0.38
0.39


3500
0.51
0.51
0.51
0.51
0.51
0.51
0.51
0.51
0.51
0.51


pystrgrp

Using the strgrp Python wrapper:
from pystrgrp import Strgrp

def pystrgrp(strings):
    clusters = Strgrp(0.85)
    for string in (x.strip() for x in strings):
        clusters.add(string, None)
    return clusters


N
0
1
2
3
4
5
6
7
8
9


500
0.05
0.05
0.05
0.05
0.05
0.05
0.05
0.05
0.05
0.05


1000
0.09
0.09
0.09
0.09
0.09
0.09
0.09
0.09
0.09
0.09


1500
0.17
0.17
0.17
0.17
0.17
0.17
0.17
0.17
0.17
0.17


2000
0.25
0.25
0.25
0.26
0.25
0.26
0.26
0.25
0.25
0.25


2500
0.39
0.39
0.38
0.39
0.39
0.39
0.39
0.39
0.39
0.39


3000
0.53
0.53
0.52
0.52
0.52
0.52
0.52
0.53
0.53
0.52


3500
0.68
0.68
0.68
0.68
0.68
0.68
0.69
0.68
0.69
0.7


Methodology


Correctness: Sort output and diff

#!/usr/bin/python3

import sys

def printgrps(txtgrps):
    grps = {}
    grp = []
    for l in txtgrps:
        sl = l.strip()
        if "" == sl:
            pass
        elif l.startswith("\t"):
            grp.append(sl)
        else:
            grps[sl] = []
            grp = grps[sl]
    for k in sorted(grps.keys()):
        print(k)
        for v in sorted(grps[k]):
            print("\t{}".format(v))
        print()

if __name__ == "__main__":
    printgrps(sys.stdin)

Performance measurement. All of the timing tables were made with the help of
a rough-and-ready Bash script.

#!/bin/bash

MAX_VALS=$1
DATA=$2
shift 2

read -r -d '' AWK_TIMER <<EOF
BEGIN { samples = 0; timer = "off"; rd[0] = 0; rl = 0 }
/sampling start/ { samples = \$3 }
/time start/ { timer = "on" }
/time stop/ { timer = "off" }
/real/ { if (timer == "on") { rd[rl++] = \$2 } }
/sampling stop/ { printf("%s, ", samples); for (e in rd) printf("%s, ", rd[e]); printf("\n"); rd[0] = 0; rl = 0; }
EOF

for i in $( seq 500 500 $MAX_VALS )
do
    echo sampling start $i
    for j in $( seq 1 10 )
    do
        echo time start
        tail -n $i $DATA | time -p $@ > /dev/null
        echo time stop
    done;
    echo sampling stop $i
done 2>&1 | awk "$AWK_TIMER"

  
## runtime-performance.png

      
    Raw
  

              runtime-performance.png
            
          
## runtime-ratios.png

      
    Raw
  

              runtime-ratios.png
N	gcm0	gcm1	pylev0	pylev1	pystrgrp	strgrp
500	1.005	0.384	0.66	0.31	0.05	0.01
1000	3.266	1.355	2.653	1.191	0.09	0.041
1500	6.953	3.295	6.009	3.021	0.17	0.105
2000	11.055	5.842	9.658	5.36	0.253	0.173
2500	16.2	9.712	14.34	8.166	0.389	0.29
3000	22.917	12.745	19.46	11.191	0.524	0.396
3500	30.935	18.289	25.801	16.233	0.684	0.521
Count	gcm0:pystrgrp	gcm1:pystrgrp	pylev0:pystrgrp	pylev1:pystrgrp	pystrgrp:strgrp
500	20.1	7.7	13.2	6.2	5.0
1000	36.3	15.1	29.5	13.2	2.2
1500	40.9	19.4	35.3	17.8	1.6
2000	43.7	23.1	38.2	21.2	1.5
2500	41.6	25.0	36.9	21.0	1.3
3000	43.7	24.3	37.1	21.4	1.3
3500	45.2	26.7	37.7	23.7	1.3
N	0	1	2	3	4	5	6	7	8	9
500	1.60	0.92	0.96	0.92	0.95	0.93	0.93	0.96	0.96	0.92
1000	3.21	3.20	3.27	3.22	3.31	3.29	3.23	3.26	3.33	3.34
1500	6.98	6.86	6.91	6.89	6.92	6.92	6.87	7.05	7.12	7.01
2000	11.31	10.75	10.64	10.96	10.90	10.96	11.12	10.85	11.20	11.86
2500	15.52	15.88	15.27	15.74	15.70	15.68	15.31	20.96	16.34	15.60
3000	19.99	26.32	20.14	20.06	26.84	26.58	20.60	20.00	28.33	20.31
3500	33.28	35.14	30.04	28.28	30.76	25.92	33.86	26.17	33.62	32.28
N	0	1	2	3	4	5	6	7	8	9
500	0.38	0.39	0.40	0.38	0.39	0.38	0.38	0.38	0.38	0.38
1000	1.34	1.35	1.33	1.38	1.35	1.37	1.34	1.33	1.37	1.39
1500	3.26	3.32	3.34	3.22	3.27	3.37	3.31	3.29	3.29	3.28
2000	5.81	5.78	5.84	5.90	5.80	5.99	5.82	5.79	5.86	5.83
2500	8.58	8.52	8.51	8.47	13.74	8.45	8.33	8.39	8.49	15.64
3000	11.26	11.34	11.08	14.36	14.13	11.44	12.06	17.21	11.04	13.53
3500	18.88	19.74	18.00	15.81	15.97	23.27	15.99	16.03	23.35	15.85