Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Comparison of the hierarchical clustering implementations in NlpTools (php) and nltk (python).

Hierarchical clustering php vs python

Cluster 500 2-dimensional euclidean points using hierarchical clustering with group average linkage and cosine similarity as distance metric. The python implementation is from the nltk library and the php one is from NlpTools.

To run the test you need to install those libraries, you will find instructions in the corresponding sites.

Results

Speed

PHP: 7.9s (best of 4)
Python: 198.8s (best of 4)

The php implementation is ~25 times faster with this dataset of 500 points. In addition, the php implementation is pure php with no C extensions for matrix manipulation (numpy).

In fact nltk is asymptotically slower since the algorithm is uses for hierarchical clustering is O(n3) and the php implementation is O(n2logn).

Memory

PHP: 60M (As reported by top under RES)
Python: 34M (As reported by top under RES)

To achieve the above time complexity there has been a sacrifice in memory consumption, you can read more about the implementation in an article describing it.

-6,-4
67,-79
78,94
46,-40
13,-8
46,50
98,64
-45,-37
94,93
78,-46
78,5
-80,69
-5,33
-1,-84
71,45
-28,-13
1,-15
82,52
-1,-88
-14,-25
-77,-19
92,20
35,34
91,-37
54,-8
-67,25
87,-15
-47,92
-79,96
98,-5
-18,79
100,11
-6,-10
73,100
61,-12
9,-87
-22,95
-39,-80
65,98
41,-59
48,-86
79,-11
-82,-50
53,33
-40,56
-2,-94
-26,65
48,74
-72,99
-23,53
90,-73
71,86
7,80
-44,80
-72,50
7,9
-68,-99
-65,-37
-62,72
-18,24
23,-36
-78,17
65,17
51,-90
-86,36
15,-76
-28,45
-1,37
-76,-70
-67,-10
86,45
-23,93
-65,82
-70,9
-4,95
36,73
99,79
-91,-42
-51,57
4,-28
-85,7
98,71
-99,13
-64,63
-10,4
45,18
-35,-71
-44,-50
-22,-45
26,-96
100,-90
43,67
34,-39
-38,-21
-23,95
-39,-15
90,-50
97,86
-48,25
-78,-75
-31,-61
-49,-10
60,90
44,-19
-78,86
43,43
-93,-96
86,7
51,-26
-8,-27
-92,-1
-15,55
-44,85
-40,95
-26,-98
-39,-96
-77,53
31,27
-60,-77
40,-87
76,76
13,-36
10,-63
-99,-12
-52,-12
25,66
-84,-19
90,77
1,82
-51,-73
84,65
59,5
35,33
63,-91
-50,59
-72,-71
89,-79
-7,22
62,6
-58,-42
11,-46
-24,-18
-48,55
96,-13
6,29
-97,-36
-6,-41
85,77
-44,11
70,-47
-64,-100
54,-32
58,90
-17,99
-44,-25
78,-30
-67,2
-32,-76
-41,-47
71,-28
99,94
-23,-27
-61,72
-19,-75
-86,-38
45,-48
-28,-97
67,-80
-19,-1
47,-37
10,-62
50,74
-10,72
88,23
-37,2
10,10
67,-30
61,77
-92,26
98,-95
100,-15
-72,32
5,-42
75,59
-84,-1
43,89
-82,69
11,57
-76,98
50,30
-4,85
-44,-35
17,64
20,24
-83,74
75,-51
-56,33
44,-19
-71,91
-57,3
-2,-68
73,-64
-49,53
-81,-12
-28,-22
30,64
21,-20
75,-67
-94,66
-12,82
-34,-100
58,-41
7,54
-44,-13
-6,49
15,-94
-20,32
52,-9
85,-12
34,68
-80,-8
77,-83
41,35
62,-33
-52,0
-43,52
36,55
35,38
97,28
70,21
30,-32
45,-97
23,-33
13,41
30,-59
-5,53
62,36
29,-8
-67,8
29,69
88,61
-86,93
-60,-20
-31,64
39,43
46,-24
-48,6
6,26
16,25
-15,-26
89,-75
59,42
19,-28
39,57
-4,-66
1,-39
36,10
21,-80
-99,18
0,-52
20,93
54,20
81,-21
76,53
-5,-13
-83,-18
92,94
24,-59
-89,-4
-88,-18
-5,32
-38,-79
-98,42
0,16
49,45
46,53
77,-21
-80,8
-55,-49
-98,-100
-22,37
-99,68
-62,-25
-64,-73
-33,87
17,-28
78,79
-62,7
29,-29
-78,-94
98,-57
66,65
87,73
87,-66
-27,-95
58,95
72,-47
-77,-6
37,-62
28,49
81,54
33,78
31,-73
-77,27
91,1
75,15
-72,92
-15,-58
-48,-92
61,-35
85,-37
3,-46
-23,-70
-65,-77
93,34
53,-27
61,66
-34,69
-78,-64
93,-9
94,67
-83,84
90,90
-73,21
68,-73
6,-79
39,-69
-10,-48
-97,71
-29,98
70,-93
-80,-100
15,-82
-10,9
-33,39
100,-92
40,-65
44,-85
72,-27
-16,-53
72,-20
-45,11
-68,62
1,25
72,-23
75,-36
0,-1
-71,94
53,-24
9,74
-90,-77
35,-14
98,-11
-6,-93
22,72
15,-3
70,-97
-32,-33
19,-82
67,100
59,73
-59,-6
81,-53
-55,-70
66,-78
-18,2
-88,-89
96,46
-90,45
9,45
-51,33
60,-85
-99,-55
-67,38
-58,-52
-62,-23
10,38
-87,-36
81,89
38,50
-54,-95
-54,-15
-44,-10
94,26
-6,38
-10,99
32,27
17,11
82,-12
-22,-18
-25,-38
51,63
-3,-19
25,-100
45,-80
-67,-1
-6,42
45,-72
75,53
45,40
48,-23
36,-49
51,-26
36,1
-53,65
15,-86
-49,-42
-32,-90
-35,87
66,69
12,-96
-5,100
67,-49
16,-15
38,-85
80,-14
-1,23
-50,-54
12,-36
27,-29
10,21
-9,-7
-97,90
22,-30
-79,-81
52,29
49,-32
-100,-40
58,-86
43,-31
53,-79
-15,-95
54,-25
-28,44
-66,-42
-56,73
-68,-10
-68,-39
-25,20
-6,4
-37,-50
-92,66
-27,-52
-65,89
42,3
50,57
-12,-47
98,-57
63,4
-49,71
-15,75
-85,58
-39,-51
92,-27
-8,95
-30,27
18,69
-15,7
97,-11
84,-44
-33,47
-42,-52
-32,-15
-60,47
87,54
91,-86
95,-94
-2,-3
-52,-94
56,-81
-23,-25
-92,10
-66,30
-38,-70
-28,7
-36,41
-50,-89
11,-65
-80,0
1,-97
17,97
-100,27
-98,-30
57,16
-26,-93
75,-9
24,84
-89,-60
28,-29
16,11
78,-44
62,-57
-32,-69
5,33
95,-34
2,72
28,-93
93,50
-33,49
76,33
<?php
include('../nlp/autoloader.php');
use NlpTools\Clustering\Hierarchical;
use NlpTools\Clustering\MergeStrategies\GroupAverage;
use NlpTools\Similarity\CosineSimilarity;
use NlpTools\FeatureFactories\DataAsFeatures;
use NlpTools\Documents\TokensDocument;
use NlpTools\Documents\TrainingSet;
$tset = new TrainingSet();
$cnt = 0;
foreach (file('data') as $line) {
list($x,$y) = explode(',',$line);
$tset->addDocument(
'',
new TokensDocument(array('x'=>(float)$x,'y'=>(float)$y))
);
}
$clusterer = new Hierarchical(
new GroupAverage(),
new CosineSimilarity()
);
$dg = $clusterer->cluster($tset,new DataAsFeatures());
from nltk.cluster import GAAClusterer
import numpy
with open('data') as f:
vectors = [numpy.array(map(int,l.split(','))) for l in f]
clusterer = GAAClusterer()
clusterer.cluster(vectors,False)
dg = clusterer.dendrogram()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment