Skip to content

Instantly share code, notes, and snippets.

@gdbassett
Created December 12, 2014 21:59
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 13 You must be signed in to fork a gist
  • Save gdbassett/528d816d035f2deaaca1 to your computer and use it in GitHub Desktop.
Save gdbassett/528d816d035f2deaaca1 to your computer and use it in GitHub Desktop.
Efficient python implementation of canopy clustering. (A method for efficiently generating centroids and clusters, most commonly as input to a more robust clustering algorithm.)
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np
# X shoudl be a numpy matrix, very likely sparse matrix: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix
# T1 > T2 for overlapping clusters
# T1 = Distance to centroid point to not include in other clusters
# T2 = Distance to centroid point to include in cluster
# T1 > T2 for overlapping clusters
# T1 < T2 will have points which reside in no clusters
# T1 == T2 will cause all points to reside in mutually exclusive clusters
# Distance metric can be any from here: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html
# filemap may be a list of point names in their order in X. If included, row numbers from X will be replaced with names from filemap.
def canopy(X, T1, T2, distance_metric='euclidean', filemap=None):
canopies = dict()
X1_dist = pairwise_distances(X, metric=distance_metric)
canopy_points = set(range(X.shape[0]))
while canopy_points:
point = canopy_points.pop()
i = len(canopies)
canopies[i] = {"c":point, "points": list(np.where(X1_dist[point] < T2)[0])}
canopy_points = canopy_points.difference(set(np.where(X1_dist[point] < T1)[0]))
if filemap:
for canopy_id in canopies.keys():
canopy = canopies.pop(canopy_id)
canopy2 = {"c":filemap[canopy['c']], "points":list()}
for point in canopy['points']:
canopy2["points"].append(filemap[point])
canopies[canopy_id] = canopy2
return canopies
@danielgarr
Copy link

Hello gdbasset.

Thanks for the implementation. It is efficient indeed. However, in "standard" canopy, points inside T2 are not eligibles as centers nor as points for new canopies.

The fact that this implementations looks for points in canopies allways in X1_dist and only removes points from canopy_points adds extra points to new canopies.

You can try it with T1 = T2 (which makes the algorithm useless :) ). In that scenario each point is supposed to belong to one and only one canopy....

@kowc
Copy link

kowc commented Jan 28, 2017

Hi gdbasset,
I am new to python and can you please tell me what data is stored in numpy matrix X?
I came to know that canopy clustering works on single attribute then what is the matrix X about?
Thank you in advance.

@mohammadreza-gim
Copy link

Is the clustering result influenced by the length of T1 and T2?

@wesleymsmith
Copy link

wesleymsmith commented May 24, 2021

One potential problem here is that this method relies on computing the distance matrix directly. This is somewhat of a brute force approach and will end up meaning that the time and memory requirements for this method will be, at best, O(n^2). That may be fine for smallish data sets, since computing the pairwise distances may be quite fast. However, for application to large datasets, you might want to consider using something like ball tree or KD tree. I.e. instead of computing the distance matrix directly, build a skl.neighbors.BallTree, then use the query_radius function to obtain a 'sparse' distance matrix. This can backfire if T1 and T2 are too larger, but if T1 and T2 are small relative to the volume / density of your clustering, this could yield considerable savings.

@MitraMitraMitra
Copy link

I think that in lines 5, 8 and 9 the signs should be reversed.

@EmersonHernly
Copy link

I agree with MitraMitraMitra, those lines seem backwards. Also, if you are using this on larger datasets, you can use pairwise_distance_chunked to avoid the need to compute the distance matrix all at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment