Skip to content

Instantly share code, notes, and snippets.

Last active July 12, 2017 14:11
Show Gist options
  • Save daveliepmann/869f5aeb00da0c78f23b153673e90d58 to your computer and use it in GitHub Desktop.
Save daveliepmann/869f5aeb00da0c78f23b153673e90d58 to your computer and use it in GitHub Desktop.
Clojure source code for Part 1 of "Implementing the k-means jump method" — see
(ns jump-sugar-james.irises
(:require [incanter.core :as i]
[incanter.stats :as incs]
[incanter.charts :as incc]
[incanter.datasets :as incd]
[clj-ml.clusterers :as mlc]
[ :as mld]))
;;;; This paper:
;; Finding the number of clusters in a data set: An information theoretic approach
;; Marshall School of Business, University of Southern California
;; fascinating. Let's work through its ideas by
;; implementing its methods and replicating examples.
;; You should read the paper, or at the very least the abstract, but
;; for those that refuse to do so, the gist I'll repeat here is that
;; we can use ideas from information theory to evaluate choices for
;; the number of clusters (which we refer to as k) in data sets we
;; apply the k-means clustering algorithm to.
;; Don't try to read the paper alongside this document: the narrative
;; here does *not* match the order in which Sugar & James introduce
;; concepts and data sets. I don't intend for this to be read
;; alongside the paper, but rather as a different path for exploring
;; the ideas they introduce, so I recommend reading or skimming the
;; paper first before working through this document.
;; So, what does it mean to apply information theory, specifically
;; rate distortion theory, to clustering applications? The core idea
;; is actually quite concise. The authors describe it in section 2.1:
;; "One can characterize cluster analysis as an attempt to find the
;; best possible representation of a population [that is, of a data
;; set --DL] using a fixed number of points. This can be thought of as
;; performing data compression or quantization on
;; i.i.d. [independently and identically distributed] draws from a
;; given distribution."
;; Awesome! Clustering a set of data is like summarizing it. And this
;; tells us directly what it means (like, philosophically) to select
;; the number of clusters: we're looking for that ideal balance
;; between compressing the original information and retaining its
;; meaning. For more on this general concept (as Sugar & James note),
;; see the work of Claude Shannon.
;; By modeling clustering as a kind of compression, Sugar & James
;; provide a tool to better determine which choice of k is best suited
;; to a given data set. That is, they propose a method to determine
;; how many clusters best describe a data set.
;; With that conceptual understanding in hand, let's look at some
;; concrete applications. Please have a REPL handy and evaluate as you
;; read.
;; Let us begin with that old classic, Fisher's iris data. We direct
;; Incanter to grab it for us. (The next few lines about PCA are
;; directly from 'Data Sorcery with Clojure' at
;;, modified only to :require
;; Incanter rather than use.)
(def iris (i/to-matrix (incd/get-dataset :iris)))
;; Since the iris data is multivariate, to plot it as an xy graph we
;; must condense its attributes. For that we perform principal
;; component analysis (PCA):
(def X (i/sel iris :cols (range 4)))
(def components (:rotation (incs/principal-components X)))
(def pc1 (i/sel components :cols 0))
(def pc2 (i/sel components :cols 1))
;; And now we can view the four-dimensional iris data in two
;; dimensions. We group by species using color.
(i/view (incc/scatter-plot (i/mmult X pc1) (i/mmult X pc2)
:group-by (i/sel iris :cols 4)
:x-label "PC1"
:y-label "PC2"
:title "Iris PCA grouped by species [Figure 5c]"))
;; The function output should be something like:
;; => #object[org.jfree.chart.ChartFrame 0x4aa3d1e4 "org.jfree.chart.ChartFrame[frame6,0,23,500x400,layout=java.awt.BorderLayout,title=Incanter Plot,resizable,normal,defaultCloseOperation=DISPOSE_ON_CLOSE,rootPane=javax.swing.JRootPane[,0,22,500x378,layout=javax.swing.JRootPane$RootLayout,alignmentX=0.0,alignmentY=0.0,border=,flags=16777673,maximumSize=,minimumSize=,preferredSize=],rootPaneCheckingEnabled=true]"]
;; ...but the goods are in the Java window, which you may have to
;; alt-tab to go see. Go ahead and look; I'll wait here.
;; So, we know that this data forms three groups by species, but
;; visualizing it with PCA reveals that while one species is well
;; differentiated, the other two have overlapping
;; attributes. Therefore this data could legitimately be interpreted
;; as consisting of either two or three groups.
;; Let's now explore what k-means analysis and Sugar & James' approach
;; say about the iris data. We'll use WEKA's k-means implementation
;; via its Clojure wrapper, clj-ml. I've already aliased its
;; 'clusterers' and 'data' namespaces as mlc and mld respectively, so
;; we can get started.
(def iris-kmeans-2 (mlc/make-clusterer :k-means {:number-clusters 2}))
;; What we have here is a 'clusterer' object, which one uses in WEKA
;; to perform cluster analysis. (It's a rather object-oriented
;; approach.) This clusterer doesn't say much, because while it's been
;; *created*, it hasn't yet been *built*:
;; => #object[weka.clusterers.SimpleKMeans 0x3b416f01 "No clusterer built yet!"]
;; So let's build it!
(doto iris-kmeans-2
(mlc/clusterer-build (mld/make-dataset "Fisher's Irises"
[:sepal-length :sepal-width
:petal-length :petal-width]
; Withold the species column
; because not to would be
; cheating:
(i/to-vect (i/sel iris :cols (range 4))))))
;; => #object[weka.clusterers.SimpleKMeans 0x3b416f01 "\nkMeans\n======\n\nNumber of iterations: 5\nWithin cluster sum of squared errors: 18.37779075053819\n\nInitial staring points (random):\n\nCluster 0: 6.1,2.9,4.7,1.4,1\nCluster 1: 6.2,2.9,4.3,1.3,1\n\nMissing values globally replaced with mean/mode\n\nFinal cluster centroids:\n Cluster#\nAttribute Full Data 0 1\n (150) (100) (50)\n===============================================\nsepal-length 5.8433 6.262 5.006\nsepal-width 3.0573 2.872 3.428\npetal-length 3.758 4.906 1.462\npetal-width 1.1993 1.676 0.246\nspecies 1 1.5 0\n\n\n"]
;; Man that's hard to read. This evaluates to `nil` but pretty-prints
;; in the REPL:
(println iris-kmeans-2)
;; ;; #object[weka.clusterers.SimpleKMeans 0x1b5c3bc
;; kMeans
;; ======
;; Number of iterations: 7
;; Within cluster sum of squared errors: 12.127790750538196
;; Initial staring points (random):
;; Cluster 0: 6.1,2.9,4.7,1.4
;; Cluster 1: 6.2,2.9,4.3,1.3
;; Missing values globally replaced with mean/mode
;; Final cluster centroids:
;; Cluster#
;; Attribute Full Data 0 1
;; (150) (100) (50)
;; ===============================================
;; sepal-length 5.8433 6.262 5.006
;; sepal-width 3.0573 2.872 3.428
;; petal-length 3.758 4.906 1.462
;; petal-width 1.1993 1.676 0.246
;; ]
;; So WEKA's k-means function dutifully fulfilled its task and gave us
;; two clusters. As the Turks say, çok iyi! Let's try with 3:
(def iris-kmeans-3 (mlc/make-clusterer :k-means {:number-clusters 3}))
(doto iris-kmeans-3
(mlc/clusterer-build (mld/make-dataset "Fisher's Irises"
[:sepal-length :sepal-width
:petal-length :petal-width]
(i/to-vect (i/sel iris :cols (range 4))))))
(println iris-kmeans-3)
;; #object[weka.clusterers.SimpleKMeans 0x6a3536e9
;; kMeans
;; ======
;; Number of iterations: 6
;; Within cluster sum of squared errors: 6.982216473785234
;; Initial staring points (random):
;; Cluster 0: 6.1,2.9,4.7,1.4
;; Cluster 1: 6.2,2.9,4.3,1.3
;; Cluster 2: 6.9,3.1,5.1,2.3
;; Missing values globally replaced with mean/mode
;; Final cluster centroids:
;; Cluster#
;; Attribute Full Data 0 1 2
;; (150) (61) (50) (39)
;; ==========================================================
;; sepal-length 5.8433 5.8885 5.006 6.8462
;; sepal-width 3.0573 2.7377 3.428 3.0821
;; petal-length 3.758 4.3967 1.462 5.7026
;; petal-width 1.1993 1.418 0.246 2.0795]
;; Again, WEKA dutifully gives us three clusters. But how do we know
;; which choice of k is more accurate? More true? More illuminating?
;; We could visually compare the cluster results to each other and to
;; the species results we already know. So, let's look at the cluster
;; assignments WEKA made.
(.getAssignments iris-kmeans-2)
;; Ach nein! We got an unhandled java.lang.Exception. From the
;; stacktrace:
;; "The assignments are only available when order of instances is
;; preserved (-O)"
;; OK, so we have to re-run our k-means analysis, this time setting a
;; new parameter:
(def iris-kmeans-2 (mlc/make-clusterer :k-means {:number-clusters 2
:preserve-instances-order true}))
(def iris-kmeans-3 (mlc/make-clusterer :k-means {:number-clusters 3
:preserve-instances-order true}))
;; Now, before we used clj-ml's (clusterer-build), which simply wraps
;; .buildClusterer, but since we're already Java-aware with (doto)
;; and it doesn't add any functionality I don't see any advantage of
;; doing so. To me, calling .buildClusterer directly is more clear:
(doto iris-kmeans-2
(.buildClusterer (mld/make-dataset "Fisher's Irises"
[:sepal-length :sepal-width
:petal-length :petal-width]
(i/to-vect (i/sel iris :cols (range 4))))))
(doto iris-kmeans-3
(.buildClusterer (mld/make-dataset "Fisher's Irises"
[:sepal-length :sepal-width
:petal-length :petal-width]
(i/to-vect (i/sel iris :cols (range 4))))))
;; And now?
(.getAssignments iris-kmeans-2)
;; => #object["[I" 0x2b78a89b "[I@2b78a89b"]
;; Ah. Bien sur. So much better. o_O
;; It turns out we need to iterate over those assignments. And
;; manually fold them back into our data. Oh, Java.
(def iris-2-clusters (map conj (i/to-vect iris) (.getAssignments iris-kmeans-2)))
;; Let's take a quick look to see if we're going in the right
;; direction:
(take 10 iris-2-clusters)
;; => ([5.1 3.5 1.4 0.2 0.0 1]
;; [4.9 3.0 1.4 0.2 0.0 1]
;; [4.7 3.2 1.3 0.2 0.0 1]
;; [4.6 3.1 1.5 0.2 0.0 1]
;; [5.0 3.6 1.4 0.2 0.0 1]
;; [5.4 3.9 1.7 0.4 0.0 1]
;; [4.6 3.4 1.4 0.3 0.0 1]
;; [5.0 3.4 1.5 0.2 0.0 1]
;; [4.4 2.9 1.4 0.2 0.0 1]
;; [4.9 3.1 1.5 0.1 0.0 1])
(def iris-3-clusters (map conj (i/to-vect iris) (.getAssignments iris-kmeans-3)))
;; OK. We're all set to compare k=2 and k=3 for the iris data. We
;; reach for PCA again, re-using our existing definitions of X, pc1,
;; and pc2, but grouping now not by species but by cluster
;; assignment. (I also like to have the original species chart open
;; for comparison, so if you closed it, I recommend re-evaluating it.)
;; Notice that I re-use the X-matrix, pc1, and pc2 from before.
(i/view (incc/scatter-plot (i/mmult X pc1) (i/mmult X pc2)
:group-by (i/sel (i/to-matrix (i/to-dataset iris-2-clusters))
:cols 5)
:x-label "PC1"
:y-label "PC2"
:title "Iris PCA grouped by 2 k-means clusters"))
(i/view (incc/scatter-plot (i/mmult X pc1) (i/mmult X pc2)
:group-by (i/sel (i/to-matrix (i/to-dataset iris-3-clusters))
:cols 5)
:x-label "PC1"
:y-label "PC2"
:title "Iris PCA grouped by 3 k-means clusters"))
;; From a purely visual standpoint, I'd say both of these are
;; excellent clustering attempts. With k=3, the overlapping border
;; between species is reasonably close to the true species groupings.
;;...But what if we didn't know that species data? How, then, would we
;; differentiate k=2 from k=3? For that matter, how would we know k=4
;; isn't more right than either one? It looks relatively convincing:
(def iris-kmeans-4 (mlc/make-clusterer :k-means {:number-clusters 4
:preserve-instances-order true}))
(doto iris-kmeans-4
(.buildClusterer (mld/make-dataset "Fisher's Irises"
[:sepal-length :sepal-width
:petal-length :petal-width]
(i/to-vect (i/sel iris :cols (range 4))))))
(i/view (incc/scatter-plot (i/mmult X pc1) (i/mmult X pc2)
:group-by (-> (map conj (i/to-vect iris)
(.getAssignments iris-kmeans-4))
(i/sel :cols 5))
:x-label "PC1"
:y-label "PC2"
:title "Iris PCA grouped by 4 k-means clusters"))
;; k=4 certainly looks as defensible as k=3, just visually. What we
;; need here is some sort of concrete way to gauge accuracy.
;; The most common technique to gauge accuracy of a clustering attempt
;; is the elbow method, where one looks for a sharp turn or "elbow" in
;; the graph of the sum of squared errors (or SSE) for each clustering
;; analysis (or "clusterer"). Let's try that.
;; Here I map our cluster-analysis process across a range of k all the
;; way up to 10. Then I chart the resulting SSEs.
(let [clusterers (map #(mlc/clusterer-info
(doto (mlc/make-clusterer :k-means {:number-clusters %})
(.buildClusterer (mld/make-dataset
"Fisher's Irises"
[:sepal-length :sepal-width
:petal-length :petal-width]
(i/to-vect (i/sel iris :cols
(range 4)))))))
(range 1 11))]
(i/view (incc/line-chart (map :number-clusters clusterers)
(map :squared-error clusterers)
:title "Fisher's irises: SSE by selection of k"
:x-label "Number of clusters, k"
:y-label "Within-cluster sum of squared errors")))
;; It looks like in the case of the iris data, the "elbow" technique
;; is informative, but leaves us with unanswered questions. There's an
;; elbow at both 2 *and* 3, and maybe even 4. With this technique
;; we're able to reject k>4 and focus on k=2, k=3, and k=4, but I want
;; more certainty. I would feel more comfortable with something
;; numerically-based to base my conclusions on. Wouldn't you?
;; Well, that's precisely what Sugar & James provide. They propose
;; calculating the distortion in each clustering analysis, and then
;; looking at the change or 'jump' in distortion from one k to the
;; next. I've created the following functions to do just that.
;; First, they calculate the 'distortion' for each cluster analysis
;; given k. This means raising the squared error, which we already
;; know, to a negative power that we can manipulate but generally set
;; to one-half of the number of parameters in our data set. Since we
;; want to look at this value across a range of k values, this
;; function applies that calculation to each k-means clusterer in a
;; seq of clusterers:
(defn assoc-distortions
"Given a number `transformation-power-y` and a seq of maps
`clusterers`, where each map describes a k-means clusterer, return a
seq of maps describing the same clusterers, now with key :distortion
mapped to the value of the cluster analysis' squared error raised to
the negative `transformation-power-y`.
Based on Sugar & James' paper at which defines
distortion (denoted d[hat][sub k]) for a given k as the mean squared error when covariance
is the identity matrix. This distortion is then transformed by
raising it to the negative of the transformation power y, which is
generally set to the number of attributes divided by two."
[clusterers transformation-power-y]
(map #(assoc % :distortion (Math/pow (:squared-error %) (- transformation-power-y)))
;; Once you know the distortions, Sugar & James write, you calculate
;; the change in those transformed distortions as one changes k. They
;; call this the "jump", which we calculate by walking our seq of
;; clusterers in a pairwise fashion using `partition`. Notice that we
;; prepend a dummy clusterer (`{:distortion 0}`) so that we can
;; calculate a jump for k=1.
(defn assoc-jumps
"Given a seq of maps `clusterers` each describing a k-means analysis
including the transformed :distortion of that analysis, returns a
sequence of maps 'updated' with the calculated jumps in transformed
Based on Sugar & James' paper at which defines
the jump for a given k (denoted J[sub k]) as simply the transformed distortion for k
minus the transformed distortion for k-1."
(map (fn [[a b]] (assoc b :jump (- (:distortion b) (:distortion a))))
(partition 2 1 (conj clusterers {:distortion 0}))))
;; Let's put these distortion and jump tools to work. I want to see
;; them plotted across a selection of k values for our iris data. We
;; also need to try out some different values for the transformation
;; power, Y. (Notice that this turns clustering from a one-dimensional
;; problem of selecting k into a two-dimensional problem of evaluating
;; different combinations of Y and k.) To do that, let's graph the
;; distortion jumps for different values of k across three selections
;; of Y.
(let [clusterers (map #(mlc/clusterer-info
(doto (mlc/make-clusterer :k-means {:number-clusters %})
(.buildClusterer (mld/make-dataset
"Fisher's Irises"
[:sepal-length :sepal-width
:petal-length :petal-width]
(i/to-vect (i/sel iris :cols (range 4)))))))
(range 1 11))]
;; k from 1 to 10 using transformation power of 2/3:
(i/view (incc/line-chart (map :number-clusters (assoc-jumps (assoc-distortions clusterers (/ 2 3))))
(map :jump (assoc-jumps (assoc-distortions clusterers (/ 2 3))))
:title "Fisher's irises: Jumps according to k, with y=2/3 [Figure 5a]"
:x-label "Number of clusters, k"
:y-label "Jumps in transformed distortions, J"))
;; k from 1 to 10 using transformation power of 1:
(i/view (incc/line-chart (map :number-clusters (assoc-jumps (assoc-distortions clusterers 1)))
(map :jump (assoc-jumps (assoc-distortions clusterers 1)))
:title "Fisher's irises: Jumps according to k, with y=1 [Figure 5b]"
:x-label "Number of clusters, k"
:y-label "Jumps in transformed distortions, J"))
;; k from 1 to 10 using transformation power of 2:
(i/view (incc/line-chart (map :number-clusters (assoc-jumps (assoc-distortions clusterers 2)))
(map :jump (assoc-jumps (assoc-distortions clusterers 2)))
:title "Fisher's irises: Jumps according to k, with y=2 (p/2) [Figure 5b]"
:x-label "Number of clusters, k"
:y-label "Jumps in transformed distortions, J")))
;; Sugar and James use Y=1 and Y=2/3, and I added Y=2 to see
;; performance of the default recommendation of 1/2 of the number of
;; parameters. Their choice seems based on trial and error:
;; "[Various theoretical reasons and observations] suggest that the
;; further the cluster distributions are from Normal, the smaller the
;; transformation power should be. However, it is not obvious how to
;; choose the optimal value of Y."
;; ...
;; "When squared error distortion is used and strong correlations
;; exist between dimensions, values of Y somewhat less that p = 2 may
;; produce superior results... Empirically, a promising approach
;; involves estimating the “effective” number of dimensions in the
;; data and transforming accordingly. For example, the iris data is
;; four-dimensional which suggests using Y = 2. However, several of
;; the variables are highly correlated. As a result, the effective
;; dimension of this data set is closer to 2, implying that a
;; transformation power near Y = 1 may be more appropriate.
;; With that understanding, what do these three graphs show? With Y=2,
;; I see a clear spike at k=7 and a secondary spike at k=3. With Y=1,
;; the spike at 7 is greatly diminished, and k=3 barely edges out
;; k=2. With Y=2/3, k=2 is the clear winner. (Your results may vary
;; based on WEKA seeding. More on this later.) You may find it
;; interesting to run these again with dramatically higher maximum k.
;; Taking the authors' recommendations on selecting a transformation
;; power, I would interpret these results (were I ignorant of the
;; actual number of classes) as suggesting either 2, 3, or 7
;; clusters. Sugar and James note that either 2 or 3 is a reasonable
;; interpretation of the data given its known classes. I would suggest
;; that k=7 is leaning into overfitting territory given the number of
;; samples per cluster (~21 rather than ~50), but would be a
;; reasonable path to investigate (again, if I were ignorant of the
;; true classes).
;; Overall, examining jumps in distortion seems like a relatively
;; robust method for this particular data set. Looking at other
;; example data sets can show us more dramatically how this method is
;; superior to the elbow method. We'll do that in the next section.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment