daveliepmann/bootstrapping.markdown

## bootstrapping.markdown

      
    Raw
  

              bootstrapping.markdown
            
          
    Implementing the k-means jump method: Part Four

This fourth and final part of our exploration into the jump method wraps things up by following a basic application of the technique through to a search for results we can be confident in.
We'll reuse functions from rest of the series (Part One, Part Two, Part Three). You'll also want the original paper handy to refer back to their charts.
First, we're going to return to several prior data sets:
(def iris (i/to-matrix (incd/get-dataset :iris)))

(def wbcd (->> "data/breast-cancer-wisconsin-no-missing-attributes.data"
               slurp
               s/split-lines
               (map #(s/split % #","))
               i/to-dataset
               i/to-matrix))

(def auto-mpg (->> "data/auto-mpg.data"
                   slurp
                   s/split-lines
                   (map #(s/split % #"[\t]"))
                   (map #(apply conj (vector (remove empty? (s/split (first %) #" "))
                                             (s/replace (second %) "\"" ""))))))

(def auto-mpg-pca
  (let [attrs      (i/sel (i/to-matrix (i/to-dataset auto-mpg))
                          :cols (range 9))
        components (:rotation (incs/principal-components attrs))]
    (map #(conj [%1 %2])
         (i/mmult attrs (i/sel components :cols 0))
         (i/mmult attrs (i/sel components :cols 1)))))
We'll also reuse functions that implement the jump method and various visualizations:
(defn assoc-distortions
  "Given a number `transformation-power-y` and a seq of maps
  `clusterers`, where each map describes a k-means clusterer, return a
  seq of maps describing the same clusterers, now with key :distortion
  mapped to the value of the cluster analysis' squared error raised to
  the negative `transformation-power-y`.
  
  Based on Sugar & James' paper at
  http://www-bcf.usc.edu/~gareth/research/ratedist.pdf which defines
  distortion (denoted d[hat][sub k]) for a given k as the mean squared error when covariance
  is the identity matrix. This distortion is then transformed by
  raising it to the negative of the transformation power y, which is
  generally set to the number of attributes divided by two."
  [clusterers transformation-power-y]
  (map #(assoc % :distortion (Math/pow (:squared-error %) (- transformation-power-y)))
       clusterers))

(defn assoc-jumps
  "Given a seq of maps `clusterers` each describing a k-means analysis
  including the transformed :distortion of that analysis, returns a
  sequence of maps 'updated' with the calculated jumps in transformed
  distortions.

  Based on Sugar & James' paper at
  http://www-bcf.usc.edu/~gareth/research/ratedist.pdf which defines
  the jump for a given k (denoted J[sub k]) as simply the transformed distortion for k
  minus the transformed distortion for k-1."
  [clusterers]
  (map (fn [[a b]] (assoc b :jump (- (:distortion b) (:distortion a))))
       (partition 2 1 (conj clusterers {:distortion 0}))))

(def abc-keys
  "Returns a function that, given integer `n`, returns a seq of
  alphabetic keys. Only valid for n<=26."
  (let [ls (mapv (comp keyword str) "abcdefghikjlmnopqrstuvwxyz")]
    (fn [n] (take n ls))))

(def init-codes
  {:random 0
   :k-means++ 1
   :canopy 2
   :farthest-first 3})

(defn clusterer-over-k
  "Returns a k-means clusterer built over the `dataset` (which we call
  `ds-name`), using `k` number of clusters."
  [dataset ds-name k init-method]
  (mlc/clusterer-info
   (doto (mlc/make-clusterer :k-means
                             {:number-clusters k
                              :initialization-method (init-codes init-method)})
     (.buildClusterer (mld/make-dataset ds-name
                                        (abc-keys (count (first dataset)))
                                        dataset)))))

(defn clusterers-over-ks
  "Returns a seq of k-means clusterer objects that have been built on
  the given `dataset` that we call `ds-name`, where for each k in
  0<k<`max-k` one clusterer attempts to fit that number of clusters to
  the data. Optionally takes a map of `options` according to clj-ml's
  `make-clusterer`."
  [dataset ds-name max-k init-method]
  (map #(clusterer-over-k dataset ds-name % init-method)
       (range 1 (inc max-k))))

(defn graph-k-analysis
  "Performs cluster analyses using k=1 to max-k, graphs a selection of
  comparisons between those choices of k, and returns a JFree chart
  object. Requires the `dataset` as a vector of vectors, the name of
  the data set `ds-name`, the greatest k desired for analysis `max-k`,
  the transformation power to use `y`, the desired initial centroid
  placement method `init-method` as a keyword from the set
  #{:random :k-means++ :canopy :farthest-first}, and a vector of keys
  describing the `analyses` desired, which must be among the set
  #{:squared-errors, :distortions, :jumps}."
  [dataset ds-name max-k y init-method analyses]
  {:pre (seq (init-codes init-method))}
  (let [clusterers (clusterers-over-ks dataset ds-name max-k init-method)
        clusterer-distortions (assoc-distortions clusterers y)]
    (when (some #{:squared-errors} analyses)
      (i/view (incc/line-chart (map :number-clusters clusterers)
                               (map :squared-error clusterers)
                               :title (str ds-name ": SSE according to selection of k")
                               :x-label "Number of clusters, k"
                               :y-label "Within-cluster sum of squared errors")))
    (when (some #{:distortions} analyses)
      (i/view (incc/line-chart (map :number-clusters clusterer-distortions)
                               (map :distortion clusterer-distortions)
                               :title (str ds-name ": Distortions according to k")
                               :x-label "Number of clusters, k"
                               :y-label (str "Transformed distortions, J, using y=" y))))
    (when (some #{:jumps} analyses)
      (i/view (incc/line-chart (map :number-clusters (assoc-jumps clusterer-distortions))
                               (map :jump (assoc-jumps clusterer-distortions))
                               :title (str ds-name ": Jumps according to k")
                               :x-label "Number of clusters, k"
                               :y-label (str "Jumps in transformed distortions, J, using y=" y))))))

(defn graph-pca
  "Given `data` as a matrix, a seq containing indices of data columns
  in the data `cols`, and optionally the index of the class attribute
  `class-idx`, this function graphs the first principal component of
  the data on the x-axis, the second PCA on the y-axis, and groups the
  data points by class."
  ([data cols]
   (let [attrs (i/sel data :cols cols)
         components (:rotation (incs/principal-components attrs))]
     (i/view (incc/scatter-plot (i/mmult attrs (i/sel components :cols 0))
                                (i/mmult attrs (i/sel components :cols 1))
                                :x-label "PC1"
                                :y-label "PC2"))))
  ([data cols class-idx]
   (let [attrs (i/sel data :cols cols)
         components (:rotation (incs/principal-components attrs))]
     (i/view (incc/scatter-plot (i/mmult attrs (i/sel components :cols 0))
                                (i/mmult attrs (i/sel components :cols 1))
                                :group-by (i/sel data :cols class-idx)
                                :x-label "PC1"
                                :y-label "PC2")))))
These should all be familiar to you from reading previous installments. We press on.
Four Gaussians

Let's start by fiddling with the degree of separation between clusters, to see how the jump method performs as clusters become less easily distinguished. Sugar & James explore this with three variations on four evenly-placed Gaussian points, modifying only the standard deviation:

Figure 3 [with nine Gaussian clusters] illustrates a situation in which the groups are well separated. However, the jump and broken line methods also perform well when the clusters overlap to a large extent. Figure 4 shows three data sets, each a mixture of four Gaussians. The data set of Figure 4(a) contains well separated clusters, that of Figure 4(b) has some overlap and that of Figure 4(c) is almost indistinguishable from a single cluster. The corresponding plots of transformed distortion reflect this decreasing level of separation.

To replicate this Figure 4 of theirs, we create a data set of four well-separated clusters:
(def four-points [[-4 -4] [-4 4] [4 -4] [4 4]])

(def four-separate
  (mapcat (fn [coords] (repeatedly 100 #(gaussian-point coords 1))) four-points))
Then we create two more data sets, also with four clusters, but less clearly grouped. Note that the only difference is the standard-deviation parameter to gaussian-points.
(def four-with-overlap
  (mapcat (fn [coords] (repeatedly 100 #(gaussian-point coords (i/sqrt 3))))
          four-points))

(def four-mingled
  (mapcat (fn [coords] (repeatedly 100 #(gaussian-point coords (i/sqrt 7))))
          four-points))
Now LET'S DO SOME CLUSTER ANALYSIIIIIIIIIIIIIIIIIIIS!
By now our process is fairly straightforward. The only major decision point is which initialization method to use. Although I experimented with all four initialization methods, I chose k-means++ here because it's nice to have some variety (compared to showing farthest-first again), and k-means++ gave me consistently good results in the few times I ran these distributions.
Anyway, we can faithfully and independently reproduce Figures 4a(i), 4a(ii) and 4a(iii)—plus the SSE graphs for good measure:
(i/view (incc/scatter-plot (map first four-separate)
                           (map second four-separate)
                           :title "A simulated data set with four well-separated Gaussian clusters [Figure 4a(i)]"))

(graph-k-analysis four-separate "Four well-separated Gaussian distributions"
                  10 1 :k-means++ [:jumps :distortions :squared-errors])


So, a relatively clear squared-errors chart, a noisy distortions chart, and a crystal-clear jump chart. Compare to our reproductions of Figures 4b(i), 4b(ii) and 4b(iii):
(i/view (incc/scatter-plot (map first four-with-overlap)
                           (map second four-with-overlap)
                           :title "A simulated data set with four somewhat overlapping Gaussian clusters [Figure 4b(i)]"))

(graph-k-analysis four-with-overlap "Four somewhat-overlapping Gaussian distributions"
                  10 1 :k-means++ [:jumps :distortions :squared-errors])


...and compare all six of those with our reproductions of Figures 4c(i), 4c(ii) and 4c(iii):
(i/view (incc/scatter-plot (map first four-mingled)
                           (map second four-mingled)
                           :title "A simulated data set with four barely-distinguishable Gaussian clusters [Figure 4b(i)]"))

(graph-k-analysis four-mingled "Four barely-distinguishable Gaussian distributions"
                  10 1 :k-means++ [:jumps :distortions :squared-errors])


Don't forget to compare the maximum magnitude of jumps for the well-mingled versus the well-separated.
What I find insightful about the juxtaposition of these six subfigures is the lasting utility of the jump method even as other methods diminish in power. This is the spot when I sat up and took notice when first reading the paper. It may be the clearest argument in favor of the jump method. See how the squared-errors graph for 4c is almost unreadabled, whereas the jump graph is so confident? See how the transformed-distortions graphs are simply the wrong measure?
I did however find one strangeness. My evaluation of initialization methods was not rigorous, but I occasionally (on the order of 1 in 10) found a convergence on k=5 when using random initialization. In order to explore this phenomena in more depth, I re-defined the data set as a function (for easier re-randomization), and defined a function that graphs a 2-dimensional data set, plus the jump graphs for each one, using each of the four initialization methods:
(defn four-separate-fresh []
  (mapcat (fn [coords] (repeatedly 100 #(gaussian-point coords 1))) four-points))

(defn see-graph-and-jumps
  [data ds-name max-k]
  (i/view (incc/scatter-plot (map first data)
                             (map second data)))
  (graph-k-analysis data (str ds-name ", random init") 10 1 :random [:jumps])
  (graph-k-analysis data (str ds-name ", k-means++ init") 10 1 :k-means++ [:jumps])
  (graph-k-analysis data (str ds-name ", canopy init") 10 1 :canopy [:jumps])
  (graph-k-analysis data (str ds-name ", farthest-first init") 10 1 :farthest-first [:jumps]))

(repeatedly 10 #(see-graph-and-jumps (four-separate-fresh)
                                     "Four well-separated Gaussians" 10))
(I'm not going to clutter this document with 40 graphs, but I encourage you to eval this in your own REPL.)
It doesn't happen often, but for some random distributions, a fifth cluster is detected by random initialization. When interpreting this, I try to keep my skeptic's hat on: maybe random initialization is not a great method, or maybe the data is just distributed in such a way that, look, there really are five clusters, and the other initialization methods just aren't finding them. It's not always cut and dried. For instance, the magnitude of the max jump when the minority report said k=5 didn't seem any different. And in some instances, k-means++ said k=5 on its own, or in agreement with random initialization! Yet again we find ourselves wanting a more rigorous judgment tool.
Part of the problem is that we're looking at a small number of iterations of a random distribution. Sometimes the roll of the Gaussian dice will be weird. But even when the data set isn't made of Gaussian points we still have a reasonable lack of confidence in any single k-means run. First, we'd like to hedge against the random factors (such as initialization) that affect k-means results. Second, we'd like to hedge against the failings of our data. Remember, any data is merely an approximation of reality. We never have more than a mere sample of the underlying mechanisms, which are what we are really after.
The path to a better read on the underlying reality is to get more data. "But we don't have more data," you cry. True, but we can create it. We can create new sample data sets from the data set we already have by leaving some elements out and replacing them with other data points. This technique is called resampling, specifically bootstrapping. Each permutation of our data (called a "bootstrap sample") is created by taking values from the original, leaving some data points out, and replacing them with other data points from the original sample.
Sugar and James introduce the bootstrapping process like so:

...for the breast cancer data the jumps at K = 1 and 2 are by far the largest, strongly indicating that there are no more than two clusters in the data. However, for the auto data there appear to be many reasonable choices for the estimate of G. Next we develop some more formal approaches for assessing the certainty in the choice of the number of clusters.


Ideally one wishes to estimate the variability associated with each jump in order to test for statistical significance.  A natural approach to this problem is to use the bootstrap (Efron and Tibshirani, 1993). Simply draw with replacement from the given data set to produce a bootstrap sample with the same number of observations as the original and calculate the jumps associated with this new data set. Repeat this process [some number of] times.

Bootstrapping uses the law of large numbers to reduce the meddling power of randomness in our analysis, and lets us look at what the reality underlying our data set is probably like. We'll give bootstrapping a shot, first with a numeric approach, then visually.
Bootstrapping a numeric answer

From section 5 of Sugar & James:

[One] approach is to calculate, for each value of K, the fraction of bootstrap data sets that have their maximum jump at K. One can then take as a (1-a)*100% confidence interval the smallest collection of K’s that account for at least 1-a of the total. For example, for the iris data 99% of all bootstrapped data sets had their maximum jump at either K = 2 or 3 so a 99% confidence interval would consist of these two numbers. For the breast cancer data the jump method selected K = 2 for all 100 bootstrap data sets so any confidence interval for this data would contain just the value two.

First we'll refactor some prior code, for concision's sake. For instance, this function doesn't do anything new, but hides some implementation details:
(defn run-clusters
  "Returns a seq of built clusterers with distortions and jumps
  calculated for the given `dataset` matrix, named `ds-name`, from k=1
  to `max-k`, using transformation power `y` and initializing the
  k-means clusterers with `init-method`."
  [dataset ds-name max-k y init-method]
  (-> (clusterers-over-ks dataset ds-name max-k init-method)
      (assoc-distortions y)
      assoc-jumps))
To peek inside the internals of the process, I'll eval run-clusters on just a few values of k for the iris data:
(run-clusters (i/sel iris :cols (range 4)) "iris" 3 1 :farthest-first)
=> ({:number-clusters 1, :centroids {0 #object[weka.core.DenseInstance 0x5c9238d3 "5.843333,3.057333,3.758,1.199333"]}, :cluster-sizes {0 150}, :squared-error 41.166110421373254, :distortion 0.024291826207627442, :jump 0.024291826207627442} 
    {:number-clusters 2, :centroids {0 #object[weka.core.DenseInstance 0x4ac5c3ed "6.262,2.872,4.906,1.676"], 1 #object[weka.core.DenseInstance 0x1ca6a139 "5.006,3.428,1.462,0.246"]}, :cluster-sizes {0 100, 1 50}, :squared-error 12.127790750538196, :distortion 0.08245524849244475, :jump 0.0581634222848173}
    {:number-clusters 3, :centroids {0 #object[weka.core.DenseInstance 0x2e6a94e5 "5.888525,2.737705,4.396721,1.418033"], 1 #object[weka.core.DenseInstance 0x63f3755b "5.006,3.428,1.462,0.246"], 2 #object[weka.core.DenseInstance 0x7cd0863 "6.846154,3.082051,5.702564,2.079487"]}, :cluster-sizes {0 61, 1 50, 2 39}, :squared-error 6.982216473785234, :distortion 0.14322099633468896, :jump 0.060765747842244214})
We'll also need a small function to calculate the number of clusters associated with the maximum jump value:
(defn best-k
  "Returns the number of clusters associated with the greatest jump
  value in a seq of WEKA k-means clusterers, each of which has had its
  jumps calculated. Intended for use to determine the maximum k among
  a range of k values."
  [clusterers]
  (:number-clusters (first (sort-by :jump > clusterers))))
...which will, if we pass in that short subset of iris clusters we made above, tell us which k won:
(best-k (run-clusters (i/sel iris :cols (range 4)) "iris" 3 1 :farthest-first))
=> 3
Notice what this function does not tell us: the shape of the jump curve, by how much this value of k won, or what other values of k were runners-up.
Now we create a wee little iris bootstrap data set. This will create a list of 150 data points, each numerically similar to actual iris data points. (Fun fact: leaving out the to-vect call will slow down later computations by at least one order of magnitude.)
(defn iris-bootstrap []
  (incs/bootstrap (i/to-vect (i/sel iris :cols (range 4))) rand-nth :size 150))
Just one possible run:
(iris-bootstrap)
=> ([5.0 3.2 1.2 0.2] [7.4 2.8 6.1 1.9] [6.0 2.9 4.5 1.5] [5.5 2.3 4.0 1.3] [6.7 3.1 4.4 1.4]
    [4.6 3.4 1.4 0.3] [5.7 2.5 5.0 2.0] [6.1 2.8 4.7 1.2] [5.5 4.2 1.4 0.2] [4.7 3.2 1.3 0.2]
    [5.1 3.5 1.4 0.3] [6.6 2.9 4.6 1.3] [6.5 3.0 5.5 1.8] [5.1 3.5 1.4 0.2] [7.2 3.6 6.1 2.5]
    [4.6 3.1 1.5 0.2] [6.3 2.5 4.9 1.5] [6.6 3.0 4.4 1.4] [5.0 3.0 1.6 0.2] [6.3 2.7 4.9 1.8]
    [5.0 3.6 1.4 0.2] [6.8 3.2 5.9 2.3] [5.8 2.6 4.0 1.2] [6.7 3.0 5.0 1.7] [6.8 3.2 5.9 2.3]
    [5.6 3.0 4.1 1.3] [6.7 2.5 5.8 1.8] [6.8 3.2 5.9 2.3] [6.4 2.7 5.3 1.9] [5.7 2.8 4.5 1.3]
    [6.2 2.9 4.3 1.3] [5.1 3.8 1.5 0.3] [6.7 3.3 5.7 2.1] [4.7 3.2 1.6 0.2] [5.5 2.4 3.7 1.0]
    [6.4 2.8 5.6 2.2] [4.8 3.4 1.6 0.2] [5.4 3.9 1.7 0.4] [4.9 3.0 1.4 0.2] [4.6 3.2 1.4 0.2]
    [6.0 2.7 5.1 1.6] [5.4 3.4 1.5 0.4] [5.8 2.7 3.9 1.2] [4.4 3.0 1.3 0.2] [5.1 3.8 1.5 0.3]
    [5.8 2.7 4.1 1.0] [6.5 2.8 4.6 1.5] [6.0 3.4 4.5 1.6] [6.4 2.8 5.6 2.1] [4.6 3.2 1.4 0.2]
    [7.0 3.2 4.7 1.4] [6.7 3.1 4.4 1.4] [4.8 3.1 1.6 0.2] [7.2 3.0 5.8 1.6] [5.1 3.5 1.4 0.2]
    [5.5 2.4 3.7 1.0] [6.6 2.9 4.6 1.3] [7.7 3.8 6.7 2.2] [6.3 2.3 4.4 1.3] [6.7 3.0 5.2 2.3]
    [5.7 2.6 3.5 1.0] [5.1 3.8 1.6 0.2] [6.5 3.2 5.1 2.0] [6.0 3.0 4.8 1.8] [6.6 2.9 4.6 1.3]
    [6.1 2.8 4.0 1.3] [5.2 2.7 3.9 1.4] [6.6 3.0 4.4 1.4] [6.7 3.0 5.0 1.7] [5.0 3.2 1.2 0.2]
    [6.4 2.7 5.3 1.9] [6.1 3.0 4.9 1.8] [5.7 2.8 4.5 1.3] [7.7 3.8 6.7 2.2] [4.7 3.2 1.6 0.2]
    [4.9 3.1 1.5 0.2] [6.0 2.2 5.0 1.5] [6.0 2.9 4.5 1.5] [5.0 3.4 1.6 0.4] [6.0 2.7 5.1 1.6]
    [7.4 2.8 6.1 1.9] [5.8 4.0 1.2 0.2] [6.0 2.7 5.1 1.6] [6.3 2.5 5.0 1.9] [4.8 3.0 1.4 0.3]
    [5.0 2.3 3.3 1.0] [5.8 2.6 4.0 1.2] [6.3 2.7 4.9 1.8] [7.4 2.8 6.1 1.9] [6.3 2.3 4.4 1.3]
    [5.1 3.8 1.9 0.4] [6.4 2.9 4.3 1.3] [4.8 3.1 1.6 0.2] [5.3 3.7 1.5 0.2] [4.6 3.2 1.4 0.2]
    [5.9 3.2 4.8 1.8] [6.5 3.0 5.8 2.2] [6.5 3.0 5.2 2.0] [6.0 2.2 4.0 1.0] [5.1 3.5 1.4 0.3]
    [4.6 3.4 1.4 0.3] [7.2 3.0 5.8 1.6] [6.7 2.5 5.8 1.8] [6.4 2.7 5.3 1.9] [5.4 3.9 1.3 0.4]
    [7.7 2.8 6.7 2.0] [5.0 3.2 1.2 0.2] [5.8 2.7 4.1 1.0] [5.1 3.5 1.4 0.3] [4.4 3.2 1.3 0.2]
    [5.6 2.8 4.9 2.0] [5.2 2.7 3.9 1.4] [5.5 2.6 4.4 1.2] [5.6 2.9 3.6 1.3] [6.8 3.0 5.5 2.1]
    [6.0 2.7 5.1 1.6] [5.8 2.7 5.1 1.9] [4.6 3.4 1.4 0.3] [7.2 3.2 6.0 1.8] [4.8 3.0 1.4 0.3]
    [7.2 3.0 5.8 1.6] [5.5 2.6 4.4 1.2] [5.2 2.7 3.9 1.4] [5.3 3.7 1.5 0.2] [6.8 3.2 5.9 2.3]
    [5.1 2.5 3.0 1.1] [7.0 3.2 4.7 1.4] [5.7 2.8 4.5 1.3] [5.1 3.5 1.4 0.2] [5.4 3.9 1.7 0.4]
    [5.6 2.7 4.2 1.3] [5.7 3.8 1.7 0.3] [5.7 2.5 5.0 2.0] [4.9 2.5 4.5 1.7] [6.7 3.3 5.7 2.5]
    [5.8 2.7 4.1 1.0] [6.8 3.2 5.9 2.3] [5.1 3.8 1.5 0.3] [6.7 3.1 5.6 2.4] [4.6 3.1 1.5 0.2]
    [5.6 2.9 3.6 1.3] [5.7 4.4 1.5 0.4] [6.3 2.9 5.6 1.8] [6.0 2.2 5.0 1.5] [6.3 3.3 6.0 2.5]
    [6.7 3.0 5.0 1.7] [5.0 3.2 1.2 0.2] [6.3 2.3 4.4 1.3] [4.9 3.6 1.4 0.1] [6.4 2.8 5.6 2.1])
Compare that to the actual iris data:
(i/to-vect (i/sel iris :cols (range 4)))
=> [[5.1 3.5 1.4 0.2] [4.9 3.0 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5.0 3.6 1.4 0.2]
    [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5.0 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1]
    [5.4 3.7 1.5 0.2] [4.8 3.4 1.6 0.2] [4.8 3.0 1.4 0.1] [4.3 3.0 1.1 0.1] [5.8 4.0 1.2 0.2]
    [5.7 4.4 1.5 0.4] [5.4 3.9 1.3 0.4] [5.1 3.5 1.4 0.3] [5.7 3.8 1.7 0.3] [5.1 3.8 1.5 0.3]
    [5.4 3.4 1.7 0.2] [5.1 3.7 1.5 0.4] [4.6 3.6 1.0 0.2] [5.1 3.3 1.7 0.5] [4.8 3.4 1.9 0.2]
    [5.0 3.0 1.6 0.2] [5.0 3.4 1.6 0.4] [5.2 3.5 1.5 0.2] [5.2 3.4 1.4 0.2] [4.7 3.2 1.6 0.2]
    [4.8 3.1 1.6 0.2] [5.4 3.4 1.5 0.4] [5.2 4.1 1.5 0.1] [5.5 4.2 1.4 0.2] [4.9 3.1 1.5 0.2]
    [5.0 3.2 1.2 0.2] [5.5 3.5 1.3 0.2] [4.9 3.6 1.4 0.1] [4.4 3.0 1.3 0.2] [5.1 3.4 1.5 0.2]
    [5.0 3.5 1.3 0.3] [4.5 2.3 1.3 0.3] [4.4 3.2 1.3 0.2] [5.0 3.5 1.6 0.6] [5.1 3.8 1.9 0.4]
    [4.8 3.0 1.4 0.3] [5.1 3.8 1.6 0.2] [4.6 3.2 1.4 0.2] [5.3 3.7 1.5 0.2] [5.0 3.3 1.4 0.2]
    [7.0 3.2 4.7 1.4] [6.4 3.2 4.5 1.5] [6.9 3.1 4.9 1.5] [5.5 2.3 4.0 1.3] [6.5 2.8 4.6 1.5]
    [5.7 2.8 4.5 1.3] [6.3 3.3 4.7 1.6] [4.9 2.4 3.3 1.0] [6.6 2.9 4.6 1.3] [5.2 2.7 3.9 1.4]
    [5.0 2.0 3.5 1.0] [5.9 3.0 4.2 1.5] [6.0 2.2 4.0 1.0] [6.1 2.9 4.7 1.4] [5.6 2.9 3.6 1.3]
    [6.7 3.1 4.4 1.4] [5.6 3.0 4.5 1.5] [5.8 2.7 4.1 1.0] [6.2 2.2 4.5 1.5] [5.6 2.5 3.9 1.1]
    [5.9 3.2 4.8 1.8] [6.1 2.8 4.0 1.3] [6.3 2.5 4.9 1.5] [6.1 2.8 4.7 1.2] [6.4 2.9 4.3 1.3]
    [6.6 3.0 4.4 1.4] [6.8 2.8 4.8 1.4] [6.7 3.0 5.0 1.7] [6.0 2.9 4.5 1.5] [5.7 2.6 3.5 1.0]
    [5.5 2.4 3.8 1.1] [5.5 2.4 3.7 1.0] [5.8 2.7 3.9 1.2] [6.0 2.7 5.1 1.6] [5.4 3.0 4.5 1.5]
    [6.0 3.4 4.5 1.6] [6.7 3.1 4.7 1.5] [6.3 2.3 4.4 1.3] [5.6 3.0 4.1 1.3] [5.5 2.5 4.0 1.3]
    [5.5 2.6 4.4 1.2] [6.1 3.0 4.6 1.4] [5.8 2.6 4.0 1.2] [5.0 2.3 3.3 1.0] [5.6 2.7 4.2 1.3]
    [5.7 3.0 4.2 1.2] [5.7 2.9 4.2 1.3] [6.2 2.9 4.3 1.3] [5.1 2.5 3.0 1.1] [5.7 2.8 4.1 1.3]
    [6.3 3.3 6.0 2.5] [5.8 2.7 5.1 1.9] [7.1 3.0 5.9 2.1] [6.3 2.9 5.6 1.8] [6.5 3.0 5.8 2.2]
    [7.6 3.0 6.6 2.1] [4.9 2.5 4.5 1.7] [7.3 2.9 6.3 1.8] [6.7 2.5 5.8 1.8] [7.2 3.6 6.1 2.5]
    [6.5 3.2 5.1 2.0] [6.4 2.7 5.3 1.9] [6.8 3.0 5.5 2.1] [5.7 2.5 5.0 2.0] [5.8 2.8 5.1 2.4]
    [6.4 3.2 5.3 2.3] [6.5 3.0 5.5 1.8] [7.7 3.8 6.7 2.2] [7.7 2.6 6.9 2.3] [6.0 2.2 5.0 1.5]
    [6.9 3.2 5.7 2.3] [5.6 2.8 4.9 2.0] [7.7 2.8 6.7 2.0] [6.3 2.7 4.9 1.8] [6.7 3.3 5.7 2.1]
    [7.2 3.2 6.0 1.8] [6.2 2.8 4.8 1.8] [6.1 3.0 4.9 1.8] [6.4 2.8 5.6 2.1] [7.2 3.0 5.8 1.6]
    [7.4 2.8 6.1 1.9] [7.9 3.8 6.4 2.0] [6.4 2.8 5.6 2.2] [6.3 2.8 5.1 1.5] [6.1 2.6 5.6 1.4]
    [7.7 3.0 6.1 2.3] [6.3 3.4 5.6 2.4] [6.4 3.1 5.5 1.8] [6.0 3.0 4.8 1.8] [6.9 3.1 5.4 2.1]
    [6.7 3.1 5.6 2.4] [6.9 3.1 5.1 2.3] [5.8 2.7 5.1 1.9] [6.8 3.2 5.9 2.3] [6.7 3.3 5.7 2.5]
    [6.7 3.0 5.2 2.3] [6.3 2.5 5.0 1.9] [6.5 3.0 5.2 2.0] [6.2 3.4 5.4 2.3] [5.9 3.0 5.1 1.8]]
The bootstrap is the same size (150 data points), and each data-point is randomly selected from the actual iris data. So the first bootstrapped data point:
[5.0 3.2 1.2 0.2]
...is repeated in the bootstrap sample, and is also the 36th or whatever data point in the original iris sample. So we have a kind of subset of the real data, with some original values missing and others doubled. Let's take a quick look:
(graph-pca (i/to-matrix (i/to-dataset (iris-bootstrap))) (range 4))

Alles in ordnung: basically the same as the original iris data, but with a few points missing, and a few points doubled. Let's put together the functions we made above so we can run clusters over some bootstrapped iris data sets, then find the best ks from those runs.
(defn best-ks
  "Runs k-means clustering analysis over `b` number of bootstrapped
  versions of the dataset named `ds-name`, evaluating number of
  clusters from 1 to `max-k`, using transformation power `y`,
  bootstrapping the data using `bootstrap-fn` (see
  Incanter's (bootstrap), in the statistics namespace, for more on the
  desired shape of that function), initializing k-means using
  `init-method`. Returns the best ks from those runs."
  [ds-name max-k y init-method b bootstrap-fn]
  (map best-k (map (fn [ds] (run-clusters ds ds-name max-k y init-method))
                   (repeatedly b bootstrap-fn))))
We def the best-ks of our iris bootstraps, since running a hundred of them is just computationally expensive enough to break my concentration:
(def iris-ks (best-ks "Bootstrapped Irises" 10 1 :farthest-first
                      100 iris-bootstrap))

iris-ks
=> (5 4 2 7 2 5 3 3 5 8 6 2 2 9 2 2 10 3 2 10 10 3 9 6 2 3 7 3 5 10 3 10 6 9 3 5 3 10 10 8 10 5 3 2 3 10 2 3 3 2 5 5 3 9 10 3 10 4 9 4 2 8 2 8 2 8 7 5 4 4 3 3 3 3 4 2 6 8 2 3 5 4 9 3 4 3 3 2 10 3 3 3 8 2 3 2 3 9 2 10)
To get a handle on that result, I recommend stalwart (frequencies):
(frequencies iris-ks)
=> {7 3, 4 8, 6 4, 3 28, 2 20, 9 7, 5 10, 10 13, 8 7}
So k=3 wins, but k=2 makes a strong showing. Vying for third are k=10 and k=4. See?
(i/view (incc/histogram iris-ks))

I find it useful to look at the raw frequencies across different analytic choices (i.e. y and initialization). (You may find you prefer to skip the raw numbers and just look at the histograms.)  Let's try a different initialization method:
(frequencies (best-ks "Fisher's Irises" 10 1 :canopy 100 iris-bootstrap))
=> {7 8, 4 6, 6 2, 3 28, 2 20, 9 12, 5 3, 10 14, 8 7}

(frequencies (best-ks "Fisher's Irises" 10 1 :k-means++ 100 iris-bootstrap))
=> {7 9, 4 11, 6 5, 3 27, 2 9, 9 15, 5 4, 10 12, 8 8}
k=3 seems to be the consistent frontrunner across init methods. (And remember, we have a degree of confidence here because we're running 100 different bootstrap samplings of the original data set.) Let's try another value for y that we found useful earlier:
(frequencies (best-ks "Fisher's Irises" 10 (/ 2 3) :farthest-first
                      100 iris-bootstrap))
=> {3 4, 2 91, 1 2, 4 3}

(frequencies (best-ks "Fisher's Irises" 10 (/ 2 3) :canopy
                      100 iris-bootstrap))
=> {2 90, 3 7, 8 1, 4 1, 10 1}

(frequencies (best-ks "Fisher's Irises" 10 (/ 2 3) :k-means++
                      100 iris-bootstrap))
=> {2 77, 3 8, 10 2, 1 4, 4 4, 5 3, 8 1, 7 1}
Interestingly, k=2 wins even more handily than k=3 did with y=1.
Though I like to see the raw frequency data, to replicate Sugar & James' bootstrap calculation we need a function that will select from our best-ks result only those values within a given range:
(defn take-until-sum
  "Given a seq of `integers`, returns the smallest collection of
  integers that is equal or greater than `sum`, taking in order from
  most frequent to least frequent."
  [integers sum]
  (map first (reduce (fn [acc int-with-freq]
                       (if (>= (apply + (map second acc)) sum)
                         acc
                         (conj acc int-with-freq)))
                     []
                     (reverse (sort-by second (frequencies integers))))))
Remember that our last iris-ks run gave us k=3 in 28 out of 100 runs, so we should expect only 3 if we set our confidence threshold to 28:
(take-until-sum iris-ks 28)
=> (3)
Good. What about the 90th percentile?
(take-until-sum iris-ks 90)
=> (3 2 10 5 4 8 9)
Note that this differs significantly from Sugar & James' results:

for the iris data 99% of all bootstrapped data sets had their maximum jump at either K = 2 or 3 so a 99% confidence interval would consist of these two numbers.

We have only 48% confidence in k=[2,3]:
(take-until-sum iris-ks 48) => (3 2)
(take-until-sum iris-ks 49) => (3 2 10)
The cause of this discrepancy is unknown: a different bootstrap function, or transformation power, or initialization method? None of these details is specified in the paper.
What about the breast cancer data? Sugar & James got a maximum of k=2 for all 100 of their bootstrap runs.
So, we define a bootstrap function:
(defn wbcd-bootstrap []
  (incs/bootstrap (i/to-vect (i/sel wbcd :cols (range 1 10))) rand-nth
                  :size (count wbcd)))

(def wbcd-ks (best-ks "Wisconsin Breast Cancer Data" 10 2
                      :farthest-first 100 wbcd-bootstrap))

(frequencies wbcd-ks)
=> {2 98, 10 1, 9 1}
Wow! That kind of makes the next expression superfluous:
(take-until-sum wbcd-ks 90)
=> (2)
While Sugar & James got k=2 for all their bootstrap runs, we have to be satisfied with only 98% confidence. But I call those results congruent.
Next, the auto data. Sugar & James saw 97% confidence in k values 7 through 10, and 87% for k of 8, 9, and 10.
(defn auto-mpg-pca-bootstrap []
  (incs/bootstrap auto-mpg-pca rand-nth :size (count auto-mpg-pca)))

(def auto-mpg-ks (best-ks "Quinlan Auto MPG Data" 10 (/ 2 3) :farthest-first
                          100 auto-mpg-pca-bootstrap))
Not surprisingly, considering our earlier failure to reproduce the authors' results with the auto data, we get quite different confidence intervals. Instead of k >= 7, we see k=5 or 1:
(take-until-sum auto-mpg-ks 90)
(5 1)
Those results right there are a good example of why I prefer to see the frequencies themselves:
(frequencies auto-mpg-ks)
=> {5 55, 1 41, 6 1, 9 3}

(i/view (incc/histogram auto-mpg-ks))

However, if we allow for a different transformation power—arguably justified by the strong non-normality of the data—we get a very different answer:
(frequencies (best-ks "Quinlan Auto MPG Data" 10 0.5 :farthest-first
                      100 auto-mpg-pca-bootstrap))
{1 100}
I find it difficult to draw any conclusions from these results. Those that I do see are divergent from Sugar & James: either there are no well-differentiated clusters to find, or there are 5. I do not see the high-k results centered at k=8 that they found.
That wraps up our numeric overview of bootstrapping these data sets.
Even Further Up By Our Own Bootstraps: Graphing Confidence Intervals

Let's now look at our bootstrapped runs visually. What we want is to graph the jump values found for each value of k across all the bootstrap runs. So instead of using best-ks, we need a function that runs a bootstrap function, extracts the relevant information (that is, the jump and k values for each run), and then rearranges the data to "pivot" around the values of k:
(defn bootstrap-jumps
  [max-k y init-method b bootstrap-fn]
  (reduce (fn [acc m]
            (if (map? (get acc (:number-clusters m)))
              (assoc acc (:number-clusters m)
                     {:jumps (conj (:jumps (get acc (:number-clusters m)))
                                   (:jump m))})
              (assoc acc (:number-clusters m)
                     {:jumps [(:jump m)]})))
          {}
          (map (fn [c] (select-keys c [:number-clusters :jump]))
               (flatten (map (fn [ds] (run-clusters ds "" max-k y
                                                   init-method))
                             (repeatedly b bootstrap-fn))))))
For brevity, to test it we run only 3 bootstraps with a maximum k of 5:
(bootstrap-jumps 5 1 :farthest-first 3 iris-bootstrap)

=> {1 {:jumps [0.02313369849903442 0.02395268938256462 0.023295688222614658]},
    2 {:jumps [0.06185816874358191 0.05159555503290575 0.0491510179008848]},
    3 {:jumps [0.010201591025008297 0.06090357682566086 0.061169713475631465]},
    4 {:jumps [0.07463034789608436 0.023000205324643724 0.018466556442923415]},
    5 {:jumps [0.050654060248076616 0.032282155650261796 0.07174310326625652]}}
With our bootstrap data in the right shape, we now do some basic statistical calculations. Instead of plotting a line for every result, we want to plot quantiles to show the gist of our bootstrap results. In addition to a median line, we'll also have an upper and lower bound, which are simple to calculate using Incanter:
(defn assoc-quantiles
  [cluster-data lower-quantile upper-quantile]
  (reduce (fn [acc [k k-jumps]]
            (assoc acc k
                   (assoc k-jumps
                          :lower-quantile (incs/quantile (:jumps k-jumps)
                                                         :probs lower-quantile)
                          :median-quantile (incs/quantile (:jumps k-jumps)
                                                          :probs 0.5)
                          :upper-quantile (incs/quantile (:jumps k-jumps)
                                                         :probs upper-quantile))))
          {}
          cluster-data))
We take it for a spin using a 90% confidence interval:
(assoc-quantiles (bootstrap-jumps 5 1 :farthest-first 3 iris-bootstrap)
                 0.05 0.95)
                 
=> {1 {:jumps [0.025104190791114183 0.02300347386388632 0.02389335800806178],
       :lower-quantile 0.023092462278303867, :median-quantile 0.02389335800806178,
       :upper-quantile 0.024983107512808942},
    2 {:jumps [0.051177757986150305 0.06132583175883856 0.055262971561601607],
       :lower-quantile 0.05158627934369544, :median-quantile 0.055262971561601607,
       :upper-quantile 0.060719545739114866},
    3 {:jumps [0.061462546297355924 0.06314867004133833 0.059280288636742964],
       :lower-quantile 0.059498514402804265, :median-quantile 0.061462546297355924,
       :upper-quantile 0.06298005766694008},
    4 {:jumps [0.05190180424959376 0.02373785616340965 0.022886043695459518],
       :lower-quantile 0.02297122494225453, :median-quantile 0.02373785616340965,
       :upper-quantile 0.04908540944097534},
    5 {:jumps [0.033767342040476545 0.015875098436963497 0.023484122838645954],
       :lower-quantile 0.016636000877131744, :median-quantile 0.023484122838645954,
       :upper-quantile 0.032739020120293484}}
That gives us everything we need to plot our graph, which will use confidence intervals to illustrate where the majority of k-means runs landed. Now we need a function that draws confidence intervals, which is merely a wrapper around Incanter's xy-plot:
(defn confidence-intervals-plot
  "Plots a line graph with confidence intervals, using the given
  `title`, median x and y points as seqs `xs` and `ys`, the lower
  confidence bound as x and y points `min-xs` and `min-ys`, and the
  upper confidence bound as `max-xs` and `max-ys`."
  [title xs medians mins maxes]
  (doto (incc/xy-plot xs medians :title title)
    (incc/set-stroke-color java.awt.Color/gray :series 0)
    (incc/set-stroke :width 2 :series 0)
    (incc/add-lines xs mins)
    (incc/add-lines xs maxes)
    (incc/set-stroke :dataset 1 :dash 5)
    (incc/set-stroke :dataset 2 :dash 5)
    (i/view)))
Nowe we put it all together: this function runs the bootstraps and gives that data to the confidence-intervals-plot in the expected format:
(defn graph-bootstrapped-jumps
  "Produce a graph, with confidence intervals, of the jumps in k-means
  distortion across values of k for `b` bootstrappings of the original
  data set. `ds-name` must be a string describing the data set;
  `max-k` the maximum integer number of clusters to evaluate; the
  transformation power `y` must be a number; the `init-method` must be
  a keyword belonging to `init-codes`; the integer number of data sets
  to create using bootstrap with replacements is `b`; the upper and
  lower quantiles for the confidence intervals must be passed as a
  two-element vector."
  [ds-name max-k y init-method b bootstrap-fn [lower-quantile upper-quantile]]
  (let [data (into (sorted-map)
                   (assoc-quantiles (bootstrap-jumps max-k y init-method
                                                     b bootstrap-fn)
                                    lower-quantile upper-quantile))]
    (confidence-intervals-plot (str "Clusters in " ds-name ": "
                                    (int (- (* 100 upper-quantile)
                                            (* 100 lower-quantile)))
                                    "% confidence intervals")
                               (keys data)
                               (map :median-quantile (vals data))
                               (map :lower-quantile (vals data))
                               (map :upper-quantile (vals data)))))
We'll try it out by replicating Sugar & James' figure 6a:
(graph-bootstrapped-jumps "Fisher's Irises bootstrapped: y=1, farthest-first"
                          10 1 :farthest-first 100 iris-bootstrap [0.05 0.95])

(graph-bootstrapped-jumps "Fisher's Irises bootstrapped: y=1, canopy"
                          10 1 :canopy 100 iris-bootstrap [0.05 0.95])

(graph-bootstrapped-jumps "Fisher's Irises bootstrapped: y=1, k-means++"
                          10 1 :k-means++ 100 iris-bootstrap [0.05 0.95])

(graph-bootstrapped-jumps "Fisher's Irises bootstrapped: y=2/3, farthest-first"
                          10 (/ 2 3) :farthest-first 100 iris-bootstrap [0.05 0.95])

(graph-bootstrapped-jumps "Fisher's Irises bootstrapped: y=2/3, canopy"
                          10 (/ 2 3) :canopy 100 iris-bootstrap [0.05 0.95])

(graph-bootstrapped-jumps "Fisher's Irises bootstrapped: y=2/3, k-means++"
                          10 (/ 2 3) :k-means++ 100 iris-bootstrap [0.05 0.95])

I get two major differences from Sugar & James results: a plateau between k=2 and k=3 instead of a peak at k=2 (when using y=1, at least), and a flagrantly wider confidence interval for k>=3. There are a number of possible causes for these differences, but generally speaking we're in the same ballpark. Notice as well that there is slight but not tremendous variation across the (non-random) initialization methods. That will not necessarily be true for all data sets.
Moving on, we can attempt to recreate Figure 6b. (The initialization method doesn't make much difference here. Experiment yourself if you don't believe me.)
(graph-bootstrapped-jumps "Wisconsin Breast Cancer bootstrapped: y=2, farthest-first"
                          10 2 :farthest-first 100 wbcd-bootstrap [0.05 0.95])

Using y=2 gives strong k=2 results across initialization methods, with great confidence. The results of trying y=3 agree:
(graph-bootstrapped-jumps "Wisconsin Breast Cancer bootstrapped: y=3, farthest-first"
                          10 3 :farthest-first 100 wbcd-bootstrap [0.05 0.95])

...whereas k=4 gives, as before, basically a flat line:
(graph-bootstrapped-jumps "Wisconsin Breast Cancer bootstrapped: y=4, farthest-first"
                          10 4 :farthest-first 100 wbcd-bootstrap [0.05 0.95])

Using y=1, following the authors, tells us k=1 just as before:
(graph-bootstrapped-jumps "Wisconsin Breast Cancer bootstrapped: y=1, farthest-first"
                          10 1 :farthest-first 100 wbcd-bootstrap [0.05 0.95])

I am not sure what configuration differences cause such disparity between the choices of y I find useful and the ones Sugar & James do. Regardless, given what we know about selecting y, 1>y>4 is a reasonable range to use, and in the end our results are not too different.
Now to figure 6c, the bootstrapped auto data. Across a variety of init methods:
(graph-bootstrapped-jumps "Quinlan Auto MPG Data Set, y=2/3, k-means++"
                          10 (/ 2 3) :k-means++ 100 auto-mpg-pca-bootstrap [0.05 0.95])

(graph-bootstrapped-jumps "Quinlan Auto MPG Data Set, y=1, k-means++"
                          10 1 :k-means++ 100 auto-mpg-pca-bootstrap [0.05 0.95])

(graph-bootstrapped-jumps "Quinlan Auto MPG Data Set, y=2/3, canopy"
                          10 (/ 2 3) :canopy 100 auto-mpg-pca-bootstrap [0.05 0.95])

(graph-bootstrapped-jumps "Quinlan Auto MPG Data Set, y=1, canopy"
                          10 1 :canopy 100 auto-mpg-pca-bootstrap [0.05 0.95])

(graph-bootstrapped-jumps "Quinlan Auto MPG Data Set, y=2/3, farthest-first"
                          10 (/ 2 3) :farthest-first 100 auto-mpg-pca-bootstrap [0.05 0.95])

(graph-bootstrapped-jumps "Quinlan Auto MPG Data Set, y=1, farthest-first"
                          10 1 :farthest-first 100 auto-mpg-pca-bootstrap [0.05 0.95])

None of these give us the k=8 spike that Sugar & James got. Instead, we see evidence for k=5, or the possibility that k=1 or that the found clusters are weakly grouped.
I'd like to know why these results differ from the authors', but I don't want to chase after their specific results. That way lies un-science. Instead, I am satisfied with seeing their techniques validated in the general case as an analytic tool.
EOF

By now, we have the jump method, bootstraps, and confidence intervals both numeric and graphed each secured in our tool belt. That is to me the actualized promise of Sugar & James' method: running a large number of k-means clustering attempts across possible values of k, then both numerically and visually evaluating the jumps in distortion. One is able to fiddle with knobs like the transformation power and initialization method while maintaining relative confidence in each run, since each one consists of many clustering attempts. Notice, too, that the visualization and the numerical result buttress each other: the graph gives one a sense of the shape of the jump data, whereas calculating the maximum jump for X proportion of Y runs gives the sometimes-vague confidence interval visualization a concrete anchor.
I've explored Sugar & James' innovative information-theory-based technique for analyzing k-means clustering attempts by implementing it in Clojure and replicating their examples. I applied progressive complexity according to the different approaches that each data set made necessary. (There is plenty of room for refactoring the functions I've built for more general purposes, but I leave that for another project.) Along the way I hope we've learned some things about the nature of clustering, visualization, and analysis of machine learning results. Thanks for reading.