Skip to content

Instantly share code, notes, and snippets.

@mahito-sugiyama
Last active January 2, 2021 10:20
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save mahito-sugiyama/ef54a3b17fff4629f106 to your computer and use it in GitHub Desktop.
Save mahito-sugiyama/ef54a3b17fff4629f106 to your computer and use it in GitHub Desktop.
k-means++ (written in R; with the Euclidean distance; distance computation is vectorized)
kmeansp2 <- function(x, k, iter.max = 10, nstart = 1, ...) {
n <- nrow(x) # number of data points
centers <- numeric(k) # IDs of centers
distances <- matrix(numeric(n * (k - 1)), ncol = k - 1) # distances[i, j]: The distance between x[i,] and x[centers[j],]
res.best <- list(tot.withinss = Inf) # the best result among <nstart> iterations
for (rep in 1:nstart) {
pr <- rep(1, n) # probability for sampling centers
for (i in 1:(k - 1)) {
centers[i] <- sample.int(n, 1, prob = pr) # Pick up the ith center
distances[, i] <- colSums((t(x) - x[centers[i], ])^2) # Compute (the square of) distances to the center
pr <- distances[cbind(1:n, max.col(-distances[, 1:i, drop = FALSE]))] # Compute probaiblity for the next sampling
}
centers[k] <- sample.int(n, 1, prob = pr)
## Perform k-means with the obtained centers
res <- kmeans(x, x[centers, ], iter.max = iter.max, nstart = 1, ...)
res$inicial.centers <- x[centers, ]
## Store the best result
if (res$tot.withinss < res.best$tot.withinss) {
res.best <- res
}
}
res.best
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment