Skip to content

Instantly share code, notes, and snippets.

@ledell
Last active September 25, 2018 18:25
Show Gist options
  • Save ledell/7430ee045ae32210f656709ac3b80209 to your computer and use it in GitHub Desktop.
Save ledell/7430ee045ae32210f656709ac3b80209 to your computer and use it in GitHub Desktop.
H2O K-Means Auto-estimate K (wine data demo)
# H2O's K-Means algo can estimate the optimal number of clusters (method by Leland Wilkinson)
# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html#estimating-k-in-k-means
#
# This demo is an extension of Kasia's blog post here:
# https://kkulma.github.io/2017-04-24-determining-optimal-number-of-clusters-in-your-data/
library(rattle) # wine data
# Remove the factor col & convert to an H2O Frame
# Note: You can skip the scale() here since H2O K-Means standardizes automatically
data(wine)
wine <- as.h2o(wine[,-1])
h2o.init() # Start a local H2O Cluster
# Train a H2O K-Means model, auto-estimate best value for k (note: k here is maximum possible k)
fit <- h2o.kmeans(training_frame = wine, k = 20, estimate_k = TRUE)
# It found 3 clusters to be optimal
print(fit)
# Model Details:
# ==============
#
# H2OClusteringModel: kmeans
# Model ID: KMeans_model_R_1503103871072_10
# Model Summary:
# number_of_rows number_of_clusters number_of_categorical_columns
# 1 178 3 0
# number_of_iterations within_cluster_sum_of_squares
# 1 19 1270.74912
# total_sum_of_squares between_cluster_sum_of_squares
# 1 2301.00000 1030.25088
#
#
# H2OClusteringMetrics: kmeans
# ** Reported on training data. **
#
#
# Total Within SS: 1270.749
# Between SS: 1030.251
# Total SS: 2301
# Centroid Statistics:
# centroid size within_cluster_sum_of_squares
# 1 1 51.00000 326.35370
# 2 2 65.00000 558.69710
# 3 3 62.00000 385.69830
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment