Skip to content

Instantly share code, notes, and snippets.

@pavlov99
Created February 22, 2017 02:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pavlov99/28fe7241224e59977e47cfc58b424a0e to your computer and use it in GitHub Desktop.
Save pavlov99/28fe7241224e59977e47cfc58b424a0e to your computer and use it in GitHub Desktop.
Disjoint (additional) objects count with Apache Spark.
// This method uses Window function to eliminate double counting of objects, which belong to multiple groups.
// `groups` is a DataSet with two columns: id and group. The first column identifies the object, the second is a group name.
// One of the usage of this method is customer segmentation.
val disjointGroups = groups
.withColumn("_rank", dense_rank().over(org.apache.spark.sql.expressions.Window.partitionBy("id").orderBy("group")))
.filter($"_rank" === 1).drop("_rank")
// Show disjoint groups with additional count.
disjointGroups
.groupBy("group")
.agg(countDistinct("id") as "number_records")
.orderBy("group")
.show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment