Skip to content

Instantly share code, notes, and snippets.

@cjauvin
Last active April 18, 2017 22:20
Show Gist options
  • Save cjauvin/68314066fa674c6c2550 to your computer and use it in GitHub Desktop.
Save cjauvin/68314066fa674c6c2550 to your computer and use it in GitHub Desktop.
val pairs = sc.parallelize(List(("aa", 1), ("bb", 2),
("aa", 10), ("bb", 20),
("aa", 100), ("bb", 200)))
/* aggregateByKey takes an initial accumulator (here an empty list),
a first lambda function to merge a value to an accumulator, and a
second lambda function to merge two accumulators */
pairs.aggregateByKey(List[Any]())(
(aggr, value) => aggr ::: (value :: Nil),
(aggr1, aggr2) => aggr1 ::: aggr2
).collect().toMap
// scala.collection.immutable.Map[String,List[Any]] =
// Map(aa -> List(1, 10, 100), bb -> List(2, 20, 200))
/* combineByKey is even more general in that it adds an initial lambda
function to create the initial accumulator */
pairs.combineByKey(
(value) => List(value),
(aggr: List[Any], value) => aggr ::: (value :: Nil),
(aggr1: List[Any], aggr2: List[Any]) => aggr1 ::: aggr2
).collect().toMap
// scala.collection.immutable.Map[String,List[Any]] =
// Map(aa -> List(1, 10, 100), bb -> List(2, 20, 200))
@kindlychung
Copy link

I don't combineByKey is more general, it's more like a generalized version of reduce, while aggregateByKey is a generalized version of fold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment