sambos/spark_aggregate_functions.md

## spark_aggregate_functions.md

      
    Raw
  

              spark_aggregate_functions.md
            
          
    aggregateByKey

A general form of reduceByKey, that can be used when the return value is different type than input. Takes an initial value of accumulator.
     val initialList = scala.collection.mutable.ListBuffer[Row]()
     val addToList = (acc:scala.collection.mutable.ListBuffer[Row], x:Row) => x +: acc
     val mergePartitionLists = (acc1: scala.collection.mutable.ListBuffer[Row],acc2: scala.collection.mutable.ListBuffer[Row]) => acc1 ++ acc2
    
     val gbyKey = rdd.map(x => (x.getAs[String]("xtransId"), x)).aggregateByKey(initialList)(addToList,         mergePartitionLists).map(x => (x._1, x._2.toList))
                          
     val initialSet = scala.collection.mutable.HashSet.empty[Row]
     val addToSet = (acc: scala.collection.mutable.HashSet[Row], v: Row) => acc += v
     val mergePartitionsSets = (acc1: scala.collection.mutable.HashSet[Row], acc2: scala.collection.mutable.HashSet[Row]) => acc1 ++= acc2
    val gbyKey = rdd.map(x => (x._1, x._2)).aggregateByKey(initialSet)(addToSet, mergePartitionSets).map(x => (x._1, x._2.toList))
    
combineByKey

A more general form of reduceByKey that provides a function for creating an initial accumulator, can be used when the return value is different type than input. Example below shows immutable List but its recommended to use scala.collection.mutable collection
val gbyKey =  rdd.map(x => (x._1, x._2))
                  .combineByKey(
                          (x: Row) => List(x),              
                          (acc: List[Row], x) => x :: acc,  
                          (acc1: List[Row], acc2: List[Row]) => acc1 ::: acc2)