Skip to content

Instantly share code, notes, and snippets.

@tzachz
tzachz / CombineMaps.scala
Last active January 26, 2023 04:31
Apache Spark UserDefinedAggregateFunction combining maps
import org.apache.spark.SparkContext
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Column, Row, SQLContext}
/***
* UDAF combining maps, overriding any duplicate key with "latest" value
* @param keyType DataType of Map key
* @param valueType DataType of Value key
* @param merge function to merge values of identical keys
@MLnick
MLnick / HyperLogLogStoreUDAF.scala
Last active March 16, 2022 05:31
Experimenting with Spark SQL UDAF - HyperLogLog UDAF for distinct counts, that stores the actual HLL for each row to allow further aggregation
class HyperLogLogStoreUDAF extends UserDefinedAggregateFunction {
override def inputSchema = new StructType()
.add("stringInput", BinaryType)
override def update(buffer: MutableAggregationBuffer, input: Row) = {
// This input Row only has a single column storing the input value in String (or other Binary data).
// We only update the buffer when the input value is not null.
if (!input.isNullAt(0)) {
if (buffer.isNullAt(0)) {