Skip to content

Instantly share code, notes, and snippets.

@johnynek
Last active Aug 29, 2015
Embed
What would you like to do?
How to do data cubing in typed scalding?

Suppose you have a key like (page, geo, day) and you want to make rollups/datacube so you can query for all pages, or all geos or all days.

Here is how you do it:

def opts[T](t: T): Seq[Option[T]] = Seq(Some(t), None)

val p: TypedPipe[(String, String, Int)] = ...

p.sumByLocalKeys
.flatMap { case ((page, geo, day), v) =>
  for {
    op <- opts(page)
    og <- opts(geo)
    od <- opts(day)
  } yield ((op, og, od), v)
}
.group
.sum

The v can be a tuple of integers, doubles, the value 1L (to count), HyperLogLogs, whatever your monoidal heart desires.

The sumByLocalKeys is optional, but it aggregates the values on the mappers before expanding the keys, which can be a big performance win (especially if done before).

TODO: it would be great to have a macro in scala 2.10 that does this for you. It has to be a macro (I think) since it is dynamic on the arity of the Tuple.

@danosipov
Copy link

danosipov commented Jul 29, 2014

Would it be possible to get all rollups just as easily (similar to how pig does it with ROLLUP)? Ex: page & day for all geos?

@johnynek
Copy link
Author

johnynek commented Jul 29, 2014

@danosipov yes, that is how this code works, the description is a bit incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment