Skip to content

Instantly share code, notes, and snippets.

@johnynek
Last active August 29, 2015 14:04
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save johnynek/dfa319c55934b8a38524 to your computer and use it in GitHub Desktop.
Save johnynek/dfa319c55934b8a38524 to your computer and use it in GitHub Desktop.
How to do data cubing in typed scalding?

Suppose you have a key like (page, geo, day) and you want to make rollups/datacube so you can query for all pages, or all geos or all days.

Here is how you do it:

def opts[T](t: T): Seq[Option[T]] = Seq(Some(t), None)

val p: TypedPipe[(String, String, Int)] = ...

p.sumByLocalKeys
.flatMap { case ((page, geo, day), v) =>
  for {
    op <- opts(page)
    og <- opts(geo)
    od <- opts(day)
  } yield ((op, og, od), v)
}
.group
.sum

The v can be a tuple of integers, doubles, the value 1L (to count), HyperLogLogs, whatever your monoidal heart desires.

The sumByLocalKeys is optional, but it aggregates the values on the mappers before expanding the keys, which can be a big performance win (especially if done before).

TODO: it would be great to have a macro in scala 2.10 that does this for you. It has to be a macro (I think) since it is dynamic on the arity of the Tuple.

@danosipov
Copy link

Would it be possible to get all rollups just as easily (similar to how pig does it with ROLLUP)? Ex: page & day for all geos?

@johnynek
Copy link
Author

@danosipov yes, that is how this code works, the description is a bit incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment