Skip to content

Instantly share code, notes, and snippets.

@piccolbo
Last active August 29, 2015 14:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save piccolbo/e94e8633304401291ae6 to your computer and use it in GitHub Desktop.
Save piccolbo/e94e8633304401291ae6 to your computer and use it in GitHub Desktop.
Vectorized grouped ops in plyrmr

Goal is to expose the vectorize group feature of rmr2 in a plyrmr way

What

  1. Operations should encapsulate the knowledge of whether they can handle multiple groups. vectorized.reduce should be set accordingly.
  2. vectorized.reduce should be propagated along a pipe when possible. Rules TBD
  3. A repertoire of vectorized reduce ops should be made available, and adding more should be easy (no C++)
  4. Wordcount is our guiding app here.

How

  1. Leverage dplyr and its handler system for fast aggregation
  2. group_by data frames before passing to aggregator.

Problems

  1. Then we can only use dplyr as aggregator. Alternatives?
  2. How do we do this for ops that don't have an equivalent in dplyr such as bind.cols, transmute etc We can try and simulate with dplyr operations, but there is no equivalent for transmute. We can try and simulate it with do, but it's difficult and slow (5k row/s). We could introduce summarize. And so we did.
  3. How do we do this in sparkR? Maybe a lapplyPartition after a groupByKey?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment