piccolbo/vecgroup.md

## vecgroup.md

      
    Raw
  

              vecgroup.md
            
          
    Goal is to expose the vectorize group feature of rmr2 in a plyrmr way
What


Operations should encapsulate the knowledge of whether they can handle multiple groups. vectorized.reduce should be set accordingly.
vectorized.reduce should be propagated along a pipe when possible. Rules TBD
A repertoire of vectorized reduce ops should be made available, and adding more should be easy (no C++)
Wordcount is our guiding app here.

How


Leverage dplyr and its handler system for fast aggregation
group_by data frames before passing to aggregator.

Problems


Then we can only use dplyr as aggregator. Alternatives?
How do we do this for ops that don't have an equivalent in dplyr such as bind.cols, transmute etc We can try and simulate with  dplyr operations, but there is no equivalent for transmute. We can try and simulate it with do, but it's difficult and slow (5k row/s). We could introduce summarize. And so we did.
How do we do this in sparkR? Maybe a lapplyPartition after a groupByKey?