Skip to content

Instantly share code, notes, and snippets.

@piccolbo
piccolbo / rmr-vec-api-compromise.R
Created March 27, 2012 18:46
Sketch of rmr vector API, answer to the devil
#predicate, group, select and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
#vectorized.map says how many records to process in one map, default 1
mapreduce(input,
map = function(k,v) keyval(k,v, vec = TRUE),
@piccolbo
piccolbo / rmr-vec-api-devil.R
Created March 22, 2012 20:01
Sketch of rmr future vector API, devil's version
#predicate, group and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#vectorized input format
native.1000 = make.input.format(nrecs = 1000)
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
@piccolbo
piccolbo / rmr-vec-api.R
Created March 21, 2012 04:30
Code sketched for rmr vector API
#predicate, group and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
#nrecs says how many records to process in one map
mapreduce(input,
map = function(k,v) vec.keyval(k,v),
```{r}
ff = function(){}
names(ff) = "abc"
# Error in names(ff) = "abc" : names() applied to a non-vector
is.vector(mtcars)
#[1] FALSE
names(mtcars) = LETTERS[1:11]
names(mtcars)
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
```
  • merge into master
  • update version #
  • update date
  • update Rd help()
  • push master
  • Repeat until tests pass
    • test local and debug
    • test remote and debug
    • test additional platforms
  • apply necessary fixes
@piccolbo
piccolbo / vecgroup.md
Last active August 29, 2015 14:04
Vectorized grouped ops in plyrmr

Goal is to expose the vectorize group feature of rmr2 in a plyrmr way

What

  1. Operations should encapsulate the knowledge of whether they can handle multiple groups. vectorized.reduce should be set accordingly.
  2. vectorized.reduce should be propagated along a pipe when possible. Rules TBD
  3. A repertoire of vectorized reduce ops should be made available, and adding more should be easy (no C++)
  4. Wordcount is our guiding app here.

How

@piccolbo
piccolbo / gist:58a69cdc80fb8e4f6dc7
Last active August 29, 2015 14:03
Problems using R serialization to communicate with MR or Spark
  • Slow. Slow even at the C level, for small objects. Non-vectorized.
  • Serialized representation is sensitive to changes that should not affect key equality or grouping, such as order of attributes, or even attributes like row names, which can not be removed.
  • Serialized representation does not preserve order of represented items. This has been the source of some of the worst bugs in rmr, particularly one whereby groups where incorrectly split
  • Some features that require the Java side to undersand the field structure, such as joins, are lost. Can be re-implemented in R at the cost of speed, duplication of effort, inconsistency etc. Having a nice type mapping between languages is almost always an advantage, the only problem is that mapping is difficult. Mapping everything in R to bytes in Java is an admission of defeat.