Skip to content

Instantly share code, notes, and snippets.

@piccolbo
piccolbo / vecgroup.md
Last active August 29, 2015 14:04
Vectorized grouped ops in plyrmr

Goal is to expose the vectorize group feature of rmr2 in a plyrmr way

What

  1. Operations should encapsulate the knowledge of whether they can handle multiple groups. vectorized.reduce should be set accordingly.
  2. vectorized.reduce should be propagated along a pipe when possible. Rules TBD
  3. A repertoire of vectorized reduce ops should be made available, and adding more should be easy (no C++)
  4. Wordcount is our guiding app here.

How

@piccolbo
piccolbo / gist:58a69cdc80fb8e4f6dc7
Last active August 29, 2015 14:03
Problems using R serialization to communicate with MR or Spark
  • Slow. Slow even at the C level, for small objects. Non-vectorized.
  • Serialized representation is sensitive to changes that should not affect key equality or grouping, such as order of attributes, or even attributes like row names, which can not be removed.
  • Serialized representation does not preserve order of represented items. This has been the source of some of the worst bugs in rmr, particularly one whereby groups where incorrectly split
  • Some features that require the Java side to undersand the field structure, such as joins, are lost. Can be re-implemented in R at the cost of speed, duplication of effort, inconsistency etc. Having a nice type mapping between languages is almost always an advantage, the only problem is that mapping is difficult. Mapping everything in R to bytes in Java is an admission of defeat.
@piccolbo
piccolbo / named-vectors-quadratic.R
Last active December 17, 2015 13:19
Assignment to R named arrays is quadratic (when extending at the same time)
name.me = c()
system.time({name.me[as.character(1:10^3)] = T})
# user system elapsed
# 0.004 0.000 0.004
system.time({name.me[as.character(1:10^4)] = T})
# user system elapsed
# 0.369 0.000 0.369
system.time({name.me[as.character(1:10^5)] = T})
# user system elapsed
# 48.187 0.055 48.235
#predicate, group, select and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
#vectorized.map says how many records to process in one map, default 1
mapreduce(input,
map = function(k,v) keyval(k,v, vectorized = TRUE),
@piccolbo
piccolbo / rmr-vec-api-compromise.R
Created March 27, 2012 18:46
Sketch of rmr vector API, answer to the devil
#predicate, group, select and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
#vectorized.map says how many records to process in one map, default 1
mapreduce(input,
map = function(k,v) keyval(k,v, vec = TRUE),
@piccolbo
piccolbo / rmr-vec-api-devil.R
Created March 22, 2012 20:01
Sketch of rmr future vector API, devil's version
#predicate, group and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#vectorized input format
native.1000 = make.input.format(nrecs = 1000)
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
@piccolbo
piccolbo / rmr-vec-api.R
Created March 21, 2012 04:30
Code sketched for rmr vector API
#predicate, group and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
#nrecs says how many records to process in one map
mapreduce(input,
map = function(k,v) vec.keyval(k,v),