Skip to content

Instantly share code, notes, and snippets.

@piccolbo
piccolbo / rmr-vec-api.R
Created March 21, 2012 04:30
Code sketched for rmr vector API
#predicate, group and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
#nrecs says how many records to process in one map
mapreduce(input,
map = function(k,v) vec.keyval(k,v),
@piccolbo
piccolbo / rmr-vec-api-devil.R
Created March 22, 2012 20:01
Sketch of rmr future vector API, devil's version
#predicate, group and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#vectorized input format
native.1000 = make.input.format(nrecs = 1000)
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
@piccolbo
piccolbo / rmr-vec-api-compromise.R
Created March 27, 2012 18:46
Sketch of rmr vector API, answer to the devil
#predicate, group, select and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
#vectorized.map says how many records to process in one map, default 1
mapreduce(input,
map = function(k,v) keyval(k,v, vec = TRUE),
#predicate, group, select and aggregate are user defined functions
#it is assumed a vectorized version is used when needed
#pass through
mapreduce(input,
map = function(k,v) keyval(k,v))
#vec version
#vectorized.map says how many records to process in one map, default 1
mapreduce(input,
map = function(k,v) keyval(k,v, vectorized = TRUE),
@piccolbo
piccolbo / named-vectors-quadratic.R
Last active December 17, 2015 13:19
Assignment to R named arrays is quadratic (when extending at the same time)
name.me = c()
system.time({name.me[as.character(1:10^3)] = T})
# user system elapsed
# 0.004 0.000 0.004
system.time({name.me[as.character(1:10^4)] = T})
# user system elapsed
# 0.369 0.000 0.369
system.time({name.me[as.character(1:10^5)] = T})
# user system elapsed
# 48.187 0.055 48.235
@piccolbo
piccolbo / gist:58a69cdc80fb8e4f6dc7
Last active August 29, 2015 14:03
Problems using R serialization to communicate with MR or Spark
  • Slow. Slow even at the C level, for small objects. Non-vectorized.
  • Serialized representation is sensitive to changes that should not affect key equality or grouping, such as order of attributes, or even attributes like row names, which can not be removed.
  • Serialized representation does not preserve order of represented items. This has been the source of some of the worst bugs in rmr, particularly one whereby groups where incorrectly split
  • Some features that require the Java side to undersand the field structure, such as joins, are lost. Can be re-implemented in R at the cost of speed, duplication of effort, inconsistency etc. Having a nice type mapping between languages is almost always an advantage, the only problem is that mapping is difficult. Mapping everything in R to bytes in Java is an admission of defeat.
@piccolbo
piccolbo / vecgroup.md
Last active August 29, 2015 14:04
Vectorized grouped ops in plyrmr

Goal is to expose the vectorize group feature of rmr2 in a plyrmr way

What

  1. Operations should encapsulate the knowledge of whether they can handle multiple groups. vectorized.reduce should be set accordingly.
  2. vectorized.reduce should be propagated along a pipe when possible. Rules TBD
  3. A repertoire of vectorized reduce ops should be made available, and adding more should be easy (no C++)
  4. Wordcount is our guiding app here.

How

  • merge into master
  • update version #
  • update date
  • update Rd help()
  • push master
  • Repeat until tests pass
    • test local and debug
    • test remote and debug
    • test additional platforms
  • apply necessary fixes
```{r}
ff = function(){}
names(ff) = "abc"
# Error in names(ff) = "abc" : names() applied to a non-vector
is.vector(mtcars)
#[1] FALSE
names(mtcars) = LETTERS[1:11]
names(mtcars)
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
```
@piccolbo
piccolbo / dplyr-backends.md
Last active June 23, 2018 03:58
Dplyr backends: the ultimate collection

Dplyr is a well known R package to work on structured data, either in memory or in DB and, more recently, in cluster. The in memory implementations have in general capabilities that are not found in the others, so the notion of backend is used with a bit of a poetic license. Even the different DB and cluster backends differ in subtle ways. But it sure is better than writing SQL directly! Here I provide a list of backends with links to the packages that implement them when necessary. I've done my best to provide links to active projects, but I am not endorsing any of them. Do your own testing. Enjoy and please contribute any corrections or additions, in the comments.

Backend Package
data.frame builtin
data.table builtin
arrays builtin
SQLite builtin
PostgreSQL/Redshift builtin