Antonio Piccolboni piccolbo

## gist:58a69cdc80fb8e4f6dc7

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                piccolbo
                / gist:58a69cdc80fb8e4f6dc7
            
            
              Last active
              August 29, 2015 14:03
            
              
                Problems using R serialization to communicate with MR or Spark
              
          
Slow. Slow even at the C level, for small objects. Non-vectorized.
Serialized representation is sensitive to changes that should not affect key equality or grouping, such as order of attributes, or even attributes like row names, which can not be removed.
Serialized representation does not preserve order of represented items. This has been the source of some of the worst bugs in rmr, particularly one whereby groups where incorrectly split
Some features that require the Java side to undersand the field structure, such as joins, are lost. Can be re-implemented in R at the cost of speed, duplication of effort, inconsistency etc. Having a nice type mapping between languages is almost always an advantage, the only problem is that mapping is difficult. Mapping everything in R to bytes in Java is an admission of defeat.


## vecgroup.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                piccolbo
                / vecgroup.md
            
            
              Last active
              August 29, 2015 14:04
            
              
                Vectorized grouped ops in plyrmr
              
          
    Goal is to expose the vectorize group feature of rmr2 in a plyrmr way
What


Operations should encapsulate the knowledge of whether they can handle multiple groups. vectorized.reduce should be set accordingly.
vectorized.reduce should be propagated along a pipe when possible. Rules TBD
A repertoire of vectorized reduce ops should be made available, and adding more should be easy (no C++)
Wordcount is our guiding app here.

How


## release process.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                piccolbo
                / release process.md
            
            
              Last active
              August 29, 2015 14:15
            
          
 merge into master
 update version #
 update date
 update Rd help()
 push master
 Repeat until tests pass

 test local and debug
 test remote and debug
 test additional platforms


 apply necessary fixes


## names scope.Rmd
```{r}
ff = function(){}
names(ff) = "abc"
# Error in names(ff) = "abc" : names() applied to a non-vector
is.vector(mtcars)
#[1] FALSE
names(mtcars) = LETTERS[1:11]
names(mtcars)
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
```

## rmr-vec-api.R
#predicate, group and aggregate are user defined functions
#it is assumed a vectorized version is used when needed

#pass through
mapreduce(input,
          map = function(k,v) keyval(k,v))
#vec version
#nrecs says how many records to process in one map
mapreduce(input,
          map = function(k,v) vec.keyval(k,v),

## rmr-vec-api-devil.R
#predicate, group and aggregate are user defined functions
#it is assumed a vectorized version is used when needed

#vectorized input format
native.1000 = make.input.format(nrecs = 1000)

#pass through
mapreduce(input,
          map = function(k,v) keyval(k,v))
#vec version

## rmr-vec-api-compromise.R
#predicate, group, select and aggregate are user defined functions
#it is assumed a vectorized version is used when needed

#pass through
mapreduce(input,
          map = function(k,v) keyval(k,v))
#vec version
#vectorized.map says how many records to process in one map, default 1
mapreduce(input,
          map = function(k,v) keyval(k,v, vec = TRUE),

## candidate-API-1.R
#predicate, group, select and aggregate are user defined functions
#it is assumed a vectorized version is used when needed

#pass through
mapreduce(input,
          map = function(k,v) keyval(k,v))
#vec version
#vectorized.map says how many records to process in one map, default 1
mapreduce(input,
          map = function(k,v) keyval(k,v, vectorized = TRUE),

## named-vectors-quadratic.R
name.me = c()
system.time({name.me[as.character(1:10^3)] = T})
#   user  system elapsed
#  0.004   0.000   0.004
system.time({name.me[as.character(1:10^4)] = T})
#   user  system elapsed
#  0.369   0.000   0.369
system.time({name.me[as.character(1:10^5)] = T})
#   user  system elapsed
# 48.187   0.055  48.235

## emr_spark_thrift_on_yarn
#on cluster
thrift /spark/sbin/start-thriftserver.sh --master yarn-client
#ssh tunnel, direct 10000 to unused 8157
ssh -i ~/caserta-1.pem -N -L 8157:ec2-54-221-27-21.compute-1.amazonaws.com:10000 hadoop@ec2-54-221-27-21.compute-1.amazonaws.com
#see this for JDBC config on client http://blogs.aws.amazon.com/bigdata/post/TxT7CJ0E7CRX88/Using-Amazon-EMR-with-SQL-Workbench-and-other-BI-Tools
	```{r}
	ff = function(){}
	names(ff) = "abc"
	# Error in names(ff) = "abc" : names() applied to a non-vector
	is.vector(mtcars)
	#[1] FALSE
	names(mtcars) = LETTERS[1:11]
	names(mtcars)
	# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K"
	```
	#predicate, group and aggregate are user defined functions
	#it is assumed a vectorized version is used when needed

	#pass through
	mapreduce(input,
	map = function(k,v) keyval(k,v))
	#vec version
	#nrecs says how many records to process in one map
	mapreduce(input,
	map = function(k,v) vec.keyval(k,v),
	#predicate, group and aggregate are user defined functions
	#it is assumed a vectorized version is used when needed

	#vectorized input format
	native.1000 = make.input.format(nrecs = 1000)

	#pass through
	mapreduce(input,
	map = function(k,v) keyval(k,v))
	#vec version
	#predicate, group, select and aggregate are user defined functions
	#it is assumed a vectorized version is used when needed

	#pass through
	mapreduce(input,
	map = function(k,v) keyval(k,v))
	#vec version
	#vectorized.map says how many records to process in one map, default 1
	mapreduce(input,
	map = function(k,v) keyval(k,v, vec = TRUE),
	name.me = c()
	system.time({name.me[as.character(1:10^3)] = T})
	# user system elapsed
	# 0.004 0.000 0.004
	system.time({name.me[as.character(1:10^4)] = T})
	# user system elapsed
	# 0.369 0.000 0.369
	system.time({name.me[as.character(1:10^5)] = T})
	# user system elapsed
	# 48.187 0.055 48.235
	#on cluster
	thrift /spark/sbin/start-thriftserver.sh --master yarn-client
	#ssh tunnel, direct 10000 to unused 8157
	ssh -i ~/caserta-1.pem -N -L 8157:ec2-54-221-27-21.compute-1.amazonaws.com:10000 hadoop@ec2-54-221-27-21.compute-1.amazonaws.com
	#see this for JDBC config on client http://blogs.aws.amazon.com/bigdata/post/TxT7CJ0E7CRX88/Using-Amazon-EMR-with-SQL-Workbench-and-other-BI-Tools