- Slow. Slow even at the C level, for small objects. Non-vectorized.
- Serialized representation is sensitive to changes that should not affect key equality or grouping, such as order of attributes, or even attributes like row names, which can not be removed.
- Serialized representation does not preserve order of represented items. This has been the source of some of the worst bugs in rmr, particularly one whereby groups where incorrectly split
- Some features that require the Java side to undersand the field structure, such as joins, are lost. Can be re-implemented in R at the cost of speed, duplication of effort, inconsistency etc. Having a nice type mapping between languages is almost always an advantage, the only problem is that mapping is difficult. Mapping everything in R to bytes in Java is an admission of defeat.
Goal is to expose the vectorize group feature of rmr2 in a plyrmr way
- Operations should encapsulate the knowledge of whether they can handle multiple groups. vectorized.reduce should be set accordingly.
- vectorized.reduce should be propagated along a pipe when possible. Rules TBD
- A repertoire of vectorized reduce ops should be made available, and adding more should be easy (no C++)
- Wordcount is our guiding app here.
- merge into master
- update version #
- update date
- update Rd help()
- push master
- Repeat until tests pass
- test local and debug
- test remote and debug
- test additional platforms
- apply necessary fixes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
```{r} | |
ff = function(){} | |
names(ff) = "abc" | |
# Error in names(ff) = "abc" : names() applied to a non-vector | |
is.vector(mtcars) | |
#[1] FALSE | |
names(mtcars) = LETTERS[1:11] | |
names(mtcars) | |
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" | |
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#predicate, group and aggregate are user defined functions | |
#it is assumed a vectorized version is used when needed | |
#pass through | |
mapreduce(input, | |
map = function(k,v) keyval(k,v)) | |
#vec version | |
#nrecs says how many records to process in one map | |
mapreduce(input, | |
map = function(k,v) vec.keyval(k,v), |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#predicate, group and aggregate are user defined functions | |
#it is assumed a vectorized version is used when needed | |
#vectorized input format | |
native.1000 = make.input.format(nrecs = 1000) | |
#pass through | |
mapreduce(input, | |
map = function(k,v) keyval(k,v)) | |
#vec version |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#predicate, group, select and aggregate are user defined functions | |
#it is assumed a vectorized version is used when needed | |
#pass through | |
mapreduce(input, | |
map = function(k,v) keyval(k,v)) | |
#vec version | |
#vectorized.map says how many records to process in one map, default 1 | |
mapreduce(input, | |
map = function(k,v) keyval(k,v, vec = TRUE), |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#predicate, group, select and aggregate are user defined functions | |
#it is assumed a vectorized version is used when needed | |
#pass through | |
mapreduce(input, | |
map = function(k,v) keyval(k,v)) | |
#vec version | |
#vectorized.map says how many records to process in one map, default 1 | |
mapreduce(input, | |
map = function(k,v) keyval(k,v, vectorized = TRUE), |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
name.me = c() | |
system.time({name.me[as.character(1:10^3)] = T}) | |
# user system elapsed | |
# 0.004 0.000 0.004 | |
system.time({name.me[as.character(1:10^4)] = T}) | |
# user system elapsed | |
# 0.369 0.000 0.369 | |
system.time({name.me[as.character(1:10^5)] = T}) | |
# user system elapsed | |
# 48.187 0.055 48.235 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#on cluster | |
thrift /spark/sbin/start-thriftserver.sh --master yarn-client | |
#ssh tunnel, direct 10000 to unused 8157 | |
ssh -i ~/caserta-1.pem -N -L 8157:ec2-54-221-27-21.compute-1.amazonaws.com:10000 hadoop@ec2-54-221-27-21.compute-1.amazonaws.com | |
#see this for JDBC config on client http://blogs.aws.amazon.com/bigdata/post/TxT7CJ0E7CRX88/Using-Amazon-EMR-with-SQL-Workbench-and-other-BI-Tools |
OlderNewer