Skip to content

Instantly share code, notes, and snippets.

@piccolbo
Last active August 29, 2015 14:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save piccolbo/58a69cdc80fb8e4f6dc7 to your computer and use it in GitHub Desktop.
Save piccolbo/58a69cdc80fb8e4f6dc7 to your computer and use it in GitHub Desktop.
Problems using R serialization to communicate with MR or Spark
  • Slow. Slow even at the C level, for small objects. Non-vectorized.
  • Serialized representation is sensitive to changes that should not affect key equality or grouping, such as order of attributes, or even attributes like row names, which can not be removed.
  • Serialized representation does not preserve order of represented items. This has been the source of some of the worst bugs in rmr, particularly one whereby groups where incorrectly split
  • Some features that require the Java side to undersand the field structure, such as joins, are lost. Can be re-implemented in R at the cost of speed, duplication of effort, inconsistency etc. Having a nice type mapping between languages is almost always an advantage, the only problem is that mapping is difficult. Mapping everything in R to bytes in Java is an admission of defeat.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment