Problems with Pig / Ideas for PigRomance
- Tuples take up too much memory
The default tuple class takes up a minimum of 96 bytes even for an empty tuple, and uses inefficient Integer, Float, etc. objects instead of primitives. I think you had done some work on code gen for efficient tuple classes, but it didn't seem to be fully integrated into all parts of the pipeline. If you could get this working for PigRomance, it could avoid a lot of spills, and make local mode especially faster.
One instance where it might be worth using extra memory though could be to append an int/long field to every schema that would store a precomputed hashcode (taking advantage of immutability).
- Even if everything can fit into memory, MapReduce spills to disk after each job