Problems with Pig / Ideas for PigRomance
- Tuples take up too much memory
The default tuple class takes up a minimum of 96 bytes even for an empty tuple, and uses inefficient Integer, Float, etc. objects instead of primitives. I think you had done some work on code gen for efficient tuple classes, but it didn't seem to be fully integrated into all parts of the pipeline. If you could get this working for PigRomance, it could avoid a lot of spills, and make local mode especially faster.
One instance where it might be worth using extra memory though could be to append an int/long field to every schema that would store a precomputed hashcode (taking advantage of immutability).
- Even if everything can fit into memory, MapReduce spills to disk after each job
I think some people are working on porting Pig to Tez which should avoid this. I'd much rather see a Pig that compiles to Tez or Spark than vanilla MR.
- The Pig unit tests are slow, redundant, confusing, and leave files laying around
They take literally hours to run. Ugh.
- It'd be nice to have optional fast local mode that's not tied to MR/Tez/whatever
I've been using Pig a lot for code that I'm never going to need to run on a cluster, because I find it easier than working with libraries like pandas in python. It'd be nice if there were a light-weight local mode (in addition to the faux-distributed local mode) that doesn't bother simulating the distributed systems and just runs as fast as possible on one machine. It'd also be a good test of the abstraction of the physical layer--if a fast local mode is non-trivial to implement, then I think that's a sign that the abstraction is leaky.
- Sparse aggregations are very slow
By "sparse aggregation", I mean group-by where most of the groups have only one record, but a few have more than one, and some aggregation needs to be done on those. A "sparse group-by" operator that acts more like distinct would be cool to have.
- It's hard to use the distributed cache (and impossible to test in local mode)
I'd like syntax where you could say "cache my_relation using my.func()" where my.func() takes the relation as a DataBag and returns an Object to be cached. Then it'd be made accessible to UDFs somehow.