danking/shuffler-ir-design.md

## shuffler-ir-design.md

      
    Raw
  

              shuffler-ir-design.md
            
          
    Shuffler IR Design

UUID is probably a string
ShuffleStart(
  keyFields Array<String>,
  rowType (virtual) Struct,
  TypedCodecSpec codecAndEType
): id UUID

ShuffleWrite(
  id UUID,
  partitionId PInt64,
  attemptId PInt64,
  rows Stream<Struct>
): Unit

ShuffleWritingFinished(
  id UUID,
  outPartitions PInt64
): Array<rowType.select(keyFields)>

ShuffleRead(
  id UUID,
  partitionId PInt64
): Stream<Struct>

ShuffleDelete(
  id UUID
): Unit

How does IR manage resources? Should Shuffle be a new type that performs a
network call when it is freed? My current thinking is that ShuffleDelete can be
inserted by the TableIR lowerer. I believe it will see all potential reads of
the shuffle (effectively, the transitive closure of the parents of the
TableKeyBy or TableOrderBy that is being lowered to a pair of stages using a
ShuffleWrite and a ShuffleRead).
ShuffleWritingFinished returns an array of length outPartitioning + 1. Partition i's bounds are [result[i], result[i+1]).
ShuffleRead is permitted to pass any partitionId in [0, outPartitions). It may
read the same partition multiple times by multiple actors. The stream returned
by each partition contains only records whose keys fall within the partition
bounds (left-inclusive, right-exclusive) returned by
ShuffleWritingFinished. Each partition will have roughly the same number of
records.