paniq/mppogs.md

## mppogs.md

      
    Raw
  

              mppogs.md
            
          
    Massively Parallel Processing of Partitioned Streams

by @paniq
We wish to perform sum/max/filter operations on streams that have been
partitioned into groups of varying size, and do this in a massively parallel fashion.
Finding Partitions

Our example grouped bitrange, with either an extra flag to mark borders,
or a partition index for each entry:
1 0 1|0 0 0 1|0|0 1 1 1 0|1 0|1 0|1|0|1 0 1 1 0 1 0 1 0 0|0 1 1 1|1 0 1|0 1 0 1

We mark the borders and create a tree map using histopyramids:
1 0 1|0 0 0 1|0|0 1 1 1 0|1 0|1 0|1|0|1 0 1 1 0 1 0 1 0 0|0 1 1 1|1 0 1|0 1 0 1
1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0
2       1       1       2       3       0       0       1       1       1      
6                               4                               2
12

From these buffers, we know

The total number of partitions
The offset of each partition
The size of each partition (size(i) = offset(i+1) - offset(i))

Partitioned Reduction

We can e.g. count the bits in each partition using JFA.
We start with a step size of N/2 where N is the largest partition size and then
divide the step size by 2 for each step >= 1.
Each index is computed so that
new_count(i) = count(i) + 
    max (is_neighbor(i,i+step)?count(i+1):0, 
         is_neighbor(i,i-step)?count(i-1):0)

1 0 1|0 0 0 1|0|0 1 1 1 0|1 0|1 0|1|0|1 0 1 1 0 1 0 1 0 0|0 1 1 1|1 0 1|0 1 0 1
1 0 1|0 0 0 1|0|0 1 1 1 0|1 0|1 0|1|0|1 0 1 1 0 1 0 1 1 0|0 1 1 1|1 0 1|0 1 0 1 (8)
1 0 1|0 0 0 1|0|0 1 1 1 0|1 0|1 0|1|0|1 1 1 2 1 1 1 2 1 1|0 1 1 1|1 0 1|0 1 0 1 (4)
2 0 2|0 1 0 1|0|1 2 1 2 1|1 0|1 0|1|0|2 3 2 3 2 3 2 3 2 3|1 2 1 2|2 0 2|0 2 0 2 (2)
2 2 2|1 1 1 1|0|3 3 3 3 3|1 1|1 0|1|0|5 5 5 5 5 5 5 5 5 5|3 3 3 3|2 2 2|2 2 2 2 (1)

We can use the partition offset map we created earlier to index into the
structure and obtain the bitcount for each partition; like histopyramids,
the individual JFA stages can be used to find the index of each set bit,
but one can also ignore the borders for this step and just do a regular
histopyramid of the whole stream.
JFA can also be used to find the largest or smallest value in each group.
Because we can't shrink the buffers, the cost is O(n * log(m)) where m is
the maximum number of entries in a partition. which is still pretty good;
also, we have optimal occupancy, no divergence, excellent caching.
Reducing JFA Passes

When a large 1D buffer is jump flooded with 3 samples per index, it takes
log2(n) passes to complete the flooding. It is possible to trade sample
count for iterations and complete the work in log4(n) steps by chaining
two operations into one and performing 7 samples at the offsets
i, i +- step, i +- step*2, i +- step*3 per pass.