Skip to content

Instantly share code, notes, and snippets.

@paniq
Last active November 11, 2017 00:24
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save paniq/6732152d25d7eeaea905e32e7f14429d to your computer and use it in GitHub Desktop.
Save paniq/6732152d25d7eeaea905e32e7f14429d to your computer and use it in GitHub Desktop.
Massively Parallel Processing of Partitioned Streams

Massively Parallel Processing of Partitioned Streams

by @paniq

We wish to perform sum/max/filter operations on streams that have been partitioned into groups of varying size, and do this in a massively parallel fashion.

Finding Partitions

Our example grouped bitrange, with either an extra flag to mark borders, or a partition index for each entry:

1 0 1|0 0 0 1|0|0 1 1 1 0|1 0|1 0|1|0|1 0 1 1 0 1 0 1 0 0|0 1 1 1|1 0 1|0 1 0 1

We mark the borders and create a tree map using histopyramids:

1 0 1|0 0 0 1|0|0 1 1 1 0|1 0|1 0|1|0|1 0 1 1 0 1 0 1 0 0|0 1 1 1|1 0 1|0 1 0 1
1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0
2       1       1       2       3       0       0       1       1       1      
6                               4                               2
12

From these buffers, we know

  • The total number of partitions
  • The offset of each partition
  • The size of each partition (size(i) = offset(i+1) - offset(i))

Partitioned Reduction

We can e.g. count the bits in each partition using JFA. We start with a step size of N/2 where N is the largest partition size and then divide the step size by 2 for each step >= 1. Each index is computed so that

new_count(i) = count(i) + 
    max (is_neighbor(i,i+step)?count(i+1):0, 
         is_neighbor(i,i-step)?count(i-1):0)
1 0 1|0 0 0 1|0|0 1 1 1 0|1 0|1 0|1|0|1 0 1 1 0 1 0 1 0 0|0 1 1 1|1 0 1|0 1 0 1
1 0 1|0 0 0 1|0|0 1 1 1 0|1 0|1 0|1|0|1 0 1 1 0 1 0 1 1 0|0 1 1 1|1 0 1|0 1 0 1 (8)
1 0 1|0 0 0 1|0|0 1 1 1 0|1 0|1 0|1|0|1 1 1 2 1 1 1 2 1 1|0 1 1 1|1 0 1|0 1 0 1 (4)
2 0 2|0 1 0 1|0|1 2 1 2 1|1 0|1 0|1|0|2 3 2 3 2 3 2 3 2 3|1 2 1 2|2 0 2|0 2 0 2 (2)
2 2 2|1 1 1 1|0|3 3 3 3 3|1 1|1 0|1|0|5 5 5 5 5 5 5 5 5 5|3 3 3 3|2 2 2|2 2 2 2 (1)

We can use the partition offset map we created earlier to index into the structure and obtain the bitcount for each partition; like histopyramids, the individual JFA stages can be used to find the index of each set bit, but one can also ignore the borders for this step and just do a regular histopyramid of the whole stream.

JFA can also be used to find the largest or smallest value in each group. Because we can't shrink the buffers, the cost is O(n * log(m)) where m is the maximum number of entries in a partition. which is still pretty good; also, we have optimal occupancy, no divergence, excellent caching.

Reducing JFA Passes

When a large 1D buffer is jump flooded with 3 samples per index, it takes log2(n) passes to complete the flooding. It is possible to trade sample count for iterations and complete the work in log4(n) steps by chaining two operations into one and performing 7 samples at the offsets i, i +- step, i +- step*2, i +- step*3 per pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment