bzm3r/transpose-timings-writeup.md

## transpose-timings-writeup.md

      
    Raw
  

              transpose-timings-writeup.md
            
          
    There might be a collection of tasks, and a collection of processors, and we want to associate tasks with a particular processor depending on some rule. Or, in an N-body physics simulations, we may want to figure out which which of the N particles is close enough to interact with some particular particle. These problems, and many others, have have a similar shape: determine whether some objects in one collection are related in some way to objects in  another collection.
It is easy to write parallelized solutions to this problem. Suppose we have a collection of objects A, and a collection of objects B, related by a boolean function p: (A, b) -> bool, and for each object A we want to determine which objects in B are related to it via p. We have a GPU, with many individual processors, so we can associate each object in A, a_i, with the ith processor, and each such processor can loop over the elements in B, indexed b_j, and store the result p(a_i, b_j) to output.
Data on a GPU should be stored as compactly as possible, in order to alleviate read/write times, and alleviate register pressure. For boolean data, bitmaps are the way to go, so it makes sense to store p(a_i, b_j) in a bitmap. For example, if there were 32 elements in collectio A, and 64 elements in B, we could store the result in 32 vec2s of 32-bit unsigned integers: a uvec2[32].  To look up if element a_4 in A is related to element b_32 in B, we fetch the uvec2 with index 4, and look at the first bit of the second unsigned integer in the vec (the first uint in the uvec2 stores data for elements 0 to 31 in B, and the uint stores data for elements 32 to 63). The details of the number of elements in A and B don't matter, so for the sake of simplicity, let's assume that there are 32 elements in both. We can thus neatly store the relationship results in uint[32] bm;, where a_i in A is related to b_j in B if bm[i] && (1 << j) == 1.
There could be many reasons why we are interested in the transpose of bm: uint[32] tbm;, where tbm[i] && (1 << j) stores whether element b_i in B is related to element a_j in B. For example, it might be less expensive to compute tbm, and then transpose it, rather than calculating bm directly. We were faced with having to transpose bm when writing an experimental GPU accelerated implementation of the 2D graphics API piet.
Transposing a uint[32] using 32 processors on a GPU can be approached in two ways: one is commonly used, and the other not so much. The commonly used method is the obvious one: use threadgroup shared memory to calculate the transpose. Or, you can be fancy, and use subgroup operations, which take advantage of the ultimately SIMD structure of the GPU, to calculate the transpose by shuffling data between the 32 threads directly, without resorting to threadgroup shared memory as an intermediate.
Which one you should use depends on the performance criteria that matter to you, and how these two methods measure up to those criteria. Our criteria are two fold: reasonably easy to implement on a variety of recent hardware (last 5 years), and time performance (the faster, the better). Given that there aren't any existing resources we're aware of that answer our questions, we decided to make such a resource.