bzm3r/transpose-timings-writeup-outline.md

## transpose-timings-writeup-outline.md

      
    Raw
  

              transpose-timings-writeup-outline.md
            
          
    Outline


Make it relevant to the audience.

share a couple of specific examples, and identify the general shape of the problem


Describe how this problem is easy to parallelize, in particular, on a GPU.


Introduce the specific shape of our problem: why are we using bitmaps?

Storing data compact on a GPU.


Introduce the two ways in which we can approach our specific problem: threadgroups, and subgroups.

(Unsure) should mention here that subgroup method is more involved, because APIs and shader languages are "threadgroup first"?


Want to choose a method depending on its time performance/ease of implementation on a variety of recent hardware. No resources which directly provide perfomance/ease of implementation data, so we made one.


Introduce discrete GPU data first, comparing threadgroup transposes, subgroup shuffle transposes. Note large gains of subgroup ops on Nvidia GPUs, but relatively minor gain on AMD GPU.

Briefly discuss why we expected subgroups to do better: show the memory hierarchy picture.
Note our surprise regarding the performance of threadgroup shared memory and subgroup memory on the AMD GPU, given this memory hierarchy picture.


Discuss how HLSL + DX12 don't have access to shuffle ops, and in this context, introduce ballot transposes. Note poor performance of this method on discrete GPUs.


Discuss challenge of implementing subgroup shuffle tranposition on integrated Intel GPUs:

introduce hybrid shuffle in this context. Note our surprise in finding that that unlike the discrete GPUs, hybrid shuffle does worse than threadgroups. Why is this? We don't know.
introduce 8x8 bit matrix transposition in this context. Note that the performance of Intel GPUs is very comparable to that of discrete GPUs in this task.


Discuss surprising result that "2D threadgroups" do worse with increasing threadgroup size, compared to "1D threadgroups".

Surprise, because we'd expect 2D threadgroups to be nothing more than syntactic sugar over threadgroups which are ultimately always 1D?


(Unsure: in each one of these bullet points, should we discuss our understanding of the hardware internals, as in the "detailed" write-up? Should we discuss it much more briefly? Should be just avoid it altogether, and link to a separate document which discusses it?)

Conclusion:

HLSL + DX12 don't support all subgroup ops available in GLSL + Vulkan. This makes it impossible to the subgroup method if you are based on HLSL + DX12 (as the piet-dx12 library is).
Interested in supporting Intel GPUs, and were challenged by the fact that we're unable to determine exactly which one of the possible logical SIMD widths the compiler chooses (GL_SUBGROUP_SIZE always returns maximum logical SIMD width). More importantly, hybrid shuffle just doesn't work as well as pure threadgroups. This is surprising. Why is this? Does someone know?
On AMD GPUs, we think (based on our N=1 sample size) that subgroups don't provide much of a benefit, we suppose because of the amazing performance of threadgroup shared memory. Does anyone know the actual reasons (e.g. some sort of automatic SIMD fallback), or why threadgroup shared memory is so performant?
Ultimate moral of the story: since Nvidia GPUs are the only devices which benefit strongly from subgroup ops, given all the other downsides of subgroup ops, we interpret our data as recommending that its not worth using subgroup ops.
(Unsure) link to much more boring/detailed write-up, for the sake of beginners?