Benchmarking global reductions with varying numbers of SMs and warps per SM. We find interesting facts, like that using one warp across more SMs is more efficient than more warps across a single SM.
These were computed on an NVIDIA Titan X GPU.
Code for the benchmark is here and the kernel is here.
1 warps