Skip to content

Instantly share code, notes, and snippets.

@unixpickle
Last active October 30, 2023 21:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save unixpickle/5686879449534f973cc71a46110b1156 to your computer and use it in GitHub Desktop.
Save unixpickle/5686879449534f973cc71a46110b1156 to your computer and use it in GitHub Desktop.
Global reduction speed

Benchmarking global reductions with varying numbers of SMs and warps per SM. We find interesting facts, like that using one warp across more SMs is more efficient than more warps across a single SM.

These were computed on an NVIDIA Titan X GPU.

Code for the benchmark is here and the kernel is here.

1 warps 2 warps 4 warps 8 warps 16 warps 32 warps
1 SMs 5.74 GiB/s 12.20 GiB/s 24.26 GiB/s 47.81 GiB/s 82.55 GiB/s 98.77 GiB/s
2 SMs 12.11 GiB/s 24.39 GiB/s 48.48 GiB/s 92.71 GiB/s 153.96 GiB/s 189.01 GiB/s
4 SMs 24.21 GiB/s 48.55 GiB/s 96.13 GiB/s 176.56 GiB/s 257.26 GiB/s 303.50 GiB/s
8 SMs 48.36 GiB/s 96.50 GiB/s 181.74 GiB/s 294.65 GiB/s 335.24 GiB/s 337.33 GiB/s
16 SMs 96.92 GiB/s 182.65 GiB/s 309.08 GiB/s 334.78 GiB/s 336.86 GiB/s 338.26 GiB/s
32 SMs 180.08 GiB/s 304.42 GiB/s 329.64 GiB/s 336.86 GiB/s 333.21 GiB/s 331.60 GiB/s
64 SMs 179.45 GiB/s 257.40 GiB/s 304.99 GiB/s 330.89 GiB/s 334.68 GiB/s 329.70 GiB/s

With more instructions

We can take a reduction over sin(x) instead of x to throw more compute instructions into the mix. I've tried that here. Here's the resulting table:

1 warps 2 warps 4 warps 8 warps 16 warps 32 warps
1 SMs 5.24 GiB/s 11.76 GiB/s 24.22 GiB/s 45.89 GiB/s 78.95 GiB/s 100.73 GiB/s
2 SMs 11.21 GiB/s 24.19 GiB/s 47.97 GiB/s 89.09 GiB/s 145.61 GiB/s 191.27 GiB/s
4 SMs 23.05 GiB/s 48.07 GiB/s 93.10 GiB/s 170.31 GiB/s 245.83 GiB/s 303.49 GiB/s
8 SMs 45.72 GiB/s 93.73 GiB/s 174.85 GiB/s 290.67 GiB/s 331.86 GiB/s 338.39 GiB/s
16 SMs 88.97 GiB/s 175.29 GiB/s 303.97 GiB/s 334.67 GiB/s 337.68 GiB/s 338.43 GiB/s
32 SMs 166.40 GiB/s 300.94 GiB/s 328.26 GiB/s 335.70 GiB/s 333.65 GiB/s 331.95 GiB/s
64 SMs 168.66 GiB/s 254.25 GiB/s 302.75 GiB/s 329.47 GiB/s 334.93 GiB/s 332.62 GiB/s

sin(x) reduction on an H100

Here's the same code as above, but on an H100:

1 warps 2 warps 4 warps 8 warps 16 warps 32 warps
1 SMs 2.94 GiB/s 5.57 GiB/s 11.02 GiB/s 21.26 GiB/s 41.59 GiB/s 81.47 GiB/s
2 SMs 5.85 GiB/s 11.04 GiB/s 21.90 GiB/s 42.12 GiB/s 82.52 GiB/s 162.32 GiB/s
4 SMs 11.71 GiB/s 22.07 GiB/s 43.71 GiB/s 84.23 GiB/s 165.47 GiB/s 323.75 GiB/s
8 SMs 23.40 GiB/s 44.03 GiB/s 87.35 GiB/s 168.77 GiB/s 330.24 GiB/s 634.83 GiB/s
16 SMs 46.70 GiB/s 88.15 GiB/s 174.68 GiB/s 336.76 GiB/s 646.65 GiB/s 1202.34 GiB/s
32 SMs 91.74 GiB/s 173.64 GiB/s 344.12 GiB/s 651.42 GiB/s 1200.77 GiB/s 2073.66 GiB/s
64 SMs 183.51 GiB/s 347.69 GiB/s 676.22 GiB/s 1228.53 GiB/s 2076.83 GiB/s 2713.38 GiB/s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment