Beyond3D CUDA Throughput Thingy
This tool is designed to run simple kernels that are almost exclusively math instructions, in order to test ALU throughput on compatible CUDA devices.
In particular, it was designed to text FP16 throughput on sm_53 and upwards device architectures, where FP16 support is present. FP16, FP32 and FP64 kernels are run and measured.
The kernels themselves are simple and execute almost nothing more than long chains of dependent FMA instructions, bar some setup at the top of each kernel to setup a thread index, and a small amount of math at the end to consume the results of the FMA chain and write it out to memory.
In the current version, each thread runs a chain 1024 FMAs long, run as 8 loops of 128. For the FP16 and FP32 kernels, 8192x4096 threads are run, split up as blocks run at the maximum number of device threads the hardware reports it can handle. For the FP64 kernels, one quarter of the number of threads are run, to make the test run faster more than anything.
Each test uses device intrinsics to feed the compiler front-end, to give it the best chance at codegen.
Kernel Details for sm_61 targets
For the FP32 kernel, the generated PTX is mostly
fma.f32 instructions. That assembles to the
FFMA hardware instruction.
For the FP64 kernel, the generated PTX is mostly
fma.f64 instructions. That assembles to the
DFMA hardware instruction.
For the FP16 kernel, the generated PTX is mostly
fma.f16x2 instructions. That assembles to the vec2
HFMA2 hardware instruction.
There are no barriers between instructions in any kernel, so the limiting factors are ALU throughput and scheduling. The tests aren't designed to achieve peak throughput due to the crude tuning of kernel dimensions and the loop count. Rather, they exist to show the rough difference in relative throughput between each instruction type.
On a GeForce GTX 1080, vec2 FP16 throughput is around 1/6th of FP32.