sebbbi/FastUniformLoadWithWaveOps.txt

## FastUniformLoadWithWaveOps.txt
In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

Simplified HLSL code looks like this:

Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;

[numthreads(8, 8, 1)]
void CSMain(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID)
{
    uint2 lightStartCount = lightStartCounts[gid.xy];
    float4 lightAccumulator = 0;

    for (uint i = 0; i < lightStartCount.y; ++i)
    {
        uint lightIndex = lightStartCount.x + i;

        // In real impl light data would be position, radius, color, etc.
        float4 lightData = lightDatas[lightIndex];	// 64-wide mem load. All lanes load from same address.
        lightAccumulator += lightData;
    }

    output[tid.xy] = lightAccumulator;
}

The problem with this code is that we load the same light data for each lane. If you would have used constant buffer instead of
Buffer<float4>, the compiler would have generated special code utilizing GPU specific uniform load facility like constant buffer
hardware (Nvidia) or scalar unit (AMD). But this can be problematic, since constant buffers have 64KB size limit, and constant buffer
array elements must be float4/int4 (16 byte alignment). Not even enough space to hold 1080p tile light lists.

Fortunately with SM 6.0 wave intrinsics we can do better. We can load 32 (Nvidia) or 64 (AMD) ligths at once using a single load
instruction and then use WaveReadLaneAt to broadcast light data from one lane to all lanes, one lane at a time. This reduces the number
of load instructions by 32x / 64x. Also there's no register bloat, since each loaded lane has now unique data, instead of replicated
value. Register file usage is as good as with AMD scalar loads.

Optimized code looks like this:

[numthreads(8, 8, 1)]
void CSMainWave(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID, uint gix : SV_GroupIndex)
{
    const uint WAVE_LANES = WaveGetLaneCount(); // Usually 64 or 32. Compile time constant.

    uint2 lightStartCount = lightStartCounts[gid.xy];
    float4 lightAccumulator = 0;

    for (uint i = 0; i < divRoundUp(lightStartCount.y, WAVE_LANES); ++i)
    {
        // Load 32/64 unique items at once in outer loop. 32x/64x reduction in mem loads.
        float4 lightData64 = lightDatas[lightStartCount.x + gix + i * WAVE_LANES];

        uint loopIters = min(WAVE_LANES, lightStartCount.y - WAVE_LANES * i);
        for (uint i2 = 0; i2 < loopIters; ++i2)
        {
            // In real impl light data would be position, radius, color, etc.
            float4 lightData = WaveReadLaneAt(lightData64, i2);	// Broadcast current light to all lanes
            lightAccumulator += lightData;
        }
    }

    output[tid.xy] = lightAccumulator;
}

This code actually will beat scalar optimized code on AMD, since one vector load is faster than 64 scalar loads. Another advantage is
that this optimization works for all buffer and texture types, not just for constant/raw/structured buffers.

According to my GPU buffer benchmark (https://github.com/sebbbi/perftest), new Nvidia Maxwell/Pascal/Volta/Turing drivers employ
uniform address optimization for all buffer and texture types. I have seen 20x-27x performance increases for cases where all threads
inside a loop load from uniform address. My guess is this Nvidia optimization employed in their new compiler is similar to the
optimization I described above. I don't have access to their PIX plugin to validate my claim. Maybe someone with access to that plugin
can validate my assumption (if NDA is not a problem).

Recent AMD drivers employ a similar optimization for global atomic add. This can be validated with publicly available tools such as
RenderDoc. If all lanes add C or 0, the compiler first does WavePrefixCountBits (MBCNT) and then a single lane global atomic, reducing
serialization to atomic counter by 64x.

I hope that AMD also implements the optimization above to their shader compiler. In AMDs case, this optimization is actually super
good, since in most cases AMD gets also full memory coalescing gain from this optimizationg, resulting in 4x load issue rate.

Shader playground link:
http://shader-playground.timjones.io/04cb6af828fe0e1a48d651e25d06cffb
	In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
	group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

	Simplified HLSL code looks like this:

	Buffer<float4> lightDatas;
	Texture2D<uint2> lightStartCounts;
	RWTexture2D<float4> output;

	[numthreads(8, 8, 1)]
	void CSMain(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID)
	{
	uint2 lightStartCount = lightStartCounts[gid.xy];
	float4 lightAccumulator = 0;

	for (uint i = 0; i < lightStartCount.y; ++i)
	{
	uint lightIndex = lightStartCount.x + i;

	// In real impl light data would be position, radius, color, etc.
	float4 lightData = lightDatas[lightIndex]; // 64-wide mem load. All lanes load from same address.
	lightAccumulator += lightData;
	}

	output[tid.xy] = lightAccumulator;
	}

	The problem with this code is that we load the same light data for each lane. If you would have used constant buffer instead of
	Buffer<float4>, the compiler would have generated special code utilizing GPU specific uniform load facility like constant buffer
	hardware (Nvidia) or scalar unit (AMD). But this can be problematic, since constant buffers have 64KB size limit, and constant buffer
	array elements must be float4/int4 (16 byte alignment). Not even enough space to hold 1080p tile light lists.

	Fortunately with SM 6.0 wave intrinsics we can do better. We can load 32 (Nvidia) or 64 (AMD) ligths at once using a single load
	instruction and then use WaveReadLaneAt to broadcast light data from one lane to all lanes, one lane at a time. This reduces the number
	of load instructions by 32x / 64x. Also there's no register bloat, since each loaded lane has now unique data, instead of replicated
	value. Register file usage is as good as with AMD scalar loads.

	Optimized code looks like this:

	[numthreads(8, 8, 1)]
	void CSMainWave(uint3 tid : SV_DispatchThreadID, uint3 gid : SV_GroupID, uint gix : SV_GroupIndex)
	{
	const uint WAVE_LANES = WaveGetLaneCount(); // Usually 64 or 32. Compile time constant.

	uint2 lightStartCount = lightStartCounts[gid.xy];
	float4 lightAccumulator = 0;

	for (uint i = 0; i < divRoundUp(lightStartCount.y, WAVE_LANES); ++i)
	{
	// Load 32/64 unique items at once in outer loop. 32x/64x reduction in mem loads.
	float4 lightData64 = lightDatas[lightStartCount.x + gix + i * WAVE_LANES];

	uint loopIters = min(WAVE_LANES, lightStartCount.y - WAVE_LANES * i);
	for (uint i2 = 0; i2 < loopIters; ++i2)
	{
	// In real impl light data would be position, radius, color, etc.
	float4 lightData = WaveReadLaneAt(lightData64, i2); // Broadcast current light to all lanes
	lightAccumulator += lightData;
	}
	}

	output[tid.xy] = lightAccumulator;
	}

	This code actually will beat scalar optimized code on AMD, since one vector load is faster than 64 scalar loads. Another advantage is
	that this optimization works for all buffer and texture types, not just for constant/raw/structured buffers.

	According to my GPU buffer benchmark (https://github.com/sebbbi/perftest), new Nvidia Maxwell/Pascal/Volta/Turing drivers employ
	uniform address optimization for all buffer and texture types. I have seen 20x-27x performance increases for cases where all threads
	inside a loop load from uniform address. My guess is this Nvidia optimization employed in their new compiler is similar to the
	optimization I described above. I don't have access to their PIX plugin to validate my claim. Maybe someone with access to that plugin
	can validate my assumption (if NDA is not a problem).

	Recent AMD drivers employ a similar optimization for global atomic add. This can be validated with publicly available tools such as
	RenderDoc. If all lanes add C or 0, the compiler first does WavePrefixCountBits (MBCNT) and then a single lane global atomic, reducing
	serialization to atomic counter by 64x.

	I hope that AMD also implements the optimization above to their shader compiler. In AMDs case, this optimization is actually super
	good, since in most cases AMD gets also full memory coalescing gain from this optimizationg, resulting in 4x load issue rate.

	Shader playground link:
	http://shader-playground.timjones.io/04cb6af828fe0e1a48d651e25d06cffb