Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@sebbbi
Last active January 12, 2024 07:16
Show Gist options
  • Star 27 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save sebbbi/6cfbec7ab343924dad9b7ee48ef3ba6c to your computer and use it in GitHub Desktop.
Save sebbbi/6cfbec7ab343924dad9b7ee48ef3ba6c to your computer and use it in GitHub Desktop.
Single pass globallycoherent mip pyramid generation
// NOTE: Must bind 8x single mip RWTexture views, because HLSL doesn't have .mips member for RWTexture2D. (SRVs only have .mips member)
// NOTE: globallycoherent attribute is needed. Without it writes aren't guaranteed to be seen by other groups
globallycoherent RWTexture2D<float> MipTextures[8];
RWTexture2D<uint> Counters[8];
groupshared uint CounterReturnLDS;
[numthreads(16, 16, 1)]
void GenerateMipPyramid(uint3 Tid : SV_DispatchThreadID, uint3 Group : SV_GroupId, uint Gix : SV_GroupIndex)
{
[unroll]
for (int Mip = 0; Mip < 8-1; ++Mip)
{
// 2x2 downsample
float Sum =
MipTextures[Mip][Tid.xy * 2 + uint2(0, 0)] +
MipTextures[Mip][Tid.xy * 2 + uint2(1, 0)] +
MipTextures[Mip][Tid.xy * 2 + uint2(0, 1)] +
MipTextures[Mip][Tid.xy * 2 + uint2(1, 1)];
MipTextures[Mip+1][Tid.xy] = Sum * 0.25;
// Four groups in 2x2 tile of groups increment the same counter.
if (Gix == 0)
{
InterlockedAdd(Counters[Mip][Group.xy / 2], 1, CounterReturnLDS);
}
// We do a full memory barrier here. In next mip the surviving thread group will read data generated by 3 other thread groups. Data needs to be visible.
AllMemoryBarrierWithGroupSync();
// Kill all groups except the last one to finish in 2x2 tile. This branch is allowed because CounterReturnLDS is group invariant.
if (CounterReturnLDS < 3)
{
return;
}
// Needed to ensure that all threads in group read CounterReturnLDS before it is modified in next loop iteration
GroupMemoryBarrierWithGroup();
Tid.xy /= 2;
Group.xy /= 2;
}
}
@tcantenot
Copy link

tcantenot commented Jun 25, 2018

Hi,

First of all, really neat idea :)!

I tried it on Tim Jone's shader playground (cs_5_0 target profile): http://shader-playground.timjones.io/
But the fxc compiler seems to disagree.. It fails to realize that CounterReturnLDS is group invariant :

error X4026: thread sync operation must be in non-varying flow control, due to a potential race condition this sync is illegal, consider adding a sync after reading any values controlling shader execution at this point
error X4026: this memory access dependent on potentially varying data

Also, there is an out-of-bound access at the line 20:

MipTextures[Mip+1][Tid.xy] = Sum * 0.25;

Either we have to increase the size of the MipTextures array by 1 or reduce the loop count by 1 (and reduce the Counters array size by 1).

Do you have an idea on how to solve the first issue? Or maybe it is a compiler bug or is truly illegal?

@kingofthebongo2008
Copy link

the loop should be unrolled for the first message.

@sebbbi
Copy link
Author

sebbbi commented Oct 4, 2018

Killing the whole group based on CounterReturnLDS is allowed. AllMemoryBarrierWithGroupSync assures that all threads in the group are in sync and groupshared mem writes have finished. So each thread is guaranteed to see same value when reading CounterReturnLDS. However there's a bug in this algorithm. I have forgot to do GroupMemoryBarrierWithGroup sync after the return branch. The error message says exactly this. This is a race condition because some waves in the group might start next loop iteration before others and those might execute InterlockedAdd to CounterReturnLDS before all waves have been able to read CounterReturnLDS.

I have used similar ways to kill the whole group in other shaders in shipping code. This is definitely allowed. In DX12 you can also use single wave groups and use WaveReadLaneFirst instead of groupshared mem to broadcast the value to all lanes in group (to ensure branch coherency).

Added GroupMemoryBarrierWithGroup() to the code to make it valid.

@sebbbi
Copy link
Author

sebbbi commented Oct 4, 2018

Hi,

First of all, really neat idea :)!

I tried it on Tim Jone's shader playground (cs_5_0 target profile): http://shader-playground.timjones.io/
But the fxc compiler seems to disagree.. It fails to realize that CounterReturnLDS is group invariant :

error X4026: thread sync operation must be in non-varying flow control, due to a potential race condition this sync is illegal, consider adding a sync after reading any values controlling shader execution at this point
error X4026: this memory access dependent on potentially varying data

Also, there is an out-of-bound access at the line 20:

MipTextures[Mip+1][Tid.xy] = Sum * 0.25;

Either we have to increase the size of the MipTextures array by 1 or reduce the loop count by 1 (and reduce the Counters array size by 1).

Do you have an idea on how to solve the first issue? Or maybe it is a compiler bug or is truly illegal?

Fixed both bugs. See note below

@mankeli
Copy link

mankeli commented Nov 25, 2018

There's probably a lot about compute shaders I don't know, but I wonder why this can't be done with just:

layout(local_size_x = 16, local_size_y = 16, local_size_z = 1) in;

layout(rgba8) uniform image2D u_texmip[12];
uniform uint u_mipcount;

void main()
{
	uvec2 pos = gl_GlobalInvocationID.xy;
	ivec2 pos1 = ivec2(pos);
	ivec2 pos2 = ivec2(pos)*2;

	uvec2 siz = (gl_NumWorkGroups.xy*gl_WorkGroupSize.xy) >> 1;
	for (uint i = 0; i < u_mipcount-1; i++)
	{
		vec4 c1 = imageLoad(u_texmip[i], pos2+ivec2(0,0));
		vec4 c2 = imageLoad(u_texmip[i], pos2+ivec2(1,0));
		vec4 c3 = imageLoad(u_texmip[i], pos2+ivec2(0,1));
		vec4 c4 = imageLoad(u_texmip[i], pos2+ivec2(1,1));

		vec4 cc = (c1+c2+c3+c4)*0.25;
		imageStore(u_texmip[i+1], pos1,cc);

		if (any(greaterThan(pos, siz)))
			return;
		siz >>= 1;

		memoryBarrierImage();
	}
}

removing that return makes 6ms->23ms, so something happens to the threadgroups. why is the LDS necessary?

@mankeli
Copy link

mankeli commented Nov 27, 2018

Ah, of course it's necessary because there's no other way to know when 2x2 tile of groups has finished. But if those groups finish in random order, how does that tid.xy /= 2; produce correct coordinates for next iteration?

@emoon
Copy link

emoon commented Feb 13, 2019

Any updates on getting a fixed version of this? :)

@kingofthebongo2008
Copy link

Hey, we have tried this version of the gist and it is definitely slower on
Radeon RX 580 and NVidia 2060 than the version in the mini engine for directx.

The version in direct xdoes uses lds for now, if it uses waveintrinsics it will be faster.

Can you comment?

On 4096x4096 on RX 580

Gist : 1806240 ns
Miniengine: 875680 ns

Do we do something wrong?

Measure with pix 1908.02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment