Skip to content

Instantly share code, notes, and snippets.

@sebbbi
Last active January 12, 2024 07:16
Show Gist options
  • Star 27 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save sebbbi/6cfbec7ab343924dad9b7ee48ef3ba6c to your computer and use it in GitHub Desktop.
Save sebbbi/6cfbec7ab343924dad9b7ee48ef3ba6c to your computer and use it in GitHub Desktop.
Single pass globallycoherent mip pyramid generation
// NOTE: Must bind 8x single mip RWTexture views, because HLSL doesn't have .mips member for RWTexture2D. (SRVs only have .mips member)
// NOTE: globallycoherent attribute is needed. Without it writes aren't guaranteed to be seen by other groups
globallycoherent RWTexture2D<float> MipTextures[8];
RWTexture2D<uint> Counters[8];
groupshared uint CounterReturnLDS;
[numthreads(16, 16, 1)]
void GenerateMipPyramid(uint3 Tid : SV_DispatchThreadID, uint3 Group : SV_GroupId, uint Gix : SV_GroupIndex)
{
[unroll]
for (int Mip = 0; Mip < 8-1; ++Mip)
{
// 2x2 downsample
float Sum =
MipTextures[Mip][Tid.xy * 2 + uint2(0, 0)] +
MipTextures[Mip][Tid.xy * 2 + uint2(1, 0)] +
MipTextures[Mip][Tid.xy * 2 + uint2(0, 1)] +
MipTextures[Mip][Tid.xy * 2 + uint2(1, 1)];
MipTextures[Mip+1][Tid.xy] = Sum * 0.25;
// Four groups in 2x2 tile of groups increment the same counter.
if (Gix == 0)
{
InterlockedAdd(Counters[Mip][Group.xy / 2], 1, CounterReturnLDS);
}
// We do a full memory barrier here. In next mip the surviving thread group will read data generated by 3 other thread groups. Data needs to be visible.
AllMemoryBarrierWithGroupSync();
// Kill all groups except the last one to finish in 2x2 tile. This branch is allowed because CounterReturnLDS is group invariant.
if (CounterReturnLDS < 3)
{
return;
}
// Needed to ensure that all threads in group read CounterReturnLDS before it is modified in next loop iteration
GroupMemoryBarrierWithGroup();
Tid.xy /= 2;
Group.xy /= 2;
}
}
@kingofthebongo2008
Copy link

Hey, we have tried this version of the gist and it is definitely slower on
Radeon RX 580 and NVidia 2060 than the version in the mini engine for directx.

The version in direct xdoes uses lds for now, if it uses waveintrinsics it will be faster.

Can you comment?

On 4096x4096 on RX 580

Gist : 1806240 ns
Miniengine: 875680 ns

Do we do something wrong?

Measure with pix 1908.02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment