Skip to content

Instantly share code, notes, and snippets.

@JuanDiegoMontoya
Last active June 23, 2024 10:57
Show Gist options
  • Save JuanDiegoMontoya/55482fc04d70e83729bb9528ecdc1c61 to your computer and use it in GitHub Desktop.
Save JuanDiegoMontoya/55482fc04d70e83729bb9528ecdc1c61 to your computer and use it in GitHub Desktop.
How to do bindless in OpenGL without blowing your legs off.

The Definitive Guide to Non-Uniform Resource Indexing in OpenGL

Code

For those short on time:

#extension GL_NV_gpu_shader5 : enable
#extension GL_EXT_nonuniform_qualifier : enable

#ifdef GL_EXT_nonuniform_qualifier
#define NonUniformIndex(x) nonuniformEXT(x)
#else
#define NonUniformIndex(x) (x)
#endif

The Problem

Resources can only be legally accessed in dynamically uniform fashion in GLSL. In other words, when a sampling instruction (or image/buffer access) is executed, all invocations in the invocation group (dispatch, draw, or sub-draw in a multidraw) must access the same resource or the results are undefined. In the GLSL specification, the relevant sections are 3.8.2. Dynamically Uniform Expressions and Uniform Control Flow and 4.1.7. Opaque Types.

The motivation for having this restriction in GLSL is that some GPUs simply cannot do it (e.g. descriptors may be placed in scalar registers which are not unique to one lane). If we investigate the fields used for the IMAGE_SAMPLE instruction on RDNA 3, we see this:

SSAMP: SGPR to supply S# (sampler constant) in 4 consecutive SGPRs. ...

SRSRC: SGPR to supply T# (resource constant) in 8 consecutive SGPRs. ...

Definitions:

SGPR: Scalar General Purpose Registers. 32-bit registers that are shared by work-items in each wave.

Wave: A collection of 32 or 64 work-items that execute in parallel on a single RDNA3 processor.

Work-item: A single element of work: one element from the dispatch grid, or in graphics a pixel, vertex or primitive.

Why do we care? Often, to minimize the number of draw calls issued, we reach for ARB_bindless_texture. This extension allows us to access samplers and images from a shader without needing to explicitly bind them (and thus issue multiple draw calls). We may then wish to associate texture indices with individual mesh instances and draw them all with one call. Of course, those indices will not necessarily be dynamically uniform value. We want to leverage the awesome ergonomics that bindless textures give us!

Note that this issue applies to regular resource arrays too (e.g. uniform sampler2D textures[N];), not just bindless textures and images. It's just that use cases requiring bindless textures often require non-uniform resource indexing as well.

The Solution

We need to inform the compiler of our intentions to allow it to generate the correct code.

AMD

As of writing, EXT_nonuniform_qualifier is supported on modern AMD drivers: https://opengl.gpuinfo.org/listreports.php?extension=GL_EXT_nonuniform_qualifier. Using this extension is shrimple. Enable it in your shader:

#extension GL_EXT_nonuniform_qualifier : enable

then put it on an expression that is used to index an array of resources, or the indexed resource itself:

vec4 col = texture(mySampler2Ds[nonuniformEXT(i)], uv);

Note that in Vulkan GLSL, you can construct a samplerND from separate samplers and textures. In this case, the nonuniform qualifier must be put on the samplerND and not the individual array indices:

vec4 col = texture(nonuniformEXT(sampler2D(myTexture2Ds[texIdx], mySamplers[samplerIdx])), uv);

Nvidia

Nvidia does not support EXT_nonuniform_qualifier in OpenGL (despite having it in Vulkan). However, they do have NV_gpu_shader5. Simply enable that and you're good to go!

#extension GL_NV_gpu_shader5 : enable

Other platforms (Intel, older AMD)

For these platforms, a handwritten waterfall loop is needed. This is essentially what the driver will do for us if we use nonuniformEXT. This only works if your driver supports ARB_shader_ballot (which must be enabled first):

vec4 col;
for (;;) {
  uint currentIdx = readFirstInvocationARB(i);
  // Ensure that the index is dynamically uniform for any instance of texture()
  if (currentIdx == i) {
    // Note that because the control flow path to this invocation is not uniform, the implicit derivatives will be undefined.
    // In practice, it will work. If you're still worried, use textureGrad instead.
    col = texture(mySampler2Ds[currentIdx], uv);
    break;
  }
}

Note: I don't know what's actually needed for Intel hardware since I have never worked with it. If you know, please comment.

AQ

Do I need this if I'm non-uniformly selecting a layer from an array texture (e.g. sampler2DArray)?

No.

Will my implicit derivatives be preserved if I do that ARB_shader_ballot thing?

No. Yes. Maybe. Technically the derivatives will be undefined, but an implementation probably won't trash those registers for the invocations that didn't follow.

The heck are implicit derivatives?

https://www.khronos.org/opengl/wiki/Sampler_(GLSL)#Texture_lookup_in_shader_stages

http://www.aclockworkberry.com/shader-derivative-functions/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment