Skip to content

Instantly share code, notes, and snippets.

@reduz
Last active April 29, 2024 15:13
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save reduz/c5769d0e705d8ab7ac187d63be0099b5 to your computer and use it in GitHub Desktop.
Save reduz/c5769d0e705d8ab7ac187d63be0099b5 to your computer and use it in GitHub Desktop.
GPU Driven Renderer for Godot 4.x

GPU Driven renderer design for Godot 4.x

Goals:

The main goal is to implement a GPU driven renderer for Godot 4.x. This is a renderer that happens entirely on GPU (no CPU Dispatches during opaque pass).

Additionally, this is a renderer that relies exclusively on raytracing (and a base raster pass aided by raytracing).

It is important to make a note that we dont want to implement a GPU driven renderer similar to that of AAA/Unreal, as example. We want to implement it in a way that allows us to retain full and complete flexibility in the rendering pipeline and in a way that it is simple and easy to maintain.

Overview

Roughly, a GPU render pass would work more or less like this:

  1. Frustum/Occlusion cull depth: This would be done using raytracing on a small screen buffer. As such, throwing a small amount of rays to generate a depth buffer.
  2. Occlusion cull lists: All objects will be culled against the small depth buffer. Objects that pass are placed in a special list per material.
  3. Opaque render: Objects are rendered in multiple passes to a G-Buffer (deferred) by executing every shader in the scene together with their specific indirect draw list. This follows the logic of the Rendering Compositor, providing the same flexibility for different kind of effects, stencil, custom buffers, etc.
  4. Light cull: Lights are culled against the depth buffer to determine which lights that are rendered per pixel and which need shadow.
  5. Shadow tracing: Shadows are traced using raytracing to the respective pixels on screen.
  6. GI Pass: Reflections and GI are processed also using raytracing. GI of objects off-screen is done with material textures (material rendered to a low res texture).
  7. Decal Pass: Decals are rendered into the G-Buffer.
  8. Volumetric Fog: Volumetric fog is processed like in the current renderer, except that instead of tapping shadow maps, raytracing is used.
  9. Light Pass: Finally, a pass adding lights is applied and reading the proper shadows.
  10. Subsurface Scatter Pass: A pass to post-process subsurface scatter must be done after the light pass.
  11. Alpha Pass: Transparency pass is done at the end. This is done using regular Z-Sorted draw calls, CPU driven.

FAQ

Q: Why do we use smaller resolution raytracing for occlusion culling and not visibility lists?
A: Visibility lists take away flexibility from the opaque passes. They require rendering objects in specific order, while opaque passes do not.

Q: Do we not have small occluder problem using raytraced occlusion?
A: Yes, but in practice this does not really matter. Roughly more than 99% of scenes work fine.

Q: Why using raytraced shadows only, is it not better to support shadow mapping?
A: We need to check depending on performance, but worst case an hybrid technique can be explored.

In Detail

Frustum/Occlusion cull depth

The first pass has to be discarding objects based on visibility. Frustum cull will discard objects not visible in the camera. A depth buffer will be created using raytracing, basically used for base occlussion. Objects will be tested against it and also discarded.

Keep in mind that a relatively large depth buffer can be used thanks to raytracing, because the depth buffer can be reprojected from the previous frame and only the places with missing rays can re-cast.

Occlusion cull lists

The list of objects passing must be sorted by shader type, in an indirect draw list fashion. Materials for all shaders of a given type will be found in an array, textures will be indices to a large array of textures instead of actual textures.

This can be achievable relatively easily using a compute shader that counts all the objects of each shader type first, then assigns their offset in a large array, then creates the indirect draw lists for each shader.

My thinking here is that, as we will eventually have mesh streaming and these will by default more or less be separated into meshlets anyway (for the purpose of streaming), those could be rendered with a special shader that culls them in more detail (maybe mesh shader), while regular objects (non streamed) go via the regular vertex shader path.

Opaque render

Opaque rendering happens by executing all shaders basically with indirect rendering. Because in the compositor proposal, shaders are assigned to subpasses, it is easy to have a system where the compositor still works despite the GPU driven nature.

Additionally, depending on what a material renders (visibility mask, emission, custom lighting, etc). We can also take advantage to do this and render in multiple render passes to different G-Buffer configurations.

Bindless implementation

The bindless implementation should be relatively simple to do. Textures can go in a simple:

uniform texture2D textures[MAX_TEXTURES];

For vertex arrays, vertex pulling can be implemented using utextureBuffer for vertex buffers:

uniform utextureBuffer textures[MAX_TEXTURES];

And the vertex format decoded on demand. Vertex pulling of a custom format would probably not be super efficient, but the following needs to be taken into consideration:

  • Most meshes will be compressed (meaning they use only one format, hence vertex pulling will be very efficient).
  • In a larger game most static meshes would most likely be also streamed anyway and the format will be fixated, the vertex pulling code for most of the vertex buffers should be very efficient too.

Light cull

With the depth buffer completed, it is possible to do light culling and assigment of visible lights. This can be done using the current clustering code. Alternatively, an unified structure like that of raytracing can be used (possibly hash grid?).

Shadow Tracing

Shadows can be traced in this pass. It is possible not all positional lights with shadows need to be processed every frame, as temporal supersampling can aid improving the performance of this.

GI Pass

We need to check a GI technique, or offer the user different GI techniques based on performance/quality ratio, such as GI 1.0 to full path tracing.

For materials, we should probably just render most materials to a small texture (128x128) and use this information for GI bounces.

Decal Pass

As we are using a G-Buffer, a decal pass is probably a lot more optimal to do by just rastering the decals to it.

Volumetric Fog

Volumetric fog should work identically to what we have now, except instead of using shadow mapping, we can just raytrace from the points to test occlusion.

Light Pass

The light pass should be almost the same as in our (future) deferred renderer.

Subsurface Scatter Pass

This should work similar to how it works now.

Alpha Pass with OIT

Because we don´t have shadow mapping, alpha pass needs to happen a bit different. The idea here is to use light pre-pass style rendering on the alpha side.

Basically, a 64-bit G-Buffer is used for alpha that looks like this: uint obj_index : 18; uint metallic : 7; uint roughness : 7; rg16 obj_normal; // encoded as octahedral

Added to this, a "incrment_texture" image texture in uint format that is half resolution format.

Alpha is done in two passes. The first pass is objects that are lit, sorted from back to front. Unshaded objects are skipped.

The following code is run in the shader:

// This piece of code ensures that depth buffer writes are rotated across a block of 2x3 pixels.

uvec2 group_coord = gl_fragCoord.xy >> uvec2(1); // group is the block of 2x2 pixels
uvec2 store_coord;
vec2 combined_roughness_metallic;
vec3 combined_normal;
bool store = false;
while (true) {
      // Of all active in subgroup, find first broadcast ID.
      uvec2 first = subgroupBroadcastFirst(gl_fragCoord.xy);
      // Get the group (block of 2x2 pixels) of the first
      uvec2 first_group = first >> uvec2(1);
      uvec2 store_coord;
      if (first == gl_fragCoord.xy) {
            // If the first broadcast ID, increment the atomic counter and get the value.
	    // the store_index is a value from 0 to 3, representing the pixel in the 2x2 block.
            uint index = imageAtomicAdd(increment_texture,first_group >> 1 ,1) & 0x3;
	    store_coord = group + uvec2(index&1,index>>1);
      }
      // Broadcast the store index
      store_coord = subgroupBroadcastFirst(store_coord);
      
      if (first_group == group) {
             // If this pixel is part of the group being stored, then only store the relevant one
	     // and discard the rest. This ensures that every write rotates the pixel in the 2x2 block.   
	     // Combined rm and normal of all 4 pixels
	     vec3 crm = subgroupAdd(vec3(roughness,metallic,1.0);
	     combined_roughness_metallic = crm.rg / cr.b;
	     combined_normal = normalize( subgroupAdd(normal) );
	     // Determine if the pixel that needs to be written actually is preset (may not be part of the primitive)
	     bool write_exists = bool(subgroupAdd(uint( gl_fragCoord.xy == store_coord )));
	
	     if (write_exists) {
	     	store = store_coord == gl_fragCoord.xy;
	     } else {
	     	store = first == gl_fragCoord.xy;
	     }
	     
	     break;
      }
}
// Store G-Buffer
// It is important to _not_ use discard in this shader, to ensure early Z works and gets rid of unwanted writes.
if (store) {
   uint store_obj_rough_metallic = object_id;
   store_obj_rough_metallic |= clamp(uint(combined_roughness_metallic.r * 127),0,127) << 18;
   store_obj_rough_metallic |= clamp(uint(combined_roughness_metallic.g * 127),0,127) << 25;
   
   imageStore(obj_id_metal_roughness_tex, store_coord, store_obj_rough_metallic);
   imageStore(normal_tex,store_coord,octahedron_encode(combined_normal));
}

After this, a compute pass is ran computing the lighting of all transparent objects (obj_index == 0 means nothing to do). Light is written as a rgba16f g-buffer. To accelerate the lookups in the next pass, the compute shader will also write for every pixel an u32 containing the following neighbouring info:

Table containing 3 bits values:

x - 2 x x + 2
00 - 02 03 - 05 06 - 08
09 - 11 12 - 14 15 - 17
18 - 20 21 - 23 24 - 26

each 3 bits values represents:

0x7: No neighbour

else:

x x + 1
0 1
2 3

Finally, a second alpha pass is ran again from back to front. For shaded objects, lighting information is searched across the surrounding 36 pixels for objects that match it, then interpolated and multiplied by the albedo.

The algorithm would look somehow like this:

uvec2 base_lookup = gl_fragCoord.xy & (~uvec2(1,1));

uvec2 light_pos = uvec2(0xFFFF,0xFFFF);

for(uint i = 0 ; i < 4; i++) {
   uvec2 lookup_pos = base_lookup + uvec2(i&1,(i>>1)&1);
   uint obj_id = texelFetch(obj_id_metal_roughness_tex,lookup_pos).x;
   if ((obj_id & OBJ_ID_MASK) == current_obj_id) {
      light_pos = lookup_pos;
      break;
   }
}

if (light_pos == uvec2(0xFFFF,0xFFFF)) {
    discard; // could not find any info to lookup, discard pixel.
}

uint neighbour_positions = vec4(texelFetch(neighbours,light_pos).rgb,1.0);

vec4 light_accum = vec4(0,0,0,1);
ivec2 neighbour_base = ivec2(base_lookup) - ivec2(1,1);
for(int i=0;i<9;i++) {
   uint neighbour = (neighbour_positions >> (i*3))&0x7;
   if (neighbour == 0x7) {
      continue;
   }
   
   ivec2 neighbour_ofs = neighbour_base;
   neighbour_ofs.x += (i % 3) * 2 + (neighbour&1)
   neighbour_ofs.y += (i / 3) * 2 + (neighbour>>1);
   float gauss = gauss_map(length(vec2(neighbour_ofs - gl_fragCoord.xy))); // Use some gauss curve based on distance to pixel.
   light_accum += vec4(texelFetch(alpha_light,neighbour_ofs).rgb,gauss);
}

vec3 light = light_accum.rgb / light_accum.a;

light *= albedo;

// Store light with alpha blending.
...
@reduz
Copy link
Author

reduz commented Oct 25, 2023

@AttackButton @patwork Having a renderer that looks like Unreal is not a problem. It is not the graphics what define these kind of engines.

An "AAA engine" pretty much means that your whole content workflow is designed for a team of hundreds of people pushing assets and logic into the game at the same time, or that everything can be tweaked to the millimeter to accommodate a single game. None of these things are of concerns to Godot users (or even Unity users), It's an entirely different territory.

@AttackButton
Copy link

AttackButton commented Oct 25, 2023

or that everything can be tweaked to the millimeter to accommodate a single game

I agree with that part and it's clear that godot's design exists to avoid something like this. However, Unreal's default renderer is already incredibly powerful, and yet not only AAA Studios use it (tweaking everything), indie devs are using this engine as well (Blueprint).

Anyway, I don't see a conflict between being "easy to use" and having a AAA renderer. What's the concern, mobile games/Apps? Even so, couldn't the user tweak some options in the editor to reduce the "potential" of the renderer?

@reduz
Copy link
Author

reduz commented Oct 25, 2023

@AttackButton Oh the goal is to continue working on improving rendering during this year, its just that currently the effort is more focused on performance, which is what has the highest demand.

@cshenton
Copy link

cshenton commented Nov 3, 2023

Could someone please summarise the reasons not to use the standard two-pass occlusion culling technique?

It's very straightforward to implement, can be dispatched at any granularity (per object, per meshlet, etc.), works with dynamic (even vertex animated, with tolerances) objects, changing LODs, different render architectures, compute and hardware raster, can be used in the shadow passes, etc.

The assertion that dynamic geometry aren't meaningful occluders assumes a very specific approach to game development (big static level with small # of dynamic characters) that I don't think a general purpose engine should make. Much of the work you do on two-pass is work you'd do anyway doing a Z-prepass, and it provides significant speed boosts on anything from more optimised game assets to film-quality geometry.

@Saul2022
Copy link

I know the idea is already set, and that priorities, but could be good to have more compatibility with older devices like this recent implementation from unity, for the future https://forum.unity.com/threads/gpu-driven-rendering-in-unity.1502702/ https://t.co/srz2yNciNt

@MaxLykoS
Copy link

In GPU driven pipeline how do you record drawcalls on the CPU side? Do you cull them by CPU(frustum/occlusion .etc) to reduce empty draw calls or simply dispatch all the indirect drawcalls ignoring visibility?

@ODtian
Copy link

ODtian commented Apr 29, 2024

@devshgraphicsprogramming I'm new to this, is 2 step culling using visibility buffer feasible with forward shading and current opaque rendering? Can this combine with soft raster like Nanite one to support pixel-level triangles (will this means we must shade in compute)? And it would be great if you have any reference doc or repo. Thank you!

@devshgraphicsprogramming

@devshgraphicsprogramming I'm new to this, is 2 step culling using visibility buffer feasible with forward shading and current opaque rendering? Can this combine with soft raster like Nanite one to support pixel-level triangles (will this means we must shade in compute)? And it would be great if you have any reference doc or repo. Thank you!

as long as you have the DrawID handy some pass and there's little to no overdraw, i.e. when doing Forward+ this will work.

Visbuffer is not necessary, but what is necessary is each pixel casting a ballot about visible objects.

You can have false positives (i.e. overdraw) but that will make it more costly to cast and less efficient as culling.

@ODtian
Copy link

ODtian commented Apr 29, 2024

as long as you have the DrawID handy some pass and there's little to no overdraw, i.e. when doing Forward+ this will work.

Visbuffer is not necessary, but what is necessary is each pixel casting a ballot about visible objects.

You can have false positives (i.e. overdraw) but that will make it more costly to cast and less efficient as culling.

So in forward rendering, it's basically like: cull geometry using last frame buffer (in this case just instance id and depth), and generate depth for this frame, use it as depth pre pass and do forward shading. Lastly generate instance id for next frame. Is that right?

@devshgraphicsprogramming

For Forward+:

  1. draw what was visible last frame in your z-prepass
  2. cull what wasn't visible last frame against your partial z-prepass
  3. draw whatever passed (2) into your z-prepass, now its complete
  4. do forward shading & ballot/record what was truly visible this frame
  5. [optional] draw transparent things

Transparent things you always treat as (2) because you don't want to murder your fillrate with per-pixel ballots of transparent pixels

@ODtian
Copy link

ODtian commented Apr 29, 2024

@devshgraphicsprogramming I get it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment