superdump/bevy_render_v2.5.md

## bevy_render_v2.5.md

      
    Raw
  

              bevy_render_v2.5.md
            
          
    Design Goals


Avoid lookups in hot paths I.e. per-entity

'Random' lookups cause cache misses whereas linear scans are much more cache coherent and so often perform much better


How things work on main at a5a457c3c82cc2e501bf6e85a2e2d07b5b7838a9


extract_meshes:

iterates an ECS query (good)
filters out non-visible entities based on the ViewVisibility component (good)
reads:

Entity (8 bytes)
ViewVisibility (1 byte)
Handle<Mesh> (24 bytes)
GlobalTransform (64 byte)
Option<PreviousGlobalTransform> (80 bytes)
three zero-sized types for whether the entity should do automatic batching, cast shadows, receive shadows


writes:

MeshTransforms (100 bytes including previous and current transforms, and mesh flags)
AssetId<Mesh> (20 bytes)
Shadow caster bool (1 byte)
MaterialBindGroupId (4 bytes)
Automatic batching bool (1 byte)


This is mostly quite good, but the PreviousGlobalTransform could instead be double-buffered in the render world, saving 80 bytes being read and 48 bytes being written. The Handle<Mesh> read and AssetId<Mesh> write could be instead a small u32 + u32 generational index that would be an 8-byte read and write.


queue_material_meshes

Iterates the visible entities and looks up:

The Entity read is (8 bytes)
The AssetId<M> using an EntityHashMap<Entity, AssetId<M>> (20 bytes)
The prepared M using a HashMap<AssetId<M>, M> (depends on the material but at least 30 bytesish, likely more)
The RenderMeshInstance using an EntityHashMap<Entity, RenderMeshInstance> (128 bytes)
The GpuMesh using a HashMap<AssetId<Mesh>, GpuMesh> (136 bytes)


This is bad. 4 HashMap lookups to get information in order to be able to:

apply pipeline specialisation to get the pipeline id
queue the entity to the phases:

calculating view z from the view and model transforms
applying a per-material depth bias
using the material's alpha mode to select the correct phase to queue to


then writing an entity, draw function, pipeline id, distance, batch range, and dynamic offset to the phase, totalling 40 bytes


If the AssetId<Mesh> and AssetId<M> were small generational indices and the HashMap<AssetId<Mesh>, GpuMesh> and HashMap<AssetId<M>, M> were instead slotmaps, those lookups would be faster.
Part of the reason for the lookup is to check that the assets have been prepared. If the presence of the small generational indices had an invariant on the assets having been prepared, that would no longer be necessary.
If pipeline specialisation were done async and the pipeline id cached on the entity, then:

The GpuMesh would not need to be looked up at all and the AssetId<Mesh> would not be needed here.


Then, if also the material properties AlphaMode (8 bytes) and depth bias (why is depth bias per material?) (4 bytes) were cached on the main world entity, then no material lookup would be necessary. That leaves the RenderMeshInstance and
If the entities were queued to phases in the main app, no lookups would be needed at all as one could just iterate the ECS query, and avoid having to extract some of the information at all.


batch_and_prepare_render_phase for 3D

Iterates phase items:

Looks up the query item in the provided query using the phase item entity
Looks up the RenderMeshInstance to get the MeshTransforms (100 bytes), MaterialBindGroupId (4 bytes), and AssetId<Mesh> (20 bytes)


This is bad. 2 lookups, though the query lookup could just be removed at the moment as it is unused in core bevy.
If the MeshTransforms and batch comparison data (MaterialBindGroupId, AssetId<Mesh>) were in a Vec in RenderPhase order, then it could be iterated directly with no lookups.
If the batch comparison data were part of the PhaseItem then it could have been sorted when sorting the RenderPhase.
However, it is possible that sorting this data and moving it around as part of the sorting process could be slower than necessary. If the RenderPhase contained only an Entity and a SortKey, then the sorting might be faster. If that were done in the main app, then if we were to create an Entity -> usize mapping where the usize is the phase item index, we could iterate an ECS query to extract the small ids and dynamic offsets etc directly into a Vec<GetBatchData::CompareData>. This achieves the necessary data being in a Vec in sorted RenderPhase order.


MainOpaquePass3dNode::run() -> RenderPhase::render()

Iterates the phase items and for each batch:

Looks up the Draw from a Vec using the phase item's DrawFunctionId
For each RenderCommand in the Draw, for DrawMaterial<M> this is 5 RenderCommands:

Looks up the SystemParam
Looks up the WorldQuery for the view
Looks up the WorldQuery for the item's entity
Calls the RenderCommand's render()
SetItemPipeline

Looks up the pipeline using the pipeline id, which involves indexing into a Vec - good


SetMeshViewBindGroup<0>

Has the data it needs from the WorldQuery for the view


SetMaterialBindGroup<M, 1>

Looks up the AssetId<M> in an EntityHashMap<Entity, AssetId<M>> using the item's entity
Looks up the prepared M in a HashMap<AssetId<M>, prepared M> using the AssetId<M>
Sets the material bind group


SetMeshBindGroup<2>

Looks up the RenderMeshInstance in an EntityHashMap<Entity, RenderMeshInstance> using the item's entity
Looks up the SkinIndex in an EntityHashMap<Entity, SkinIndex> using the item's entity
Looks up the MorphIndex in an EntityHashMap<Entity, MorphIndex> using the item's entity
Gets the mesh bind group from a MeshBindGroups resource. For non-morphed the bind group is a direct member. For morphed, this causes a lookup in a HashMap<AssetId<Mesh>, BindGroup>


DrawMesh

Looks up the RenderMeshInstance in an EntityHashMap<Entity, RenderMeshInstance> using the item's entity
Looks up the GpuMesh in a HashMap<AssetId<Mesh>, GpuMesh> using the AssetId<Mesh> from the RenderMeshInstance


This is bad. That's a whole lot of lookups - 11 per drawn thing in total, 12 if the mesh is morphed!!! 2 or 3 HashMap<AssetId<T>, U> lookups, 5 EntityHashMap<Entity, T> lookups, 2 WorldQuery lookups, and 1 SystemParam lookup

The pipeline, bind groups, and mesh buffer lookups should all be very close to single lookups by indexing into a Vec each. That would be a total of 5 or 6 Vec lookups that are necessary.


BUT, that's not all! All these lookups are done necessarily for each phase item at the moment. They are passed to TrackedRenderPass which maintains state of which resources have been bound already, and if the resources being requested to be bound match the state, then the operation is skipped. Otherwise it is passed to wgpu's RenderPass API. This means that a lot of lookups could have been avoided, if they were only done when actually needed!

Aaltonen made a design for a renderer for HypeHype where the 'draw struct' information (all the ids, offsets, counts, etc) is never written, only used to encode a 'draw stream'. The 'draw stream' is a Vec<u32> that contains a sequence of operations like setting a pipeline, bind group, index/vertex buffer, or drawing encoded as a bit field of the operations that follow, followed by the ids for those operations. For example, if the first u32 contains a bit that indicates that the pipeline needs to be set, then the next u32 would be the pipeline id. This compresses the operations down to only those that need to be applied, and the lookups could be avoided. We could generate this draw stream as the output of the batching step.


Design Summary


Prepare pipelines, view bind groups, and assets in main app when they change and cache on entities as components. Store them in some storage that has fast lookups. Probably generational-index-based, but check the generation once at extraction time, and then only store the index part?
Phase item is an Entity and sort key
Queue to phases in main app
Sort phases in main app
Make Entity -> extraction index by iterating and enumerating the phase items
Make DrawFunctions into ExtractFunctions that gather ids and offsets
Extract by iterating ECS query, gathering a draw struct with the extract function, looking up the extract index for the entity and writing the draw struct to Vec

A draw struct is a struct containing small ids to pipelines, bind groups, and index/vertex buffers, and also dynamic offsets for bind groups, vertex offsets and counts, etc.


Store dynamic offsets in Vec and store offset + count in draw struct as 4 bits for count and 28 for offset. Allows for 2^28 dynamic offsets in the Vec.
Batch from draw structs into draw stream

A draw stream is a Vec that for each draw contains a u32 bit field that encodes what operations need to be carried out to produce the draw command, followed by the data necessary for each of those operations. For example, the first bit may be a pipeline bit, meaning that the next u32 is a pipeline id. This means that TrackedRenderPass is no longer needed as diffing and avoiding unnecessary operations would be already done by batching code that anyway has to compare draw state.


Encode commands from draw stream using ids to look up as necessary

Implementation Steps

It may be desirable to avoid moving render schedule sets to the main app. As such, when breaking down work toward this design, it could be good to implement things in a way that keeps all render sets where they are to start with, and later evaluate moving them if there are clear arguments to do so.

Cache ids on main world entities maintaining a fast lookup mechanism (e.g. slotmap for index/vertex buffers and bindgroups instead of HashMap<AssetId<T>, T> where AssetId<T> is 20 bytes large and has to be hashed.)

Use a channel to send ids back to the main world
If Handle has not changed and id is present, extract id, else extract AssetId. In render app, manage asset preparation, send updates to the main world, and update instances that have AssetId but not id


Make PhaseItem be a DrawStruct, rework DrawFunctions to build DrawStructs or remove and do whatever works well for that, sort and batch DrawStructs, and draw from DrawStructs
See if PhaseItem being only Entity + SortKey, sorting, and then gathering DrawStruct data is faster than sorting full DrawStructs
Batch DrawStructs into DrawStream, remove TrackedRenderPass and draw from DrawStream
Test the idea of preparing materials and meshes, queuing, and sorting in the main app, and extracting directly into Vec