- Avoid lookups in hot paths I.e. per-entity
- 'Random' lookups cause cache misses whereas linear scans are much more cache coherent and so often perform much better
extract_meshes
:- iterates an ECS query (good)
- filters out non-visible entities based on the
ViewVisibility
component (good) - reads:
Entity
(8 bytes)ViewVisibility
(1 byte)Handle<Mesh>
(24 bytes)GlobalTransform
(64 byte)Option<PreviousGlobalTransform>
(80 bytes)- three zero-sized types for whether the entity should do automatic batching, cast shadows, receive shadows
- writes:
MeshTransforms
(100 bytes including previous and current transforms, and mesh flags)AssetId<Mesh>
(20 bytes)- Shadow caster bool (1 byte)
MaterialBindGroupId
(4 bytes)- Automatic batching bool (1 byte)
- This is mostly quite good, but the
PreviousGlobalTransform
could instead be double-buffered in the render world, saving 80 bytes being read and 48 bytes being written. TheHandle<Mesh>
read andAssetId<Mesh>
write could be instead a smallu32
+u32
generational index that would be an 8-byte read and write.
queue_material_meshes
- Iterates the visible entities and looks up:
- The
Entity
read is (8 bytes) - The
AssetId<M>
using anEntityHashMap<Entity, AssetId<M>>
(20 bytes) - The prepared
M
using aHashMap<AssetId<M>, M>
(depends on the material but at least 30 bytesish, likely more) - The
RenderMeshInstance
using anEntityHashMap<Entity, RenderMeshInstance>
(128 bytes) - The
GpuMesh
using aHashMap<AssetId<Mesh>, GpuMesh>
(136 bytes)
- The
- This is bad. 4 HashMap lookups to get information in order to be able to:
- apply pipeline specialisation to get the pipeline id
- queue the entity to the phases:
- calculating view z from the view and model transforms
- applying a per-material depth bias
- using the material's alpha mode to select the correct phase to queue to
- then writing an entity, draw function, pipeline id, distance, batch range, and dynamic offset to the phase, totalling 40 bytes
- If the
AssetId<Mesh>
andAssetId<M>
were small generational indices and theHashMap<AssetId<Mesh>, GpuMesh>
andHashMap<AssetId<M>, M>
were instead slotmaps, those lookups would be faster. - Part of the reason for the lookup is to check that the assets have been prepared. If the presence of the small generational indices had an invariant on the assets having been prepared, that would no longer be necessary.
- If pipeline specialisation were done async and the pipeline id cached on the entity, then:
- The
GpuMesh
would not need to be looked up at all and theAssetId<Mesh>
would not be needed here.
- The
- Then, if also the material properties AlphaMode (8 bytes) and depth bias (why is depth bias per material?) (4 bytes) were cached on the main world entity, then no material lookup would be necessary. That leaves the RenderMeshInstance and
- If the entities were queued to phases in the main app, no lookups would be needed at all as one could just iterate the ECS query, and avoid having to extract some of the information at all.
- Iterates the visible entities and looks up:
batch_and_prepare_render_phase
for 3D- Iterates phase items:
- Looks up the query item in the provided query using the phase item entity
- Looks up the
RenderMeshInstance
to get theMeshTransforms
(100 bytes),MaterialBindGroupId
(4 bytes), andAssetId<Mesh>
(20 bytes)
- This is bad. 2 lookups, though the query lookup could just be removed at the moment as it is unused in core bevy.
- If the
MeshTransforms
and batch comparison data(MaterialBindGroupId, AssetId<Mesh>)
were in aVec
inRenderPhase
order, then it could be iterated directly with no lookups. - If the batch comparison data were part of the
PhaseItem
then it could have been sorted when sorting theRenderPhase
. - However, it is possible that sorting this data and moving it around as part of the sorting process could be slower than necessary. If the
RenderPhase
contained only anEntity
and aSortKey
, then the sorting might be faster. If that were done in the main app, then if we were to create anEntity
->usize
mapping where theusize
is the phase item index, we could iterate an ECS query to extract the small ids and dynamic offsets etc directly into aVec<GetBatchData::CompareData>
. This achieves the necessary data being in aVec
in sortedRenderPhase
order.
- Iterates phase items:
MainOpaquePass3dNode::run()
->RenderPhase::render()
- Iterates the phase items and for each batch:
- Looks up the
Draw
from aVec
using the phase item'sDrawFunctionId
- For each
RenderCommand
in theDraw
, forDrawMaterial<M>
this is 5RenderCommand
s:- Looks up the
SystemParam
- Looks up the
WorldQuery
for the view - Looks up the
WorldQuery
for the item's entity - Calls the
RenderCommand
'srender()
SetItemPipeline
- Looks up the pipeline using the pipeline id, which involves indexing into a Vec - good
SetMeshViewBindGroup<0>
- Has the data it needs from the
WorldQuery
for the view
- Has the data it needs from the
SetMaterialBindGroup<M, 1>
- Looks up the
AssetId<M>
in anEntityHashMap<Entity, AssetId<M>>
using the item's entity - Looks up the prepared
M
in aHashMap<AssetId<M>, prepared M>
using theAssetId<M>
- Sets the material bind group
- Looks up the
SetMeshBindGroup<2>
- Looks up the
RenderMeshInstance
in anEntityHashMap<Entity, RenderMeshInstance>
using the item's entity - Looks up the
SkinIndex
in anEntityHashMap<Entity, SkinIndex>
using the item's entity - Looks up the
MorphIndex
in anEntityHashMap<Entity, MorphIndex>
using the item's entity - Gets the mesh bind group from a
MeshBindGroups
resource. For non-morphed the bind group is a direct member. For morphed, this causes a lookup in aHashMap<AssetId<Mesh>, BindGroup>
- Looks up the
DrawMesh
- Looks up the
RenderMeshInstance
in anEntityHashMap<Entity, RenderMeshInstance>
using the item's entity - Looks up the
GpuMesh
in aHashMap<AssetId<Mesh>, GpuMesh>
using theAssetId<Mesh>
from theRenderMeshInstance
- Looks up the
- Looks up the
- Looks up the
- This is bad. That's a whole lot of lookups - 11 per drawn thing in total, 12 if the mesh is morphed!!! 2 or 3
HashMap<AssetId<T>, U>
lookups, 5EntityHashMap<Entity, T>
lookups, 2WorldQuery
lookups, and 1SystemParam
lookup- The pipeline, bind groups, and mesh buffer lookups should all be very close to single lookups by indexing into a Vec each. That would be a total of 5 or 6
Vec
lookups that are necessary.
- The pipeline, bind groups, and mesh buffer lookups should all be very close to single lookups by indexing into a Vec each. That would be a total of 5 or 6
- BUT, that's not all! All these lookups are done necessarily for each phase item at the moment. They are passed to
TrackedRenderPass
which maintains state of which resources have been bound already, and if the resources being requested to be bound match the state, then the operation is skipped. Otherwise it is passed to wgpu'sRenderPass
API. This means that a lot of lookups could have been avoided, if they were only done when actually needed!- Aaltonen made a design for a renderer for HypeHype where the 'draw struct' information (all the ids, offsets, counts, etc) is never written, only used to encode a 'draw stream'. The 'draw stream' is a
Vec<u32>
that contains a sequence of operations like setting a pipeline, bind group, index/vertex buffer, or drawing encoded as a bit field of the operations that follow, followed by the ids for those operations. For example, if the first u32 contains a bit that indicates that the pipeline needs to be set, then the nextu32
would be the pipeline id. This compresses the operations down to only those that need to be applied, and the lookups could be avoided. We could generate this draw stream as the output of the batching step.
- Aaltonen made a design for a renderer for HypeHype where the 'draw struct' information (all the ids, offsets, counts, etc) is never written, only used to encode a 'draw stream'. The 'draw stream' is a
- Iterates the phase items and for each batch:
- Prepare pipelines, view bind groups, and assets in main app when they change and cache on entities as components. Store them in some storage that has fast lookups. Probably generational-index-based, but check the generation once at extraction time, and then only store the index part?
- Phase item is an Entity and sort key
- Queue to phases in main app
- Sort phases in main app
- Make Entity -> extraction index by iterating and enumerating the phase items
- Make DrawFunctions into ExtractFunctions that gather ids and offsets
- Extract by iterating ECS query, gathering a draw struct with the extract function, looking up the extract index for the entity and writing the draw struct to Vec
- A draw struct is a struct containing small ids to pipelines, bind groups, and index/vertex buffers, and also dynamic offsets for bind groups, vertex offsets and counts, etc.
- Store dynamic offsets in Vec and store offset + count in draw struct as 4 bits for count and 28 for offset. Allows for 2^28 dynamic offsets in the Vec.
- Batch from draw structs into draw stream
- A draw stream is a Vec that for each draw contains a u32 bit field that encodes what operations need to be carried out to produce the draw command, followed by the data necessary for each of those operations. For example, the first bit may be a pipeline bit, meaning that the next u32 is a pipeline id. This means that TrackedRenderPass is no longer needed as diffing and avoiding unnecessary operations would be already done by batching code that anyway has to compare draw state.
- Encode commands from draw stream using ids to look up as necessary
It may be desirable to avoid moving render schedule sets to the main app. As such, when breaking down work toward this design, it could be good to implement things in a way that keeps all render sets where they are to start with, and later evaluate moving them if there are clear arguments to do so.
- Cache ids on main world entities maintaining a fast lookup mechanism (e.g. slotmap for index/vertex buffers and bindgroups instead of
HashMap<AssetId<T>, T>
whereAssetId<T>
is 20 bytes large and has to be hashed.)- Use a channel to send ids back to the main world
- If Handle has not changed and id is present, extract id, else extract AssetId. In render app, manage asset preparation, send updates to the main world, and update instances that have AssetId but not id
- Make PhaseItem be a DrawStruct, rework DrawFunctions to build DrawStructs or remove and do whatever works well for that, sort and batch DrawStructs, and draw from DrawStructs
- See if PhaseItem being only Entity + SortKey, sorting, and then gathering DrawStruct data is faster than sorting full DrawStructs
- Batch DrawStructs into DrawStream, remove TrackedRenderPass and draw from DrawStream
- Test the idea of preparing materials and meshes, queuing, and sorting in the main app, and extracting directly into Vec