So you want to use deko3d for maximum speed and minimum bloat?
Great!
deko3d is fantastic, working on the same layer of abstraction or even lower than Vulkan, but (even without the C++ wrapper) you tend to only need half as much code to do the same thing.
I've used it for a while now and kind of done the same thing wrong one too many times, so here's a list of those things.
Some things listed here are also documented in the primer while others are not documented there.
If some of these questions seem contrived, they aren't, this is the result of a lot of trial and error.
How can I barrier stuff which uses the copy engine (dkCmdBufCopyBuffer
, dkCmdBufCopyBufferToImage
, dkCmdBufCopyImageToBuffer
) or 2D engine (dkCmdBufBlitImage
, dkCmdBufResolveImage
)?
You don't need to. In fact you don't even need to use a barrier before it, if you want everything on the 3D engine to finish and only start after the 2D/copy engine is done and you also don't need to worry about cache coherency. In the GPU cmdbufs there's this concept of subchannels, each engine (3D, 2D, compute, copy, GPFIFO) has one. When you switch from or to 3D or compute a wait is inserted and cache is flushed.
More in-depth source brought up by xerpi: https://github.com/NVIDIA/open-gpu-doc/blob/master/manuals/turing/tu104/dev_ram.ref.txt#L998-L1007
When doing a readback from the GPU (e.g. dkCmdBufCopyImageToBuffer
) it is necessary to clear L2 cache. When a fence is directly inserted afterwards (with a cache flush inbetween) no switch to 3D engine was performed. Thus the readback is not safe.
not_he helped with this one and suggested this code to insert a 3D engine nop which should synchronise the transfer operation.
dkCmdbufCopyBufferToImage(...);
std::uint32_t threed_nop = 0x80000040;
dkCmdBufReplayCmds(cmdbuf, &threed_nop, 1);
dkBarrier(cmdbuf, DkBarrier_None, DkInvalidateFlags_L2Cache);
Otherwise I have been previously using primitive barriers which probably worked as they are a 3D channel command. A full barrier probably would work as well without the switch to 3D engine.
Can 2D engine operations have the same source and destination, like glCopyImageSubData
or glBlitFramebuffer
Yes, overlapping source and destination regions are to be undefined behaviour though like in the OpenGL.
For fun's sake what seems to happen is that it actually manages to integrate the in progress result resulting in fractal like images with a few glitches in the lowest levels.
Thanks to Pharynx for helping with the last two.
I push data with dkCmdBufPushData
to descriptors/samplers/whatever similar to dkCmdBufPushConstants
(i.e. multiple times between draw calls with no barrier inbetween), why doesn't it update the data?
Contrary to it's name, dkCmdBufPushData
doesn't seem to have push semantics, but it also doesn't use the copy engine (which would result in everything being ordered correctly). It uses the 3D engine and thus needs explicit fences and cache flushes for correct ordering.
Though it's probably more advisable to just use a larger descriptor buffer containing all the descriptors than to insert a bunch of slow barriers.
DkImageView
objects contain a pointer to the DkImage
which was used to create it, so they may not outlive it.
The memory access of a fence is written into the command list, it's address can't be changed anymore afterwards, so make sure that memory stays where it is until the fence is signaled.
Secret deko3d info: Internally there are external and internal fences. Almost all of the fences your fences will be internal ones, where this applies, but e.g. the fences from dkSwapchainAcquireImage
are not, their memory is external (you can see this in action in the implementation of dkQueueAcquireImage
how the advice given here about the fence having to live long enough is just ignored).
You probably misconfigured the drawing pipeline (the error should go away if you remove the draw calls), double check all of that, especially all the vertex buffer setup.
Happened to me twice:
Passing DkStage_Vertex|DkStage_Fragment
to dkCmdBufBindShaders
when it should be DkStageFlag_Vertex|DkStageFlag_Fragment
.
The queue needs to be flushed otherwise the already submitted command buffers will never start executing and dead lock occurs. dkQueuePresentImage
flushes the queue.
The flush parameter in dkCmdBufSignalFence
doesn't flush the queue, it only flushes the cache, only calling dkQueueFlush
(and some other operations like presenting an image) does actually flush it.
Also happened to me twice.
I have a cmdbuf with a callback to allocate further memory. Why doesn't it call it after callingdkCmdBufClear
?
dkCmdBufClear
does not remove all the memory associated with the command buffer, it only moves back to the beginning of the last appened slice.
I want to resolve only part of a multisampled (MSAA) framebuffer into another part of an image like in Vulkan, how can I do this?
Good news, resolving framebuffers seems to be just an image blit with linear filter and dkCmdBufBlitImage
with DkBlitFlag_FilterLinear
seems works just as well while allowing to crop and even scale!
I'm protecting per frame resources (cmdbuf memory, upload buffers, etc.) with dkQueueAcquireImage
and dkQueuePresentImage
(no other fences)? Why do I get flickering/other glitches and instabilities.
dkQueueAcquireImage
inserts a fence wait into the queue to wait for the framebuffer to be writeable again, but it (and the associated resources) aren't writeable when it returns. You either need to call dkSwapchainAcquireImage
(instead of dkSwapchainAcquireImage
) yourself and then wait on the fence yourself (which would work but is stupid) or use your own (possibly finer grained) fences.
dkCmdBufBindShaders
always replaces the entire shader pipeline. Unspecified stages are disabled.
Ref the defered shading deko3D example. Is it possible to render "subpasses" (via tile barrier) and not use separate framebuffers for the different "subpasses" like in the example?
Looks like it works! Unlike Vulkan (where the special subpassLoad
function in the shader has to be used) the same image is bound as render target and texture and then accessed with texelFetch
. Accesses outside the same tile are funny.
GPU method error (irq 0x00100000), result 0
[b197:458] = 0x78800000, result 0
A reason for an error like the one above might be padding bits in DkVtxAttribState
declared as uint32_t : 1;
. When a vertex attribute state struct to be bound is not initially cleared these bits may be in undefined state. Contrary to expectation these values are not ignored and can lead to this crash!
For understanding the crash message or similar, see https://gitlab.freedesktop.org/mesa/mesa/-/raw/main/src/nouveau/headers/nvidia/classes/clb197.h?ref_type=heads
#define MAXWELL_B 0xB197
And
#define NVB197_SET_VERTEX_ATTRIBUTE_A(i) (0x1160+(i)*4)
0x1160/4 = 0x458
.
Credit to @xerpi for having the problem and figuring it out.