FireNX70/melonDS_compute_renderer_testing Secret

## melonDS_compute_renderer_testing
Cheep Cheep Beach in MKDS will display garbage even at 5x if you go into the
water. (Happened once, can't reproduce)

I tried using a debug context, enabling the debug output from the compute
renderer (even though it shouldn't be needed because debug contexts enable it by
default) and giving it a function to print debug messages. It didn't seem to
print any more info than when I tested with a regular context.
I also tested with Mesa. It prints even less stuff than Nvidia's driver
(Nvidia's driver only printed messages about successful buffer allocations).

Newer versions of Mesa will have the red and blue channels swapped. A fresh
install of Debian 12.2 has it working fine. It's broken after dist-upgrading to
testing (Trixie), which at the time of writing is using Mesa 24.0.8. I suppose
it's a small behavior change in newer Mesa releases. It still doesn't log any
errors.

I checked all compute shaders with the glslang validator and got no complaints
from it.

Tried increasing maxYSpanIndices (there's a comment saying the current values
are a bad guess). It didn't change anything.

There's a couple of glMemoryBarrier calls using GL_SHADER_STORAGE_BUFFER, which
looks wrong to me. The documentation doesn't list GL_SHADER_STORAGE_BUFFER as a
glMemoryBarrier flag; but it does list GL_SHADER_STORAGE_BARRIER_BIT, which
later glMemoryBarrier calls use. Replacing GL_SHADER_STORAGE_BUFFER with
GL_SHADER_STORAGE_BARRIER_BIT did not get rid of the garbage. Adding barriers
after the first two glDispatchCompute calls did not fix it either. Adding a
barrier after the glDispatchCompute call in the loop did not fix it.

I checked that the vast majority of the pipeline wasn't exceeding
GL_MAX_COMPUTE_WORK_GROUP_COUNT. It wasn't. For the stages using
glDispatchCompute; if this were a problem OpenGL should log errors and it
doesn't, so they must be fine. glDispatchComputeIndirect will not log OpenGL
errors even if the compute work group counts are too high, and checking these
would be annoying since I'd have to read the buffer back from VRAM. However,
RenderDoc did capture the values used for these. At 6x none of the calls ever go
above the minimum of 65535. At 7x there is one call that does exceed that
minimum (glDispatchComputeIndirect(1, 1, 79414)). This exceeds the maximum
supported by my 1080Ti with the 552 driver (it sticks to the minimum of 65535 in
the Z axis). I tried replacing the glDispatchComputeIndirect call in the loop
with glDispatchCompute(1, 1, 65535). This DID have an effect on the garbage, but
it did not fix it. It's likely this is the problem on Nvidia. The local sizes
all seem fine.

Tried grabbing a couple of captures with RenderDoc while running Phantom
Hourglass. The low res framebuffer seems to be affected in a similar way to the
high res one. The high res framebuffer also looks like it's got the red and blue
channels swapped before presentation (much like what I was seeing on Mesa).
	Cheep Cheep Beach in MKDS will display garbage even at 5x if you go into the
	water. (Happened once, can't reproduce)

	I tried using a debug context, enabling the debug output from the compute
	renderer (even though it shouldn't be needed because debug contexts enable it by
	default) and giving it a function to print debug messages. It didn't seem to
	print any more info than when I tested with a regular context.
	I also tested with Mesa. It prints even less stuff than Nvidia's driver
	(Nvidia's driver only printed messages about successful buffer allocations).

	Newer versions of Mesa will have the red and blue channels swapped. A fresh
	install of Debian 12.2 has it working fine. It's broken after dist-upgrading to
	testing (Trixie), which at the time of writing is using Mesa 24.0.8. I suppose
	it's a small behavior change in newer Mesa releases. It still doesn't log any
	errors.

	I checked all compute shaders with the glslang validator and got no complaints
	from it.

	Tried increasing maxYSpanIndices (there's a comment saying the current values
	are a bad guess). It didn't change anything.

	There's a couple of glMemoryBarrier calls using GL_SHADER_STORAGE_BUFFER, which
	looks wrong to me. The documentation doesn't list GL_SHADER_STORAGE_BUFFER as a
	glMemoryBarrier flag; but it does list GL_SHADER_STORAGE_BARRIER_BIT, which
	later glMemoryBarrier calls use. Replacing GL_SHADER_STORAGE_BUFFER with
	GL_SHADER_STORAGE_BARRIER_BIT did not get rid of the garbage. Adding barriers
	after the first two glDispatchCompute calls did not fix it either. Adding a
	barrier after the glDispatchCompute call in the loop did not fix it.

	I checked that the vast majority of the pipeline wasn't exceeding
	GL_MAX_COMPUTE_WORK_GROUP_COUNT. It wasn't. For the stages using
	glDispatchCompute; if this were a problem OpenGL should log errors and it
	doesn't, so they must be fine. glDispatchComputeIndirect will not log OpenGL
	errors even if the compute work group counts are too high, and checking these
	would be annoying since I'd have to read the buffer back from VRAM. However,
	RenderDoc did capture the values used for these. At 6x none of the calls ever go
	above the minimum of 65535. At 7x there is one call that does exceed that
	minimum (glDispatchComputeIndirect(1, 1, 79414)). This exceeds the maximum
	supported by my 1080Ti with the 552 driver (it sticks to the minimum of 65535 in
	the Z axis). I tried replacing the glDispatchComputeIndirect call in the loop
	with glDispatchCompute(1, 1, 65535). This DID have an effect on the garbage, but
	it did not fix it. It's likely this is the problem on Nvidia. The local sizes
	all seem fine.

	Tried grabbing a couple of captures with RenderDoc while running Phantom
	Hourglass. The low res framebuffer seems to be affected in a similar way to the
	high res one. The high res framebuffer also looks like it's got the red and blue
	channels swapped before presentation (much like what I was seeing on Mesa).