Stefan Dyulgerov kingofthebongo2008

## cbrt.c
// Public Domain under http://unlicense.org, see link for details.
// except:
// * core-math function `cr_cbrtf`               (see license below)
// * musl flavored fdlib function `fdlibm_cbrtf` (see license below)

// code and test driver for cube root and it's reciprocal based on:
// "Fast Calculation of Cube and Inverse Cube Roots Using
// a Magic Constant and Its Implementation on Microcontrollers"
// Moroz, Samotyy, Walczyk, Cieslinski, 2021
// (PDF: https://www.mdpi.com/1996-1073/14/4/1058)

## restir-meets-surfel-lighting-breakdown.md

      
              1 file
            
          
              3 forks
            
          
              5 comments
            
          
              80 stars
            
          
                h3r2tic
                / restir-meets-surfel-lighting-breakdown.md
            
            
              Created
              November 23, 2021 02:15
            
              
                A quick breakdown of lighting in the `restir-meets-surfel` branch of my renderer
              
          
    A quick breakdown of lighting in the restir-meets-surfel branch of my renderer, where I revive some olde surfel experiments, and generously sprinkle ReSTIR on top.
General remarks

Please note that this is all based on work-in-progress experimental software, and represents a single snapshot in development history. Things will certainly change 😛
Due to how I'm capturing this, there's frame-to-frame variability, e.g. different rays being shot, TAA shimmering slightly. Some of the images come from a dedicated visualization pass, and are anti-aliased, and some show internal buffers which are not anti-aliased.
Final images


## PrefixSort.compute
#pragma use_dxc //enable SM 6.0 features, in Unity this is only supported on version 2020.2.0a8 or later with D3D12 enabled
#pragma kernel CountTotalsInBlock
#pragma kernel BlockCountPostfixSum
#pragma kernel CalculateOffsetsForEachKey
#pragma kernel FinalSort

uint _FirstBitToSort;
int _NumElements;
int _NumBlocks;
bool _ShouldSortPayload;

## wavelets2d.h
#pragma once

#include <stdint.h>
#include <string.h>

#define WAVELET_DIM 512

extern void wavelet_forward_2d(uint8_t *mat, size_t N);
extern void wavelet_inverse_2d(uint8_t *mat, size_t N);

## wavelets.js
// http://bearcave.com/misl/misl_tech/wavelets/index.html

class WaveletBase {
  constructor() {
    this.forward = 1;
    this.inverse = 2;
  }
  split(vec, N) {
    var half = N >> 1;
    var vc = vec.slice();

## GPUOptimizationForGameDev.md

      
              1 file
            
          
              96 forks
            
          
              11 comments
            
          
              1052 stars
            
          
                silvesthu
                / GPUOptimizationForGameDev.md
            
            
              Last active
              July 4, 2024 07:36
            
              
                GPU Optimization for GameDev
              
          
    GPU Optimization for GameDev

Graphics Pipeline / GPU Architecture Overview


2011 - A trip through the Graphics Pipeline 2011
2015 - Life of a triangle - NVIDIA's logical pipeline
2015 - Render Hell 2.0
2016 - How bad are small triangles on GPU and why?
2017 - GPU Performance for Game Artists
2019 - Understanding the anatomy of GPUs using Pokémon
2020 - GPU ARCHITECTURE RESOURCES


## BetterBuffers.txt
All current buffer types in shading languages are slightly different ways to present homogeneous arrays (single struct or type repeating N times in memory).

DirectX has raw buffers (RWByteAddressBuffer) but that is limited to 32 bit integer types and the implementation doesn't require natural alignment for wide loads resulting in suboptimal codegen on Nvidia GPUs.

Complex use cases, such as tree traversal in spatial data structures (physics, ray-tracing, etc) require data structure that is non-homogeneous. You want different node payloads and tight memory layout.

Ability to mix 8/16/32 bit data types and 1d/2d/4d vectors to faciliate GPU wide loads (max bandwidth) in same data structure is crucial for complex use cases like this.

On the other hand we want better more readable/maintainable code syntax than DirectX raw buffers without manual bit packing/extracting and reinterpret casting. Goal should be to allow modern GPUs to use sub-register addressing (SDWA on AMD hardware). Saving both ALU and register

## gist:831c4b1926aa88c0da9221211723da2d
//
// C++ implementaion of "A simple method to construct isotropic quasirandom blue
// noise point sequences"
//
// http://extremelearning.com.au/a-simple-method-to-construct-isotropic-quasirandom-blue-noise-point-sequences/
//

// Assume 0 <= x
static double myfmod(double x) { return x - std::floor(x); }

## FastUniformLoadWithWaveOps.txt
In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

Simplified HLSL code looks like this:

Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;

[numthreads(8, 8, 1)]

## cone-culling-experiments.log
Algorithms used for Cone* preprocess the mesh in some way, then split sequentially into 64-triangle clusters:
ConeBase: optimize mesh for transform cache
ConeSort: split mesh into large planar connected clusters, bin clusters into 6 buckets by cardinal axes, optimize each bucket for transform cache
ConeAcmr: optimize mesh for transform cache, split sequentially into variable length clusters that are relatively planar, sort clusters by avg normal
ConeCash: optimize mesh for transform cache, picking triangles that reduce ACMR but prioritizing those that keep current cluster planar
MaskBase: split sequentially into 64-triangle clusters, store a 64-bit conservative triangle mask for 6 frustums (cube faces)
ManyConeN: split sequentially into 64-triangle clusters, store N (up to 4) cones for each cluster and a cone id per triangle (2 bit)

Note that all Cone* solutions get significantly worse results with 128 or 256 triangle clusters; it doesn't matter much for Mask.
The biggest challenge with Cone* algorithms is t
	// Public Domain under http://unlicense.org, see link for details.
	// except:
	// * core-math function `cr_cbrtf` (see license below)
	// * musl flavored fdlib function `fdlibm_cbrtf` (see license below)

	// code and test driver for cube root and it's reciprocal based on:
	// "Fast Calculation of Cube and Inverse Cube Roots Using
	// a Magic Constant and Its Implementation on Microcontrollers"
	// Moroz, Samotyy, Walczyk, Cieslinski, 2021
	// (PDF: https://www.mdpi.com/1996-1073/14/4/1058)
	#pragma use_dxc //enable SM 6.0 features, in Unity this is only supported on version 2020.2.0a8 or later with D3D12 enabled
	#pragma kernel CountTotalsInBlock
	#pragma kernel BlockCountPostfixSum
	#pragma kernel CalculateOffsetsForEachKey
	#pragma kernel FinalSort

	uint _FirstBitToSort;
	int _NumElements;
	int _NumBlocks;
	bool _ShouldSortPayload;
	#pragma once

	#include <stdint.h>
	#include <string.h>

	#define WAVELET_DIM 512

	extern void wavelet_forward_2d(uint8_t *mat, size_t N);
	extern void wavelet_inverse_2d(uint8_t *mat, size_t N);
	// http://bearcave.com/misl/misl_tech/wavelets/index.html

	class WaveletBase {
	constructor() {
	this.forward = 1;
	this.inverse = 2;
	}
	split(vec, N) {
	var half = N >> 1;
	var vc = vec.slice();
	All current buffer types in shading languages are slightly different ways to present homogeneous arrays (single struct or type repeating N times in memory).

	DirectX has raw buffers (RWByteAddressBuffer) but that is limited to 32 bit integer types and the implementation doesn't require natural alignment for wide loads resulting in suboptimal codegen on Nvidia GPUs.

	Complex use cases, such as tree traversal in spatial data structures (physics, ray-tracing, etc) require data structure that is non-homogeneous. You want different node payloads and tight memory layout.

	Ability to mix 8/16/32 bit data types and 1d/2d/4d vectors to faciliate GPU wide loads (max bandwidth) in same data structure is crucial for complex use cases like this.

	On the other hand we want better more readable/maintainable code syntax than DirectX raw buffers without manual bit packing/extracting and reinterpret casting. Goal should be to allow modern GPUs to use sub-register addressing (SDWA on AMD hardware). Saving both ALU and register
	//
	// C++ implementaion of "A simple method to construct isotropic quasirandom blue
	// noise point sequences"
	//
	// http://extremelearning.com.au/a-simple-method-to-construct-isotropic-quasirandom-blue-noise-point-sequences/
	//

	// Assume 0 <= x
	static double myfmod(double x) { return x - std::floor(x); }
	In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
	group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

	Simplified HLSL code looks like this:

	Buffer<float4> lightDatas;
	Texture2D<uint2> lightStartCounts;
	RWTexture2D<float4> output;

	[numthreads(8, 8, 1)]
	Algorithms used for Cone* preprocess the mesh in some way, then split sequentially into 64-triangle clusters:
	ConeBase: optimize mesh for transform cache
	ConeSort: split mesh into large planar connected clusters, bin clusters into 6 buckets by cardinal axes, optimize each bucket for transform cache
	ConeAcmr: optimize mesh for transform cache, split sequentially into variable length clusters that are relatively planar, sort clusters by avg normal
	ConeCash: optimize mesh for transform cache, picking triangles that reduce ACMR but prioritizing those that keep current cluster planar
	MaskBase: split sequentially into 64-triangle clusters, store a 64-bit conservative triangle mask for 6 frustums (cube faces)
	ManyConeN: split sequentially into 64-triangle clusters, store N (up to 4) cones for each cluster and a cone id per triangle (2 bit)

	Note that all Cone* solutions get significantly worse results with 128 or 256 triangle clusters; it doesn't matter much for Mask.
	The biggest challenge with Cone* algorithms is t