reinsteam

## FastUniformLoadWithWaveOps.txt
In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

Simplified HLSL code looks like this:

Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;

[numthreads(8, 8, 1)]

## gist:6ce04569f213f3dc987b9274cdd677c8
http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/6

"The High-Level Zen Overview"

- "Features such as the micro-op cache help most instruction streams improve in performance and bypass parts of potentially
  long-cycle repetitive operations, but also the larger dispatch, larger retire, larger schedulers and better branch
  prediction means that higher throughput can be maintained longer and in the fastest order possible."

  Micro-op caches have nothing to with "bypassing parts of potentially long-cycle repetitive operations" (what does
  that even mean?). They reduce decode bottlenecks and decrease power consumption. Depending on the implementation,

## bluenoise.md

      
              1 file
            
          
              10 forks
            
          
              1 comment
            
          
              107 stars
            
          
                pixelmager
                / bluenoise.md
            
            
              Last active
              October 11, 2023 07:05
            
              
                Blue Noise links
              
          
    Use cases


Bluenoise in the game INSIDE (dithering, raymarching, reflections)
Dithering, Ray marching, shadows etc
A Survery of Blue Noise and Its Applications

Textures/Matrices for direct use (data!)


Moments In Graphics (void-and-cluster)

2D
3D and 4D


Bart Wronski Implementation of Solid Angle algorithm


## Tex2DCatmullRom.hlsl
// The following code is licensed under the MIT license: https://gist.github.com/TheRealMJP/bc503b0b87b643d3505d41eab8b332ae

// Samples a texture with Catmull-Rom filtering, using 9 texture fetches instead of 16.
// See http://vec3.ca/bicubic-filtering-in-fewer-taps/ for more details
float4 SampleTextureCatmullRom(in Texture2D<float4> tex, in SamplerState linearSampler, in float2 uv, in float2 texSize)
{
    // We're going to sample a a 4x4 grid of texels surrounding the target UV coordinate. We'll do this by rounding
    // down the sample location to get the exact center of our "starting" texel. The starting texel will be at
    // location [1, 1] in the grid, where [0, 0] is the top left corner.
    float2 samplePos = uv * texSize;

## abi_glue.inc
; This is intended for long-running leaf funcs that don't use XMM registers,
; and just saves all callee-save registers regardless of whether they're used
; or not.

; detect some parameters from output format
%ifidn __OUTPUT_FORMAT__,win32
        %define resp resd
        %define LEADING_UNDERSCORES
        %define CALLEE_SAVE_GPRS ebp,ebx,esi,edi
        %define BYTES_PER_ARG 4

## gist:2144712
// half->float variants.
// by Fabian "ryg" Giesen.
//
// I hereby place this code in the public domain.
//
// half_to_float_fast: table based
// tables could be done in a more compact fashion (in particular, can store tab2 in low word of tab1!)
// but something of a dead end since not very SIMD-friendly. pretty much abandoned at this point.
//
// half_to_float_fast2: use FP adder hardware to deal with denormals.

## fp16_to_32.asm
; input: 4x F16 in XMM0 (low words of each DWord)
; original idea+implementation by Dean Macri

; WARNING: copy & pasted together from other code, this ver is untested!!
; though the original version was definitely correct.

bits 32

section .data
	In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
	group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

	Simplified HLSL code looks like this:

	Buffer<float4> lightDatas;
	Texture2D<uint2> lightStartCounts;
	RWTexture2D<float4> output;

	[numthreads(8, 8, 1)]
	http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/6

	"The High-Level Zen Overview"

	- "Features such as the micro-op cache help most instruction streams improve in performance and bypass parts of potentially
	long-cycle repetitive operations, but also the larger dispatch, larger retire, larger schedulers and better branch
	prediction means that higher throughput can be maintained longer and in the fastest order possible."

	Micro-op caches have nothing to with "bypassing parts of potentially long-cycle repetitive operations" (what does
	that even mean?). They reduce decode bottlenecks and decrease power consumption. Depending on the implementation,
	// The following code is licensed under the MIT license: https://gist.github.com/TheRealMJP/bc503b0b87b643d3505d41eab8b332ae

	// Samples a texture with Catmull-Rom filtering, using 9 texture fetches instead of 16.
	// See http://vec3.ca/bicubic-filtering-in-fewer-taps/ for more details
	float4 SampleTextureCatmullRom(in Texture2D<float4> tex, in SamplerState linearSampler, in float2 uv, in float2 texSize)
	{
	// We're going to sample a a 4x4 grid of texels surrounding the target UV coordinate. We'll do this by rounding
	// down the sample location to get the exact center of our "starting" texel. The starting texel will be at
	// location [1, 1] in the grid, where [0, 0] is the top left corner.
	float2 samplePos = uv * texSize;
	; This is intended for long-running leaf funcs that don't use XMM registers,
	; and just saves all callee-save registers regardless of whether they're used
	; or not.

	; detect some parameters from output format
	%ifidn __OUTPUT_FORMAT__,win32
	%define resp resd
	%define LEADING_UNDERSCORES
	%define CALLEE_SAVE_GPRS ebp,ebx,esi,edi
	%define BYTES_PER_ARG 4
	// half->float variants.
	// by Fabian "ryg" Giesen.
	//
	// I hereby place this code in the public domain.
	//
	// half_to_float_fast: table based
	// tables could be done in a more compact fashion (in particular, can store tab2 in low word of tab1!)
	// but something of a dead end since not very SIMD-friendly. pretty much abandoned at this point.
	//
	// half_to_float_fast2: use FP adder hardware to deal with denormals.
	; input: 4x F16 in XMM0 (low words of each DWord)
	; original idea+implementation by Dean Macri

	; WARNING: copy & pasted together from other code, this ver is untested!!
	; though the original version was definitely correct.

	bits 32

	section .data