Sebastian Aaltonen sebbbi

## FastUniformLoadWithWaveOps.txt
In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

Simplified HLSL code looks like this:

Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;

[numthreads(8, 8, 1)]

## SinglePassMipPyramid.hlsl
// NOTE: Must bind 8x single mip RWTexture views, because HLSL doesn't have .mips member for RWTexture2D. (SRVs only have .mips member)
// NOTE: globallycoherent attribute is needed. Without it writes aren't guaranteed to be seen by other groups
globallycoherent RWTexture2D<float> MipTextures[8];
RWTexture2D<uint> Counters[8];
groupshared uint CounterReturnLDS;

[numthreads(16, 16, 1)]
void GenerateMipPyramid(uint3 Tid : SV_DispatchThreadID, uint3 Group : SV_GroupId, uint Gix : SV_GroupIndex)
{
	[unroll]

## ConeTraceAnalytic.txt
Spherical cap cone analytic solution is a 1d problem, since the cone cap sphere slides along the ray. The intersection point to empty space sphere is always on the ray.

S : radius of cone cap sphere at t=1
r(d) : cone cap sphere radius at distance d

r(d) = d*S

p = distance of current SDF sample
SDF(p) = sdf function result at location p
x = distance after conservative step

## BetterBuffers.txt
All current buffer types in shading languages are slightly different ways to present homogeneous arrays (single struct or type repeating N times in memory).

DirectX has raw buffers (RWByteAddressBuffer) but that is limited to 32 bit integer types and the implementation doesn't require natural alignment for wide loads resulting in suboptimal codegen on Nvidia GPUs.

Complex use cases, such as tree traversal in spatial data structures (physics, ray-tracing, etc) require data structure that is non-homogeneous. You want different node payloads and tight memory layout.

Ability to mix 8/16/32 bit data types and 1d/2d/4d vectors to faciliate GPU wide loads (max bandwidth) in same data structure is crucial for complex use cases like this.

On the other hand we want better more readable/maintainable code syntax than DirectX raw buffers without manual bit packing/extracting and reinterpret casting. Goal should be to allow modern GPUs to use sub-register addressing (SDWA on AMD hardware). Saving both ALU and register

## lidia.txt
i10 jab / punish:
1,2,2: mid NC, -10 block, +8 CFT hit
1,2,4: low, -13 block, +3 hit. CH: ff3 followup = 34 dmg

+8 CFT mixup:
1: mid i20, -9 block, hit KD, CH ff1+2 followup = 59 dmg
4: low i19, -26 block, ff3 followup = 26 dmg

ff2 i14 long range mid:
block: -2 pushback -> backdash

## jack.txt
Top moves:
2: i11 high, +1 block, +9 hit, 10 dmg, followup (NC): (1) mid, -2 block, +3 hit, NC = 22 dmg
f2: i10 high, -12 block, +5 hit, 17 dmg, CH followups: ff1 = 40 dmg, ff3 = 48 dmg, f3+4 = 42 dmg
f1: i14 mid, -6 block, +5 hit, 15 dmg, followup (NC): (1) high, -7 block, NC = 40 dmg
df1: i14 mid, -4 block, +3 hit, 12 dmg, followups (CH NC): (2,1) delayable high,high, launch, (1) mid, -12 block. CH NC = 55 dmg
db1: i12 low, -12 block, +2 hit, 13 dmg
df2: i15 mid, -14 block (safe tip range), launch
f1+2: i15 mid, -19 block (pushback), wall bounce
db,d,df1: i24 low, high crush, -37 block, 30 dmg
FC db1: i12 low, high crush, -8 block, +6 hit, 15 dmg

## asuka.txt
Top moves (close):
1,2,4/3: 10 high, -2 block, followups: mid(-8 push),mid(-12 push), low(-11 / 0 hit)
1+2: i16 mid, -9 block, launch
df1,2/4: i13 mid,high/mid, -3 block, followups: high(-1), mid(-12)
df2: i15 mid, -6 block, launch (no crouch)
d1+2: i20 low, -18 block, high crush, 36 damage minicombo (d2, f2)
d3+4: i14 low,high, -6 block (push), low crush, CH launch
db1,2: i14 mid,high, -9 block, high crush, followup CH launch
db2: i20 mid, -11 block, high cruch, launch
b2,1,4/4/1+2,4: i15 mid, -4 block, followups: mid,low/high (-7,-6 push), low(-11), high,mid (-9,-13)

## leroy.txt
Parry:

b2 mid+high:
3 frame startup and can interrupt many (non-NC) strings.
1 or 2 followup = 30 damage
Against slow recovery moves can launch with b3 or uf4.

3+4 (Hermit stance) low:
3 frame startup and can interrupt many (non-NC) strings. Hermit string transitions parry dick jab even at -9.
4,1+2 followup = 56 damage

## 5600x.txt
Source: https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested

Format:
TestName (lower = better): 3700X -> 5600X (performance difference)
Less than 1% difference = tie

Office and Science
Agisoft Photoscan (lower = better): 2377 -> 2133 (+11.4%)
GIMP (lower = better): 20.72 -> 17.15 (+20.8%)
3D particle movement non-AVX: 2768->2452 (-11.4%)

## fast_spheres.txt
Setup:
1. Index buffer containing N quads (each 2 triangles), where N is the max amount of spheres. Repeating pattern of {0,1,2,1,3,2} + K*4.
2. No vertex buffer.

Render N*2 triangles, where N is the number of spheres you have.

Vertex shader:
1. Sphere index = N/4 (N = SV_VertexId)
2. Quad coord: Q = float2(N%2, (N%4)/2) * 2.0 - 1.0
3. Transform sphere center -> pos
	In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
	group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

	Simplified HLSL code looks like this:

	Buffer<float4> lightDatas;
	Texture2D<uint2> lightStartCounts;
	RWTexture2D<float4> output;

	[numthreads(8, 8, 1)]
	// NOTE: Must bind 8x single mip RWTexture views, because HLSL doesn't have .mips member for RWTexture2D. (SRVs only have .mips member)
	// NOTE: globallycoherent attribute is needed. Without it writes aren't guaranteed to be seen by other groups
	globallycoherent RWTexture2D<float> MipTextures[8];
	RWTexture2D<uint> Counters[8];
	groupshared uint CounterReturnLDS;

	[numthreads(16, 16, 1)]
	void GenerateMipPyramid(uint3 Tid : SV_DispatchThreadID, uint3 Group : SV_GroupId, uint Gix : SV_GroupIndex)
	{
	[unroll]
	Spherical cap cone analytic solution is a 1d problem, since the cone cap sphere slides along the ray. The intersection point to empty space sphere is always on the ray.

	S : radius of cone cap sphere at t=1
	r(d) : cone cap sphere radius at distance d

	r(d) = d*S

	p = distance of current SDF sample
	SDF(p) = sdf function result at location p
	x = distance after conservative step
	All current buffer types in shading languages are slightly different ways to present homogeneous arrays (single struct or type repeating N times in memory).

	DirectX has raw buffers (RWByteAddressBuffer) but that is limited to 32 bit integer types and the implementation doesn't require natural alignment for wide loads resulting in suboptimal codegen on Nvidia GPUs.

	Complex use cases, such as tree traversal in spatial data structures (physics, ray-tracing, etc) require data structure that is non-homogeneous. You want different node payloads and tight memory layout.

	Ability to mix 8/16/32 bit data types and 1d/2d/4d vectors to faciliate GPU wide loads (max bandwidth) in same data structure is crucial for complex use cases like this.

	On the other hand we want better more readable/maintainable code syntax than DirectX raw buffers without manual bit packing/extracting and reinterpret casting. Goal should be to allow modern GPUs to use sub-register addressing (SDWA on AMD hardware). Saving both ALU and register
	i10 jab / punish:
	1,2,2: mid NC, -10 block, +8 CFT hit
	1,2,4: low, -13 block, +3 hit. CH: ff3 followup = 34 dmg

	+8 CFT mixup:
	1: mid i20, -9 block, hit KD, CH ff1+2 followup = 59 dmg
	4: low i19, -26 block, ff3 followup = 26 dmg

	ff2 i14 long range mid:
	block: -2 pushback -> backdash
	Top moves:
	2: i11 high, +1 block, +9 hit, 10 dmg, followup (NC): (1) mid, -2 block, +3 hit, NC = 22 dmg
	f2: i10 high, -12 block, +5 hit, 17 dmg, CH followups: ff1 = 40 dmg, ff3 = 48 dmg, f3+4 = 42 dmg
	f1: i14 mid, -6 block, +5 hit, 15 dmg, followup (NC): (1) high, -7 block, NC = 40 dmg
	df1: i14 mid, -4 block, +3 hit, 12 dmg, followups (CH NC): (2,1) delayable high,high, launch, (1) mid, -12 block. CH NC = 55 dmg
	db1: i12 low, -12 block, +2 hit, 13 dmg
	df2: i15 mid, -14 block (safe tip range), launch
	f1+2: i15 mid, -19 block (pushback), wall bounce
	db,d,df1: i24 low, high crush, -37 block, 30 dmg
	FC db1: i12 low, high crush, -8 block, +6 hit, 15 dmg
	Top moves (close):
	1,2,4/3: 10 high, -2 block, followups: mid(-8 push),mid(-12 push), low(-11 / 0 hit)
	1+2: i16 mid, -9 block, launch
	df1,2/4: i13 mid,high/mid, -3 block, followups: high(-1), mid(-12)
	df2: i15 mid, -6 block, launch (no crouch)
	d1+2: i20 low, -18 block, high crush, 36 damage minicombo (d2, f2)
	d3+4: i14 low,high, -6 block (push), low crush, CH launch
	db1,2: i14 mid,high, -9 block, high crush, followup CH launch
	db2: i20 mid, -11 block, high cruch, launch
	b2,1,4/4/1+2,4: i15 mid, -4 block, followups: mid,low/high (-7,-6 push), low(-11), high,mid (-9,-13)
	Parry:

	b2 mid+high:
	3 frame startup and can interrupt many (non-NC) strings.
	1 or 2 followup = 30 damage
	Against slow recovery moves can launch with b3 or uf4.

	3+4 (Hermit stance) low:
	3 frame startup and can interrupt many (non-NC) strings. Hermit string transitions parry dick jab even at -9.
	4,1+2 followup = 56 damage
	Source: https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested

	Format:
	TestName (lower = better): 3700X -> 5600X (performance difference)
	Less than 1% difference = tie

	Office and Science
	Agisoft Photoscan (lower = better): 2377 -> 2133 (+11.4%)
	GIMP (lower = better): 20.72 -> 17.15 (+20.8%)
	3D particle movement non-AVX: 2768->2452 (-11.4%)
	Setup:
	1. Index buffer containing N quads (each 2 triangles), where N is the max amount of spheres. Repeating pattern of {0,1,2,1,3,2} + K*4.
	2. No vertex buffer.

	Render N*2 triangles, where N is the number of spheres you have.

	Vertex shader:
	1. Sphere index = N/4 (N = SV_VertexId)
	2. Quad coord: Q = float2(N%2, (N%4)/2) * 2.0 - 1.0
	3. Transform sphere center -> pos