Skip to content

Instantly share code, notes, and snippets.

@sebbbi
sebbbi / FastUniformLoadWithWaveOps.txt
Last active February 15, 2024 08:41
Fast uniform load with wave ops (up to 64x speedup)
In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.
Simplified HLSL code looks like this:
Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;
[numthreads(8, 8, 1)]
http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/6
"The High-Level Zen Overview"
- "Features such as the micro-op cache help most instruction streams improve in performance and bypass parts of potentially
long-cycle repetitive operations, but also the larger dispatch, larger retire, larger schedulers and better branch
prediction means that higher throughput can be maintained longer and in the fastest order possible."
Micro-op caches have nothing to with "bypassing parts of potentially long-cycle repetitive operations" (what does
that even mean?). They reduce decode bottlenecks and decrease power consumption. Depending on the implementation,
@TheRealMJP
TheRealMJP / Tex2DCatmullRom.hlsl
Last active April 9, 2024 08:41
An HLSL function for sampling a 2D texture with Catmull-Rom filtering, using 9 texture samples instead of 16
// The following code is licensed under the MIT license: https://gist.github.com/TheRealMJP/bc503b0b87b643d3505d41eab8b332ae
// Samples a texture with Catmull-Rom filtering, using 9 texture fetches instead of 16.
// See http://vec3.ca/bicubic-filtering-in-fewer-taps/ for more details
float4 SampleTextureCatmullRom(in Texture2D<float4> tex, in SamplerState linearSampler, in float2 uv, in float2 texSize)
{
// We're going to sample a a 4x4 grid of texels surrounding the target UV coordinate. We'll do this by rounding
// down the sample location to get the exact center of our "starting" texel. The starting texel will be at
// location [1, 1] in the grid, where [0, 0] is the top left corner.
float2 samplePos = uv * texSize;
@rygorous
rygorous / abi_glue.inc
Created May 2, 2016 22:11
NASM glue to deal with ABI diffs (in particular, Win64 UNWIND_INFO)
; This is intended for long-running leaf funcs that don't use XMM registers,
; and just saves all callee-save registers regardless of whether they're used
; or not.
; detect some parameters from output format
%ifidn __OUTPUT_FORMAT__,win32
%define resp resd
%define LEADING_UNDERSCORES
%define CALLEE_SAVE_GPRS ebp,ebx,esi,edi
%define BYTES_PER_ARG 4
@rygorous
rygorous / gist:2144712
Created March 21, 2012 05:20
half->float variants
// half->float variants.
// by Fabian "ryg" Giesen.
//
// I hereby place this code in the public domain.
//
// half_to_float_fast: table based
// tables could be done in a more compact fashion (in particular, can store tab2 in low word of tab1!)
// but something of a dead end since not very SIMD-friendly. pretty much abandoned at this point.
//
// half_to_float_fast2: use FP adder hardware to deal with denormals.
@rygorous
rygorous / fp16_to_32.asm
Created March 21, 2012 04:37
half->float using SSE2
; input: 4x F16 in XMM0 (low words of each DWord)
; original idea+implementation by Dean Macri
; WARNING: copy & pasted together from other code, this ver is untested!!
; though the original version was definitely correct.
bits 32
section .data