Instantly share code, notes, and snippets.

# TheRealMJP/Tex2DCatmullRom.hlsl

Last active April 9, 2024 08:41
Show Gist options
• Save TheRealMJP/c83b8c0f46b63f3a88a5986f4fa982b1 to your computer and use it in GitHub Desktop.
An HLSL function for sampling a 2D texture with Catmull-Rom filtering, using 9 texture samples instead of 16
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 // The following code is licensed under the MIT license: https://gist.github.com/TheRealMJP/bc503b0b87b643d3505d41eab8b332ae // Samples a texture with Catmull-Rom filtering, using 9 texture fetches instead of 16. // See http://vec3.ca/bicubic-filtering-in-fewer-taps/ for more details float4 SampleTextureCatmullRom(in Texture2D tex, in SamplerState linearSampler, in float2 uv, in float2 texSize) { // We're going to sample a a 4x4 grid of texels surrounding the target UV coordinate. We'll do this by rounding // down the sample location to get the exact center of our "starting" texel. The starting texel will be at // location [1, 1] in the grid, where [0, 0] is the top left corner. float2 samplePos = uv * texSize; float2 texPos1 = floor(samplePos - 0.5f) + 0.5f; // Compute the fractional offset from our starting texel to our original sample location, which we'll // feed into the Catmull-Rom spline function to get our filter weights. float2 f = samplePos - texPos1; // Compute the Catmull-Rom weights using the fractional offset that we calculated earlier. // These equations are pre-expanded based on our knowledge of where the texels will be located, // which lets us avoid having to evaluate a piece-wise function. float2 w0 = f * (-0.5f + f * (1.0f - 0.5f * f)); float2 w1 = 1.0f + f * f * (-2.5f + 1.5f * f); float2 w2 = f * (0.5f + f * (2.0f - 1.5f * f)); float2 w3 = f * f * (-0.5f + 0.5f * f); // Work out weighting factors and sampling offsets that will let us use bilinear filtering to // simultaneously evaluate the middle 2 samples from the 4x4 grid. float2 w12 = w1 + w2; float2 offset12 = w2 / (w1 + w2); // Compute the final UV coordinates we'll use for sampling the texture float2 texPos0 = texPos1 - 1; float2 texPos3 = texPos1 + 2; float2 texPos12 = texPos1 + offset12; texPos0 /= texSize; texPos3 /= texSize; texPos12 /= texSize; float4 result = 0.0f; result += tex.SampleLevel(linearSampler, float2(texPos0.x, texPos0.y), 0.0f) * w0.x * w0.y; result += tex.SampleLevel(linearSampler, float2(texPos12.x, texPos0.y), 0.0f) * w12.x * w0.y; result += tex.SampleLevel(linearSampler, float2(texPos3.x, texPos0.y), 0.0f) * w3.x * w0.y; result += tex.SampleLevel(linearSampler, float2(texPos0.x, texPos12.y), 0.0f) * w0.x * w12.y; result += tex.SampleLevel(linearSampler, float2(texPos12.x, texPos12.y), 0.0f) * w12.x * w12.y; result += tex.SampleLevel(linearSampler, float2(texPos3.x, texPos12.y), 0.0f) * w3.x * w12.y; result += tex.SampleLevel(linearSampler, float2(texPos0.x, texPos3.y), 0.0f) * w0.x * w3.y; result += tex.SampleLevel(linearSampler, float2(texPos12.x, texPos3.y), 0.0f) * w12.x * w3.y; result += tex.SampleLevel(linearSampler, float2(texPos3.x, texPos3.y), 0.0f) * w3.x * w3.y; return result; }

### TheRealMJP commented Sep 21, 2016 • edited

Some quick benchmark results with an R9 380:

All of these numbers were gathered by using the above code for reprojecting the previous frame's result for the purpose of TAA, using the following shader: https://github.com/TheRealMJP/MSAAFilter/blob/master/MSAAFilter/Resolve.hlsl.

The above code was used exactly for testing the 9-tap version. The 1-tap version just uses bilinear filtering, and is there for for reference. The 16-tap version used a modified version of the above function that performs 16 texture loads, with no sampling or filtering (I didn't use the filtering code that's checked in for that file, which has several branches for choosing the filter kernel and a few other options). The 5-tap version is the same as the above except that it omits the corner taps, as suggested by Jorge Jimenez in his SIGGRAPH 2016 presentation about Filmic SMAA: http://advances.realtimerendering.com/s2016/Filmic%20SMAA%20v7.pptx

Resolution MSAA Level 1 tap 9 taps 16 taps 5 taps
1600x1200 1x MSAA 0.40ms 0.53ms 0.58ms 0.47ms
1600x1200 4x MSAA 2.40ms 2.47ms 2.50ms 2.45ms

Here's some more timings captured with a GTX 980:

Resolution MSAA Level 1 tap 9 taps 16 taps 5 taps
1920x1080 1x MSAA 0.30ms 0.32ms 0.34ms 0.32ms
1920x1080 4x MSAA 0.92ms 1.12ms 1.18ms 1.07ms

### aras-p commented Sep 21, 2016

btw a coworker suggested this small optimization:

// get rid of f3, and:
float2 w0 = (1.0f / 2.0f) * f * (-1.0f + f * (2.0f - f));
float2 w1 = (1.0f / 6.0f) * f2 * (-15.0f + 9.0f * f) + 1.0f;
float2 w2 = (1.0f / 6.0f) * f * (3.0f + f * (12.0f - f * 9.0f));
float2 w3 = (1.0f / 2.0f) * f2 * (f - 1.0f);

Checking with Pyramid using AMDDXX for Bonaire target:
VGPRs: 51 -> 49
VALU: 147 -> 146

### pixelmager commented Sep 21, 2016 • edited

Alternatively putting the polynomials straight in horner-form:

float2 w0 = f * ( -0.5 + f * (1.0 - 0.5*f));
float2 w1 = 1.0 + f * f * (-2.5 + 1.5*f );
float2 w2 = f * ( 0.5 + f * (2.0 - 1.5*f) );
float2 w3 = f * f * (-0.5 + 0.5 * f);

Pyramid, AMDDXX, Bonaire ( http://pastebin.com/12ccE9Lk )
VGPRs: 55 -> 47
VALU: 146 -> 135

### TheRealMJP commented Sep 22, 2016

Thanks guys! I updated the code with the optimizations.

### dwulive commented Aug 22, 2022

If you are doing the filtering yourself and you want to use a linear buffer, you can use rawBuffer0.Load4()
coherency might or might not be worse, it depends. Dynamic updates are usually easier.

### foxmalderalex commented Nov 6, 2022

For the 5 taps should we renormalize weights?
float weight = w12.x * w0.y + w0.x * w12.y + w12.x * w12.y + w3.x * w12.y + w12.x * w3.y;
result /= weight;