Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
An HLSL function for sampling a 2D texture with Catmull-Rom filtering, using 9 texture samples instead of 16
// The following code is licensed under the MIT license: https://gist.github.com/TheRealMJP/bc503b0b87b643d3505d41eab8b332ae
// Samples a texture with Catmull-Rom filtering, using 9 texture fetches instead of 16.
// See http://vec3.ca/bicubic-filtering-in-fewer-taps/ for more details
float4 SampleTextureCatmullRom(in Texture2D<float4> tex, in SamplerState linearSampler, in float2 uv, in float2 texSize)
{
// We're going to sample a a 4x4 grid of texels surrounding the target UV coordinate. We'll do this by rounding
// down the sample location to get the exact center of our "starting" texel. The starting texel will be at
// location [1, 1] in the grid, where [0, 0] is the top left corner.
float2 samplePos = uv * texSize;
float2 texPos1 = floor(samplePos - 0.5f) + 0.5f;
// Compute the fractional offset from our starting texel to our original sample location, which we'll
// feed into the Catmull-Rom spline function to get our filter weights.
float2 f = samplePos - texPos1;
// Compute the Catmull-Rom weights using the fractional offset that we calculated earlier.
// These equations are pre-expanded based on our knowledge of where the texels will be located,
// which lets us avoid having to evaluate a piece-wise function.
float2 w0 = f * (-0.5f + f * (1.0f - 0.5f * f));
float2 w1 = 1.0f + f * f * (-2.5f + 1.5f * f);
float2 w2 = f * (0.5f + f * (2.0f - 1.5f * f));
float2 w3 = f * f * (-0.5f + 0.5f * f);
// Work out weighting factors and sampling offsets that will let us use bilinear filtering to
// simultaneously evaluate the middle 2 samples from the 4x4 grid.
float2 w12 = w1 + w2;
float2 offset12 = w2 / (w1 + w2);
// Compute the final UV coordinates we'll use for sampling the texture
float2 texPos0 = texPos1 - 1;
float2 texPos3 = texPos1 + 2;
float2 texPos12 = texPos1 + offset12;
texPos0 /= texSize;
texPos3 /= texSize;
texPos12 /= texSize;
float4 result = 0.0f;
result += tex.SampleLevel(linearSampler, float2(texPos0.x, texPos0.y), 0.0f) * w0.x * w0.y;
result += tex.SampleLevel(linearSampler, float2(texPos12.x, texPos0.y), 0.0f) * w12.x * w0.y;
result += tex.SampleLevel(linearSampler, float2(texPos3.x, texPos0.y), 0.0f) * w3.x * w0.y;
result += tex.SampleLevel(linearSampler, float2(texPos0.x, texPos12.y), 0.0f) * w0.x * w12.y;
result += tex.SampleLevel(linearSampler, float2(texPos12.x, texPos12.y), 0.0f) * w12.x * w12.y;
result += tex.SampleLevel(linearSampler, float2(texPos3.x, texPos12.y), 0.0f) * w3.x * w12.y;
result += tex.SampleLevel(linearSampler, float2(texPos0.x, texPos3.y), 0.0f) * w0.x * w3.y;
result += tex.SampleLevel(linearSampler, float2(texPos12.x, texPos3.y), 0.0f) * w12.x * w3.y;
result += tex.SampleLevel(linearSampler, float2(texPos3.x, texPos3.y), 0.0f) * w3.x * w3.y;
return result;
}
@TheRealMJP
Copy link
Author

TheRealMJP commented Sep 21, 2016

Some quick benchmark results with an R9 380:

All of these numbers were gathered by using the above code for reprojecting the previous frame's result for the purpose of TAA, using the following shader: https://github.com/TheRealMJP/MSAAFilter/blob/master/MSAAFilter/Resolve.hlsl.

The above code was used exactly for testing the 9-tap version. The 1-tap version just uses bilinear filtering, and is there for for reference. The 16-tap version used a modified version of the above function that performs 16 texture loads, with no sampling or filtering (I didn't use the filtering code that's checked in for that file, which has several branches for choosing the filter kernel and a few other options). The 5-tap version is the same as the above except that it omits the corner taps, as suggested by Jorge Jimenez in his SIGGRAPH 2016 presentation about Filmic SMAA: http://advances.realtimerendering.com/s2016/Filmic%20SMAA%20v7.pptx

Resolution MSAA Level 1 tap 9 taps 16 taps 5 taps
1600x1200 1x MSAA 0.40ms 0.53ms 0.58ms 0.47ms
1600x1200 4x MSAA 2.40ms 2.47ms 2.50ms 2.45ms

Here's some more timings captured with a GTX 980:

Resolution MSAA Level 1 tap 9 taps 16 taps 5 taps
1920x1080 1x MSAA 0.30ms 0.32ms 0.34ms 0.32ms
1920x1080 4x MSAA 0.92ms 1.12ms 1.18ms 1.07ms

@aras-p
Copy link

aras-p commented Sep 21, 2016

btw a coworker suggested this small optimization:

// get rid of f3, and:
float2 w0 = (1.0f / 2.0f) * f * (-1.0f + f * (2.0f - f));
float2 w1 = (1.0f / 6.0f) * f2 * (-15.0f + 9.0f * f) + 1.0f;
float2 w2 = (1.0f / 6.0f) * f * (3.0f + f * (12.0f - f * 9.0f));
float2 w3 = (1.0f / 2.0f) * f2 * (f - 1.0f);    

Checking with Pyramid using AMDDXX for Bonaire target:
VGPRs: 51 -> 49
VALU: 147 -> 146

@pixelmager
Copy link

pixelmager commented Sep 21, 2016

Alternatively putting the polynomials straight in horner-form:

float2 w0 = f * ( -0.5 + f * (1.0 - 0.5*f));
float2 w1 = 1.0 + f * f * (-2.5 + 1.5*f );
float2 w2 = f * ( 0.5 + f * (2.0 - 1.5*f) );
float2 w3 = f * f * (-0.5 + 0.5 * f);

Pyramid, AMDDXX, Bonaire ( http://pastebin.com/12ccE9Lk )
VGPRs: 55 -> 47
VALU: 146 -> 135

@TheRealMJP
Copy link
Author

TheRealMJP commented Sep 22, 2016

Thanks guys! I updated the code with the optimizations.

@jamesford42
Copy link

jamesford42 commented Feb 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment