Skip to content

Instantly share code, notes, and snippets.

@atyuwen
Last active January 8, 2024 06:07
Show Gist options
  • Star 51 You must be signed in to star a gist
  • Fork 14 You must be signed in to fork a gist
  • Save atyuwen/78d6e810e6d0f7fd4aa6207d416f2eeb to your computer and use it in GitHub Desktop.
Save atyuwen/78d6e810e6d0f7fd4aa6207d416f2eeb to your computer and use it in GitHub Desktop.
An optimized AMD FSR implementation for Mobiles
//==============================================================================================================================
// An optimized AMD FSR's EASU implementation for Mobiles
// Based on https://github.com/GPUOpen-Effects/FidelityFX-FSR/blob/master/ffx-fsr/ffx_fsr1.h
// Details can be found: https://atyuwen.github.io/posts/optimizing-fsr/
// Distributed under the MIT License. Copyright (c) 2021 atyuwen.
// -- FsrEasuSampleH should be implemented by calling shader, like following:
// AH3 FsrEasuSampleH(AF2 p) { return MyTex.SampleLevel(LinearSampler, p, 0).xyz; }
//==============================================================================================================================
void FsrEasuL(
out AH3 pix,
AF2 ip,
AF4 con0,
AF4 con1,
AF4 con2,
AF4 con3){
//------------------------------------------------------------------------------------------------------------------------------
// Direction is the '+' diff.
// A
// B C D
// E
AF2 pp=(ip)*(con0.xy)+(con0.zw);
AF2 tc=(pp+AF2_(0.5))*con1.xy;
AH3 sA=FsrEasuSampleH(tc-AF2(0, con1.y));
AH3 sB=FsrEasuSampleH(tc-AF2(con1.x, 0));
AH3 sC=FsrEasuSampleH(tc);
AH3 sD=FsrEasuSampleH(tc+AF2(con1.x, 0));
AH3 sE=FsrEasuSampleH(tc+AF2(0, con1.y));
AH1 lA=sA.r*AH1_(0.5)+sA.g;
AH1 lB=sB.r*AH1_(0.5)+sB.g;
AH1 lC=sC.r*AH1_(0.5)+sC.g;
AH1 lD=sD.r*AH1_(0.5)+sD.g;
AH1 lE=sE.r*AH1_(0.5)+sE.g;
// Then takes magnitude from abs average of both sides of 'C'.
// Length converts gradient reversal to 0, smoothly to non-reversal at 1, shaped, then adding horz and vert terms.
AH1 dc=lD-lC;
AH1 cb=lC-lB;
AH1 lenX=max(abs(dc),abs(cb));
lenX=ARcpH1(lenX);
AH1 dirX=lD-lB;
lenX=ASatH1(abs(dirX)*lenX);
lenX*=lenX;
// Repeat for the y axis.
AH1 ec=lE-lC;
AH1 ca=lC-lA;
AH1 lenY=max(abs(ec),abs(ca));
lenY=ARcpH1(lenY);
AH1 dirY=lE-lA;
lenY=ASatH1(abs(dirY)*lenY);
AH1 len = lenY * lenY + lenX;
AH2 dir = AH2(dirX, dirY);
//------------------------------------------------------------------------------------------------------------------------------
AH2 dir2=dir*dir;
AH1 dirR=dir2.x+dir2.y;
if (dirR<AH1_(1.0/64.0)) {
pix = sC;
return;
}
dirR=ARsqH1(dirR);
dir*=AH2_(dirR);
len=len*AH1_(0.5);
len*=len;
AH1 stretch=(dir.x*dir.x+dir.y*dir.y)*ARcpH1(max(abs(dir.x),abs(dir.y)));
AH2 len2=AH2(AH1_(1.0)+(stretch-AH1_(1.0))*len,AH1_(1.0)+AH1_(-0.5)*len);
AH1 lob=AH1_(0.5)+AH1_((1.0/4.0-0.04)-0.5)*len;
AH1 clp=ARcpH1(lob);
//------------------------------------------------------------------------------------------------------------------------------
AF2 fp=floor(pp);
pp-=fp;
AH2 ppp=AH2(pp);
AF2 p0=fp*(con1.xy)+(con1.zw);
AF2 p1=p0+(con2.xy);
AF2 p2=p0+(con2.zw);
AF2 p3=p0+(con3.xy);
p0.y-=con1.w; p3.y+=con1.w;
AH4 fgcbR=FsrEasuRH(p0);
AH4 fgcbG=FsrEasuGH(p0);
AH4 fgcbB=FsrEasuBH(p0);
AH4 ijfeR=FsrEasuRH(p1);
AH4 ijfeG=FsrEasuGH(p1);
AH4 ijfeB=FsrEasuBH(p1);
AH4 klhgR=FsrEasuRH(p2);
AH4 klhgG=FsrEasuGH(p2);
AH4 klhgB=FsrEasuBH(p2);
AH4 nokjR=FsrEasuRH(p3);
AH4 nokjG=FsrEasuGH(p3);
AH4 nokjB=FsrEasuBH(p3);
//------------------------------------------------------------------------------------------------------------------------------
// This part is different for FP16, working pairs of taps at a time.
AH2 pR=AH2_(0.0);
AH2 pG=AH2_(0.0);
AH2 pB=AH2_(0.0);
AH2 pW=AH2_(0.0);
FsrEasuTapH(pR,pG,pB,pW,AH2( 1.0, 0.0)-ppp.xx,AH2(-1.0,-1.0)-ppp.yy,dir,len2,lob,clp,fgcbR.zw,fgcbG.zw,fgcbB.zw);
FsrEasuTapH(pR,pG,pB,pW,AH2(-1.0, 0.0)-ppp.xx,AH2( 1.0, 1.0)-ppp.yy,dir,len2,lob,clp,ijfeR.xy,ijfeG.xy,ijfeB.xy);
FsrEasuTapH(pR,pG,pB,pW,AH2( 0.0,-1.0)-ppp.xx,AH2( 0.0, 0.0)-ppp.yy,dir,len2,lob,clp,ijfeR.zw,ijfeG.zw,ijfeB.zw);
FsrEasuTapH(pR,pG,pB,pW,AH2( 1.0, 2.0)-ppp.xx,AH2( 1.0, 1.0)-ppp.yy,dir,len2,lob,clp,klhgR.xy,klhgG.xy,klhgB.xy);
FsrEasuTapH(pR,pG,pB,pW,AH2( 2.0, 1.0)-ppp.xx,AH2( 0.0, 0.0)-ppp.yy,dir,len2,lob,clp,klhgR.zw,klhgG.zw,klhgB.zw);
FsrEasuTapH(pR,pG,pB,pW,AH2( 0.0, 1.0)-ppp.xx,AH2( 2.0, 2.0)-ppp.yy,dir,len2,lob,clp,nokjR.xy,nokjG.xy,nokjB.xy);
AH3 aC=AH3(pR.x+pR.y,pG.x+pG.y,pB.x+pB.y);
AH1 aW=pW.x+pW.y;
//------------------------------------------------------------------------------------------------------------------------------
pix=aC*AH3_(ARcpH1(aW));}
@gnif
Copy link

gnif commented Aug 26, 2021

You say this is for mobiles but we are interested in using it for the FOSS Looking Glass project (https://github.com/gnif/LookingGlass) where latency is crucial and I have two questions:

  1. Are there any compromises made here or is this simply optimization?
  2. What license is this under? We need something that is GPLv2 compatible if we are to use it.

@sorasoras
Copy link

You say this is for mobiles but we are interested in using it for the FOSS Looking Glass project (https://github.com/gnif/LookingGlass) where latency is crucial and I have two questions:

  1. Are there any compromises made here or is this simply optimization?
  2. What license is this under? We need something that is GPLv2 compatible if we are to use it.

https://twitter.com/atyuwen/status/1430459990561607689

@gnif
Copy link

gnif commented Aug 27, 2021

You say this is for mobiles but we are interested in using it for the FOSS Looking Glass project (https://github.com/gnif/LookingGlass) where latency is crucial and I have two questions:

  1. Are there any compromises made here or is this simply optimization?
  2. What license is this under? We need something that is GPLv2 compatible if we are to use it.

https://twitter.com/atyuwen/status/1430459990561607689

Thanks mate but this is not what I asked.

@atyuwen
Copy link
Author

atyuwen commented Aug 27, 2021

You say this is for mobiles but we are interested in using it for the FOSS Looking Glass project (https://github.com/gnif/LookingGlass) where latency is crucial and I have two questions:

  1. Are there any compromises made here or is this simply optimization?
  2. What license is this under? We need something that is GPLv2 compatible if we are to use it.
  1. It's a slimed version of the original algorithm, though I tried my best to keep upscaling quality as much as possible.
  2. It's under the MIT license.

@gnif
Copy link

gnif commented Aug 27, 2021

You say this is for mobiles but we are interested in using it for the FOSS Looking Glass project (https://github.com/gnif/LookingGlass) where latency is crucial and I have two questions:

  1. Are there any compromises made here or is this simply optimization?
  2. What license is this under? We need something that is GPLv2 compatible if we are to use it.
  1. It's a slimed version of the original algorithm, though I tried my best to keep upscaling quality as much as possible.
  2. It's under the MIT license.

Thank you sir! :)

@atyuwen
Copy link
Author

atyuwen commented Sep 2, 2021

Please note that the signature of FsrEasuL here is different from the original FsrEasuH :

FsrEasuL(out AH3 pix, AF2 ip, AF4 con0, AF4 con1, AF4 con2, AF4 con3);    // FsrEasuL expects float inputs.
FsrEasuH(out AH3 pix, AU2 ip, AU4 con0, AU4 con1, AU4 con2, AU4 con3);    // FsrEasuH expects uint inputs.

Here is another version of FsrEasuL that has the same signature as FsrEasuH:
FSR Mobile Demo(based on the official FSR Demo)

@terboz
Copy link

terboz commented Sep 9, 2021

Thank you for sharing awesome code. Can I ask you two question.

  • Do you have a plan to pull request your code to original repository ?

  • I can't find the license.txt about the opt_fxr in your repository below. Could you add it in the repo ?
    https://github.com/atyuwen/FidelityFX-FSR

@atyuwen
Copy link
Author

atyuwen commented Sep 13, 2021

Thank you for sharing awesome code. Can I ask you two question.

  • Do you have a plan to pull request your code to original repository ?
  • I can't find the license.txt about the opt_fxr in your repository below. Could you add it in the repo ?
    https://github.com/atyuwen/FidelityFX-FSR

No plan to pull request, I guess AMD might not be very interested in porting FSR to mobile.
Now the copyright is added to "ffx_fsr1.h".

@terboz
Copy link

terboz commented Oct 5, 2021

Now the copyright is added to "ffx_fsr1.h".

Thanks for adding the license statement!

@hafuxiaoguaishou
Copy link

Thanks for sharing.
May i ask one question?
Why the official version do not use LDS store the texture color to reduce the texture sampling times?

@allomancerMac
Copy link

It seems to me when checking for the luminance direction AMD version does 0.5*B+0.5*R+1.0*G whereas you version seems to drop blue entirely: AH1 lA=sA.r*AH1_(0.5)+sA.g; which is 0.5*R+1.0*G.
Is this intentional or am I misreading the code?

@atyuwen
Copy link
Author

atyuwen commented Apr 19, 2022

It seems to me when checking for the luminance direction AMD version does 0.5*B+0.5*R+1.0*G whereas you version seems to drop blue entirely: AH1 lA=sA.r*AH1_(0.5)+sA.g; which is 0.5*R+1.0*G. Is this intentional or am I misreading the code?

Yes, the blue channel is dropped intentionally to save some ALUs.

@Tanshaydar
Copy link

Tanshaydar commented Jun 17, 2022

Would this work on Android VR?
OpenGLES3 implementation or half precision conversions didn't work (i.e. only rendered lower left corner instead of full upscaling).

Anyone tried this, or has knowledge?

@atyuwen
Copy link
Author

atyuwen commented Jun 20, 2022

Would this work on Android VR? OpenGLES3 implementation or half precision conversions didn't work (i.e. only rendered lower left corner instead of full upscaling).

Anyone tried this, or has knowledge?

We have tested this code on Android (GLES3). No bugs founded. Did you use fp16s on texcoord-related stuff?

@Andraw-sue
Copy link

AF2 fp=floor(pp);
pp-=fp;
AH2 ppp=AH2(pp);
AF2 p0=fp*(con1.xy)+(con1.zw);
AF2 p1=p0+(con2.xy);
AF2 p2=p0+(con2.zw);
AF2 p3=p0+(con3.xy);
p0.y-=con1.w; p3.y+=con1.w;
AH4 fgcbR=FsrEasuRH(p0);
AH4 fgcbG=FsrEasuGH(p0);
AH4 fgcbB=FsrEasuBH(p0);
AH4 ijfeR=FsrEasuRH(p1);
AH4 ijfeG=FsrEasuGH(p1);
AH4 ijfeB=FsrEasuBH(p1);
AH4 klhgR=FsrEasuRH(p2);
AH4 klhgG=FsrEasuGH(p2);
AH4 klhgB=FsrEasuBH(p2);
AH4 nokjR=FsrEasuRH(p3);
AH4 nokjG=FsrEasuGH(p3);
AH4 nokjB=FsrEasuBH(p3);

This is different from AMD version, the f,g,j,k will be processed four times. Why change the position of p0~p3 and modify the Gather sample?

@atyuwen
Copy link
Author

atyuwen commented Oct 24, 2023

AF2 fp=floor(pp); pp-=fp; AH2 ppp=AH2(pp); AF2 p0=fp*(con1.xy)+(con1.zw); AF2 p1=p0+(con2.xy); AF2 p2=p0+(con2.zw); AF2 p3=p0+(con3.xy); p0.y-=con1.w; p3.y+=con1.w; AH4 fgcbR=FsrEasuRH(p0); AH4 fgcbG=FsrEasuGH(p0); AH4 fgcbB=FsrEasuBH(p0); AH4 ijfeR=FsrEasuRH(p1); AH4 ijfeG=FsrEasuGH(p1); AH4 ijfeB=FsrEasuBH(p1); AH4 klhgR=FsrEasuRH(p2); AH4 klhgG=FsrEasuGH(p2); AH4 klhgB=FsrEasuBH(p2); AH4 nokjR=FsrEasuRH(p3); AH4 nokjG=FsrEasuGH(p3); AH4 nokjB=FsrEasuBH(p3);

This is different from AMD version, the f,g,j,k will be processed four times. Why change the position of p0~p3 and modify the Gather sample?

It's been a while and I don't remember exactly, but it was probably an attempt to optimize TMU, though no performance gain in the end.

@zhangbaochong
Copy link

I want use half precision on android (310 es), while there is an error that the extension GL_EXT_shader_16bit_storage not supported, how to do for this? or just use mediump float instead of float16_t

@atyuwen
Copy link
Author

atyuwen commented Oct 27, 2023 via email

@zhangbaochong
Copy link

It depends on your shader cross-compiler. Some have a flag to convert half to mediump.

On Thu, Oct 26, 2023 at 9:00 PM Zhang Baochong @.> wrote: @.* commented on this gist. ------------------------------ I want use half precision on android (310 es), while there is an error that the extension GL_EXT_shader_16bit_storage not supported, how to do for this? or just use mediump float instead of float16_t — Reply to this email directly, view it on GitHub https://gist.github.com/atyuwen/78d6e810e6d0f7fd4aa6207d416f2eeb#gistcomment-4739608 or unsubscribe https://github.com/notifications/unsubscribe-auth/AANBAMMPSZ7O7PX2IRXU46DYBJNF7BFKMF2HI4TJMJ2XIZLTSKBKK5TBNR2WLJDHNFZXJJDOMFWWLK3UNBZGKYLEL52HS4DFQKSXMYLMOVS2I5DSOVS2I3TBNVS3W5DIOJSWCZC7OBQXE5DJMNUXAYLOORPWCY3UNF3GS5DZVRZXKYTKMVRXIX3UPFYGLK2HNFZXIQ3PNVWWK3TUUZ2G64DJMNZZDAVEOR4XAZNEM5UXG5FFOZQWY5LFVEYTCMJUGM2DINRWU52HE2LHM5SXFJTDOJSWC5DF . You are receiving this email because you authored the thread. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

You mean that you test on Android through HLSL cross platform compilation? I now just use origin glsl, and I find that the float16_t extension not supported, I will try to use mediump float/mediump vec2 instead of float16_t/f16vec2, while some code may be changed, thanks

@Andraw-sue
Copy link

The early stop condition "dirR<AH1_(1.0/64.0)", how is this boundary value AH1_(1.0/64.0) determined?

@atyuwen
Copy link
Author

atyuwen commented Nov 1, 2023 via email

@zhangbaochong
Copy link

now fsr has two pass: easu and rcas, can we make it to one pass? I try to do sharp with 12 texel in easu, but the image quality is so bad,do you have some good idea for this, thanks

@atyuwen
Copy link
Author

atyuwen commented Nov 2, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment