rygorous/avx_sigh.md

## avx_sigh.md

      
    Raw
  

              avx_sigh.md
            
          
why doesn't radfft support AVX on PC?

So there's two separate issues here: using instructions added in AVX and using 256-bit wide vectors. The former turns out to be much easier than the latter for our use case.
Problem number 1 was that you positively need to put AVX code in a separate file with different compiler settings (/arch:AVX for VC++, -mavx for GCC/Clang) that make all SSE code emitted also use VEX encoding, and at the time radfft was written there was no way in CDep to set compiler flags for just one file, just for the overall build.
[There's the GCC "target" annotations on individual funcs, which in principle fix this, but I ran into nasty problems with this for several compiler versions, and VC++ has no equivalent, so we're not currently using that and just sticking with different compilation units.]
The other issue is to do with CPU power management.
https://en.wikichip.org/wiki/intel/frequency_behavior#Base.2C_Non-AVX_Turbo.2C_and_AVX_Turbo
https://computing.llnl.gov/tutorials/linux_clusters/intelAVXperformanceWhitePaper.pdf (Broadwell era).
New style first (SKX and later, i.e. brand new): there are three power bins, and they are per core.
"Regular" lets you use any 128-bit wide instructions and "light" 256-bit instructions (shuffles and integer adds/logic/shifts OK, no FP, nothing that powers on the 256-bit multiplier whether int or FP)
"AVX2 heavy" lets you use all 256-bit wide instructions and "light" 512-bit wide ones. Base and turbo frequencies are about 15% lower than baseline.
"AVX512 heavy" lets you use anything. Forces the frequencies to be about 25-30% lower than regular.
Using any instructions outside the current power bin makes the core request a higher power license, which takes a long time. Quoting the Intel optimization guide:

When the core requests a higher license level than its current one, it takes the PCU up to 500 micro-
seconds to grant the new license. Until then the core operates at a lower peak capability. During this time
period the PCU evaluates how many cores are executing at the new license level and adjusts their
frequency as necessary, potentially lowering the frequency. Cores that execute at other license levels are
not affected.
A timer of approximately 2ms is applied before going back to a higher frequency level. Any condition that
would have requested a new license resets the timer.

Millisecond scale, so we're talking millions of cycles. Until you're granted the higher power license level, your code runs on the narrower data path by issuing multiple internal uops for narrower slices.
[In case you're wondering why: they literally need to give the voltage regulators time to adjust because if they start powering up the wide datapaths directly, the chip browns out. This is not theoretical, we had crashes from this when we got the new machines, because the Fucking Gamer BIOS of course defaults to turning this off, "FULL TURBO FOR EVERYTHING ALL THE TIME!!!!".]
The short version is that unless you're planning to run AVX-intensive code on that core for at least the next 10ms or so, you are completely shooting yourself in the foot by using AVX float.
A complex RADFFT at N=2048 (relevant size for Bink Audio, Miles sometimes uses larger FFTs) takes about 19k cycles when computed 128b-wide without FMAs. That means that the actual FFT runs and completes long before we ever get the higher power license grant, and then when we do get the higher power license, all we've done is docked the core frequency by about 15% (25%+ when using AVX-512) for the next couple milliseconds, when somebody else's code runs.
That's a Really Bad Thing for middleware to be doing, so we don't.

What I just described is the new power management, which is actually a lot better than the old power management on Client Haswell through Kaby Lake. [I don't know what Coffee Lake does; I fear the "new" logic might be confined to the Core i9s based on the Skylake Server silicon for now.]
These older CPUs are somewhat faster to grant the higher power license level (but still on the order of 150k cycles), but if there is even one core using AVX code, all cores (well, everything on the same package if you're in a multi-socket CPU system) get limited to the max AVX frequency.
And they don't seem to have the "light" vs. "heavy" distinction either. Use anything 256b wide, even a single instruction, and you're docking the max turbo for all cores for the next couple milliseconds.
That's an even worse thing for middleware to be doing, so again, we try not to.

So why do we have AVX in RADFFT enabled on the consoles with Jaguar cores of all places, which don't even have any 256b wide SIMD execution?
Well, because they don't have any 256b wide SIMD execution, the 256b AVX ops just turn into two 128b ops each, so there's no frequency wonkiness or anything to worry about. And they're already using VEX encoding for all SSE instructions since it saves on moves (helps a lot with the Jaguars low FE throughput).
Using AVX instrs there basically ends up being an extra 2x unroll of the core loops, but without increasing the code size, and the code changes in RADFFT were completely trivial.
It's about 4% faster, mostly from reducing loop overhead. That's not world-shaking, but hey, 4% is better than nothing, and unlike the desktop PC variant it doesn't come with strings attached.