jhaberstro/jaguar_optimization_notes.md

## jaguar_optimization_notes.md

      
    Raw
  

              jaguar_optimization_notes.md
            
          
    Taming the Jaguar x86 Optimization at Insomniac Games


When the branch predictor is wrong and speculatively executes code from a branch that is not taken, that can actually pollute caches, causing much worse performance than just wasted cycles from fetch, decode, ALU.
Retiring: all instructions retire (commit) in program order and happens at a max rate of 2/cycle.

i.e. the visible side-effects of an instruction are committed in order, even if executed out of order.


L1 hit takes 3 cycle, L2 hit takes 25 cycles, i.e. L2 is ~8x slower.
Main memory ~200 cycles, i.e. around 66x slower than L1
Retire control unit (RCU) can only store 64 instructions.
L2 miss + full RCU can be a recipe for disaster:

L2 miss will not retire for 200+ cycles and frontend is (almost) always fetching 2 instructions / cycle, which means after ~32 instructions the RCU is full and so the entire pipeline must stall. CPU can no longer (out of order) execute instructions that occur after the memory op to hide that memory ops 200+ cycle latency because instructions must be committed in order but the RCU is maxed out. Thus, must wait for the memory op to commit before executing anything further out of order.
To micro-optimize, move independent long latency instructions (e.g. square roots, divides, reciprocals, other loads) to right after the load so that the two long latency instructions can execute in parallel, reducing the total latency of the combined instructions.

Can overlap up to 8 L2 misses.


Loop unrolling:

Unrolling by a factor of 2x-3x is usually win, but unrolling more doesn’t help beyond the 2x-3x unroll factor.
Before unrolling, you could be bound by the ALU utilization due to the frontend only being able to issue 2 instructions/cycle. If the ratio between the ALU instructions in the loop body and the non-ALU instructions for the loop bookkeeping is low (i.e. loop bookkeeping dominates), you will have poor ALU utilization and the bottleneck will be the frontend issue rate.
As you start to unroll the bottleneck will shift the load unit. You only go out to cache once per cycle, so depending on how many loads (or writes) you have in the loop body, that will usually become the bottleneck as the loop body size increases in proportion to the loop book-keeping.
If you use SIMD, you can usually get a bigger win and increase the unrolling factor because with a SIMD load, though you can still only communicate with cache once per cycle, you’re not fetching 4x (for 4-wide SIMD) the amount of data per memory transaction.

So using SIMD is a win not necessarily because of ALU parallelization, but because it increases the number of bytes per memory transaction.


Once your loops are more complicated and have long latency instructions allowing the front-end start issuing instructions from further in time, unrolling doesn’t help much because the nature of OOO means the HW can start speculatively executing future iterations of the loop.
Thus, unrolling only helps when ops are simple, all memory ops hit the cache, and thus the front-end issue rate dominates the loop performance (i.e. the instructions are being executed/retired as fast as they are being issued).


Prefetching

Utterly useless to try to prefetch indirect memory (i.e. a pointer to the next node in a linked list)
When looping through an array, if the loop body is just doing math on the array elements, prefetch is also useless because the OOO core has probably internally unrolled and executed the next N iterations of the loop, effectively prefetching the next N array elements, meaning the prefetch instruction is just extra overhead (and extra instruction) to inform the CPU of information it already has.
If the loop body’s computation is dependent on a pointer chase from a pointer fetched from the array we’re looping through and the that dependent computation is compute intensive, then prefetching the next array elements can help as it can be hidden by the latency of the heavy compute workload (or possibly by the pointer deref of the pointer fetched from the array we’re looping through).
Guidelines summary:

Never prefetch basic arrays
Prefetch only heavy array/pointer workloads

Need work to overlap the latency of the prefetch.


When things are in cache (i.e. everything is low latency), instruction counts matter a lot.
There are actually 64 integer regs and 72 SSE regs. CPU micro-arch renaming allows CPU to use those extra registers to perform out of order execution.