gfoidl/Processor_optimizations.md

## Processor_optimizations.md

      
    Raw
  

              Processor_optimizations.md
            
          
    Code samples

See comments below.

Instruction dependency
Cache bank conflicts
Minimize Register Spills
Alignment
Loop Blocking
Loop Strip Mining
Non temporal stores / loads
Prefetching (hardware / software)
Cache organization / critical stride / cache sets
Loop over array: forward, downward, and perfect
(Loop) Induction variables


## x_01.md

      
    Raw
  

              x_01.md
            
          
    Key takeaways from Intel and Angner Fog

Also see Performance Speed Limits from Travis Downs.
32/64 bit hints


Use the 32-bit versions of instructions in 64-bit mode to reduce code size unless the 64-bit version is necessary to access 64-bit data or additional registers

Calling conventions


win-x64: No more than four parameters can be transferred in registers, regardless of type
linux-x64: The first six integer parameters are transferred in RDI, RSI, RDX, RCX, R8, R9, respectively. The first eight floating point parameters are transferred in XMM0-XMM7. All these registers can be used, so that a maximum of fourteen parameters can be transferred in registers

Dependencies / out of order


Prevent false dependences by writing to a full register rather than a partial register.
Any pointer or index registers should be calculated as early as possible so that the CPU can verify that the addresses of memoryoperands are different.
A memory write may not be able to execute before a preceding write, but the write buffers can hold a number of pending writes, typically four or more.
Organize your code so that there are some independent instructions to execute following long latency operations, which don’t depend on the result of those operations.
The optimal number of accumulators is the latency of the instruction, divided by the reciprocal throughput.
Keep loops small so that they are likely to fit into the μop cache or loopback buffer. Avoid unnecessary loop unrolling.
y = a + b + c + d can be changed to y = (a + b) + (c + d) so that the two parentheses can be calculated in parallel.

Branch prediction


Forward condition branches are not taken (fall through is taken)
Backward conditional branches are taken
Unconditional branches (jmp) are taken
a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%

Macro fusion


TEST can fused with all conditional jumps
Prefer TEST over CMP if possible
Use unsigned variables and unsigned jumps when possible
Unsigned iteration count is compatible with macrofusion --> nuint in loops
TEST is better than CMP ..., 0 because the instruction size is smaller

Integer/SIMD Operands


(note necessary) The PALIGNR instruction is most valuable when loading or storing unaligned data with the address shifts by a few bytes.


Replace Shuffles with Blends: AVX shuffles are executed only on port 5, while blends are also executed on port 0


Gather can be useful for non-sequential access and be faster than creating the vector from scalars.


Generating constants per table from Agner Fog


Floating-Point/SIMD Operands


The MOVSD REG, REG instruction writes only the bottom 64 bits of a register, not all 128 bits. This introduces a dependence on the preceding instruction that produces the upper 64 bits (even if those bits are not longer wanted). The dependence inhibits register renaming, and thereby reduces parallelism


Approximate divisions by using one ofthe fast reciprocal instructions on the divisor and then multiply with the dividend. The precision can be increased from 12 to 23 bits by using the principle ofNewton-Raphsoniteration:
x0 = rcpss(d); x1 = x0 * (2 - d * x0) = 2 * x0 - d * x0 * x0;
where x0 is the first approximation to the reciprocal of the divisor d, and x1 is a better approximation.


A fast way of calculating an approximate square root is to multiply the reciprocal square root of x by x:
sqrt(x) = x * rsqrt(x)
The reciprocal square root instructions give the reciprocal square root with a precision ofat least12 bits. You can improvethe precision to 23 bits by using the Newton-Raphson iteration formula:
x0 = rsqrtss(a); x1 = 0.5 * x0 * (3 -(a * x0) * x0)
where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation.


Warm-up period for YMM and ZMMvector instructions The processor turns off the upper parts of the vectorexecution units when it is not used, in order to save power. Instructions with 256-bit vectors have a throughput that is approximately 4.5times slower than normal during an initial warm-up period of approximately 56,000clock cyclesor 14 μs. A sequence of code containing 256-bit vector operations will run at full speed after this warm-up period. The processor returns to the mode of slow 256-bit execution 2.7 million clock cycles, or 675 μs, after the last 256-bit instruction...It is possible to prepare the processor for executing YMM or ZMMinstructions by giving it a dummy YMM or ZMMinstruction at least 56,000 clock cycles before acriticalsequence of vector instructions.


Memory Access


If there is a blend of reads and writes on the bus, changing the code to separate these bus transactions into read phases and write phases can help performance.


A cache works most efficiently when the data are accessed sequentially. It works somewhat less efficiently when data are accessed backwards and much less efficiently when data are accessed in a random manner. This applies to reading as well as writing data


The number of arrays accessed in a loop should not exceed the number of read/write buffers in the microprocessor.


Alignment


AVX: Align data to 32-byte boundary when possible. Prefer store alignment over load alignment.

Load


can load from port 2 and port 3 --> so two loads
to exploit this bandwidth, the data has to stay in the L1 data cache or it should be accessed sequentially, enabling the hardware prefetchers to bring the data to the L1 data cache in time
Load latency from L1D cache...best case is 4 cycles, which apply to load operations to general purpose registers using one of the following:

one register
a base register plus an offset that is smaller than 2048


A bank conflict happens when two simultaneous load operations have the same bit 2-5 of their linear address but they are not from the same set in the cache (bits 6 - 12)
If it is necessary to extract a non-aligned portion of stored data, read out the smallest aligned portion that completely contains the data and shift/mask the data as necessar
Avoid several small loads after large stores to the same area of memory by using a single large read and register copies as needed.

Prefetching


If the data is arranged into a set of streams, the automatic hardware prefetcher can prefetch data


If the data is accessed in a non-sequential manner, the automatic hardware prefetcher cannot prefetch the data


The prefetcher can recognize up to 8 concurrent streams.


Streamer Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line


Hardware prefetch


Automatic hardware prefetch can bring cache lines into the unified last-level cache based on prior data misses. It will attempt to prefetch two cache lines ahead of the prefetch stream.
The automatic hardware prefetcher is most effective if the strides of two successive cache misses remain less than the trigger threshold distance and close to 64 bytes
The hardware prefetcher requires a couple misses before it starts operating
It will not prefetch across a 4-KByte page boundary

Software prefetch


Interleave prefetches with computation instructions. For best performance, software prefetch instructions must be interspersed with computational instructions in the instruction sequence (rather than clustered together).


PREFETCH is mainly designed to improve application performance by hiding memory latency in the back-ground.


Do strip-mining: partition loops so that the dataset fits into second-level cache


Use PREFETCHNTA if the data is only used once or the dataset fits into 32 KBytes (one way of second-level cache). Use PREFETCHT0if the dataset exceeds 32 KBytes.


Prefetching is useful for inner loops that have heavy computations, or are close to the boundary between being compute-bound and memory-bandwidth-bound. It is probably not very useful for loops which are predominately memory bandwidth-bound.


When data is already located in the first level cache, prefetching can be useless and could even slow down the performance because the extra μops either back up waiting for outstanding memory accesses or may be dropped altogether.


Using software prefetches delays the hardware prefetcher from starting to fetch data needed by the processor core


Store


memory system only sends stores to memory (including cache) after store retirement
store data can be forwarded from a store to a subsequent load from the same address to give a much shorter store-load latency
The data of a load which is forwarded from a store must be completely contained within the store data.
Resolve cache management issues. Minimize the disturbance of temporal data held within processor’s caches by using streaming store instructions.
Write combining -- successive writes to the same cache line are combined.
p7 can only be used for stores, and if store addressing is simple (i.e. [base_reg + offset] where offset in [0, 2048))

Streaming stores / non-temporal


must be aligned to a 16-byte boundary or a general protection exception will occur
store data from a register to memory. They are implicitly weakly-ordered, do no write-allocate, and so minimize cache pollution
Because streaming stores are weakly ordered, a fencing operation is required to ensure that the stored data is flushed from the processor to memory. Failure to use an appropriate fence may result in data being “trapped” within the processor and will prevent visibility of this data by other processors or system agents.
As a rule of thumb, it can be good to use non-temporal writes when moving data blocks bigger than half the size of the last-level cache.

Cache organization


Organized in sets of cachelines. This means that a particular memory address cannot be loaded into an arbitrary cache line. Only one of the sets can be used, but any of the cachelines in the set.
(set) = (memory address) / (line size) % (number of sets)
(critical stride) = (number of sets) x (line size) = (total cache size) / (number of ways)
Variables whose distance in memory is a multiple of the critical stride will contend for the same cache lines.

|     | Ways | Cache size kiB | Cache Size bytes | Line size bytes | # of lines | Sets | Critical stride |
|-----|-----:|---------------:|-----------------:|----------------:|-----------:|-----:|----------------:|
| L1  |    8 |             32 |            32768 |              64 |        512 |   64 |            4096 |
| L2  |    4 |            256 |           262144 |              64 |       4096 | 1024 |           65536 |
| L3  |   12 |           6144 |          6291456 |              64 |      98304 | 8192 |          524288 |

So high power of 2s as stride should be avoided.
See demo (below) and calculation sheet.
Multi-threading

False sharing


On processors based on Intel NetBurst microarchitecture, bus reads typically fetch 128 bytes into a cache, the optimal spacing to minimize eviction of cached data is 128 bytes. To prevent false-sharing, synchronization variables and system objects (such as a critical section) should be allocated to reside alone in a 128-byte region and aligned to a 128-byte boundary