Skip to content

Instantly share code, notes, and snippets.

@gfoidl
Last active February 13, 2022 20:50
Show Gist options
  • Save gfoidl/cf95138724aab9933f7aad315decea8c to your computer and use it in GitHub Desktop.
Save gfoidl/cf95138724aab9933f7aad315decea8c to your computer and use it in GitHub Desktop.
Processor optimizations

Key takeaways from Intel and Angner Fog

Also see Performance Speed Limits from Travis Downs.

32/64 bit hints

  • Use the 32-bit versions of instructions in 64-bit mode to reduce code size unless the 64-bit version is necessary to access 64-bit data or additional registers

Calling conventions

  • win-x64: No more than four parameters can be transferred in registers, regardless of type
  • linux-x64: The first six integer parameters are transferred in RDI, RSI, RDX, RCX, R8, R9, respectively. The first eight floating point parameters are transferred in XMM0-XMM7. All these registers can be used, so that a maximum of fourteen parameters can be transferred in registers

Dependencies / out of order

  • Prevent false dependences by writing to a full register rather than a partial register.
  • Any pointer or index registers should be calculated as early as possible so that the CPU can verify that the addresses of memoryoperands are different.
  • A memory write may not be able to execute before a preceding write, but the write buffers can hold a number of pending writes, typically four or more.
  • Organize your code so that there are some independent instructions to execute following long latency operations, which don’t depend on the result of those operations.
  • The optimal number of accumulators is the latency of the instruction, divided by the reciprocal throughput.
  • Keep loops small so that they are likely to fit into the μop cache or loopback buffer. Avoid unnecessary loop unrolling.
  • y = a + b + c + d can be changed to y = (a + b) + (c + d) so that the two parentheses can be calculated in parallel.

Branch prediction

  • Forward condition branches are not taken (fall through is taken)
  • Backward conditional branches are taken
  • Unconditional branches (jmp) are taken
  • a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%

Macro fusion

  • TEST can fused with all conditional jumps
  • Prefer TEST over CMP if possible
  • Use unsigned variables and unsigned jumps when possible
  • Unsigned iteration count is compatible with macrofusion --> nuint in loops
  • TEST is better than CMP ..., 0 because the instruction size is smaller

Integer/SIMD Operands

  • (note necessary) The PALIGNR instruction is most valuable when loading or storing unaligned data with the address shifts by a few bytes.

  • Replace Shuffles with Blends: AVX shuffles are executed only on port 5, while blends are also executed on port 0

  • Gather can be useful for non-sequential access and be faster than creating the vector from scalars.

  • Generating constants per table from Agner Fog

Floating-Point/SIMD Operands

  • The MOVSD REG, REG instruction writes only the bottom 64 bits of a register, not all 128 bits. This introduces a dependence on the preceding instruction that produces the upper 64 bits (even if those bits are not longer wanted). The dependence inhibits register renaming, and thereby reduces parallelism

  • Approximate divisions by using one ofthe fast reciprocal instructions on the divisor and then multiply with the dividend. The precision can be increased from 12 to 23 bits by using the principle ofNewton-Raphsoniteration: x0 = rcpss(d); x1 = x0 * (2 - d * x0) = 2 * x0 - d * x0 * x0; where x0 is the first approximation to the reciprocal of the divisor d, and x1 is a better approximation.

  • A fast way of calculating an approximate square root is to multiply the reciprocal square root of x by x: sqrt(x) = x * rsqrt(x) The reciprocal square root instructions give the reciprocal square root with a precision ofat least12 bits. You can improvethe precision to 23 bits by using the Newton-Raphson iteration formula: x0 = rsqrtss(a); x1 = 0.5 * x0 * (3 -(a * x0) * x0) where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation.

  • Warm-up period for YMM and ZMMvector instructions The processor turns off the upper parts of the vectorexecution units when it is not used, in order to save power. Instructions with 256-bit vectors have a throughput that is approximately 4.5times slower than normal during an initial warm-up period of approximately 56,000clock cyclesor 14 μs. A sequence of code containing 256-bit vector operations will run at full speed after this warm-up period. The processor returns to the mode of slow 256-bit execution 2.7 million clock cycles, or 675 μs, after the last 256-bit instruction...It is possible to prepare the processor for executing YMM or ZMMinstructions by giving it a dummy YMM or ZMMinstruction at least 56,000 clock cycles before acriticalsequence of vector instructions.

Memory Access

  • If there is a blend of reads and writes on the bus, changing the code to separate these bus transactions into read phases and write phases can help performance.

  • A cache works most efficiently when the data are accessed sequentially. It works somewhat less efficiently when data are accessed backwards and much less efficiently when data are accessed in a random manner. This applies to reading as well as writing data

  • The number of arrays accessed in a loop should not exceed the number of read/write buffers in the microprocessor.

Alignment

  • AVX: Align data to 32-byte boundary when possible. Prefer store alignment over load alignment.

Load

  • can load from port 2 and port 3 --> so two loads
  • to exploit this bandwidth, the data has to stay in the L1 data cache or it should be accessed sequentially, enabling the hardware prefetchers to bring the data to the L1 data cache in time
  • Load latency from L1D cache...best case is 4 cycles, which apply to load operations to general purpose registers using one of the following:
    • one register
    • a base register plus an offset that is smaller than 2048
  • A bank conflict happens when two simultaneous load operations have the same bit 2-5 of their linear address but they are not from the same set in the cache (bits 6 - 12)
  • If it is necessary to extract a non-aligned portion of stored data, read out the smallest aligned portion that completely contains the data and shift/mask the data as necessar
  • Avoid several small loads after large stores to the same area of memory by using a single large read and register copies as needed.

Prefetching

  • If the data is arranged into a set of streams, the automatic hardware prefetcher can prefetch data

  • If the data is accessed in a non-sequential manner, the automatic hardware prefetcher cannot prefetch the data

  • The prefetcher can recognize up to 8 concurrent streams.

  • Streamer Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line

Hardware prefetch

  • Automatic hardware prefetch can bring cache lines into the unified last-level cache based on prior data misses. It will attempt to prefetch two cache lines ahead of the prefetch stream.
  • The automatic hardware prefetcher is most effective if the strides of two successive cache misses remain less than the trigger threshold distance and close to 64 bytes
  • The hardware prefetcher requires a couple misses before it starts operating
  • It will not prefetch across a 4-KByte page boundary

Software prefetch

  • Interleave prefetches with computation instructions. For best performance, software prefetch instructions must be interspersed with computational instructions in the instruction sequence (rather than clustered together).

  • PREFETCH is mainly designed to improve application performance by hiding memory latency in the back-ground.

  • Do strip-mining: partition loops so that the dataset fits into second-level cache

  • Use PREFETCHNTA if the data is only used once or the dataset fits into 32 KBytes (one way of second-level cache). Use PREFETCHT0if the dataset exceeds 32 KBytes.

  • Prefetching is useful for inner loops that have heavy computations, or are close to the boundary between being compute-bound and memory-bandwidth-bound. It is probably not very useful for loops which are predominately memory bandwidth-bound.

  • When data is already located in the first level cache, prefetching can be useless and could even slow down the performance because the extra μops either back up waiting for outstanding memory accesses or may be dropped altogether.

  • Using software prefetches delays the hardware prefetcher from starting to fetch data needed by the processor core

Store

  • memory system only sends stores to memory (including cache) after store retirement
  • store data can be forwarded from a store to a subsequent load from the same address to give a much shorter store-load latency
  • The data of a load which is forwarded from a store must be completely contained within the store data.
  • Resolve cache management issues. Minimize the disturbance of temporal data held within processor’s caches by using streaming store instructions.
  • Write combining -- successive writes to the same cache line are combined.
  • p7 can only be used for stores, and if store addressing is simple (i.e. [base_reg + offset] where offset in [0, 2048))

Streaming stores / non-temporal

  • must be aligned to a 16-byte boundary or a general protection exception will occur
  • store data from a register to memory. They are implicitly weakly-ordered, do no write-allocate, and so minimize cache pollution
  • Because streaming stores are weakly ordered, a fencing operation is required to ensure that the stored data is flushed from the processor to memory. Failure to use an appropriate fence may result in data being “trapped” within the processor and will prevent visibility of this data by other processors or system agents.
  • As a rule of thumb, it can be good to use non-temporal writes when moving data blocks bigger than half the size of the last-level cache.

Cache organization

  • Organized in sets of cachelines. This means that a particular memory address cannot be loaded into an arbitrary cache line. Only one of the sets can be used, but any of the cachelines in the set.
  • (set) = (memory address) / (line size) % (number of sets)
  • (critical stride) = (number of sets) x (line size) = (total cache size) / (number of ways)
  • Variables whose distance in memory is a multiple of the critical stride will contend for the same cache lines.
|     | Ways | Cache size kiB | Cache Size bytes | Line size bytes | # of lines | Sets | Critical stride |
|-----|-----:|---------------:|-----------------:|----------------:|-----------:|-----:|----------------:|
| L1  |    8 |             32 |            32768 |              64 |        512 |   64 |            4096 |
| L2  |    4 |            256 |           262144 |              64 |       4096 | 1024 |           65536 |
| L3  |   12 |           6144 |          6291456 |              64 |      98304 | 8192 |          524288 |

So high power of 2s as stride should be avoided. See demo (below) and calculation sheet.

Multi-threading

False sharing

  • On processors based on Intel NetBurst microarchitecture, bus reads typically fetch 128 bytes into a cache, the optimal spacing to minimize eviction of cached data is 128 bytes. To prevent false-sharing, synchronization variables and system objects (such as a critical section) should be allocated to reside alone in a 128-byte region and aligned to a 128-byte boundary
@gfoidl
Copy link
Author

gfoidl commented Nov 13, 2020

(Loop) Induction variables

If the floating point value of the loop counter is needed for some other purpose, then it is better to have both an integer counter and a floating point counter.
It's much faster to add to a floating point counter, than converting the int-counter to a float.

code
using System;
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
#if !DEBUG
using BenchmarkDotNet.Running;
#endif

// per default loops are JITed as Tier-1, so there will be the check for static initialization
// this env-var tells the JIT to use TC for loops too.
Environment.SetEnvironmentVariable("COMPlus_TC_QuickJitForLoops", "1");

Bench bench = new();
bench.A();
double[] res0 = bench.SinTable.ToArray();
bench.B();
double[] res1 = bench.SinTable.ToArray();

for (int i = 0; i < res0.Length; ++i)
{
    double r0 = res0[i];
    double r1 = res1[i];

    if (Math.Abs(r0 - r1) > 1e-10)
    {
        Console.WriteLine($"error at {i}");
    }
}

#if !DEBUG
BenchmarkRunner.Run<Bench>();
#endif

[DisassemblyDiagnoser]
//[HardwareCounters(HardwareCounter.CacheMisses, HardwareCounter.LlcMisses, HardwareCounter.LlcReference)]
public unsafe class Bench
{
    private const int N = 1_000;
    private readonly double[] _array = GC.AllocateArray<double>(N, pinned: true);
    private readonly double* _ptr;

    public Bench()
    {
        _ptr = (double*)Marshal.UnsafeAddrOfPinnedArrayElement(_array, 0);
    }

    public Span<double> SinTable => _array;

    [Benchmark(Baseline = true)]
    public void A()
    {
        double* ptr = _ptr;

        for (nuint i = 0; i < N; ++i)
        {
            ptr[i] = Math.Sin(0.01 * i);
        }
    }

    [Benchmark]
    public void A1()
    {
        double* ptr = _ptr;
        double scale = 0.01;

        for (nuint i = 0; i < N; ++i)
        {
            ptr[i] = Math.Sin(scale * i);
        }
    }

    [Benchmark]
    public void B()
    {
        double* ptr = _ptr;
        double x = 0d;

        for (nuint i = 0; i < N; ++i, x += 0.01)
        {
            ptr[i] = Math.Sin(x);
        }
    }

    [Benchmark]
    public void B1()
    {
        double* ptr = _ptr;
        double x = 0d;
        double inc = 0.01;

        for (nuint i = 0; i < N; ++i, x += inc)
        {
            ptr[i] = Math.Sin(x);
        }
    }

    private static readonly long s_inc = BitConverter.DoubleToInt64Bits(0.01);
    private static double Inc => BitConverter.Int64BitsToDouble(s_inc);

    [Benchmark]
    public void B2()
    {
        double* ptr = _ptr;
        double x = 0d;

        for (nuint i = 0; i < N; ++i, x += Inc)
        {
            ptr[i] = Math.Sin(x);
        }
    }
}
asm
; Bench.A()
       push      rdi
       push      rsi
       sub       rsp,28
       vzeroupper
       mov       rsi,[rcx+10]
       xor       edi,edi
M00_L00:
       vxorps    xmm0,xmm0,xmm0
       vcvtsi2sd xmm0,xmm0,rdi
       vmulsd    xmm0,xmm0,qword ptr [7FFEBB6D47A8]
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       cmp       rdi,3E8
       jb        short M00_L00
       add       rsp,28
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 61

; Bench.A1()
       push      rdi
       push      rsi
       sub       rsp,28
       vzeroupper
       mov       rsi,[rcx+10]
       xor       edi,edi
M00_L00:
       vxorps    xmm0,xmm0,xmm0
       vcvtsi2sd xmm0,xmm0,rdi
       vmulsd    xmm0,xmm0,qword ptr [7FFEBB6E48E8]
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       cmp       rdi,3E8
       jb        short M00_L00
       add       rsp,28
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 61

; Bench.B()
       push      rdi
       push      rsi
       sub       rsp,38
       vzeroupper
       vmovaps   [rsp+20],xmm6
       mov       rsi,[rcx+10]
       vxorps    xmm6,xmm6,xmm6
       xor       edi,edi
M00_L00:
       vmovaps   xmm0,xmm6
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       vaddsd    xmm6,xmm6,qword ptr [7FFEBB6C48F8]
       cmp       rdi,3E8
       jb        short M00_L00
       vmovaps   xmm6,[rsp+20]
       add       rsp,38
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 72

; Bench.B1()
       push      rdi
       push      rsi
       sub       rsp,38
       vzeroupper
       vmovaps   [rsp+20],xmm6
       mov       rsi,[rcx+10]
       vxorps    xmm6,xmm6,xmm6
       xor       edi,edi
M00_L00:
       vmovaps   xmm0,xmm6
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       vaddsd    xmm6,xmm6,qword ptr [7FFEBB6C48F8]
       cmp       rdi,3E8
       jb        short M00_L00
       vmovaps   xmm6,[rsp+20]
       add       rsp,38
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 72

; Bench.B2()
       push      rdi
       push      rsi
       sub       rsp,48
       vzeroupper
       vmovaps   [rsp+30],xmm6
       vmovaps   [rsp+20],xmm7
       mov       rsi,[rcx+10]
       vxorps    xmm6,xmm6,xmm6
       xor       edi,edi
       mov       rax,7AE147AE147B
       vmovq     xmm7,rax
M00_L00:
       vmovaps   xmm0,xmm6
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       vmovaps   xmm0,xmm7
       vaddsd    xmm6,xmm6,xmm0
       cmp       rdi,3E8
       jb        short M00_L00
       vmovaps   xmm6,[rsp+30]
       vmovaps   xmm7,[rsp+20]
       add       rsp,48
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 99
machine info
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.630 (2004/?/20H1)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
| Method |     Mean |     Error |    StdDev | Ratio | Code Size |
|------- |---------:|----------:|----------:|------:|----------:|
|      A | 9.173 μs | 0.0917 μs | 0.0858 μs |  1.00 |      61 B |
|     A1 | 9.223 μs | 0.1036 μs | 0.0969 μs |  1.01 |      61 B |
|      B | 8.928 μs | 0.0758 μs | 0.0633 μs |  0.97 |      72 B |
|     B1 | 8.887 μs | 0.0723 μs | 0.0641 μs |  0.97 |      72 B |
|     B2 | 8.838 μs | 0.0746 μs | 0.0661 μs |  0.96 |      99 B |

Note: The calculation of a function of the loop counter by the induction variable method makes a loop-carried dependency chain. If this chain is too long then it may be advantageous to calculate each value from a value that is two or more iterations back.

Note 2: Once dotnet/runtime Double constants usage in a loop can be CSEed is resolved, codegen for some methods will be better, and the trick used for B2 isn't necessary (store the double-const as long).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment