Skip to content

Instantly share code, notes, and snippets.

@gfoidl
Last active February 13, 2022 20:50
Show Gist options
  • Save gfoidl/cf95138724aab9933f7aad315decea8c to your computer and use it in GitHub Desktop.
Save gfoidl/cf95138724aab9933f7aad315decea8c to your computer and use it in GitHub Desktop.
Processor optimizations

Key takeaways from Intel and Angner Fog

Also see Performance Speed Limits from Travis Downs.

32/64 bit hints

  • Use the 32-bit versions of instructions in 64-bit mode to reduce code size unless the 64-bit version is necessary to access 64-bit data or additional registers

Calling conventions

  • win-x64: No more than four parameters can be transferred in registers, regardless of type
  • linux-x64: The first six integer parameters are transferred in RDI, RSI, RDX, RCX, R8, R9, respectively. The first eight floating point parameters are transferred in XMM0-XMM7. All these registers can be used, so that a maximum of fourteen parameters can be transferred in registers

Dependencies / out of order

  • Prevent false dependences by writing to a full register rather than a partial register.
  • Any pointer or index registers should be calculated as early as possible so that the CPU can verify that the addresses of memoryoperands are different.
  • A memory write may not be able to execute before a preceding write, but the write buffers can hold a number of pending writes, typically four or more.
  • Organize your code so that there are some independent instructions to execute following long latency operations, which don’t depend on the result of those operations.
  • The optimal number of accumulators is the latency of the instruction, divided by the reciprocal throughput.
  • Keep loops small so that they are likely to fit into the μop cache or loopback buffer. Avoid unnecessary loop unrolling.
  • y = a + b + c + d can be changed to y = (a + b) + (c + d) so that the two parentheses can be calculated in parallel.

Branch prediction

  • Forward condition branches are not taken (fall through is taken)
  • Backward conditional branches are taken
  • Unconditional branches (jmp) are taken
  • a conditional jump is faster than a conditional move if the code is part of a dependency chain and the prediction rate is better than 75%

Macro fusion

  • TEST can fused with all conditional jumps
  • Prefer TEST over CMP if possible
  • Use unsigned variables and unsigned jumps when possible
  • Unsigned iteration count is compatible with macrofusion --> nuint in loops
  • TEST is better than CMP ..., 0 because the instruction size is smaller

Integer/SIMD Operands

  • (note necessary) The PALIGNR instruction is most valuable when loading or storing unaligned data with the address shifts by a few bytes.

  • Replace Shuffles with Blends: AVX shuffles are executed only on port 5, while blends are also executed on port 0

  • Gather can be useful for non-sequential access and be faster than creating the vector from scalars.

  • Generating constants per table from Agner Fog

Floating-Point/SIMD Operands

  • The MOVSD REG, REG instruction writes only the bottom 64 bits of a register, not all 128 bits. This introduces a dependence on the preceding instruction that produces the upper 64 bits (even if those bits are not longer wanted). The dependence inhibits register renaming, and thereby reduces parallelism

  • Approximate divisions by using one ofthe fast reciprocal instructions on the divisor and then multiply with the dividend. The precision can be increased from 12 to 23 bits by using the principle ofNewton-Raphsoniteration: x0 = rcpss(d); x1 = x0 * (2 - d * x0) = 2 * x0 - d * x0 * x0; where x0 is the first approximation to the reciprocal of the divisor d, and x1 is a better approximation.

  • A fast way of calculating an approximate square root is to multiply the reciprocal square root of x by x: sqrt(x) = x * rsqrt(x) The reciprocal square root instructions give the reciprocal square root with a precision ofat least12 bits. You can improvethe precision to 23 bits by using the Newton-Raphson iteration formula: x0 = rsqrtss(a); x1 = 0.5 * x0 * (3 -(a * x0) * x0) where x0 is the first approximation to the reciprocal square root of a, and x1 is a better approximation.

  • Warm-up period for YMM and ZMMvector instructions The processor turns off the upper parts of the vectorexecution units when it is not used, in order to save power. Instructions with 256-bit vectors have a throughput that is approximately 4.5times slower than normal during an initial warm-up period of approximately 56,000clock cyclesor 14 μs. A sequence of code containing 256-bit vector operations will run at full speed after this warm-up period. The processor returns to the mode of slow 256-bit execution 2.7 million clock cycles, or 675 μs, after the last 256-bit instruction...It is possible to prepare the processor for executing YMM or ZMMinstructions by giving it a dummy YMM or ZMMinstruction at least 56,000 clock cycles before acriticalsequence of vector instructions.

Memory Access

  • If there is a blend of reads and writes on the bus, changing the code to separate these bus transactions into read phases and write phases can help performance.

  • A cache works most efficiently when the data are accessed sequentially. It works somewhat less efficiently when data are accessed backwards and much less efficiently when data are accessed in a random manner. This applies to reading as well as writing data

  • The number of arrays accessed in a loop should not exceed the number of read/write buffers in the microprocessor.

Alignment

  • AVX: Align data to 32-byte boundary when possible. Prefer store alignment over load alignment.

Load

  • can load from port 2 and port 3 --> so two loads
  • to exploit this bandwidth, the data has to stay in the L1 data cache or it should be accessed sequentially, enabling the hardware prefetchers to bring the data to the L1 data cache in time
  • Load latency from L1D cache...best case is 4 cycles, which apply to load operations to general purpose registers using one of the following:
    • one register
    • a base register plus an offset that is smaller than 2048
  • A bank conflict happens when two simultaneous load operations have the same bit 2-5 of their linear address but they are not from the same set in the cache (bits 6 - 12)
  • If it is necessary to extract a non-aligned portion of stored data, read out the smallest aligned portion that completely contains the data and shift/mask the data as necessar
  • Avoid several small loads after large stores to the same area of memory by using a single large read and register copies as needed.

Prefetching

  • If the data is arranged into a set of streams, the automatic hardware prefetcher can prefetch data

  • If the data is accessed in a non-sequential manner, the automatic hardware prefetcher cannot prefetch the data

  • The prefetcher can recognize up to 8 concurrent streams.

  • Streamer Loads data or instructions from memory to the second-level cache. To use the streamer, organize the data or instructions in blocks of 128 bytes, aligned on 128 bytes. The first access to one of the two cache lines in this block while it is in memory triggers the streamer to prefetch the pair line

Hardware prefetch

  • Automatic hardware prefetch can bring cache lines into the unified last-level cache based on prior data misses. It will attempt to prefetch two cache lines ahead of the prefetch stream.
  • The automatic hardware prefetcher is most effective if the strides of two successive cache misses remain less than the trigger threshold distance and close to 64 bytes
  • The hardware prefetcher requires a couple misses before it starts operating
  • It will not prefetch across a 4-KByte page boundary

Software prefetch

  • Interleave prefetches with computation instructions. For best performance, software prefetch instructions must be interspersed with computational instructions in the instruction sequence (rather than clustered together).

  • PREFETCH is mainly designed to improve application performance by hiding memory latency in the back-ground.

  • Do strip-mining: partition loops so that the dataset fits into second-level cache

  • Use PREFETCHNTA if the data is only used once or the dataset fits into 32 KBytes (one way of second-level cache). Use PREFETCHT0if the dataset exceeds 32 KBytes.

  • Prefetching is useful for inner loops that have heavy computations, or are close to the boundary between being compute-bound and memory-bandwidth-bound. It is probably not very useful for loops which are predominately memory bandwidth-bound.

  • When data is already located in the first level cache, prefetching can be useless and could even slow down the performance because the extra μops either back up waiting for outstanding memory accesses or may be dropped altogether.

  • Using software prefetches delays the hardware prefetcher from starting to fetch data needed by the processor core

Store

  • memory system only sends stores to memory (including cache) after store retirement
  • store data can be forwarded from a store to a subsequent load from the same address to give a much shorter store-load latency
  • The data of a load which is forwarded from a store must be completely contained within the store data.
  • Resolve cache management issues. Minimize the disturbance of temporal data held within processor’s caches by using streaming store instructions.
  • Write combining -- successive writes to the same cache line are combined.
  • p7 can only be used for stores, and if store addressing is simple (i.e. [base_reg + offset] where offset in [0, 2048))

Streaming stores / non-temporal

  • must be aligned to a 16-byte boundary or a general protection exception will occur
  • store data from a register to memory. They are implicitly weakly-ordered, do no write-allocate, and so minimize cache pollution
  • Because streaming stores are weakly ordered, a fencing operation is required to ensure that the stored data is flushed from the processor to memory. Failure to use an appropriate fence may result in data being “trapped” within the processor and will prevent visibility of this data by other processors or system agents.
  • As a rule of thumb, it can be good to use non-temporal writes when moving data blocks bigger than half the size of the last-level cache.

Cache organization

  • Organized in sets of cachelines. This means that a particular memory address cannot be loaded into an arbitrary cache line. Only one of the sets can be used, but any of the cachelines in the set.
  • (set) = (memory address) / (line size) % (number of sets)
  • (critical stride) = (number of sets) x (line size) = (total cache size) / (number of ways)
  • Variables whose distance in memory is a multiple of the critical stride will contend for the same cache lines.
|     | Ways | Cache size kiB | Cache Size bytes | Line size bytes | # of lines | Sets | Critical stride |
|-----|-----:|---------------:|-----------------:|----------------:|-----------:|-----:|----------------:|
| L1  |    8 |             32 |            32768 |              64 |        512 |   64 |            4096 |
| L2  |    4 |            256 |           262144 |              64 |       4096 | 1024 |           65536 |
| L3  |   12 |           6144 |          6291456 |              64 |      98304 | 8192 |          524288 |

So high power of 2s as stride should be avoided. See demo (below) and calculation sheet.

Multi-threading

False sharing

  • On processors based on Intel NetBurst microarchitecture, bus reads typically fetch 128 bytes into a cache, the optimal spacing to minimize eviction of cached data is 128 bytes. To prevent false-sharing, synchronization variables and system objects (such as a critical section) should be allocated to reside alone in a 128-byte region and aligned to a 128-byte boundary
@gfoidl
Copy link
Author

gfoidl commented Nov 9, 2020

Prefetching (hardware / software)

Demonstrated at the same copy loop from before.

code
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
#if !DEBUG
using BenchmarkDotNet.Running;
#endif

Bench bench = new();

bench.GlobalSetup();
bench.NonTemporalStoreCopy();
Console.WriteLine($"{nameof(bench.NonTemporalStoreCopy)} -> {(bench.Src.SequenceEqual(bench.Dest) ? "OK" : "FAIL")}");

bench.GlobalSetup();
bench.HardwarePrefetch();
Console.WriteLine($"{nameof(bench.HardwarePrefetch)} -> {(bench.Src.SequenceEqual(bench.Dest) ? "OK" : "FAIL")}");

bench.GlobalSetup();
bench.SoftwarePrefetch();
Console.WriteLine($"{nameof(bench.SoftwarePrefetch)} -> {(bench.Src.SequenceEqual(bench.Dest) ? "OK" : "FAIL")}");

#if !DEBUG
BenchmarkRunner.Run<Bench>();
#endif

[DisassemblyDiagnoser]
[HardwareCounters(HardwareCounter.CacheMisses, HardwareCounter.LlcMisses, HardwareCounter.LlcReference)]
public unsafe class Bench
{
    private const int CacheLineSizeInBytes = 64;

    private const int L1CacheSizeInKiB = 32;
    private const int L1CacheSizeInBytes = L1CacheSizeInKiB << 10;

    private const int L2CacheSizeInKiB = 256;
    private const int L2CacheSizeInBytes = L2CacheSizeInKiB << 10;

    private const int L3CacheSizeInMiB = 6;
    private const int L3CacheSizeInKiB = L3CacheSizeInMiB << 10;
    private const int L3CacheSizeInBytes = L3CacheSizeInKiB << 10;

    private const int Size = 2 * L3CacheSizeInBytes + 1;      // no power of two

    private readonly byte[] _srcArray;
    private readonly byte[] _destArray;

    private readonly byte* _src;
    private readonly byte* _dest;

    private readonly Random _rnd = new Random(0);

    public ReadOnlySpan<byte> Src => new ReadOnlySpan<byte>(_src, Size);
    public ReadOnlySpan<byte> Dest => new ReadOnlySpan<byte>(_dest, Size);

    public Bench()
    {
        Console.WriteLine($"{nameof(CacheLineSizeInBytes)}: {CacheLineSizeInBytes,7}");
        Console.WriteLine($"{nameof(L1CacheSizeInKiB)}    : {L1CacheSizeInKiB,7}");
        Console.WriteLine($"{nameof(L2CacheSizeInKiB)}    : {L2CacheSizeInKiB,7}");
        Console.WriteLine($"{nameof(L3CacheSizeInKiB)}    : {L3CacheSizeInKiB,7}");
        Console.WriteLine();
        Console.WriteLine($"size: {Size}\n");

        int size = Size + CacheLineSizeInBytes;

        _srcArray = GC.AllocateUninitializedArray<byte>(size, pinned: true);
        _destArray = GC.AllocateUninitializedArray<byte>(size, pinned: true);

        _src = (byte*)Marshal.UnsafeAddrOfPinnedArrayElement(_srcArray, 0);
        _dest = (byte*)Marshal.UnsafeAddrOfPinnedArrayElement(_destArray, 0);

        while ((long)_src % CacheLineSizeInBytes != 0) _src++;
        while ((long)_dest % CacheLineSizeInBytes != 0) _dest++;

        Debug.Assert((long)_src % CacheLineSizeInBytes == 0);
        Debug.Assert((long)_dest % CacheLineSizeInBytes == 0);
    }

    [GlobalSetup]
    public void GlobalSetup() => _rnd.NextBytes(_srcArray);

    [Benchmark(Baseline = true)]
    public void NonTemporalStoreCopy()
    {
        byte* src = _src;
        byte* dest = _dest;
        nuint i = 0;
        nuint vecCount = (nuint)Vector256<byte>.Count;

        if (Size >= 4 * vecCount)
        {
            do
            {
                Vector256<byte> srcVec0 = Avx.LoadAlignedVector256(src + i + 0 * vecCount);
                Vector256<byte> srcVec1 = Avx.LoadAlignedVector256(src + i + 1 * vecCount);
                Vector256<byte> srcVec2 = Avx.LoadAlignedVector256(src + i + 2 * vecCount);
                Vector256<byte> srcVec3 = Avx.LoadAlignedVector256(src + i + 3 * vecCount);

                Avx.StoreAlignedNonTemporal(dest + i + 0 * vecCount, srcVec0);
                Avx.StoreAlignedNonTemporal(dest + i + 1 * vecCount, srcVec1);
                Avx.StoreAlignedNonTemporal(dest + i + 2 * vecCount, srcVec2);
                Avx.StoreAlignedNonTemporal(dest + i + 3 * vecCount, srcVec3);

                i += 4 * vecCount;
            } while (i <= Size - 4 * vecCount);

            Sse.StoreFence();
        }

        for (; i < Size; ++i)
        {
            dest[i] = src[i];
        }
    }

    [Benchmark]
    public void HardwarePrefetch()
    {
        const int blockSize = L1CacheSizeInBytes;
        byte* src = _src;
        byte* dest = _dest;
        nuint i = 0;
        nuint vecCount = (nuint)Vector256<byte>.Count;

        if (Size >= 4 * vecCount)
        {
            do
            {
                nuint k = i;
                nuint n = i + blockSize;

                // Prefetch blockSize into L1 by hardware prefetcher
                do
                {
                    Vector256<byte> tmp0 = Avx.LoadAlignedVector256(src + k);
                    Vector256<byte> tmp1 = Avx.LoadAlignedVector256(src + k + CacheLineSizeInBytes);

                    k += 2 * CacheLineSizeInBytes;
                } while (k <= n - 2 * CacheLineSizeInBytes);

                k = i;

                do
                {
                    Vector256<byte> srcVec0 = Avx.LoadAlignedVector256(src + k + 0 * vecCount);
                    Vector256<byte> srcVec1 = Avx.LoadAlignedVector256(src + k + 1 * vecCount);
                    Vector256<byte> srcVec2 = Avx.LoadAlignedVector256(src + k + 2 * vecCount);
                    Vector256<byte> srcVec3 = Avx.LoadAlignedVector256(src + k + 3 * vecCount);

                    Avx.StoreAlignedNonTemporal(dest + k + 0 * vecCount, srcVec0);
                    Avx.StoreAlignedNonTemporal(dest + k + 1 * vecCount, srcVec1);
                    Avx.StoreAlignedNonTemporal(dest + k + 2 * vecCount, srcVec2);
                    Avx.StoreAlignedNonTemporal(dest + k + 3 * vecCount, srcVec3);

                    k += 4 * vecCount;
                } while (k <= n - 4 * vecCount);

                i = n;
            } while (i <= Size - blockSize);

            Sse.StoreFence();
        }

        for (; i < Size; ++i)
        {
            dest[i] = src[i];
        }
    }

    [Benchmark]
    public void SoftwarePrefetch()
    {
        byte* src = _src;
        byte* dest = _dest;
        nuint i = 0;
        nuint vecCount = (nuint)Vector256<byte>.Count;

        if (Size >= 4 * vecCount)
        {
            do
            {
                // Load some cache lines in advance, so they're there when needed
                Sse.Prefetch0(src + i + 64 * CacheLineSizeInBytes);
                Sse.Prefetch0(src + i + 64 * CacheLineSizeInBytes + CacheLineSizeInBytes);

                Vector256<byte> srcVec0 = Avx.LoadAlignedVector256(src + i + 0 * vecCount);
                Vector256<byte> srcVec1 = Avx.LoadAlignedVector256(src + i + 1 * vecCount);
                Vector256<byte> srcVec2 = Avx.LoadAlignedVector256(src + i + 2 * vecCount);
                Vector256<byte> srcVec3 = Avx.LoadAlignedVector256(src + i + 3 * vecCount);

                Avx.StoreAlignedNonTemporal(dest + i + 0 * vecCount, srcVec0);
                Avx.StoreAlignedNonTemporal(dest + i + 1 * vecCount, srcVec1);
                Avx.StoreAlignedNonTemporal(dest + i + 2 * vecCount, srcVec2);
                Avx.StoreAlignedNonTemporal(dest + i + 3 * vecCount, srcVec3);

                i += 4 * vecCount;
            } while (i <= Size - 4 * vecCount);

            Sse.StoreFence();
        }

        for (; i < Size; ++i)
        {
            dest[i] = src[i];
        }
    }
}
machine info
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.7.20366.6
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
|               Method |     Mean |     Error |    StdDev | Ratio | RatioSD | Code Size | CacheMisses/Op | LLCReference/Op | LLCMisses/Op |
|--------------------- |---------:|----------:|----------:|------:|--------:|----------:|---------------:|----------------:|-------------:|
| NonTemporalStoreCopy | 1.441 ms | 0.0266 ms | 0.0493 ms |  1.00 |    0.00 |     120 B |         90,408 |         141,759 |       90,419 |
|     HardwarePrefetch | 1.418 ms | 0.0271 ms | 0.0266 ms |  0.97 |    0.04 |     185 B |         88,033 |         136,823 |       88,040 |
|     SoftwarePrefetch | 1.336 ms | 0.0266 ms | 0.0567 ms |  0.93 |    0.05 |     142 B |         82,827 |         136,547 |       82,837 |

(Note: these benchmark are a bit flaky, due other processes' need for the caches)

Different prefetch scheduling distances

Same code as before, but different prefetch distances in no. of cachelines.

|    Method |     Mean |     Error |    StdDev | Ratio | RatioSD | CacheMisses/Op | LLCReference/Op | LLCMisses/Op |
|---------- |---------:|----------:|----------:|------:|--------:|---------------:|----------------:|-------------:|
|   Ahead_2 | 1.143 ms | 0.0223 ms | 0.0451 ms |  1.00 |    0.00 |         64,614 |         130,425 |       64,623 |
|   Ahead_4 | 1.113 ms | 0.0107 ms | 0.0100 ms |  0.93 |    0.03 |         65,370 |         130,396 |       65,378 |
|   Ahead_8 | 1.124 ms | 0.0073 ms | 0.0069 ms |  0.94 |    0.03 |         71,501 |         140,042 |       71,510 |
|  Ahead_16 | 1.108 ms | 0.0111 ms | 0.0104 ms |  0.93 |    0.03 |         67,516 |         133,487 |       67,524 |
|  Ahead_32 | 1.105 ms | 0.0116 ms | 0.0108 ms |  0.92 |    0.04 |         65,419 |         126,203 |       65,427 |
|  Ahead_64 | 1.094 ms | 0.0071 ms | 0.0067 ms |  0.91 |    0.03 |         63,432 |         122,940 |       63,440 |
| Ahead_128 | 1.188 ms | 0.0233 ms | 0.0444 ms |  1.04 |    0.07 |         71,371 |         132,153 |       71,380 |

grafik

@gfoidl
Copy link
Author

gfoidl commented Nov 10, 2020

Cache organization / critical stride / cache sets

Background in Agner Fog -- Optimizing C++ 9.10 (page 103).

The code steps with different strides through a array. Demonstrated is the eviction of cache lines from the cache sets.
The higher the cache, the more critical is the eviction -- e.g. L3 needs to read from memory, whereas L1 needs to read from L2.

code
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
#if !DEBUG
using BenchmarkDotNet.Running;
#endif

Demo.Run();

#if !DEBUG
BenchmarkRunner.Run<Bench>();
#endif

public static class CacheInfo
{
    public const int CacheLineSizeInBytes = 64;

    public const int L1CacheSizeInKiB = 32;
    public const int L1CacheSizeInBytes = L1CacheSizeInKiB << 10;
    public const int L1CacheWays = 8;
    public const int L1CriticalStride = L1CacheSizeInBytes / L1CacheWays;
    public const int L1NoOfLines = L1CacheSizeInBytes / CacheLineSizeInBytes;
    public const int L1NoOfSets = L1NoOfLines / L1CacheWays;

    public const int L2CacheSizeInKiB = 256;
    public const int L2CacheSizeInBytes = L2CacheSizeInKiB << 10;
    public const int L2CacheWays = 4;
    public const int L2CriticalStride = L2CacheSizeInBytes / L2CacheWays;
    public const int L2NoOfLines = L2CacheSizeInBytes / CacheLineSizeInBytes;
    public const int L2NoOfSets = L2NoOfLines / L2CacheWays;

    public const int L3CacheSizeInMiB = 6;
    public const int L3CacheSizeInKiB = L3CacheSizeInMiB << 10;
    public const int L3CacheSizeInBytes = L3CacheSizeInKiB << 10;
    public const int L3CacheWays = 12;
    public const int L3CriticalStride = L3CacheSizeInBytes / L3CacheWays;
    public const int L3NoOfLines = L3CacheSizeInBytes / CacheLineSizeInBytes;
    public const int L3NoOfSets = L3NoOfLines / L3CacheWays;
}

[HardwareCounters(HardwareCounter.CacheMisses, HardwareCounter.LlcMisses, HardwareCounter.LlcReference)]
public unsafe class Bench
{
    private const int Iterations = 1_000;

    private readonly byte[] _buffer;
    private readonly byte* _ptr;

    public Bench()
    {
        _buffer = GC.AllocateUninitializedArray<byte>(2 * CacheInfo.L3CacheSizeInBytes, pinned: true);
        _ptr = (byte*)Marshal.UnsafeAddrOfPinnedArrayElement(_buffer, 0);

        while ((long)_ptr % CacheInfo.CacheLineSizeInBytes != 0) _ptr++;
        Debug.Assert((long)_ptr % CacheInfo.CacheLineSizeInBytes == 0);

        Random rnd = new(0);
        rnd.NextBytes(_buffer);
    }

    [ParamsSource(nameof(GetArguments))]
    public BenchArguments BenchArgs { get; set; }

    [Benchmark(Baseline = true)]
    public int CriticalStride() => this.Run(this.BenchArgs.CriticalStride, this.BenchArgs.CacheWays, this.BenchArgs.NoOfSets);

    [Benchmark]
    public int CS_min1() => this.Run(this.BenchArgs.CriticalStride - 1, this.BenchArgs.CacheWays, this.BenchArgs.NoOfSets);

    [Benchmark]
    public int CS_plus64() => this.Run(this.BenchArgs.CriticalStride + 64, this.BenchArgs.CacheWays, this.BenchArgs.NoOfSets);

    public static IEnumerable<BenchArguments> GetArguments()
    {
        yield return new BenchArguments { Name = "L1", CriticalStride = CacheInfo.L1CriticalStride, CacheWays = CacheInfo.L1CacheWays, NoOfSets = CacheInfo.L1NoOfSets };
        yield return new BenchArguments { Name = "L2", CriticalStride = CacheInfo.L2CriticalStride, CacheWays = CacheInfo.L2CacheWays, NoOfSets = CacheInfo.L2NoOfSets };
        yield return new BenchArguments { Name = "L3", CriticalStride = CacheInfo.L3CriticalStride, CacheWays = CacheInfo.L3CacheWays, NoOfSets = CacheInfo.L3NoOfSets };
    }

    public class BenchArguments
    {
        public string Name { get; set; }
        public uint CriticalStride { get; set; }
        public int CacheWays { get; set; }
        public int NoOfSets { get; set; }

        public override string ToString() => this.Name;
    }

    internal int Run(uint stride, int cacheWays, int noOfSets)
    {
        int sum = 0;
        byte* ptr = _ptr;

#if !DEBUG
        for (int k = 0; k < Iterations; ++k)
#endif
        {
            // With the critical stride cacheWays steps can be done, to fill the cache set.
            // Here we do twice the steps, to evict the first cachelines from the set.
            // With the outer loop we will bring back the first cachelines.
            //
            // If the stride != critical stride, different cache sets will be used, so 
            // no / less eviction occurs.
            for (uint i = 0; i < 2 * cacheWays; ++i)
            {
                uint index = i * stride;
                sum += Read(ptr, index, cacheWays, noOfSets);
            }
        }

        return sum;
    }

    private readonly Dictionary<int, CacheSet> _cacheSetStats = new Dictionary<int, CacheSet>();

    public void Reset() => _cacheSetStats.Clear();

    private int Read(byte* ptr, uint index, int cacheWays, int noOfSets)
    {
        ptr += index;

#if DEBUG
        long address = (long)ptr;
        int set = (int)(address / CacheInfo.CacheLineSizeInBytes % noOfSets);

        if (_cacheSetStats.TryGetValue(set, out CacheSet cacheSet))
        {
            int currentHits = cacheSet.Hits;
            currentHits++;

            if (currentHits > cacheWays)
            {
                cacheSet.Evictions++;
            }

            cacheSet.Hits = currentHits;
            _cacheSetStats[set] = cacheSet;
        }
        else
        {
            _cacheSetStats.Add(set, CacheSet.Initial);
        }
#endif

        return *ptr;
    }

    [Conditional("DEBUG")]
    public void PrintStats(string stride)
    {
        Console.WriteLine();
        Console.Write($"cache set stats, # of entries: {_cacheSetStats.Count}");
        Console.WriteLine($"\tstride: {stride}");

        foreach (var kvp in _cacheSetStats.OrderBy(k => k.Key))
        {
            Console.WriteLine($"\tset: {kvp.Key,2}\t\tevictions: {kvp.Value.Evictions}");
        }
    }

    private struct CacheSet
    {
        public int Hits { get; set; }
        public int Evictions { get; set; }

        public static readonly CacheSet Initial = new CacheSet { Hits = 1 };
    }
}

public static class Demo
{
    public static void Run()
    {
        Bench bench = new();

        bench.Reset();
        bench.Run(CacheInfo.L2CriticalStride - 1, CacheInfo.L2CacheWays, CacheInfo.L2NoOfSets);
        bench.PrintStats("L2CriticalStride - 1");

        bench.Reset();
        bench.Run(CacheInfo.L2CriticalStride, CacheInfo.L2CacheWays, CacheInfo.L2NoOfSets);
        bench.PrintStats("L2CriticalStride");

        bench.Reset();
        bench.Run(CacheInfo.L2CriticalStride + 1, CacheInfo.L2CacheWays, CacheInfo.L2NoOfSets);
        bench.PrintStats("L2CriticalStride + 1");

        bench.Reset();
        bench.Run(CacheInfo.L2CriticalStride + 9, CacheInfo.L2CacheWays, CacheInfo.L2NoOfSets);
        bench.PrintStats("L2CriticalStride + 9");

        bench.Reset();
        bench.Run(CacheInfo.L2CriticalStride + 10, CacheInfo.L2CacheWays, CacheInfo.L2NoOfSets);
        bench.PrintStats("L2CriticalStride + 10");

        bench.Reset();
        bench.Run(CacheInfo.L2CriticalStride + 63, CacheInfo.L2CacheWays, CacheInfo.L2NoOfSets);
        bench.PrintStats("L2CriticalStride + 63");

        bench.Reset();
        bench.Run(CacheInfo.L2CriticalStride + 64, CacheInfo.L2CacheWays, CacheInfo.L2NoOfSets);
        bench.PrintStats("L2CriticalStride + 64");
    }
}
machine info
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.7.20366.6
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
|         Method | BenchArgs |      Mean |     Error |    StdDev | Ratio | RatioSD | CacheMisses/Op | LLCReference/Op | LLCMisses/Op |
|--------------- |---------- |----------:|----------:|----------:|------:|--------:|---------------:|----------------:|-------------:|
| CriticalStride |        L1 | 11.301 μs | 0.2205 μs | 0.3091 μs |  1.00 |    0.00 |              4 |               9 |            4 |
|        CS_min1 |        L1 | 10.955 μs | 0.1081 μs | 0.0959 μs |  0.95 |    0.03 |              5 |              16 |            5 |
|      CS_plus64 |        L1 | 10.999 μs | 0.0699 μs | 0.0583 μs |  0.95 |    0.02 |              2 |               5 |            2 |
|                |           |           |           |           |       |         |                |                 |              |
| CriticalStride |        L2 |  5.992 μs | 0.1128 μs | 0.1055 μs |  1.00 |    0.00 |              1 |               9 |            1 |
|        CS_min1 |        L2 |  5.521 μs | 0.0349 μs | 0.0327 μs |  0.92 |    0.02 |              1 |              10 |            1 |
|      CS_plus64 |        L2 |  5.530 μs | 0.0427 μs | 0.0399 μs |  0.92 |    0.02 |              1 |               3 |            1 |
|                |           |           |           |           |       |         |                |                 |              |
| CriticalStride |        L3 | 31.845 μs | 0.6182 μs | 0.5783 μs |  1.00 |    0.00 |             13 |          12,376 |           13 |
|        CS_min1 |        L3 | 30.744 μs | 0.5754 μs | 0.5101 μs |  0.96 |    0.02 |             13 |           8,080 |           13 |
|      CS_plus64 |        L3 | 16.179 μs | 0.1386 μs | 0.1229 μs |  0.51 |    0.01 |              3 |              10 |            3 |

Output of demo program / stats of cache sets

L1

cache set stats, # of entries: 2    stride: L1CriticalStride - 1
    set:  0     evictions: 7
    set:  1     evictions: 0

cache set stats, # of entries: 1    stride: L1CriticalStride
    set:  1     evictions: 8

cache set stats, # of entries: 1    stride: L1CriticalStride + 1
    set:  1     evictions: 8

cache set stats, # of entries: 2    stride: L1CriticalStride + 5
    set:  1     evictions: 5
    set:  2     evictions: 0

cache set stats, # of entries: 15   stride: L1CriticalStride + 63
    set:  1     evictions: 0
    set:  2     evictions: 0
    set:  3     evictions: 0
    set:  4     evictions: 0
    set:  5     evictions: 0
    set:  6     evictions: 0
    set:  7     evictions: 0
    set:  8     evictions: 0
    set:  9     evictions: 0
    set: 10     evictions: 0
    set: 11     evictions: 0
    set: 12     evictions: 0
    set: 13     evictions: 0
    set: 14     evictions: 0
    set: 15     evictions: 0

cache set stats, # of entries: 16   stride: L1CriticalStride + 64
    set:  1     evictions: 0
    set:  2     evictions: 0
    set:  3     evictions: 0
    set:  4     evictions: 0
    set:  5     evictions: 0
    set:  6     evictions: 0
    set:  7     evictions: 0
    set:  8     evictions: 0
    set:  9     evictions: 0
    set: 10     evictions: 0
    set: 11     evictions: 0
    set: 12     evictions: 0
    set: 13     evictions: 0
    set: 14     evictions: 0
    set: 15     evictions: 0
    set: 16     evictions: 0
L2

cache set stats, # of entries: 2    stride: L2CriticalStride - 1
    set: 64     evictions: 3
    set: 65     evictions: 0

cache set stats, # of entries: 1    stride: L2CriticalStride
    set: 65     evictions: 4

cache set stats, # of entries: 1    stride: L2CriticalStride + 1
    set: 65     evictions: 4

cache set stats, # of entries: 1    stride: L2CriticalStride + 9
    set: 65     evictions: 4

cache set stats, # of entries: 2    stride: L2CriticalStride + 10
    set: 65     evictions: 3
    set: 66     evictions: 0

cache set stats, # of entries: 7    stride: L2CriticalStride + 63
    set: 65     evictions: 0
    set: 66     evictions: 0
    set: 67     evictions: 0
    set: 68     evictions: 0
    set: 69     evictions: 0
    set: 70     evictions: 0
    set: 71     evictions: 0

cache set stats, # of entries: 8    stride: L2CriticalStride + 64
    set: 65     evictions: 0
    set: 66     evictions: 0
    set: 67     evictions: 0
    set: 68     evictions: 0
    set: 69     evictions: 0
    set: 70     evictions: 0
    set: 71     evictions: 0
    set: 72     evictions: 0
L3

cache set stats, # of entries: 2    stride: L3CriticalStride - 1
    set: 64     evictions: 11
    set: 65     evictions: 0

cache set stats, # of entries: 1    stride: L3CriticalStride
    set: 65     evictions: 12

cache set stats, # of entries: 1    stride: L3CriticalStride + 1
    set: 65     evictions: 12

cache set stats, # of entries: 2    stride: L3CriticalStride + 3
    set: 65     evictions: 10
    set: 66     evictions: 0

cache set stats, # of entries: 23   stride: L3CriticalStride + 63
    set: 65     evictions: 0
    set: 66     evictions: 0
    set: 67     evictions: 0
    set: 68     evictions: 0
    set: 69     evictions: 0
    set: 70     evictions: 0
    set: 71     evictions: 0
    set: 72     evictions: 0
    set: 73     evictions: 0
    set: 74     evictions: 0
    set: 75     evictions: 0
    set: 76     evictions: 0
    set: 77     evictions: 0
    set: 78     evictions: 0
    set: 79     evictions: 0
    set: 80     evictions: 0
    set: 81     evictions: 0
    set: 82     evictions: 0
    set: 83     evictions: 0
    set: 84     evictions: 0
    set: 85     evictions: 0
    set: 86     evictions: 0
    set: 87     evictions: 0

cache set stats, # of entries: 24   stride: L3CriticalStride + 64
    set: 65     evictions: 0
    set: 66     evictions: 0
    set: 67     evictions: 0
    set: 68     evictions: 0
    set: 69     evictions: 0
    set: 70     evictions: 0
    set: 71     evictions: 0
    set: 72     evictions: 0
    set: 73     evictions: 0
    set: 74     evictions: 0
    set: 75     evictions: 0
    set: 76     evictions: 0
    set: 77     evictions: 0
    set: 78     evictions: 0
    set: 79     evictions: 0
    set: 80     evictions: 0
    set: 81     evictions: 0
    set: 82     evictions: 0
    set: 83     evictions: 0
    set: 84     evictions: 0
    set: 85     evictions: 0
    set: 86     evictions: 0
    set: 87     evictions: 0
    set: 88     evictions: 0

@gfoidl
Copy link
Author

gfoidl commented Nov 13, 2020

Loop over array: forward, downward, and perfect

Downward loops >= 0 are better, as ideally the flags set by add/sub / inc/dec are re-used for the conditional jump.
Data cache is optimized for accessing data forwards, not backwards, therefore it's best to count up from negative values from -n to zero.
We put the pointer to the end of the array, and use a negative offset from the end of the array, so indexing is forward.

code
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
#if !DEBUG
using BenchmarkDotNet.Running;
#endif

Bench bench = new();
Console.WriteLine(bench.ForwardLoop());
Console.WriteLine(bench.DownwardLoop());
Console.WriteLine(bench.DownwardLoop_CacheFriendly());

#if !DEBUG
BenchmarkRunner.Run<Bench>();
#endif

[DisassemblyDiagnoser]
//[HardwareCounters(HardwareCounter.CacheMisses, HardwareCounter.LlcMisses, HardwareCounter.LlcReference)]
public unsafe class Bench
{
    private const int N = 10_000_000;
    private readonly int[] _array = GC.AllocateArray<int>(N, pinned: true);
    private readonly int* _ptr;

    public Bench()
    {
        for (int i = 0; i < N; ++i) _array[i] = -i;

        _ptr = (int*)Marshal.UnsafeAddrOfPinnedArrayElement(_array, 0);
    }

    [Benchmark(Baseline =true)]
    public int ForwardLoop()
    {
        int sum = 0;
        int* ptr = _ptr;

        for (nuint i = 0; i < N; ++i)
        {
            sum += ptr[i];
        }

        return sum;
    }

    [Benchmark]
    public int DownwardLoop()
    {
        int sum = 0;
        int* ptr = _ptr;

        // Ideally no cmp or test is emitted, as add/sub set the zero-flag
        // which can be used for the jump-instruction.
        for (nint i = N - 1; i >= 0; --i)
        {
            sum += ptr[i];
        }

        return sum;
    }

    [Benchmark]
    public int DownwardLoop_CacheFriendly()
    {
        int sum = 0;
        int* ptr = _ptr;

#if DEBUG
        int* start = _ptr;
        int* end = _ptr + N;
#endif

        // Cache is optimized for accessing data forwards, so we start at 
        // negative and count up to 0.
        ptr += N;   // end of array
        for (nint i = -N; i < 0; ++i)
        {
#if DEBUG
            int* p = ptr + i;
            Debug.Assert(start <= p && p < end);
#endif

            sum += ptr[i];
        }

        return sum;
    }
}
asm
; Bench.ForwardLoop()
       xor       eax,eax
       mov       rdx,[rcx+10]
       xor       ecx,ecx
M00_L00:
       add       eax,[rdx+rcx*4]
       inc       rcx
       cmp       rcx,989680
       jb        short M00_L00
       ret
; Total bytes of code 24

; Bench.DownwardLoop()
       xor       eax,eax
       mov       rdx,[rcx+10]
       mov       ecx,98967F
M00_L00:
       add       eax,[rdx+rcx*4]
       dec       rcx
       test      rcx,rcx
       jge       short M00_L00
       ret
; Total bytes of code 23

; Bench.DownwardLoop_CacheFriendly()
       xor       eax,eax
       mov       rdx,[rcx+10]
       add       rdx,2625A00
       mov       rcx,0FFFFFF676980
M00_L00:
       add       eax,[rdx+rcx*4]
       inc       rcx
       test      rcx,rcx
       jl        short M00_L00
       ret
; Total bytes of code 35
machine info
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.630 (2004/?/20H1)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
|                     Method |     Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|--------------------------- |---------:|----------:|----------:|------:|--------:|----------:|
|                ForwardLoop | 6.692 ms | 0.0403 ms | 0.0314 ms |  1.00 |    0.00 |      24 B |
|               DownwardLoop | 7.302 ms | 0.1363 ms | 0.1674 ms |  1.09 |    0.02 |      23 B |
| DownwardLoop_CacheFriendly | 4.512 ms | 0.0849 ms | 0.0872 ms |  0.68 |    0.01 |      35 B |
M00_L00:
       add       eax,[rdx+rcx*4]
       inc       rcx
-      cmp       rcx,989680
-      jb        short M00_L00
+      test      rcx,rcx
+      jl        short M00_L00

(less size and better for macro fusion).

But: the more work is done, the less the effect will be, and this optimization isn't very intuitive.
So this optimization is only worthwhile if the loop overhead is the limiting bottleneck.

@gfoidl
Copy link
Author

gfoidl commented Nov 13, 2020

(Loop) Induction variables

If the floating point value of the loop counter is needed for some other purpose, then it is better to have both an integer counter and a floating point counter.
It's much faster to add to a floating point counter, than converting the int-counter to a float.

code
using System;
using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
#if !DEBUG
using BenchmarkDotNet.Running;
#endif

// per default loops are JITed as Tier-1, so there will be the check for static initialization
// this env-var tells the JIT to use TC for loops too.
Environment.SetEnvironmentVariable("COMPlus_TC_QuickJitForLoops", "1");

Bench bench = new();
bench.A();
double[] res0 = bench.SinTable.ToArray();
bench.B();
double[] res1 = bench.SinTable.ToArray();

for (int i = 0; i < res0.Length; ++i)
{
    double r0 = res0[i];
    double r1 = res1[i];

    if (Math.Abs(r0 - r1) > 1e-10)
    {
        Console.WriteLine($"error at {i}");
    }
}

#if !DEBUG
BenchmarkRunner.Run<Bench>();
#endif

[DisassemblyDiagnoser]
//[HardwareCounters(HardwareCounter.CacheMisses, HardwareCounter.LlcMisses, HardwareCounter.LlcReference)]
public unsafe class Bench
{
    private const int N = 1_000;
    private readonly double[] _array = GC.AllocateArray<double>(N, pinned: true);
    private readonly double* _ptr;

    public Bench()
    {
        _ptr = (double*)Marshal.UnsafeAddrOfPinnedArrayElement(_array, 0);
    }

    public Span<double> SinTable => _array;

    [Benchmark(Baseline = true)]
    public void A()
    {
        double* ptr = _ptr;

        for (nuint i = 0; i < N; ++i)
        {
            ptr[i] = Math.Sin(0.01 * i);
        }
    }

    [Benchmark]
    public void A1()
    {
        double* ptr = _ptr;
        double scale = 0.01;

        for (nuint i = 0; i < N; ++i)
        {
            ptr[i] = Math.Sin(scale * i);
        }
    }

    [Benchmark]
    public void B()
    {
        double* ptr = _ptr;
        double x = 0d;

        for (nuint i = 0; i < N; ++i, x += 0.01)
        {
            ptr[i] = Math.Sin(x);
        }
    }

    [Benchmark]
    public void B1()
    {
        double* ptr = _ptr;
        double x = 0d;
        double inc = 0.01;

        for (nuint i = 0; i < N; ++i, x += inc)
        {
            ptr[i] = Math.Sin(x);
        }
    }

    private static readonly long s_inc = BitConverter.DoubleToInt64Bits(0.01);
    private static double Inc => BitConverter.Int64BitsToDouble(s_inc);

    [Benchmark]
    public void B2()
    {
        double* ptr = _ptr;
        double x = 0d;

        for (nuint i = 0; i < N; ++i, x += Inc)
        {
            ptr[i] = Math.Sin(x);
        }
    }
}
asm
; Bench.A()
       push      rdi
       push      rsi
       sub       rsp,28
       vzeroupper
       mov       rsi,[rcx+10]
       xor       edi,edi
M00_L00:
       vxorps    xmm0,xmm0,xmm0
       vcvtsi2sd xmm0,xmm0,rdi
       vmulsd    xmm0,xmm0,qword ptr [7FFEBB6D47A8]
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       cmp       rdi,3E8
       jb        short M00_L00
       add       rsp,28
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 61

; Bench.A1()
       push      rdi
       push      rsi
       sub       rsp,28
       vzeroupper
       mov       rsi,[rcx+10]
       xor       edi,edi
M00_L00:
       vxorps    xmm0,xmm0,xmm0
       vcvtsi2sd xmm0,xmm0,rdi
       vmulsd    xmm0,xmm0,qword ptr [7FFEBB6E48E8]
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       cmp       rdi,3E8
       jb        short M00_L00
       add       rsp,28
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 61

; Bench.B()
       push      rdi
       push      rsi
       sub       rsp,38
       vzeroupper
       vmovaps   [rsp+20],xmm6
       mov       rsi,[rcx+10]
       vxorps    xmm6,xmm6,xmm6
       xor       edi,edi
M00_L00:
       vmovaps   xmm0,xmm6
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       vaddsd    xmm6,xmm6,qword ptr [7FFEBB6C48F8]
       cmp       rdi,3E8
       jb        short M00_L00
       vmovaps   xmm6,[rsp+20]
       add       rsp,38
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 72

; Bench.B1()
       push      rdi
       push      rsi
       sub       rsp,38
       vzeroupper
       vmovaps   [rsp+20],xmm6
       mov       rsi,[rcx+10]
       vxorps    xmm6,xmm6,xmm6
       xor       edi,edi
M00_L00:
       vmovaps   xmm0,xmm6
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       vaddsd    xmm6,xmm6,qword ptr [7FFEBB6C48F8]
       cmp       rdi,3E8
       jb        short M00_L00
       vmovaps   xmm6,[rsp+20]
       add       rsp,38
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 72

; Bench.B2()
       push      rdi
       push      rsi
       sub       rsp,48
       vzeroupper
       vmovaps   [rsp+30],xmm6
       vmovaps   [rsp+20],xmm7
       mov       rsi,[rcx+10]
       vxorps    xmm6,xmm6,xmm6
       xor       edi,edi
       mov       rax,7AE147AE147B
       vmovq     xmm7,rax
M00_L00:
       vmovaps   xmm0,xmm6
       call      System.Math.Sin(Double)
       vmovsd    qword ptr [rsi+rdi*8],xmm0
       inc       rdi
       vmovaps   xmm0,xmm7
       vaddsd    xmm6,xmm6,xmm0
       cmp       rdi,3E8
       jb        short M00_L00
       vmovaps   xmm6,[rsp+30]
       vmovaps   xmm7,[rsp+20]
       add       rsp,48
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 99
machine info
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.630 (2004/?/20H1)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.51904, CoreFX 5.0.20.51904), X64 RyuJIT
| Method |     Mean |     Error |    StdDev | Ratio | Code Size |
|------- |---------:|----------:|----------:|------:|----------:|
|      A | 9.173 μs | 0.0917 μs | 0.0858 μs |  1.00 |      61 B |
|     A1 | 9.223 μs | 0.1036 μs | 0.0969 μs |  1.01 |      61 B |
|      B | 8.928 μs | 0.0758 μs | 0.0633 μs |  0.97 |      72 B |
|     B1 | 8.887 μs | 0.0723 μs | 0.0641 μs |  0.97 |      72 B |
|     B2 | 8.838 μs | 0.0746 μs | 0.0661 μs |  0.96 |      99 B |

Note: The calculation of a function of the loop counter by the induction variable method makes a loop-carried dependency chain. If this chain is too long then it may be advantageous to calculate each value from a value that is two or more iterations back.

Note 2: Once dotnet/runtime Double constants usage in a loop can be CSEed is resolved, codegen for some methods will be better, and the trick used for B2 isn't necessary (store the double-const as long).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment