Skip to content

Instantly share code, notes, and snippets.

@BastianBlokland
Last active September 10, 2021 12:31
Show Gist options
  • Save BastianBlokland/a00bf43aa174fa5f99f5d4ee69747026 to your computer and use it in GitHub Desktop.
Save BastianBlokland/a00bf43aa174fa5f99f5d4ee69747026 to your computer and use it in GitHub Desktop.
Playing around with the new hardware intrinsics of .net core 3.0

.Net Core 3.0 (currently in preview) adds support for hardware intrinsics.

So now we can write simd instructions in c# 🎉

Intel has a nice website for browsing through the available instructions: Intel intrinsics guide

Everything from Avx2 down seems supported. (Given that your cpu supports it ofcourse)

For double checking what c# call maps to what instruction you can check the summary of the methods: Avx2.cs

As as test i created a utility that searches a collection of ints 8 entries at a time. Its a bit silly implementation because it only works if the target exists in the collection only once, but it shows you can get pretty big speed ups.

public static class GetIndex
{
    /// <summary> Implemented with a simple loop </summary>
    public static int GetIndexSimple(ReadOnlySpan<int> span, int item)
    {
        for (int i = 0; i < span.Length; i++)
        {
            if (span[i] == item)
                return i;
        }

        return -1;
    }

    /// <summary> Implemented using build-in library 'IndexOf'</summary>
    public static int GetIndexLibrary(ReadOnlySpan<int> span, int item)
    {
        return span.IndexOf(item);
    }

    /// <summary> Implemented using 'Avx2' intrinsics</summary>
    /// <remarks> Without 'Avx2' support will behave as a simple loop</remarks>
    public static unsafe int GetIndexIntrinsics(ReadOnlySpan<int> span, int item)
    {
        // Get a fixed pointer so the garbage-collector doesn't move the collection
        fixed (int* startPointer = span)
        {
            int* endPointer = startPointer + span.Length;
            int* pointer = startPointer;

            // Query if the cpu actually supports 'Avx2' instructions 
            if (Avx2.IsSupported)
            {
                // Load '1' item into a 128 bit vector
                Vector128<int> itemScaler = Avx2.LoadScalarVector128(&item);

                // Copy that first item into the other 7 slots of a 256 bit vector, this means
                // we now have a vector that is holding 8 times the value 'item'
                Vector256<int> itemVector = Avx2.BroadcastScalarToVector256(itemScaler);

                // Loop through the span 8 elements at a time (256 bit / 32 bit = 8)
                for (; (pointer + 8) < endPointer; pointer += 8)
                {
                    // Load 8 elements from the span
                    Vector256<int> elements = Avx2.LoadVector256(pointer);

                    // Compare those 8 elements with our item. This will give us 8 values of
                    // 'FFFF' or '0000' (32 bits of either 1 or 0) in a 256 bit vector
                    Vector256<int> elementEquals = Avx2.CompareEqual(elements, itemVector);

                    /*
                    Because 256 bit is a too big type to work with we combine it into a single
                    integer by taking 4 bits from each 32 bit value (bit 7, 15, 23 and 31).

                    eq 32:       0        0        1        0        0        0        0        0        0
                    MoveMask: 0 0 0 0  0 0 0 0  1 1 1 1  0 0 0 0  0 0 0 0  0 0 0 0  0 0 0 0  0 0 0 0  0 0 0 0
                    Hex:         0        0        F        0        0        0        0        0        0
                    */

                    int mask = Avx2.MoveMask(elementEquals.AsByte());

                    // If we make the assumption that the item only exists in the span once then
                    // we can construct a jump table for it.
                    switch (mask)
                    {
                        case (int)0x0000000F: // At element 0
                            return (int)(pointer - startPointer);

                        case (int)0x000000F0: // At element 1
                            return (int)(pointer + 1 - startPointer);

                        case (int)0x00000F00: // At element 2
                            return (int)(pointer + 2 - startPointer);

                        case (int)0x0000F000: // At element 3
                            return (int)(pointer + 3 - startPointer);

                        case (int)0x000F0000: // At element 4
                            return (int)(pointer + 4 - startPointer);

                        case (int)0x00F00000: // At element 5
                            return (int)(pointer + 5 - startPointer);

                        case (int)0x0F000000: // At element 6
                            return (int)(pointer + 6 - startPointer);

                        case unchecked((int)0xF0000000): // At element 7
                            return (int)(pointer + 7 - startPointer);

                        case (int)0x00000000: // Not found
                            continue;

                        default:
                            throw new Exception("Item found in span multiple times");
                    }
                }
            }

            // Handle the remaiming items with a simple loop
            for (; pointer < endPointer; pointer++)
            {
                if (*pointer == item)
                    return (int)(pointer - startPointer);
            }
        }

        return -1;
    }
}

Test proving that this actually works: (Using XUnit)

[Fact]
public static void GetIndex_BehavesConsistentlyAtEveryIndex()
{
    const int TestKey = 1337;

    int[] testElements = new int[100];
    for (int i = 0; i < testElements.Length; i++)
    {
        // Set test key at this element
        testElements[i] = TestKey;

        // Verify that it can be found
        Assert.Equal(i, GetIndex.GetIndexSimple(testElements, TestKey));
        Assert.Equal(i, GetIndex.GetIndexLibrary(testElements, TestKey));
        Assert.Equal(i, GetIndex.GetIndexIntrinsics(testElements, TestKey));

        // Unset the test key
        testElements[i] = 0;
    }
}

Benchmark to test the performance: (Using BenchmarkDotNet)

public class GetIndexBenchmark
{
    private const int ElementCount = 100_000;
    private readonly int[] testElements = new int[ElementCount];

    public GetIndexBenchmark()
    {
        // Set the element we want to find
        testElements[50_000] = 1337;
    }

    [Benchmark(Baseline = true)]
    public int GetIndexSimple() => GetIndex.GetIndexSimple(testElements, 1337);

    [Benchmark]
    public int GetIndexLibrary() => GetIndex.GetIndexLibrary(testElements, 1337);

    [Benchmark]
    public int GetIndexIntrinsics() => GetIndex.GetIndexIntrinsics(testElements, 1337);
}

Benchmark results:

BenchmarkDotNet=v0.11.3, OS=macOS Mojave 10.14 (18A391) [Darwin 18.0.0]
Intel Core i9-8950HK CPU 2.90GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview-009812
  [Host]     : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0-preview-27122-01 (CoreCLR 4.6.27121.03, CoreFX 4.7.18.57103), 64bit RyuJIT
Method Mean Error StdDev Ratio
GetIndexSimple 23.118 us 0.3721 us 0.3299 us 1.00
GetIndexLibrary 9.985 us 0.1479 us 0.1384 us 0.43
GetIndexIntrinsics 5.419 us 0.0847 us 0.0792 us 0.23
// * Legends *
  Mean   : Arithmetic mean of all measurements
  Error  : Half of 99.9% confidence interval
  StdDev : Standard deviation of all measurements
  Ratio  : Mean of the ratio distribution ([Current]/[Baseline])
  1 us   : 1 Microsecond (0.000001 sec)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment