ladeak/inline-array.md

## inline-array.md

      
    Raw
  

              inline-array.md
            
          
    Exploring Inline Arrays

Inline arrays have been introduced as new feature in C# 12. As the feature is preview at the time of writing, it may change before the .NET team releases the final version of C#.
A type attributed with inline array can be used to represent a fixed-size, contiguous sequence of primitives. The runtime and language provide type and overrun safe indexing operations to the underlying primitives.
At the time of writing one can define an inline array by applying the [InlineArray] attribute on a struct, that has a single field.
[System.Runtime.CompilerServices.InlineArray(8)]
struct Octet
{
    private int _element0;
}
The type above represents an inline array of 8 integers.
Where to use it?

Inline arrays are represented by a struct, which brings copy-by-value semantics. This can be beneficial and can be a disadvantage as well. It can be beneficial when we need a temporary, known-sized array allocation free. For example, using the regular new operator (and not stackalloc), the developer can get a continuous array on the stack. In the following example the variable two is not heap allocated and provides array like semantics:
var two = new Two();
two[0] = 10;
two[1] = 20;
foreach (var x in two)
    Console.WriteLine(x);

[System.Runtime.CompilerServices.InlineArray(2)]
struct Two
{
    private int _element0;
}
Elements of two may be addressed by an indexer as-well-as iterated with a foreach loop. However, an inline array does not implement IEnumerable<T> and does not have a GetEnumerator() method. Hence, Two may not be passed as a method argument for a parameter typed IEnumerable<int>, as the compiler issues an error:

Error	CS1503	Argument 1: cannot convert from 'Two' to 'System.Collections.Generic.IEnumerable'

So, how is it able to iterate it with the foreach loop? As the compiler knows the exact size of the array, it may re-write the foreach loop as a regular for loop. This is clearer when the IL code is decompiled to C# with ILSpy:
for (int i = 0; i < 2; i++)
While an inline array cannot be passed to IEnumerable<T> parameters, it can be passed as an argument to Span<T> and ReadOnlySpan<T> parameters:
ReadOnlySpan<int> twoSpan = two;
Iterate(twoSpan);
void Iterate(ReadOnlySpan<int> source)
{
    foreach (var item in source)
        Console.WriteLine(item);
}
In the example above, the variable two can be directly passed to Iterate() method. Similar way, one can use the range operator to select a range of the underlying memory as a span.
Allocated on Stack

Two when allocated with the new operator is allocated on the stack. This has the advantage of decreasing the load on the Garbage Collector. However, the stack size is limited, the default stack size on Windows is 1.5 MB for .NET processes. This means using inline arrays carelessly with recursive methods may exhaust the stack rather soon.
Allocated on Heap

Developers may also use inline array types as a member of a reference type.
class Coordinate(Two dimensions)
{
    public int X { get => dimensions[0]; }
    public int Y { get => dimensions[1]; }
}
In the example above Two is a field of type Coordinate. The X and Y properties allow access to the underlying data using the indexers of the inline array. The fixed size array is part of the Coordinate objects memory layout. This means that a separate array allocation is not required. If a regular array was used in place of the inline array, two objects would needed to be allocated: one for the Coordinate object and one for the array referenced by the Coordinate object.
With inline arrays, data is packed more compactly. In the following example CoordinateInline uses the inline array of type Two, CoordinateNonInline references a regular int[] array.
class CoordinateInline(Two dimensions)
{
    public int X { get => dimensions[0]; }
    public int Y { get => dimensions[1]; }
}

class CoordinateNonInline(int[] dimensions)
{
    public int X { get => dimensions[0]; }
    public int Y { get => dimensions[1]; }
}
Using CoordinateNonInline the int[] array allocation requires an additional object with object header (8 bytes), MT (8 bytes), a local field for the size, which results in an additional 24 bytes in memory allocations. CoordinateNonInline also contains a reference to this object which is 8 bytes in size.
In the case of a million coordinate objects, CoordinateInline allocates 32 MB less on the memory heap.
Boxing

structs types are prone to boxing. For example, taking the variable two from the above examples and assigning it to an object results in a boxing instruction.
object a = two;
The corresponding boxing instruction in MSIL:
IL_001d: ldloc.0
IL_001e: box Two
IL_0023: stloc.1

Expanding the Usage

Generic Types

Inline array types can be generic:
[System.Runtime.CompilerServices.InlineArray(8)]
struct Octet<T>
{
    private T _element0;
}
The Octet type could be interesting choice for a node of an Octree.
Non-primitive Elements

Elements of inline arrays could be other, complex types. However, if these types are reference types, the inline array will only contain a reference to these objects.
In the case below, the Node<T> type could be the pillar of an octree structure. Its children are represented as an Octet<Node<T>. In this case the inline array holds 8 references to the children Node<T> nodes (or to null if they are leaf).
var children = new Octet<Node<int>>();
children[0] = new Node<int>(new Point3D<int>(1, 1, 1));
children[1] = new Node<int>(new Point3D<int>(1, 1, 8));
var root = new Node<int>(new Point3D<int>(5, 5, 5), children);

[System.Runtime.CompilerServices.InlineArray(8)]
struct Octet<T>
{
    private T _element0;
}

record struct Point3D<T>(T X, T Y, T Z);

class Node<T>
{
    public Node(Point3D<T> value)
    {
        Value = value;
    }

    public Node(Point3D<T> value, Octet<Node<T>> children)
    {
        Value = value;
        Children = children;
    }

    public Point3D<T> Value { get; }

    public Octet<Node<T>> Children { get; }
}
Notice, that Node<T> has a member that is value type of a generic struct Point3D<T>. The other member typed Octet<T> is an inline array, this array only holds references to other heap allocated Node<T> objects.
Inline Arrays implementing Interfaces

Inline arrays may implement interfaces:
interface ICountable { }

[System.Runtime.CompilerServices.InlineArray(2)]
struct Two : ICountable
{
    private int _element0;
}

[System.Runtime.CompilerServices.InlineArray(3)]
struct Three : ICountable
{
    private int _element0;
}
In the above case Two and Three implement the ICountable interface. While this enables polymorphism, it involves boxing. The same way when object a = two; resulted boxing, referring to inline arrays through interfaces results in boxing them, regardless of the array is used as a local variable or as an object member. As inline arrays become heap allocated it diminishes the benefits of using them in the first place.
Inline arrays with inline array fields

The previous sections have shown that we can use primitives or complex types for the type of the elements of an inline array. An inline array can have elements that are typed as inline arrays:
var matrix = new Matrix3x3();
matrix[0][1] = 30;

[System.Runtime.CompilerServices.InlineArray(3)]
struct Row3
{
    private int _element0;
}

[System.Runtime.CompilerServices.InlineArray(3)]
struct Matrix3x3
{
    private Row3 _element0;
}
The above example gives an example of creating a 3x3 matrix by using 3 elements of Row3 type, which Row3 is also an inline array. A developer can reference a single element of the matrix by using the regular indexers, such as matrix[0][1].
Exploring Use-case

In this section 2 examples are show how inline arrays may improve performance or cause performance degradation when not used with care.
Performance Penalty

Large inline arrays require careful handling. Using it irresponsibly might cause performance degradation. The general guidance is to measure the code and prove that the implementation yields in real world performance improvement. In this section an inline array with 256 integers is used:
[System.Runtime.CompilerServices.InlineArray(256)]
public struct Large
{
    private int _element0;
}
The array is a field on a type that is the subject of the benchmark. The inline array is passed to a private method in three different ways:

as the inline array type
as a ref parameter
as a Span<T>

Find(_data);
FindRef(ref _data);
FindSpan(_data)
When executing the benchmarks, the results show a surprising performance difference:
|          Method |      Mean |     Error |    StdDev |    Median |
|---------------- |----------:|----------:|----------:|----------:|
|     InlineArray | 6.7510 ns | 0.0994 ns | 0.0929 ns | 6.7383 ns |
|  InlineArrayRef | 0.1486 ns | 0.0032 ns | 0.0028 ns | 0.1481 ns |
| InlineArraySpan | 0.0036 ns | 0.0062 ns | 0.0058 ns | 0.0000 ns |

While the difference is measurable in nanoseconds, that is usually insignificant compared to the actual operations. However, passing the inline array by value is a magnitude slower compared to passing the is by ref or by Span<T>. When the inline array is passed as a regular parameter it is copied by value. The corresponding assembly code performs the copy operation. 400 is the hexadecimal value of 1024 bytes or 256 * 4 bytes (as integers are 4 bytes in size).
lea      rcx, bword ptr [rbp-400H]
mov      r8d, 0x400
call     CORINFO_HELP_MEMCPY

The larger the inline array is the more data needs to be copied, which results in the additional performance penalty. When the argument is passed by reference, only the reference needs to be copied which is 8 bytes compared to the 256 * 4 bytes. The Span<T> is a reference to a continuous array of memory, in this case a reference to the inline array. The performance difference between the by ref and span based solution comes from the way the FindX method indexes into the array. The by ref solution internally also uses a Span, hence to overhead.
When the size of the inline array is reduced to four items, the same performance benchmark comes head-to-head as the copy operation only needs to copy 16 bytes of data. In this case the CLR can choose to use vector registers backing the data of the inline array.
Benchmarking the Large types with 4 items length:
|          Method |      Mean |     Error |    StdDev |    Median |
|---------------- |----------:|----------:|----------:|----------:|
|     InlineArray | 0.1433 ns | 0.0159 ns | 0.0149 ns | 0.1448 ns |
|  InlineArrayRef | 0.0078 ns | 0.0067 ns | 0.0062 ns | 0.0056 ns |
| InlineArraySpan | 0.1231 ns | 0.0060 ns | 0.0056 ns | 0.1248 ns |

Gaining Performance

In a second example real-time market data updates are processed. Market data updates are sent by exchanges. For the sake of this example, the market data update contains 5 properties. The symbol is the name of a product, this would be typically a few letters abbreviation of a traded company or a derivative. Ask Price and Bid Price published by sellers and buyers; Mid Price is the average of ask and bid prices; Last Price is the price at which the last trade occurred. Such data could be encapsulated by MarketDataUpdateVanilla class.
public class MarketDataUpdateVanilla
{
    public required string Symbol { get; init; }
    public double AskPrice { get; set; }
    public double BidPrice { get; set; }
    public double MidPrice { get; set; }
    public double LastPrice { get; set; }
}
Let's assume the prices are published in cents by a server onto a pub-sub. End-user client applications may subscribe to these market data updates, so they can show the most recent prices of stocks to the user. Some users may prefer that the client applications display prices in cents, some may prefer to have the cents separated by a decimal point. Each user may set a user preference indicating their preferred quantifier. For example, a bid price of the value 1000 can be displayed as 10.00 for one user and as 1000 to another user.
Price updates are published by an exchange to pub-sub. Client applications process the available updates, ever so often so that a human end-user may read the current price value, typically every 200 milliseconds. The client applications read a collection of market data updates from the pub-sub with this given frequency. Before displaying the prices to the user, the application needs to process each market data update and adjust it with the user's preferred quantifier. A typical implementation could be:
foreach (var item in VanillaUpdates)
{
    item.AskPrice = item.AskPrice / Quantifier;
    item.BidPrice = item.BidPrice / Quantifier;
    item.MidPrice = item.MidPrice / Quantifier;
    item.LastPrice = item.LastPrice / Quantifier;
}
In this blog post I will explore 2 further representations of the same market data update using arrays and inline arrays. Then I will compare them from memory allocation and mean execution time standpoints.
A developer may decide to display the same market data update in a class as MarketDataUpdateArray using arrays:
public class MarketDataUpdateArray
{
    public required string Symbol { get; init; }
    public double[] Prices = new double[4];
    public double AskPrice { get => Prices[0]; set => Prices[0] = value; }
    public double BidPrice { get => Prices[1]; set => Prices[1] = value; }
    public double MidPrice { get => Prices[2]; set => Prices[2] = value; }
    public double LastPrice { get => Prices[3]; set => Prices[3] = value; }
}
The key advantage of this type is that a developer may process the updates of the array with loop:
foreach (var item in ArrayUpdates)
{
    for (int i = 0; i < item.Prices.Length; i++)
        item.Prices[i] = item.Prices[i] / Quantifier;
}
There are pros and cons for this implementation. While it is simpler to implement and maintain the processing of messages, a single update allocates 3 objects on the heap. One object is allocated for an instance of MarketDataUpdateArray, one for the array of prices and one for string symbol. Excluding the string Symbol property, this update allocates 88 bytes compared to the previous implementation of 56 bytes on x64 architecture.
Finally, a developer may choose to represent the market data update with the help of an inline array:
[System.Runtime.CompilerServices.InlineArray(4)]
public struct Quartet<T>
{
    private T _element0;
}

public class MarketDataUpdateQuartet
{
    public required string Symbol { get; init; }
    public Quartet<double> Prices;
    public double AskPrice { get => Prices[0]; set => Prices[0] = value; }
    public double BidPrice { get => Prices[1]; set => Prices[1] = value; }
    public double MidPrice { get => Prices[2]; set => Prices[2] = value; }
    public double LastPrice { get => Prices[3]; set => Prices[3] = value; }
}
The key difference is that this solution uses Quartet<double> instead of a double[] for representing the prices. This quartet is not allocated as a separate object, but it is in-lined with the memory allocation of an MarketDataUpdateQuartet object. Hence, this object also consumes 56 bytes on the heap.
One may process the updates the same way as shown above for the arrays. However, I present here an alternative approach using Single instruction, multiple data SIMD:
foreach (var item in QuartetUpdates)
{
    var vPrices = Vector.LoadUnsafe(ref item.Prices[0]);
    vPrices = Vector.Divide(vPrices, vQuantifier);
    Vector.StoreUnsafe(vPrices, ref item.Prices[0]);
}
First, all double values are loaded into a vector register, then the divide operation is executed where vQuantifier is a vector representation of the Quantifier property's value. Finally, the results are written back to the inline array.
I used BenchmarkDotNet to compare these solutions. Each approach processed a collection of 100 market data updates.
|      Method |     Mean |   Error |  StdDev | Allocated |
|------------ |---------:|--------:|--------:|----------:|
|     Vanilla | 382.4 ns | 4.80 ns | 4.01 ns |         - |
|       Array | 369.2 ns | 3.45 ns | 3.06 ns |         - |
| QuartetSIMD | 184.6 ns | 2.78 ns | 2.32 ns |         - |

While the results show the SIMD implementation is a clear winner: it does not allocate more to the vanilla approach but executes twice as fast. However, as usual benchmarks for SIMD operations are more nuanced. When the size of the collection is reduced to a single (or a few) entries only, the performance difference vanishes.
Which approach should one choose? As usual it depends. With tight loops and frequent market data updates for multiple symbols, it might be worth taking the SIMD optimization. For less frequent updates and less tight loops can still efficiently update data using the vanilla solutions. The suggestion is to measure the given application's use-case and choose an appropriate implementation given the performance and maintainability requirements of the application.