Last active September 10, 2021 12:31
Does caching of delegates in high-frequency c# make sense?

Benchmark to test if caching delegates in high-frequence code has any benefit.



void ConsolePrint(string text) => Console.WriteLine(text);

void Print42(Action<string> print) => print("42");


Action<string> consolePrint = () => Console.WriteLine(text);


void Print42(Action<string> print) => print("42");

Lets test with a benchmark: (using BenchmarkDotNet)

using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class DelegateBenchmark
    private readonly Action<string> noopPrintDelegate;

    public DelegateBenchmark()
        noopPrintDelegate = NoopPrint;

    [Benchmark(Baseline = true)]
    public void RawMethod() => Print42(NoopPrint);

    public void CachedDelegate() => Print42(noopPrintDelegate);

    [MethodImpl(MethodImplOptions.NoOptimization | MethodImplOptions.NoInlining)]
    private void NoopPrint(string text)

    [MethodImpl(MethodImplOptions.NoOptimization | MethodImplOptions.NoInlining)]
    private static void Print42(Action<string> print) => print("42");

public class Program
    public static void Main(string[] args) => BenchmarkRunner.Run<DelegateBenchmark>();

Results on my laptop: (using the 3.0 preview sdk but should not matter)

BenchmarkDotNet=v0.11.5, OS=macOS Mojave 10.14.5 (18F203) [Darwin 18.6.0]
Intel Core i9-8950HK CPU 2.90GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host]     : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT

|         Method |      Mean |     Error |    StdDev | Ratio |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------------- |----------:|----------:|----------:|------:|-------:|------:|------:|----------:|
|      RawMethod | 14.206 ns | 0.1664 ns | 0.1475 ns |  1.00 | 0.3917 |     - |     - |      64 B |
| CachedDelegate |  2.401 ns | 0.0718 ns | 0.0769 ns |  0.17 |      - |     - |     - |         - |

// * Legends *
  Mean      : Arithmetic mean of all measurements
  Error     : Half of 99.9% confidence interval
  StdDev    : Standard deviation of all measurements
  Ratio     : Mean of the ratio distribution ([Current]/[Baseline])
  Gen 0     : GC Generation 0 collects per 1000 operations
  Gen 1     : GC Generation 1 collects per 1000 operations
  Gen 2     : GC Generation 2 collects per 1000 operations
  Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
  1 ns      : 1 Nanosecond (0.000000001 sec)

So quite a big difference (14 ns vs 2.4 ns). So why is the caching faster? This is the il that gets generated for both methods: (easy to get using

.method public hidebysig 
    instance void RawMethod () cil managed 
    // Method begins at RVA 0x206c
    // Code size 19 (0x13)
    .maxstack 8

    IL_0000: ldarg.0
    IL_0001: ldftn instance void DelegateBenchmark::NoopPrint(string)
    IL_0007: newobj instance void class [mscorlib]System.Action`1<string>::.ctor(object, native int)
    IL_000c: call void DelegateBenchmark::Print42(class [mscorlib]System.Action`1<string>)
    IL_0011: nop
    IL_0012: ret
} // end of method DelegateBenchmark::RawMethod

.method public hidebysig 
    instance void CachedDelegate () cil managed 
    // Method begins at RVA 0x2080
    // Code size 13 (0xd)
    .maxstack 8

    IL_0000: ldarg.0
    IL_0001: ldfld class [mscorlib]System.Action`1<string> DelegateBenchmark::noopPrintDelegate
    IL_0006: call void DelegateBenchmark::Print42(class [mscorlib]System.Action`1<string>)
    IL_000b: nop
    IL_000c: ret
} // end of method DelegateBenchmark::CachedDelegate

On the RawMethod method it needs to build the delegate:

IL_0007: newobj instance void class [mscorlib]System.Action`1<string>::.ctor(object, native int)

While on the CachedDelegate method it just invokes the method on our existing delegate.

So that explains the extra cost. Just to be clear this is a super tiny cost, but if your code gets called enough times it can end up mattering.

Just for fun i dug a bit into what happens when you create a delegate:

Action<T> is defined as delegate void Action<T>(T arg) in the bcl.

For every delegate the compiler generates a class:

.class nested private auto ansi sealed Action`1<T>
    extends [mscorlib]System.MulticastDelegate
    // Methods
    .method public hidebysig specialname rtspecialname 
        instance void .ctor (
            object 'object',
            native int 'method'
        ) runtime managed 
    } // end of method Action`1::.ctor

    .method public hidebysig newslot virtual 
        instance void Invoke (
            !T arg
        ) runtime managed 
    } // end of method Action`1::Invoke

    .method public hidebysig newslot virtual 
        instance class [mscorlib]System.IAsyncResult BeginInvoke (
            !T arg,
            class [mscorlib]System.AsyncCallback callback,
            object 'object'
        ) runtime managed 
    } // end of method Action`1::BeginInvoke

    .method public hidebysig newslot virtual 
        instance void EndInvoke (
            class [mscorlib]System.IAsyncResult result
        ) runtime managed 
    } // end of method Action`1::EndInvoke

} // end of class Action`1

That class inherts from: MulticastDelegate. MulticastDelegate itself doesn't do much (it mostly comes into play when you 'add' delegates together).

One level deeper it inherts from: System.Delegate.

System.Delegate is a wrapper around a target object target (or null for a static method) and a method pointer (IntPtr methodPtr). This also explains why its just as fast as calling a method directly as it is just a method pointer in the end.

The method that ends up constructing the delegate is (the internal constructors there are for other scenarios):

private extern void DelegateConstruct(object target, IntPtr slot);

So this is where the managed trail ends...

If you want to continue the trail then you need to go to: comdelegate.cpp

FCIMPL3(void, COMDelegate::DelegateConstruct, Object* refThisUNSAFE, Object* targetUNSAFE, PCODE method)

Here it goes way beyond my skils but AFAIK it looks up the memoryadress to the code that was generated by the jit for that method.

