BastianBlokland/dotnet_cached-delegate_benchmark.md

## dotnet_cached-delegate_benchmark.md

      
    Raw
  

              dotnet_cached-delegate_benchmark.md
            
          
    Benchmark to test if caching delegates in high-frequence code has any benefit.
so:
Print42(ConsolePrint);

void ConsolePrint(string text) => Console.WriteLine(text);

void Print42(Action<string> print) => print("42");
vs
Action<string> consolePrint = () => Console.WriteLine(text);

Print42(consolePrint);

void Print42(Action<string> print) => print("42");
Lets test with a benchmark: (using BenchmarkDotNet)
using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
public class DelegateBenchmark
{
    private readonly Action<string> noopPrintDelegate;

    public DelegateBenchmark()
    {
        noopPrintDelegate = NoopPrint;
    }

    [Benchmark(Baseline = true)]
    public void RawMethod() => Print42(NoopPrint);

    [Benchmark]
    public void CachedDelegate() => Print42(noopPrintDelegate);

    [MethodImpl(MethodImplOptions.NoOptimization | MethodImplOptions.NoInlining)]
    private void NoopPrint(string text)
    {
    }

    [MethodImpl(MethodImplOptions.NoOptimization | MethodImplOptions.NoInlining)]
    private static void Print42(Action<string> print) => print("42");
}

public class Program
{
    public static void Main(string[] args) => BenchmarkRunner.Run<DelegateBenchmark>();
}
Results on my laptop: (using the 3.0 preview sdk but should not matter)
BenchmarkDotNet=v0.11.5, OS=macOS Mojave 10.14.5 (18F203) [Darwin 18.6.0]
Intel Core i9-8950HK CPU 2.90GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host]     : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT

|         Method |      Mean |     Error |    StdDev | Ratio |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------------- |----------:|----------:|----------:|------:|-------:|------:|------:|----------:|
|      RawMethod | 14.206 ns | 0.1664 ns | 0.1475 ns |  1.00 | 0.3917 |     - |     - |      64 B |
| CachedDelegate |  2.401 ns | 0.0718 ns | 0.0769 ns |  0.17 |      - |     - |     - |         - |

// * Legends *
  Mean      : Arithmetic mean of all measurements
  Error     : Half of 99.9% confidence interval
  StdDev    : Standard deviation of all measurements
  Ratio     : Mean of the ratio distribution ([Current]/[Baseline])
  Gen 0     : GC Generation 0 collects per 1000 operations
  Gen 1     : GC Generation 1 collects per 1000 operations
  Gen 2     : GC Generation 2 collects per 1000 operations
  Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
  1 ns      : 1 Nanosecond (0.000000001 sec)

So quite a big difference (14 ns vs 2.4 ns). So why is the caching faster?
This is the il that gets generated for both methods: (easy to get using sharplab.io)
.method public hidebysig 
    instance void RawMethod () cil managed 
{
    // Method begins at RVA 0x206c
    // Code size 19 (0x13)
    .maxstack 8

    IL_0000: ldarg.0
    IL_0001: ldftn instance void DelegateBenchmark::NoopPrint(string)
    IL_0007: newobj instance void class [mscorlib]System.Action`1<string>::.ctor(object, native int)
    IL_000c: call void DelegateBenchmark::Print42(class [mscorlib]System.Action`1<string>)
    IL_0011: nop
    IL_0012: ret
} // end of method DelegateBenchmark::RawMethod

.method public hidebysig 
    instance void CachedDelegate () cil managed 
{
    // Method begins at RVA 0x2080
    // Code size 13 (0xd)
    .maxstack 8

    IL_0000: ldarg.0
    IL_0001: ldfld class [mscorlib]System.Action`1<string> DelegateBenchmark::noopPrintDelegate
    IL_0006: call void DelegateBenchmark::Print42(class [mscorlib]System.Action`1<string>)
    IL_000b: nop
    IL_000c: ret
} // end of method DelegateBenchmark::CachedDelegate
On the RawMethod method it needs to build the delegate:
IL_0007: newobj instance void class [mscorlib]System.Action`1<string>::.ctor(object, native int)
While on the CachedDelegate method it just invokes the method on our existing delegate.
So that explains the extra cost.
Just to be clear this is a super tiny cost, but if your code gets called enough times it can end up mattering.
Just for fun i dug a bit into what happens when you create a delegate:
Action<T> is defined as delegate void Action<T>(T arg) in the bcl.
For every delegate the compiler generates a class:
.class nested private auto ansi sealed Action`1<T>
    extends [mscorlib]System.MulticastDelegate
{
    // Methods
    .method public hidebysig specialname rtspecialname 
        instance void .ctor (
            object 'object',
            native int 'method'
        ) runtime managed 
    {
    } // end of method Action`1::.ctor

    .method public hidebysig newslot virtual 
        instance void Invoke (
            !T arg
        ) runtime managed 
    {
    } // end of method Action`1::Invoke

    .method public hidebysig newslot virtual 
        instance class [mscorlib]System.IAsyncResult BeginInvoke (
            !T arg,
            class [mscorlib]System.AsyncCallback callback,
            object 'object'
        ) runtime managed 
    {
    } // end of method Action`1::BeginInvoke

    .method public hidebysig newslot virtual 
        instance void EndInvoke (
            class [mscorlib]System.IAsyncResult result
        ) runtime managed 
    {
    } // end of method Action`1::EndInvoke

} // end of class Action`1
That class inherts from: MulticastDelegate.
MulticastDelegate itself doesn't do much (it mostly comes into play when you 'add' delegates together).
One level deeper it inherts from: System.Delegate.
System.Delegate is a wrapper around a target object target (or null for a static method) and a method pointer (IntPtr methodPtr).
This also explains why its just as fast as calling a method directly as it is just a method pointer in the end.
The method that ends up constructing the delegate is (the internal constructors there are for other scenarios):
[MethodImplAttribute(MethodImplOptions.InternalCall)]
private extern void DelegateConstruct(object target, IntPtr slot);
So this is where the managed trail ends...
If you want to continue the trail then you need to go to:
comdelegate.cpp
FCIMPL3(void, COMDelegate::DelegateConstruct, Object* refThisUNSAFE, Object* targetUNSAFE, PCODE method)
{
...
Here it goes way beyond my skils but AFAIK it looks up the memoryadress to the code that was generated by the jit for that method.