Skip to content

Instantly share code, notes, and snippets.

@BastianBlokland
Last active September 10, 2021 12:31
Show Gist options
  • Save BastianBlokland/6cec9eb9fa9b2d6d5ba3a06aaebd1800 to your computer and use it in GitHub Desktop.
Save BastianBlokland/6cec9eb9fa9b2d6d5ba3a06aaebd1800 to your computer and use it in GitHub Desktop.
Does caching of delegates in high-frequency c# make sense?

Benchmark to test if caching delegates in high-frequence code has any benefit.

so:

Print42(ConsolePrint);

void ConsolePrint(string text) => Console.WriteLine(text);

void Print42(Action<string> print) => print("42");

vs

Action<string> consolePrint = () => Console.WriteLine(text);

Print42(consolePrint);

void Print42(Action<string> print) => print("42");

Lets test with a benchmark: (using BenchmarkDotNet)

using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
public class DelegateBenchmark
{
    private readonly Action<string> noopPrintDelegate;

    public DelegateBenchmark()
    {
        noopPrintDelegate = NoopPrint;
    }

    [Benchmark(Baseline = true)]
    public void RawMethod() => Print42(NoopPrint);

    [Benchmark]
    public void CachedDelegate() => Print42(noopPrintDelegate);

    [MethodImpl(MethodImplOptions.NoOptimization | MethodImplOptions.NoInlining)]
    private void NoopPrint(string text)
    {
    }

    [MethodImpl(MethodImplOptions.NoOptimization | MethodImplOptions.NoInlining)]
    private static void Print42(Action<string> print) => print("42");
}

public class Program
{
    public static void Main(string[] args) => BenchmarkRunner.Run<DelegateBenchmark>();
}

Results on my laptop: (using the 3.0 preview sdk but should not matter)

BenchmarkDotNet=v0.11.5, OS=macOS Mojave 10.14.5 (18F203) [Darwin 18.6.0]
Intel Core i9-8950HK CPU 2.90GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.0.100-preview5-011568
  [Host]     : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT
  DefaultJob : .NET Core 3.0.0-preview5-27626-15 (CoreCLR 4.6.27622.75, CoreFX 4.700.19.22408), 64bit RyuJIT

|         Method |      Mean |     Error |    StdDev | Ratio |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------------- |----------:|----------:|----------:|------:|-------:|------:|------:|----------:|
|      RawMethod | 14.206 ns | 0.1664 ns | 0.1475 ns |  1.00 | 0.3917 |     - |     - |      64 B |
| CachedDelegate |  2.401 ns | 0.0718 ns | 0.0769 ns |  0.17 |      - |     - |     - |         - |

// * Legends *
  Mean      : Arithmetic mean of all measurements
  Error     : Half of 99.9% confidence interval
  StdDev    : Standard deviation of all measurements
  Ratio     : Mean of the ratio distribution ([Current]/[Baseline])
  Gen 0     : GC Generation 0 collects per 1000 operations
  Gen 1     : GC Generation 1 collects per 1000 operations
  Gen 2     : GC Generation 2 collects per 1000 operations
  Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
  1 ns      : 1 Nanosecond (0.000000001 sec)

So quite a big difference (14 ns vs 2.4 ns). So why is the caching faster? This is the il that gets generated for both methods: (easy to get using sharplab.io)

.method public hidebysig 
    instance void RawMethod () cil managed 
{
    // Method begins at RVA 0x206c
    // Code size 19 (0x13)
    .maxstack 8

    IL_0000: ldarg.0
    IL_0001: ldftn instance void DelegateBenchmark::NoopPrint(string)
    IL_0007: newobj instance void class [mscorlib]System.Action`1<string>::.ctor(object, native int)
    IL_000c: call void DelegateBenchmark::Print42(class [mscorlib]System.Action`1<string>)
    IL_0011: nop
    IL_0012: ret
} // end of method DelegateBenchmark::RawMethod

.method public hidebysig 
    instance void CachedDelegate () cil managed 
{
    // Method begins at RVA 0x2080
    // Code size 13 (0xd)
    .maxstack 8

    IL_0000: ldarg.0
    IL_0001: ldfld class [mscorlib]System.Action`1<string> DelegateBenchmark::noopPrintDelegate
    IL_0006: call void DelegateBenchmark::Print42(class [mscorlib]System.Action`1<string>)
    IL_000b: nop
    IL_000c: ret
} // end of method DelegateBenchmark::CachedDelegate

On the RawMethod method it needs to build the delegate:

IL_0007: newobj instance void class [mscorlib]System.Action`1<string>::.ctor(object, native int)

While on the CachedDelegate method it just invokes the method on our existing delegate.

So that explains the extra cost. Just to be clear this is a super tiny cost, but if your code gets called enough times it can end up mattering.

Just for fun i dug a bit into what happens when you create a delegate:

Action<T> is defined as delegate void Action<T>(T arg) in the bcl.

For every delegate the compiler generates a class:

.class nested private auto ansi sealed Action`1<T>
    extends [mscorlib]System.MulticastDelegate
{
    // Methods
    .method public hidebysig specialname rtspecialname 
        instance void .ctor (
            object 'object',
            native int 'method'
        ) runtime managed 
    {
    } // end of method Action`1::.ctor

    .method public hidebysig newslot virtual 
        instance void Invoke (
            !T arg
        ) runtime managed 
    {
    } // end of method Action`1::Invoke

    .method public hidebysig newslot virtual 
        instance class [mscorlib]System.IAsyncResult BeginInvoke (
            !T arg,
            class [mscorlib]System.AsyncCallback callback,
            object 'object'
        ) runtime managed 
    {
    } // end of method Action`1::BeginInvoke

    .method public hidebysig newslot virtual 
        instance void EndInvoke (
            class [mscorlib]System.IAsyncResult result
        ) runtime managed 
    {
    } // end of method Action`1::EndInvoke

} // end of class Action`1

That class inherts from: MulticastDelegate. MulticastDelegate itself doesn't do much (it mostly comes into play when you 'add' delegates together).

One level deeper it inherts from: System.Delegate.

System.Delegate is a wrapper around a target object target (or null for a static method) and a method pointer (IntPtr methodPtr). This also explains why its just as fast as calling a method directly as it is just a method pointer in the end.

The method that ends up constructing the delegate is (the internal constructors there are for other scenarios):

[MethodImplAttribute(MethodImplOptions.InternalCall)]
private extern void DelegateConstruct(object target, IntPtr slot);

So this is where the managed trail ends...

If you want to continue the trail then you need to go to: comdelegate.cpp

FCIMPL3(void, COMDelegate::DelegateConstruct, Object* refThisUNSAFE, Object* targetUNSAFE, PCODE method)
{
...

Here it goes way beyond my skils but AFAIK it looks up the memoryadress to the code that was generated by the jit for that method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment