EgorBo/Dynamic PGO in .NET 6.0.md

## Dynamic PGO in .NET 6.0.md

      
    Raw
  

              Dynamic PGO in .NET 6.0.md
            
          
    Dynamic PGO in .NET 6.0

Dynamic PGO (Profile-guided optimization) is a JIT-compiler optimization technique that allows JIT to collect additional information about surroundings (aka profile) in tier0 codegen in order to rely on it later during promotion from tier0 to tier1 for hot methods to make them even more efficient.
What exactly PGO can optimize for us?


Profile-driving inlining - inliner relies on PGO data and can be very aggressive for hot paths and care less about cold ones, see dotnet/runtime#52708 and dotnet/runtime#55478. A good example where it has visible effects is this StringBuilder benchmark:


Guarded devirtualization - most monomorphic virtual/interface calls can be devirtualized using PGO data, e.g.:


void DisposeMe(IDisposable d)
{
    d.Dispose();
}
It looks like nothing can be optimized here, right? Just an ordinary virtual (interface) call on top of an unknown object that will go through several indirects to call the actual Dispose() implementation and its body will never be inlined here. Now let's see what PGO can do here.

With Dynamic PGO on, this method will be compiled to something like this in tier0 (in machine code):
void DisposeMe(IDisposable d)
{
+   call CORINFO_HELP_CLASSPROFILE32(d, offset);
    d.Dispose();
}
We now poll that d for its underlying type every call of that method. Yes, it makes it slightly slower, but eventually it will be re-compiled to tier1 to something like this:
void DisposeMe(IDisposable d)
{
+   if (d is MyType)           // E.g. Profile states that Dispose here is 'mostly' called on MyType.
+       ((MyType)d).Dispose(); // Direct call - can be inlined now!
+   else
        d.Dispose();           // a cold fallback, just in case
}

    ^ codegen diff for a case where MyType::Dispose is empty

Hot-cold block reordering - JIT re-orders blocks to keep hot ones closer to each other and pushes cold ones to the end of the method. The following code:

void DoWork(int a)
{
    if (a > 0)
        DoWork1();
    else
        DoWork2();
}
Is compiled like this in tier0:
void DoWork(int a)
{
    if (a > 0)
+       __block_0_counter++;
        DoWork1();
    else
+       __block_1_counter++;
        DoWork2();
}
And again: once it's recompiled to tier1 it is optimized into:
void DoWork(int a)
{
    // E.g. __block_0_counter is smaller or even zero => DoWork1 is rarely (never) taken
    // and JIT re-orders DoWork2 with DoWork1:
-   if (a > 0)
+   if (a <= 0)
-       DoWork2();
+       DoWork1();
    else
-       DoWork1();
+       DoWork2();
}

Register allocation - realistic block weights allow JIT to pick a better strategy on what to keep in registers and what to spill
Misc - some optimizations such as Loop Clonning, Inlined Casts, etc. aren't applied in cold blocks.

Benchmarks (Default mode vs Dynamic PGO)

using System.Text;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

// Run the benchmarks
BenchmarkRunner.Run<PgoBenchmarks>();


[Config(typeof(MyEnvVars))]
public class PgoBenchmarks
{
    // Custom config to define "Default vs PGO"
    class MyEnvVars : ManualConfig
    {
        public MyEnvVars()
        {
            // Use .NET 6.0 default mode:
            AddJob(Job.Default.WithId("Default mode"));

            // Use Dynamic PGO mode:
            AddJob(Job.Default.WithId("Dynamic PGO")
                .WithEnvironmentVariables(
                    new EnvironmentVariable("DOTNET_TieredPGO", "1"),
                    new EnvironmentVariable("DOTNET_TC_QuickJitForLoops", "1"),
                    new EnvironmentVariable("DOTNET_ReadyToRun", "0")));
        }
    }


    //
    // Benchmark 1: Devirtualize unknown virtual calls:
    //

    public IEnumerable<object> TestData()
    {
        // Test data for 'GuardedDevirtualization(ICollection<int>)'
        yield return new List<int>();
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public void GuardedDevirtualization(ICollection<int> collection)
    {
        // a chain of unknown virtual calls...
        collection.Clear();
        collection.Add(1);
        collection.Add(2);
        collection.Add(3);
    }


    //
    // Benchmark 2: Allow inliner to be way more aggressive than usual
    //              for profiled call-sites:
    //

    [Benchmark]
    public StringBuilder ProfileDrivingInlining()
    {
        StringBuilder sb = new();
        for (int i = 0; i < 1000; i++)
            sb.Append("hi"); // see https://twitter.com/EgorBo/status/1451149444183990273
        return sb;
    }


    //
    // Benchmark 3: Reorder hot-cold blocks for better performance
    //

    [Benchmark]
    [Arguments(42)]
    public string HotColdBlockReordering(int a)
    {
        if (a == 1)
            return "a is 1";
        if (a == 2)
            return "a is 2";
        if (a == 3)
            return "a is 3";
        if (a == 4)
            return "a is 4";
        if (a == 5)
            return "a is 5";
        return "a is too big"; // this branch is always taken in this benchmark (a is 42)
    }
}
Results:


Method
Job
Mean
Error
StdDev


GuardedDevirtualization
Default mode
5.7448 ns
0.0020 ns
0.0017 ns


GuardedDevirtualization
Dynamic PGO
3.2651 ns
0.0233 ns
0.0182 ns


ProfileDrivingInlining
Default mode
3,538.2980 ns
26.7256 ns
23.6915 ns


ProfileDrivingInlining
Dynamic PGO
2,167.8397 ns
5.0619 ns
4.2269 ns


HotColdBlockReordering
Default mode
1.5244 ns
0.0029 ns
0.0025 ns


HotColdBlockReordering
Dynamic PGO
0.0181 ns
0.0051 ns
0.0040 ns


How can I try it on my production?

You only need to make sure the following environment variables are defined in the execution process of your program:
# Enable Dynamic PGO
export DOTNET_TieredPGO=1

# AOT images aren't instrumented so we need to disable them and collect
# relevant PGO data for literally everything. It affects startup time badly, 
# but leads to higher performance after warm up.
export DOTNET_ReadyToRun=0

# For .NET 7.0 we hopefully will enable full-fledged OSR, but for now methods with loops 
# always bypass tier0, however, we do need them in tier0 to be instrumented for PGO.
export DOTNET_TC_QuickJitForLoops=1
^ Linux/macOS, for Windows-Powershell:
$env:DOTNET_TieredPGO=1
$env:DOTNET_ReadyToRun=0
$env:DOTNET_TC_QuickJitForLoops=1
Community feedback on PGO in .NET 6.0

Please, tag me EgorBo on twitter and I'll forward it to the team

@stebets:


@xoofx:


@badamczewski01:


@reillywood:


@AntaoAlmada:


@evocationist:


@deiruch:


@WhiteBlackGoose:

asc-community/AngouriMath (New open-source cross-platform symbolic algebra library for C# · F# · Jupyter · C++ (WIP)" - Dynamic PGO benefits: https://gist.github.com/WhiteBlackGoose/dd2fcd088b3d45e117d1a47ded02f686


@turbedi
ppy/osu-framework#4995 (comment)


Azure Active Directory’s gateway is on .NET 6.0!


aspnet/TechEmpower benchmarks on our hardware: https://aka.ms/aspnet/benchmarks (Navigate to the 17th page which is "PGO"):


YARP Proxy:
Method	Job	Mean	Error	StdDev
GuardedDevirtualization	Default mode	5.7448 ns	0.0020 ns	0.0017 ns
GuardedDevirtualization	Dynamic PGO	3.2651 ns	0.0233 ns	0.0182 ns

ProfileDrivingInlining	Default mode	3,538.2980 ns	26.7256 ns	23.6915 ns
ProfileDrivingInlining	Dynamic PGO	2,167.8397 ns	5.0619 ns	4.2269 ns

HotColdBlockReordering	Default mode	1.5244 ns	0.0029 ns	0.0025 ns
HotColdBlockReordering	Dynamic PGO	0.0181 ns	0.0051 ns	0.0040 ns