Skip to content

Instantly share code, notes, and snippets.

@EgorBo
Last active January 25, 2024 15:15
Show Gist options
  • Star 33 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save EgorBo/dc181796683da3d905a5295bfd3dd95b to your computer and use it in GitHub Desktop.
Save EgorBo/dc181796683da3d905a5295bfd3dd95b to your computer and use it in GitHub Desktop.
Dynamic PGO in .NET 6.0.md

Dynamic PGO in .NET 6.0

Dynamic PGO (Profile-guided optimization) is a JIT-compiler optimization technique that allows JIT to collect additional information about surroundings (aka profile) in tier0 codegen in order to rely on it later during promotion from tier0 to tier1 for hot methods to make them even more efficient.

What exactly PGO can optimize for us?

  1. Profile-driving inlining - inliner relies on PGO data and can be very aggressive for hot paths and care less about cold ones, see dotnet/runtime#52708 and dotnet/runtime#55478. A good example where it has visible effects is this StringBuilder benchmark:

  2. Guarded devirtualization - most monomorphic virtual/interface calls can be devirtualized using PGO data, e.g.:

void DisposeMe(IDisposable d)
{
    d.Dispose();
}

It looks like nothing can be optimized here, right? Just an ordinary virtual (interface) call on top of an unknown object that will go through several indirects to call the actual Dispose() implementation and its body will never be inlined here. Now let's see what PGO can do here.
With Dynamic PGO on, this method will be compiled to something like this in tier0 (in machine code):

void DisposeMe(IDisposable d)
{
+   call CORINFO_HELP_CLASSPROFILE32(d, offset);
    d.Dispose();
}

We now poll that d for its underlying type every call of that method. Yes, it makes it slightly slower, but eventually it will be re-compiled to tier1 to something like this:

void DisposeMe(IDisposable d)
{
+   if (d is MyType)           // E.g. Profile states that Dispose here is 'mostly' called on MyType.
+       ((MyType)d).Dispose(); // Direct call - can be inlined now!
+   else
        d.Dispose();           // a cold fallback, just in case
}

image     ^ codegen diff for a case where MyType::Dispose is empty

  1. Hot-cold block reordering - JIT re-orders blocks to keep hot ones closer to each other and pushes cold ones to the end of the method. The following code:
void DoWork(int a)
{
    if (a > 0)
        DoWork1();
    else
        DoWork2();
}

Is compiled like this in tier0:

void DoWork(int a)
{
    if (a > 0)
+       __block_0_counter++;
        DoWork1();
    else
+       __block_1_counter++;
        DoWork2();
}

And again: once it's recompiled to tier1 it is optimized into:

void DoWork(int a)
{
    // E.g. __block_0_counter is smaller or even zero => DoWork1 is rarely (never) taken
    // and JIT re-orders DoWork2 with DoWork1:
-   if (a > 0)
+   if (a <= 0)
-       DoWork2();
+       DoWork1();
    else
-       DoWork1();
+       DoWork2();
}
  1. Register allocation - realistic block weights allow JIT to pick a better strategy on what to keep in registers and what to spill
  2. Misc - some optimizations such as Loop Clonning, Inlined Casts, etc. aren't applied in cold blocks.

Benchmarks (Default mode vs Dynamic PGO)

using System.Text;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

// Run the benchmarks
BenchmarkRunner.Run<PgoBenchmarks>();


[Config(typeof(MyEnvVars))]
public class PgoBenchmarks
{
    // Custom config to define "Default vs PGO"
    class MyEnvVars : ManualConfig
    {
        public MyEnvVars()
        {
            // Use .NET 6.0 default mode:
            AddJob(Job.Default.WithId("Default mode"));

            // Use Dynamic PGO mode:
            AddJob(Job.Default.WithId("Dynamic PGO")
                .WithEnvironmentVariables(
                    new EnvironmentVariable("DOTNET_TieredPGO", "1"),
                    new EnvironmentVariable("DOTNET_TC_QuickJitForLoops", "1"),
                    new EnvironmentVariable("DOTNET_ReadyToRun", "0")));
        }
    }


    //
    // Benchmark 1: Devirtualize unknown virtual calls:
    //

    public IEnumerable<object> TestData()
    {
        // Test data for 'GuardedDevirtualization(ICollection<int>)'
        yield return new List<int>();
    }

    [Benchmark]
    [ArgumentsSource(nameof(TestData))]
    public void GuardedDevirtualization(ICollection<int> collection)
    {
        // a chain of unknown virtual calls...
        collection.Clear();
        collection.Add(1);
        collection.Add(2);
        collection.Add(3);
    }


    //
    // Benchmark 2: Allow inliner to be way more aggressive than usual
    //              for profiled call-sites:
    //

    [Benchmark]
    public StringBuilder ProfileDrivingInlining()
    {
        StringBuilder sb = new();
        for (int i = 0; i < 1000; i++)
            sb.Append("hi"); // see https://twitter.com/EgorBo/status/1451149444183990273
        return sb;
    }


    //
    // Benchmark 3: Reorder hot-cold blocks for better performance
    //

    [Benchmark]
    [Arguments(42)]
    public string HotColdBlockReordering(int a)
    {
        if (a == 1)
            return "a is 1";
        if (a == 2)
            return "a is 2";
        if (a == 3)
            return "a is 3";
        if (a == 4)
            return "a is 4";
        if (a == 5)
            return "a is 5";
        return "a is too big"; // this branch is always taken in this benchmark (a is 42)
    }
}

Results:

Method Job Mean Error StdDev
GuardedDevirtualization Default mode 5.7448 ns 0.0020 ns 0.0017 ns
GuardedDevirtualization Dynamic PGO 3.2651 ns 0.0233 ns 0.0182 ns
ProfileDrivingInlining Default mode 3,538.2980 ns 26.7256 ns 23.6915 ns
ProfileDrivingInlining Dynamic PGO 2,167.8397 ns 5.0619 ns 4.2269 ns
HotColdBlockReordering Default mode 1.5244 ns 0.0029 ns 0.0025 ns
HotColdBlockReordering Dynamic PGO 0.0181 ns 0.0051 ns 0.0040 ns

How can I try it on my production?

You only need to make sure the following environment variables are defined in the execution process of your program:

# Enable Dynamic PGO
export DOTNET_TieredPGO=1

# AOT images aren't instrumented so we need to disable them and collect
# relevant PGO data for literally everything. It affects startup time badly, 
# but leads to higher performance after warm up.
export DOTNET_ReadyToRun=0

# For .NET 7.0 we hopefully will enable full-fledged OSR, but for now methods with loops 
# always bypass tier0, however, we do need them in tier0 to be instrumented for PGO.
export DOTNET_TC_QuickJitForLoops=1

^ Linux/macOS, for Windows-Powershell:

$env:DOTNET_TieredPGO=1
$env:DOTNET_ReadyToRun=0
$env:DOTNET_TC_QuickJitForLoops=1

Community feedback on PGO in .NET 6.0

Please, tag me EgorBo on twitter and I'll forward it to the team










image


image

  • YARP Proxy: image
@mrange
Copy link

mrange commented Nov 21, 2021

Hi. Great post and very interesting. Dynamic PGO is a feature I have been looking forward to.

Are there cases where we can expect performance regressions? I have a small loop that does seem to do worse with Dynamic PGO.

@EgorBo
Copy link
Author

EgorBo commented Dec 2, 2021

@mrange Unfortunately, yes, there are.
Due to lack of OSR in .NET 6.0 it's possible that with DOTNET_TC_QuickJitForLoops=1 some methods with hot loops will stuck in tier0 forever.
We hope it won't be an issue in .NET 7.0 (UPD it is already happening: dotnet/runtime#61934)

@AndyAyersMS
Copy link

I have a small loop that does seem to do worse with Dynamic PGO.

If you can share this example, we'd be happy to take a look.

@f2bo
Copy link

f2bo commented Feb 23, 2022

Hi @EgorBo,

Very interesting. Thank you. However, isn't the diff after recompilation for the Hot-cold block reordering section (copied below) backward?

void DoWork(int a)
{
    // E.g. __block_0_counter is smaller or even zero => DoWork1 is rarely (never) taken
    // and JIT re-orders DoWork2 with DoWork1:
-   if (a > 0)
+   if (a <= 0)
-       DoWork2();
+       DoWork1();
    else
-       DoWork1();
+       DoWork2();
}

And should really be:

void DoWork(int a)
{
    // E.g. __block_0_counter is smaller or even zero => DoWork1 is rarely (never) taken
    // and JIT re-orders DoWork2 with DoWork1:
-   if (a > 0)
+   if (a <= 0)
-      DoWork1();
+      DoWork2();
    else
-      DoWork2();
+      DoWork1();
}

@jerviscui
Copy link

We now poll that d for its underlying type every call of that method. Yes, it makes it slightly slower, but eventually it will be re-compiled to tier1 to something like this:

What is the end result tier1? Is that so?

void DisposeMe(IDisposable d)
{
+   if (d is MyType)           // E.g. Profile states that Dispose here is 'mostly' called on MyType.
+       ((MyType)d).Dispose(); // Direct call - can be inlined now!
+   else
        d.Dispose();           // a cold fallback, just in case
}

But the next image shows that it has a much larger bytes of code.

@EgorBo
Copy link
Author

EgorBo commented Aug 18, 2023

But the next image shows that it has a much larger bytes of code.

Why would the code size be smaller considering we added a type check and the devirtualized path?

@jerviscui
Copy link

Ok, I got it.

In this picture, the left side shows tier1 when MyType::Dispose is empty and the right side shows tier1 when it is not empty.
https://user-images.githubusercontent.com/523221/126960839-6bc3b110-014a-4680-abd8-44c9e7e01765.png

Right?

@EgorBo
Copy link
Author

EgorBo commented Aug 21, 2023

Ok, I got it.

In this picture, the left side shows tier1 when MyType::Dispose is empty and the right side shows tier1 when it is not empty. https://user-images.githubusercontent.com/523221/126960839-6bc3b110-014a-4680-abd8-44c9e7e01765.png

Right?

No, it's empty in both cases, it's just that it's not devirtualized/inlined on the left.

@jerviscui
Copy link

So, with PGO turned on it will be this
image

This is no PGO
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment