Skip to content

Instantly share code, notes, and snippets.

@kunalspathak
Last active June 3, 2024 02:12
Show Gist options
  • Save kunalspathak/fe97e3dea349b5a5aa714ef7228fee8e to your computer and use it in GitHub Desktop.
Save kunalspathak/fe97e3dea349b5a5aa714ef7228fee8e to your computer and use it in GitHub Desktop.

Problem Statement

In https://github.com/dotnet/performance, we have seen lot of MicroBenchmarks regress that often is the case of Intel's JCC erratum. As such, we wanted to see if we can mitigate it or add a switch in codegen to mitigate it for experimental purpose and performance comparison.

Observation

  • As per an experiment conducted by Brian Robbins in JCC Erratum Impact on .NET Core · Issue #35730 · dotnet/runtime (github.com), the impact of JCC erratum is not on real world applications.
  • This is “Intel only” issue and has no impact on other processors. For e.g. see a benchmark validated by Aman in dotnet/runtime#102763 (comment) where on AMD, it doesn’t show any performance impact. So, enabling this globally might not be an option unless we can find out during JITting if we are generating the code for Intel processor. This might not be a viable option for AOT unless we expose some flag stating that we are generating for Intel.
  • As such, the only thing we can do is make a release available flag for developers to try out while investigating performance issues, which is also captured by Bruce in RyuJIT: Implement mitigation for Intel JCC erratum · Issue #93243.

Prerequisite

  • Add Pseudo NOP instructions
  • Branch tightening
  • Loop alignment adjustment

Possible design

The following discussion is under the assumption that we only mitigate the instructions that are present in hot blocks.

Mitigation w/o loop code

Lets first discuss the easy scenario where we want to mitigate scenarios for following methods. Note that mitigating the instructions for such methods might be less impactful as well because they are executed once or may be couple of times and hence the perf penalty is not observable.

  • All the methods that has no loops or the ones that has loops, but we decide to not align will be considered for mitigation.
  • We can easily mitigate the region(s) of methods that ends at the loop back edge, or that fall after the last loop ends.

Here are steps to accompolish this:

  1. Add pseudo NOP instructions of XXX bytes before every instruction that are JCC erratum candidates. The XXX size is determined by the encoding length of given instruction(s) involved. For example, if we have pair of instructions add (3 bytes) followed by jcc (5 bytes) and they qualify of getting impacted with JCC erratum (by definition in the manual), we will add 3+5= 8 bytes of NOP instruction before that pair, taking into account all possibilities. We call them as "estimated NOP bytes". See below 3 such possibilities:
; case 1 - alignment of 8 bytes

XX XX XX        add
XX XX XX XX XX  jcc
================ 32B boundary ================ 
; case 2 - alignment of 7 bytes

XX XX XX     add
XX XX XX XX 
================ 32B boundary ================ 
XX           jcc
; case 3 - alignment of 0 bytes

================ 32B boundary ================ 
XX XX XX        add
XX XX XX XX XX  jcc
  1. Perform branch tightening based on the estimated NOP bytes.
  2. Similar to loop align adjustments, adjust the JCC mitigation "estimated NOP bytes" based on the new offset calculated after step 2.

Mitigation w/ loop code

The most impactful methods that will be affected with the mitigation are the ones that contains loop code. The steps for them would roughly look something like this:

  1. Add NOP instructions:
    1. Add pseudo NOP instructions of XXX bytes before every instruction that are JCC erratum candidates.
    2. Add pseudo NOP instructions of XXX bytes for loops, the way we do currently.
  2. Perform branch tightening
  3. Perform modified loop alignment adjustment, taking into account jcc erratum adjustment
for (loop_to_align in list_of_loops) {
  total_jcc_erratums = get_jcc_erratum_for_loop(loop_to_align);
  
  retry = 0;
  // try for 5 times max, otherwise just align the loop and move on
  while (retry < 5) {
    for (curr_jcc_erratum = 0; curr_jcc_erratum < total_jcc_erratums; curr_jcc_erratum++) {
        // adjust the jcc erratum based on current alignment
        jcc_erratum_adjustment(curr_jcc_erratum);
    }
    aligment_bytes = calculate_alignment_needed(loop_to_align);
  }  
}

Cost

Just like how we did for loop aligment, when we allocate memory for JIT code from runtime, we get the "estimated size" bytes i.e. the maximum possible bytes based on our estimation. During branch tightening and loop alignment, we might require less memory than we allocated. Thus we end up wasting some of the allocated memory. See memory section of loop alignment blog for details.

With Intel's JCC erratum mitigation, if there are N occurance of instruction(s) in method that are JCC erratum candidate and each instruction needs at most X bytes, then we will be allocating N * X extra bytes per method, potentially wasting some portion of it.

References

https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment