kunalspathak/intel_jcc_erratum.md

## intel_jcc_erratum.md

      
    Raw
  

              intel_jcc_erratum.md
            
          
    Problem Statement

In https://github.com/dotnet/performance, we have seen lot of MicroBenchmarks regress that often is the case of Intel's JCC erratum. As such, we
wanted to see if we can mitigate it or add a switch in codegen to mitigate it for experimental purpose and performance comparison.
Observation


As per an experiment conducted by Brian Robbins in JCC Erratum Impact on .NET Core · Issue #35730 · dotnet/runtime (github.com),
the impact of JCC erratum is not on real world applications.
This is “Intel only” issue and has no impact on other processors. For e.g. see a benchmark validated by Aman in dotnet/runtime#102763 (comment)
where on AMD, it doesn’t show any performance impact. So, enabling this globally might not be an option unless we can find out during
JITting if we are generating the code for Intel processor. This might not be a viable option for AOT unless we expose some flag stating
that we are generating for Intel.
As such, the only thing we can do is make a release available flag for developers to try out while investigating performance issues,
which is also captured by Bruce in RyuJIT: Implement mitigation for Intel JCC erratum · Issue #93243.

Prerequisite


Add Pseudo NOP instructions
Branch tightening
Loop alignment adjustment

Possible design

The following discussion is under the assumption that we only mitigate the instructions that are present in hot blocks.
Mitigation w/o loop code

Lets first discuss the easy scenario where we want to mitigate scenarios for following methods. Note that mitigating the instructions for
such methods might be less impactful as well because they are executed once or may be couple of times and hence the perf penalty is not
observable.

All the methods that has no loops or the ones that has loops, but we decide to not align will be considered for mitigation.
We can easily mitigate the region(s) of methods that ends at the loop back edge, or that fall after the last loop ends.

Here are steps to accompolish this:

Add pseudo NOP instructions of XXX bytes before every instruction that are JCC erratum candidates. The XXX size is determined by the
encoding length of given instruction(s) involved. For example, if we have pair of instructions add (3 bytes) followed by jcc (5 bytes) and
they qualify of getting impacted with JCC erratum (by definition in the manual), we will add 3+5= 8 bytes of NOP instruction before
that pair, taking into account all possibilities. We call them as "estimated NOP bytes". See below 3 such possibilities:

; case 1 - alignment of 8 bytes

XX XX XX        add
XX XX XX XX XX  jcc
================ 32B boundary ================ 

; case 2 - alignment of 7 bytes

XX XX XX     add
XX XX XX XX 
================ 32B boundary ================ 
XX           jcc

; case 3 - alignment of 0 bytes

================ 32B boundary ================ 
XX XX XX        add
XX XX XX XX XX  jcc


Perform branch tightening based on the estimated NOP bytes.
Similar to loop align adjustments, adjust the JCC mitigation "estimated NOP bytes" based on the new offset calculated after step 2.

Mitigation w/ loop code

The most impactful methods that will be affected with the mitigation are the ones that contains loop code. The steps for them would roughly
look something like this:

Add NOP instructions:

Add pseudo NOP instructions of XXX bytes before every instruction that are JCC erratum candidates.
Add pseudo NOP instructions of XXX bytes for loops, the way we do currently.


Perform branch tightening
Perform modified loop alignment adjustment, taking into account jcc erratum adjustment

for (loop_to_align in list_of_loops) {
  total_jcc_erratums = get_jcc_erratum_for_loop(loop_to_align);
  
  retry = 0;
  // try for 5 times max, otherwise just align the loop and move on
  while (retry < 5) {
    for (curr_jcc_erratum = 0; curr_jcc_erratum < total_jcc_erratums; curr_jcc_erratum++) {
        // adjust the jcc erratum based on current alignment
        jcc_erratum_adjustment(curr_jcc_erratum);
    }
    aligment_bytes = calculate_alignment_needed(loop_to_align);
  }  
}
Cost

Just like how we did for loop aligment, when we allocate memory for JIT code from runtime, we get the "estimated size" bytes i.e. the maximum
possible bytes based on our estimation. During branch tightening and loop alignment, we might require less memory than we allocated. Thus
we end up wasting some of the allocated memory. See memory section of loop alignment blog for details.
With Intel's JCC erratum mitigation, if there are N occurance of instruction(s) in method that are JCC erratum candidate and each instruction
needs at most X bytes, then we will be allocating N * X extra bytes per method, potentially wasting some portion of it.
References

https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf