In https://github.com/dotnet/performance, we have seen lot of MicroBenchmarks regress that often is the case of Intel's JCC erratum. As such, we wanted to see if we can mitigate it or add a switch in codegen to mitigate it for experimental purpose and performance comparison.
- As per an experiment conducted by Brian Robbins in JCC Erratum Impact on .NET Core · Issue #35730 · dotnet/runtime (github.com), the impact of JCC erratum is not on real world applications.
- This is “Intel only” issue and has no impact on other processors. For e.g. see a benchmark validated by Aman in dotnet/runtime#102763 (comment) where on AMD, it doesn’t show any performance impact. So, enabling this globally might not be an option unless we can find out during JITting if we are generating the code for Intel processor. This might not be a viable option for AOT unless we expose some flag stating that we are generating for Intel.
- As such, the only thing we can do is make a release available flag for developers to try out while investigating performance issues, which is also captured by Bruce in RyuJIT: Implement mitigation for Intel JCC erratum · Issue #93243.
- Add Pseudo NOP instructions
- Branch tightening
- Loop alignment adjustment
The following discussion is under the assumption that we only mitigate the instructions that are present in hot blocks.
Lets first discuss the easy scenario where we want to mitigate scenarios for following methods. Note that mitigating the instructions for such methods might be less impactful as well because they are executed once or may be couple of times and hence the perf penalty is not observable.
- All the methods that has no loops or the ones that has loops, but we decide to not align will be considered for mitigation.
- We can easily mitigate the region(s) of methods that ends at the loop back edge, or that fall after the last loop ends.
Here are steps to accompolish this:
- Add pseudo NOP instructions of
XXX
bytes before every instruction that are JCC erratum candidates. The XXX size is determined by the encoding length of given instruction(s) involved. For example, if we have pair of instructionsadd
(3 bytes) followed byjcc
(5 bytes) and they qualify of getting impacted with JCC erratum (by definition in the manual), we will add3+5= 8
bytes ofNOP
instruction before that pair, taking into account all possibilities. We call them as "estimated NOP bytes". See below 3 such possibilities:
; case 1 - alignment of 8 bytes
XX XX XX add
XX XX XX XX XX jcc
================ 32B boundary ================
; case 2 - alignment of 7 bytes
XX XX XX add
XX XX XX XX
================ 32B boundary ================
XX jcc
; case 3 - alignment of 0 bytes
================ 32B boundary ================
XX XX XX add
XX XX XX XX XX jcc
- Perform branch tightening based on the estimated NOP bytes.
- Similar to loop align adjustments, adjust the JCC mitigation "estimated NOP bytes" based on the new offset calculated after step 2.
The most impactful methods that will be affected with the mitigation are the ones that contains loop code. The steps for them would roughly look something like this:
- Add NOP instructions:
- Add pseudo
NOP
instructions ofXXX
bytes before every instruction that are JCC erratum candidates. - Add pseudo
NOP
instructions ofXXX
bytes for loops, the way we do currently.
- Add pseudo
- Perform branch tightening
- Perform modified loop alignment adjustment, taking into account jcc erratum adjustment
for (loop_to_align in list_of_loops) {
total_jcc_erratums = get_jcc_erratum_for_loop(loop_to_align);
retry = 0;
// try for 5 times max, otherwise just align the loop and move on
while (retry < 5) {
for (curr_jcc_erratum = 0; curr_jcc_erratum < total_jcc_erratums; curr_jcc_erratum++) {
// adjust the jcc erratum based on current alignment
jcc_erratum_adjustment(curr_jcc_erratum);
}
aligment_bytes = calculate_alignment_needed(loop_to_align);
}
}
Just like how we did for loop aligment, when we allocate memory for JIT code from runtime, we get the "estimated size" bytes i.e. the maximum possible bytes based on our estimation. During branch tightening and loop alignment, we might require less memory than we allocated. Thus we end up wasting some of the allocated memory. See memory section of loop alignment blog for details.
With Intel's JCC erratum mitigation, if there are N occurance of instruction(s) in method that are JCC erratum candidate and each instruction
needs at most X bytes, then we will be allocating N * X
extra bytes per method, potentially wasting some portion of it.