travisdowns/split-forwarding.md Secret

## split-forwarding.md

      
    Raw
  

              split-forwarding.md
            
          
    A quick check of split forwaring scenarios as a follow up from this Twitter thread and associated blog post by Paul Khuong.
These tests are run from uarch-bench using:
./uarch-bench.sh --test-name=studies/forwarding/*

First, the results, then a bit of discussion.
|                  Benchmark | ICL i5-1035 | SKL i7-6700HQ | Zen2 EPYC 7262 | HSW i7-4770 | SNB E5-1620 |
|              fw_write_read |        0.58 |          4.43 |           1.00 |        5.25 |        5.55 | 
|             fw_write_readx |        5.00 |          5.00 |           8.01 |        6.00 |        7.37 | 
|        fw_split_write_read |       17.01 |         14.00 |          21.04 |       14.90 |       17.00 | 
|fw_split_write_read_chained |       22.02 |         19.17 |          24.53 |       18.18 |       20.17 | 
|        fw_write_split_read |        7.28 |          6.31 |           9.01 |        6.27 |        7.08 | 
|   fw_write_split_read_both |        7.32 |          7.46 |          10.01 |        8.00 |        8.04 | 


Each test is some kind of dense (back-to-back) store forwarding scenario, i.e., a write to memory followed by a read, mostly of 16 byte values. The ones that include split mean that the corresponding access is split into two accesses: two 8-byte general purpose register reads or writes. Details of each test can be found at the end.
Findings

Here is a summary of the key findings.
Split Writes

All tested CPUs perform poorly with split writes. In particular, all CPUs have some kind of stall when small writes are followed by a wider read (that reads more than one prior write), which affects performance in the throughput domain as well as the latency domain. This is unlike any other scenario. That is, split writes followed by split reads cannot execute faster than ~14 cycles on any CPU, even when they are independent. This probably indicates that such split reads have to wait until the store buffer flushes, in order to perform their read, and this blocks somehow the subsequent reads and/or writes in a way that affects throughput.
Intel is generally faster here, progressing from  17 cycles in Sandy Bridge to 14 cycles in Skylake. Zen2 is at 21 cycles.
If you do actually chain the results, e.g., by moving the xmm resgister (write source) into the scalar rax (read source) register, performance degrades only slightly (3-5 cycles, and the movq itself takes up to 3 cycles) – indicating that the performance is largely bottlenecked on a throughput factor.
Simple Store Forwarding

These are the simplest store forwarding cases, fw_write_read and fw_write_readx, where a GP or vector register is written and then immediately read (with the same size).
The most interesting case is the Zen2 result of 1 cycle. This indicates that memory renaming, at least for this restricted scenario. That is, it executes a load and store pair in 1 cycle, indicating that at least one of the pair has zero effective latency. This feature was hinted at in the AMD optimization manuals for Zen 1, and is described in detail on wikichip for Zen 2 – but other than that I am not aware of any other information:

Dependencies on the RSP arise from the side effect of decrementing and incrementing the stack pointer. A stack operation can not proceed until the previous one updated the register. The SSO lifts these adjustments into the front end, calculating an offset which falls and rises with every PUSH and POP, and turns these instructions into stores and loads with RSP + offset addressing. The stack tracker records PUSH and POP instructions and their offset in a table. The memfile records stores and their destination, given by base, index, and displacement since linear or physical addresses are still unknown. They remain on file until the instruction retires. A temporary register is assigned to each store. When the store is later executed, the data is copied into this register (possibly by mapping it to the physical register backing the source register?) as well as being sent to the store queue. Loads are compared to recorded stores. A load predicted to match a previous store is modified to speculatively read the data from the store's temporary register. This is resolved in the integer or FP rename unit, potentially as a zero latency register-register move. The load is also sent to the LS unit to verify the prediction, and if incorrect, replay the instructions depending on the load with the correct data. It should be noted that several loads and stores can be dispatched in one cycle and this optimization is applied to all of them.

I tested separately a variant of the fw_write_read test, called fw_write_read_rcx which uses the rcx register as the base for renaming on Zen2, and it also worked in one cycle, so involving the rsp register isn't necessary.
This renaming apparently doesn't work for vector loads/stores on AMD, and Intel up to to Skylake doesn't show anything like that (stay tuned for ICL results) for either vector or scalar accesses.
For vector store forwaring, recent Intel (5, 6 or 7.x cycles on SKL, HSW and SNB respectively) is faster than AMD (8 cycles), but this may depend on details of the test.
Remaining

The remaining tests don't show anything remarkable.
All CPUs can forward quickly in the split_read scenario where a wide write is followed by a narrow load, although both Zen2 and Ice Lake no longer perform memory renaming in this case.
Test Details

Details of each test follows.
fw_write_read

    mov [rsp], rax
    mov rax, [rsp]
Straightforward write and read of an 8-byte value, no splits.
fw_write_readx

    vmovdqa [rsp], xmm0
    vmovdqa xmm0, [rsp]
Straightforward write and read of a 16-byte value using and xmm register, no splits.
fw_split_write_read

    mov [rsp], rax
    mov [rsp + 8], rax
    vmovdqa xmm0, [rsp]
Split write, unsplit read. The 16-byte value is written by two 8-byte GP writes, and read by a single xmm read. In this test, there is no dependency between the written value and the read value, so there is no carried dependency chain, so this is a throughput test.
fw_split_write_read_chained

    mov [rsp], rax
    mov [rsp + 8], rax
    vmovdqa xmm0, [rsp]
    vmovq rax, xmm0
Split write, unsplit read. The 16-byte value is written by two 8-byte GP writes, and read by a single xmm read. Then, the xmm0 value is written to rax, so that the stored value depends on the read value. Unlike the prior test, this means there is a carried dependency chain and so this is a latency test.
fw_write_split_read

    vmovdqa [rsp], xmm0
    mov     rax, [rsp]
    movq    xmm0, rax
Unsplit write, split read. The 16-byte value is written by a single xmm write, then is read by a single xmm read. The read is chained to the next write with a vmovq instruction, so this is a latency test.
fw_write_split_read_both

    vmovdqa [rsp], xmm0
    mov     rax, [rsp]
    add     rax, [rsp + 8]
    movq    xmm0, rax
Same as the prior test: fw_write_split_read, except that both halves of the 16-byte value are read (the second half is added to the first). Generally I expect this to perform the same, plus one cycle (because the latency chain is increased by one cycle from the add), but if that's not the case something interesting happened.
Thanks

Thanks to Daniel Lemire who provided 3 out of 4 of the machines used for testing.