A quick check of split forwaring scenarios as a follow up from this Twitter thread and associated blog post by Paul Khuong.
These tests are run from uarch-bench using:
./uarch-bench.sh --test-name=studies/forwarding/*
First, the results, then a bit of discussion.
| Benchmark | ICL i5-1035 | SKL i7-6700HQ | Zen2 EPYC 7262 | HSW i7-4770 | SNB E5-1620 |
| fw_write_read | 0.58 | 4.43 | 1.00 | 5.25 | 5.55 |
| fw_write_readx | 5.00 | 5.00 | 8.01 | 6.00 | 7.37 |
| fw_split_write_read | 17.01 | 14.00 | 21.04 | 14.90 | 17.00 |
|fw_split_write_read_chained | 22.02 | 19.17 | 24.53 | 18.18 | 20.17 |
| fw_write_split_read | 7.28 | 6.31 | 9.01 | 6.27 | 7.08 |
| fw_write_split_read_both | 7.32 | 7.46 | 10.01 | 8.00 | 8.04 |
Each test is some kind of dense (back-to-back) store forwarding scenario, i.e., a write to memory followed by a read, mostly of 16 byte values. The ones that include split
mean that the corresponding access is split into two accesses: two 8-byte general purpose register reads or writes. Details of each test can be found at the end.
Here is a summary of the key findings.
All tested CPUs perform poorly with split writes. In particular, all CPUs have some kind of stall when small writes are followed by a wider read (that reads more than one prior write), which affects performance in the throughput domain as well as the latency domain. This is unlike any other scenario. That is, split writes followed by split reads cannot execute faster than ~14 cycles on any CPU, even when they are independent. This probably indicates that such split reads have to wait until the store buffer flushes, in order to perform their read, and this blocks somehow the subsequent reads and/or writes in a way that affects throughput.
Intel is generally faster here, progressing from 17 cycles in Sandy Bridge to 14 cycles in Skylake. Zen2 is at 21 cycles.
If you do actually chain the results, e.g., by moving the xmm
resgister (write source) into the scalar rax
(read source) register, performance degrades only slightly (3-5 cycles, and the movq
itself takes up to 3 cycles) – indicating that the performance is largely bottlenecked on a throughput factor.
These are the simplest store forwarding cases, fw_write_read
and fw_write_readx
, where a GP or vector register is written and then immediately read (with the same size).
The most interesting case is the Zen2
result of 1 cycle. This indicates that memory renaming, at least for this restricted scenario. That is, it executes a load and store pair in 1 cycle, indicating that at least one of the pair has zero effective latency. This feature was hinted at in the AMD optimization manuals for Zen 1, and is described in detail on wikichip for Zen 2 – but other than that I am not aware of any other information:
Dependencies on the RSP arise from the side effect of decrementing and incrementing the stack pointer. A stack operation can not proceed until the previous one updated the register. The SSO lifts these adjustments into the front end, calculating an offset which falls and rises with every PUSH and POP, and turns these instructions into stores and loads with RSP + offset addressing. The stack tracker records PUSH and POP instructions and their offset in a table. The memfile records stores and their destination, given by base, index, and displacement since linear or physical addresses are still unknown. They remain on file until the instruction retires. A temporary register is assigned to each store. When the store is later executed, the data is copied into this register (possibly by mapping it to the physical register backing the source register?) as well as being sent to the store queue. Loads are compared to recorded stores. A load predicted to match a previous store is modified to speculatively read the data from the store's temporary register. This is resolved in the integer or FP rename unit, potentially as a zero latency register-register move. The load is also sent to the LS unit to verify the prediction, and if incorrect, replay the instructions depending on the load with the correct data. It should be noted that several loads and stores can be dispatched in one cycle and this optimization is applied to all of them.
I tested separately a variant of the fw_write_read
test, called fw_write_read_rcx
which uses the rcx
register as the base for renaming on Zen2, and it also worked in one cycle, so involving the rsp
register isn't necessary.
This renaming apparently doesn't work for vector loads/stores on AMD, and Intel up to to Skylake doesn't show anything like that (stay tuned for ICL results) for either vector or scalar accesses.
For vector store forwaring, recent Intel (5, 6 or 7.x cycles on SKL, HSW and SNB respectively) is faster than AMD (8 cycles), but this may depend on details of the test.
The remaining tests don't show anything remarkable.
All CPUs can forward quickly in the split_read
scenario where a wide write is followed by a narrow load, although both Zen2 and Ice Lake no longer perform memory renaming in this case.
Details of each test follows.
mov [rsp], rax
mov rax, [rsp]
Straightforward write and read of an 8-byte value, no splits.
vmovdqa [rsp], xmm0
vmovdqa xmm0, [rsp]
Straightforward write and read of a 16-byte value using and xmm
register, no splits.
mov [rsp], rax
mov [rsp + 8], rax
vmovdqa xmm0, [rsp]
Split write, unsplit read. The 16-byte value is written by two 8-byte GP writes, and read by a single xmm
read. In this test, there is no dependency between the written value and the read value, so there is no carried dependency chain, so this is a throughput test.
mov [rsp], rax
mov [rsp + 8], rax
vmovdqa xmm0, [rsp]
vmovq rax, xmm0
Split write, unsplit read. The 16-byte value is written by two 8-byte GP writes, and read by a single xmm
read. Then, the xmm0
value is written to rax
, so that the stored value depends on the read value. Unlike the prior test, this means there is a carried dependency chain and so this is a latency test.
vmovdqa [rsp], xmm0
mov rax, [rsp]
movq xmm0, rax
Unsplit write, split read. The 16-byte value is written by a single xmm
write, then is read by a single xmm
read. The read is chained to the next write with a vmovq
instruction, so this is a latency test.
vmovdqa [rsp], xmm0
mov rax, [rsp]
add rax, [rsp + 8]
movq xmm0, rax
Same as the prior test: fw_write_split_read
, except that both halves of the 16-byte value are read (the second half is added to the first). Generally I expect this to perform the same, plus one cycle (because the latency chain is increased by one cycle from the add
), but if that's not the case something interesting happened.
Thanks to Daniel Lemire who provided 3 out of 4 of the machines used for testing.
Results from Piledriver:
I found this interesting because there were hints this functionality may have existed in some form pre-Zen1 -- per the 15h model 30 BKDG (https://www.amd.com/system/files/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf), one of the changes from PD to SR was
There was no evidence of even a 3-store memfile here (it'd have showed up in fw_write_read)