Skip to content

Instantly share code, notes, and snippets.

What would you like to do?

A quick check of split forwaring scenarios as a follow up from this Twitter thread and associated blog post by Paul Khuong.

These tests are run from uarch-bench using:

./ --test-name=studies/forwarding/*

First, the results, then a bit of discussion.

|                  Benchmark | ICL i5-1035 | SKL i7-6700HQ | Zen2 EPYC 7262 | HSW i7-4770 | SNB E5-1620 |
|              fw_write_read |        0.58 |          4.43 |           1.00 |        5.25 |        5.55 | 
|             fw_write_readx |        5.00 |          5.00 |           8.01 |        6.00 |        7.37 | 
|        fw_split_write_read |       17.01 |         14.00 |          21.04 |       14.90 |       17.00 | 
|fw_split_write_read_chained |       22.02 |         19.17 |          24.53 |       18.18 |       20.17 | 
|        fw_write_split_read |        7.28 |          6.31 |           9.01 |        6.27 |        7.08 | 
|   fw_write_split_read_both |        7.32 |          7.46 |          10.01 |        8.00 |        8.04 | 

Each test is some kind of dense (back-to-back) store forwarding scenario, i.e., a write to memory followed by a read, mostly of 16 byte values. The ones that include split mean that the corresponding access is split into two accesses: two 8-byte general purpose register reads or writes. Details of each test can be found at the end.


Here is a summary of the key findings.

Split Writes

All tested CPUs perform poorly with split writes. In particular, all CPUs have some kind of stall when small writes are followed by a wider read (that reads more than one prior write), which affects performance in the throughput domain as well as the latency domain. This is unlike any other scenario. That is, split writes followed by split reads cannot execute faster than ~14 cycles on any CPU, even when they are independent. This probably indicates that such split reads have to wait until the store buffer flushes, in order to perform their read, and this blocks somehow the subsequent reads and/or writes in a way that affects throughput.

Intel is generally faster here, progressing from 17 cycles in Sandy Bridge to 14 cycles in Skylake. Zen2 is at 21 cycles.

If you do actually chain the results, e.g., by moving the xmm resgister (write source) into the scalar rax (read source) register, performance degrades only slightly (3-5 cycles, and the movq itself takes up to 3 cycles) – indicating that the performance is largely bottlenecked on a throughput factor.

Simple Store Forwarding

These are the simplest store forwarding cases, fw_write_read and fw_write_readx, where a GP or vector register is written and then immediately read (with the same size).

The most interesting case is the Zen2 result of 1 cycle. This indicates that memory renaming, at least for this restricted scenario. That is, it executes a load and store pair in 1 cycle, indicating that at least one of the pair has zero effective latency. This feature was hinted at in the AMD optimization manuals for Zen 1, and is described in detail on wikichip for Zen 2 – but other than that I am not aware of any other information:

Dependencies on the RSP arise from the side effect of decrementing and incrementing the stack pointer. A stack operation can not proceed until the previous one updated the register. The SSO lifts these adjustments into the front end, calculating an offset which falls and rises with every PUSH and POP, and turns these instructions into stores and loads with RSP + offset addressing. The stack tracker records PUSH and POP instructions and their offset in a table. The memfile records stores and their destination, given by base, index, and displacement since linear or physical addresses are still unknown. They remain on file until the instruction retires. A temporary register is assigned to each store. When the store is later executed, the data is copied into this register (possibly by mapping it to the physical register backing the source register?) as well as being sent to the store queue. Loads are compared to recorded stores. A load predicted to match a previous store is modified to speculatively read the data from the store's temporary register. This is resolved in the integer or FP rename unit, potentially as a zero latency register-register move. The load is also sent to the LS unit to verify the prediction, and if incorrect, replay the instructions depending on the load with the correct data. It should be noted that several loads and stores can be dispatched in one cycle and this optimization is applied to all of them.

I tested separately a variant of the fw_write_read test, called fw_write_read_rcx which uses the rcx register as the base for renaming on Zen2, and it also worked in one cycle, so involving the rsp register isn't necessary.

This renaming apparently doesn't work for vector loads/stores on AMD, and Intel up to to Skylake doesn't show anything like that (stay tuned for ICL results) for either vector or scalar accesses.

For vector store forwaring, recent Intel (5, 6 or 7.x cycles on SKL, HSW and SNB respectively) is faster than AMD (8 cycles), but this may depend on details of the test.


The remaining tests don't show anything remarkable.

All CPUs can forward quickly in the split_read scenario where a wide write is followed by a narrow load, although both Zen2 and Ice Lake no longer perform memory renaming in this case.

Test Details

Details of each test follows.


    mov [rsp], rax
    mov rax, [rsp]

Straightforward write and read of an 8-byte value, no splits.


    vmovdqa [rsp], xmm0
    vmovdqa xmm0, [rsp]

Straightforward write and read of a 16-byte value using and xmm register, no splits.


    mov [rsp], rax
    mov [rsp + 8], rax
    vmovdqa xmm0, [rsp]

Split write, unsplit read. The 16-byte value is written by two 8-byte GP writes, and read by a single xmm read. In this test, there is no dependency between the written value and the read value, so there is no carried dependency chain, so this is a throughput test.


    mov [rsp], rax
    mov [rsp + 8], rax
    vmovdqa xmm0, [rsp]
    vmovq rax, xmm0

Split write, unsplit read. The 16-byte value is written by two 8-byte GP writes, and read by a single xmm read. Then, the xmm0 value is written to rax, so that the stored value depends on the read value. Unlike the prior test, this means there is a carried dependency chain and so this is a latency test.


    vmovdqa [rsp], xmm0
    mov     rax, [rsp]
    movq    xmm0, rax

Unsplit write, split read. The 16-byte value is written by a single xmm write, then is read by a single xmm read. The read is chained to the next write with a vmovq instruction, so this is a latency test.


    vmovdqa [rsp], xmm0
    mov     rax, [rsp]
    add     rax, [rsp + 8]
    movq    xmm0, rax

Same as the prior test: fw_write_split_read, except that both halves of the 16-byte value are read (the second half is added to the first). Generally I expect this to perform the same, plus one cycle (because the latency chain is increased by one cycle from the add), but if that's not the case something interesting happened.


Thanks to Daniel Lemire who provided 3 out of 4 of the machines used for testing.

Copy link

vsrinivas commented Feb 5, 2020

Results from Piledriver:

I found this interesting because there were hints this functionality may have existed in some form pre-Zen1 -- per the 15h model 30 BKDG (, one of the changes from PD to SR was

  • "Improved memfile, from last 3 stores to last 8 stores, and allow tracking of dependent stack operations."

There was no evidence of even a 3-store memfile here (it'd have showed up in fw_write_read)

** Running group studies/forwarding : Forwarding scenarios **
                               Benchmark    Cycles     Nanos
                           fw_write_read      7.40      2.31
                          fw_write_readx     10.56      3.30
                     fw_split_write_read     40.15     12.54
             fw_split_write_read_chained     58.43     18.24
                     fw_write_split_read     18.23      5.69
                fw_write_split_read_both     20.42      6.38
Finished in 1840 ms (studies/forwarding)
Restored no_turbo state:
Reverting cpufreq governor to ondemand: SUCCESS

Copy link

travisdowns commented Feb 5, 2020

Results from Piledriver: ...


That's interesting. It seems PD had a few interesting things but it was never very popular and so a lot of this stuff wasn't really investigated.

Interesting that this test doesn't show it, you'd think if any test would, this would. Perhaps there is a problem with all the loads and stores going on: this stresses the memory disambiguation predictors, etc, and it is hard to predict exactly the state of the pipeline as there will be a lot of stores and loads in flight at once (although they can only successfully execute one at a time). Maybe the memfile entries get used somehow by earlier or later stores, and not the ones that execute. Or maybe there is some other limitation here.

Even in Zen/Zen2, while the memfile was hinted at by the optimization manual, I never really saw an description of it working, like "hey look at this benchmark that sped up 2x" which I would have expected somewhere. Some tests that I would expect to show the effect, still don't (although this one does on Zen2). So I guess there are some conditions for it to work that are often not met.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment