| %define COUNT 5 ; 1=original test, 5=still in LSD, 50=not in LSD | |
| %define DEC_IN_REP | |
| ;%define USE_ZERO_IDIOM | |
| global main:function | |
| main: | |
| mov rbp, rsp | |
| mov eax, 3700000 | |
| .oloop: | |
| mov ecx, 1000 | |
| %ifdef USE_ZERO_IDIOM | |
| xor edi, edi | |
| %else | |
| mov edi, 0 | |
| %endif | |
| align 16 | |
| .iloop: | |
| %rep COUNT | |
| mov dword [rbp], edi | |
| mov dword [rsp], edi | |
| sub rsp, 4 | |
| %ifdef DEC_IN_REP | |
| dec ecx ; sub ecx, 1 behaves identically here | |
| %endif | |
| %endrep | |
| %ifndef DEC_IN_REP | |
| sub ecx, COUNT | |
| %endif | |
| jg .iloop | |
| mov rsp, rbp | |
| dec eax | |
| jnz .oloop | |
| mov eax, 60 | |
| syscall |
| With the above code, all tests run on a Core i7-3770S (Ivy Bridge). | |
| COUNT=1, no USE_ZERO_IDIOM (original test) gives: | |
| Performance counter stats for './bimodal1h': | |
| 2598.794560 task-clock:u (msec) # 1.000 CPUs utilized | |
| 18,570,413,195 instructions:u # 1.70 insn per cycle | |
| 10,940,768,758 cycles:u # 4.210 GHz | |
| 7,039,562,895 resource_stalls.sb:u # 2708.780 M/sec | |
| 14,238,283,080 lsd.uops:u # 5478.803 M/sec | |
| 2.599108301 seconds time elapsed | |
| COUNT=1, USE_ZERO_IDIOM (original test w/ zeroing) gives: | |
| Performance counter stats for './bimodal1h': | |
| 1769.848558 task-clock:u (msec) # 1.000 CPUs utilized | |
| 18,522,312,392 instructions:u # 2.49 insn per cycle | |
| 7,442,949,963 cycles:u # 4.205 GHz | |
| 3,530,810,123 resource_stalls.sb:u # 1994.979 M/sec | |
| 14,222,734,359 lsd.uops:u # 8036.131 M/sec | |
| 1.770161474 seconds time elapsed | |
| COUNT=50, no USE_ZERO_IDIOM gives: | |
| Performance counter stats for './bimodal1h': | |
| 2549.693958 task-clock:u (msec) # 1.000 CPUs utilized | |
| 14,944,413,164 instructions:u # 1.39 insn per cycle | |
| 10,733,404,877 cycles:u # 4.210 GHz | |
| 6,850,766,643 resource_stalls.sb:u # 2686.898 M/sec | |
| 12,511 lsd.uops:u # 0.005 M/sec | |
| 2.550020402 seconds time elapsed | |
| NOTE: no significant cycle difference to COUNT=1 no-zero-idiom test | |
| (those 1.9% are way smaller than the bigger effect we're seeing here) | |
| but we're clearly not running out of the LSD anymore | |
| COUNT=50, USE_ZERO_IDIOM on: | |
| Performance counter stats for './bimodal1h': | |
| 1760.131634 task-clock:u (msec) # 1.000 CPUs utilized | |
| 14,896,312,383 instructions:u # 2.01 insn per cycle | |
| 7,402,124,894 cycles:u # 4.205 GHz | |
| 3,695,252,388 resource_stalls.sb:u # 2099.418 M/sec | |
| 12,503 lsd.uops:u # 0.007 M/sec | |
| 1.760444633 seconds time elapsed | |
| NOTE: just to confirm that we don't need to be running out of the LSD | |
| to get the quicker execution times. | |
| ---- | |
| What definitely matters is whether we decrement ecx after every write or | |
| not. For example, here's COUNT=50, no USE_ZERO_IDIOM, but also no DEC_IN_REP: | |
| COUNT=50, no USE_ZERO_IDIOM, no DEC_IN_REP: | |
| Performance counter stats for './bimodal1h': | |
| 1768.658769 task-clock:u (msec) # 1.000 CPUs utilized | |
| 11,318,412,364 instructions:u # 1.52 insn per cycle | |
| 7,436,418,026 cycles:u # 4.205 GHz | |
| 4,625,030,005 resource_stalls.sb:u # 2614.993 M/sec | |
| 12,699 lsd.uops:u # 0.007 M/sec | |
| 1.769107479 seconds time elapsed | |
| NOTE: still subtracting 4 from rsp every time; but fewer subtractions | |
| of rcx. Also fixes the problem. |
Does the issue still happen if the source reg edi is not the same between the two stores?
E.g., with
mov dword [rbp], edi
mov dword [rbp], esi
and edi and esi initialized in all 4 combinations of ways?
Yes: both xored 2.0, both moved 2.97, edi xored 2.1 cycles/iter, esi xored 2.2 cycles/iter. Similar fractional cycles are observed in the original test if changing mov operands to immediate 0.
On twitter Fabian pointed out that in the slow variant of the original test, changing mov [rbp], edi to mov [rbp+rdi], edi speeds it up from 3 to 2 cycles/iter (to be clear: this is with mov edi, 0 prior to the loop).
How about if the writes are to the same location, or two fixed locations?
There is something "special" about simple addressing modes for loads since they are available for the 4-cycle latency but I'm not sure if anything about that applies to stores (but maybe since they are using the same AGU after all).
Changing the test to use rsp or rbp in both stores doesn't seem to affect timing. Also tried adding a +4 displacement to one of the stores, still no difference.
A perhaps more interesting data point: with zeroing inside the loop, it is fast with either xor or mov. I think with zeroing on every iteration, store data does not need a register file read (arrives via operand forwarding network)?..
Also, it's nice to see that taking "both writes to fixed locations" variant and increasing the number of inner loop iterations shows a predictable effect: when time spent in the inner loop approaches rescheduling interval, the fast variant with xor gradually gets slower, ultimately matching the slow variant. So, when the process is interrupted for rescheduling and resumes, edi is no longer the blessed fast idiomatic zero (because it was moved to from memory) and the inner loop is no longer fast.
Travis, does your original issue with a 64KB buffer on Skylake still reproduce without using the zero idiom?
Travis, does your original issue with a 64KB buffer on Skylake still reproduce without using the zero idiom?
Never mind — I've found a Skylake and I see that using mov eax, 0 in the asm makes no difference and the bimodal behavior is still observed.
Register read pressure was my thought too, but the pressure is not at all high here: only 6 reads total (on per inner loop instruction), over 2 cycles, so a low 3 per cycle. Even SnB can do around 5 or 6 per cycle (Haswell is > 6), IIRC.
I'd play with it too, but I don't have a machine that reproduces it.