Skip to content

Instantly share code, notes, and snippets.

@rygorous
Last active December 11, 2018 13:01
Embed
What would you like to do?
%define COUNT 5 ; 1=original test, 5=still in LSD, 50=not in LSD
%define DEC_IN_REP
;%define USE_ZERO_IDIOM
global main:function
main:
mov rbp, rsp
mov eax, 3700000
.oloop:
mov ecx, 1000
%ifdef USE_ZERO_IDIOM
xor edi, edi
%else
mov edi, 0
%endif
align 16
.iloop:
%rep COUNT
mov dword [rbp], edi
mov dword [rsp], edi
sub rsp, 4
%ifdef DEC_IN_REP
dec ecx ; sub ecx, 1 behaves identically here
%endif
%endrep
%ifndef DEC_IN_REP
sub ecx, COUNT
%endif
jg .iloop
mov rsp, rbp
dec eax
jnz .oloop
mov eax, 60
syscall
With the above code, all tests run on a Core i7-3770S (Ivy Bridge).
COUNT=1, no USE_ZERO_IDIOM (original test) gives:
Performance counter stats for './bimodal1h':
2598.794560 task-clock:u (msec) # 1.000 CPUs utilized
18,570,413,195 instructions:u # 1.70 insn per cycle
10,940,768,758 cycles:u # 4.210 GHz
7,039,562,895 resource_stalls.sb:u # 2708.780 M/sec
14,238,283,080 lsd.uops:u # 5478.803 M/sec
2.599108301 seconds time elapsed
COUNT=1, USE_ZERO_IDIOM (original test w/ zeroing) gives:
Performance counter stats for './bimodal1h':
1769.848558 task-clock:u (msec) # 1.000 CPUs utilized
18,522,312,392 instructions:u # 2.49 insn per cycle
7,442,949,963 cycles:u # 4.205 GHz
3,530,810,123 resource_stalls.sb:u # 1994.979 M/sec
14,222,734,359 lsd.uops:u # 8036.131 M/sec
1.770161474 seconds time elapsed
COUNT=50, no USE_ZERO_IDIOM gives:
Performance counter stats for './bimodal1h':
2549.693958 task-clock:u (msec) # 1.000 CPUs utilized
14,944,413,164 instructions:u # 1.39 insn per cycle
10,733,404,877 cycles:u # 4.210 GHz
6,850,766,643 resource_stalls.sb:u # 2686.898 M/sec
12,511 lsd.uops:u # 0.005 M/sec
2.550020402 seconds time elapsed
NOTE: no significant cycle difference to COUNT=1 no-zero-idiom test
(those 1.9% are way smaller than the bigger effect we're seeing here)
but we're clearly not running out of the LSD anymore
COUNT=50, USE_ZERO_IDIOM on:
Performance counter stats for './bimodal1h':
1760.131634 task-clock:u (msec) # 1.000 CPUs utilized
14,896,312,383 instructions:u # 2.01 insn per cycle
7,402,124,894 cycles:u # 4.205 GHz
3,695,252,388 resource_stalls.sb:u # 2099.418 M/sec
12,503 lsd.uops:u # 0.007 M/sec
1.760444633 seconds time elapsed
NOTE: just to confirm that we don't need to be running out of the LSD
to get the quicker execution times.
----
What definitely matters is whether we decrement ecx after every write or
not. For example, here's COUNT=50, no USE_ZERO_IDIOM, but also no DEC_IN_REP:
COUNT=50, no USE_ZERO_IDIOM, no DEC_IN_REP:
Performance counter stats for './bimodal1h':
1768.658769 task-clock:u (msec) # 1.000 CPUs utilized
11,318,412,364 instructions:u # 1.52 insn per cycle
7,436,418,026 cycles:u # 4.205 GHz
4,625,030,005 resource_stalls.sb:u # 2614.993 M/sec
12,699 lsd.uops:u # 0.007 M/sec
1.769107479 seconds time elapsed
NOTE: still subtracting 4 from rsp every time; but fewer subtractions
of rcx. Also fixes the problem.
@travisdowns
Copy link

Register read pressure was my thought too, but the pressure is not at all high here: only 6 reads total (on per inner loop instruction), over 2 cycles, so a low 3 per cycle. Even SnB can do around 5 or 6 per cycle (Haswell is > 6), IIRC.

I'd play with it too, but I don't have a machine that reproduces it.

@travisdowns
Copy link

Does the issue still happen if the source reg edi is not the same between the two stores?

E.g., with

mov     dword [rbp], edi
mov     dword [rbp], esi

and edi and esi initialized in all 4 combinations of ways?

@amonakov
Copy link

Yes: both xored 2.0, both moved 2.97, edi xored 2.1 cycles/iter, esi xored 2.2 cycles/iter. Similar fractional cycles are observed in the original test if changing mov operands to immediate 0.

On twitter Fabian pointed out that in the slow variant of the original test, changing mov [rbp], edi to mov [rbp+rdi], edi speeds it up from 3 to 2 cycles/iter (to be clear: this is with mov edi, 0 prior to the loop).

@travisdowns
Copy link

How about if the writes are to the same location, or two fixed locations?

There is something "special" about simple addressing modes for loads since they are available for the 4-cycle latency but I'm not sure if anything about that applies to stores (but maybe since they are using the same AGU after all).

@amonakov
Copy link

Changing the test to use rsp or rbp in both stores doesn't seem to affect timing. Also tried adding a +4 displacement to one of the stores, still no difference.

@amonakov
Copy link

A perhaps more interesting data point: with zeroing inside the loop, it is fast with either xor or mov. I think with zeroing on every iteration, store data does not need a register file read (arrives via operand forwarding network)?..

@amonakov
Copy link

Also, it's nice to see that taking "both writes to fixed locations" variant and increasing the number of inner loop iterations shows a predictable effect: when time spent in the inner loop approaches rescheduling interval, the fast variant with xor gradually gets slower, ultimately matching the slow variant. So, when the process is interrupted for rescheduling and resumes, edi is no longer the blessed fast idiomatic zero (because it was moved to from memory) and the inner loop is no longer fast.

Travis, does your original issue with a 64KB buffer on Skylake still reproduce without using the zero idiom?

@amonakov
Copy link

Travis, does your original issue with a 64KB buffer on Skylake still reproduce without using the zero idiom?

Never mind — I've found a Skylake and I see that using mov eax, 0 in the asm makes no difference and the bimodal behavior is still observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment