Skip to content

Instantly share code, notes, and snippets.

@brendangregg
Created April 16, 2022 11:47
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brendangregg/c31fb0500dbda7861fe9a47682e4b52a to your computer and use it in GitHub Desktop.
Save brendangregg/c31fb0500dbda7861fe9a47682e4b52a to your computer and use it in GitHub Desktop.
sysbench cpu x86 arm analysis
Some rough notes from an analysis of sysbench cpu on x86 vs ARM, which showed it was 2.6x faster on ARM only
because of a faster div instruction, which did not translate to a production win. The benchmark was misleading.
I also talked about this topic in my IntelON 2021 talk.
m6g.4xl
=======
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~ # sysbench --max-requests=10000000 --max-time=60 --test=cpu --cpu-max-prime=100000 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --max-requests is deprecated, use --events instead
WARNING: --max-time is deprecated, use --time instead
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Prime numbers limit: 100000
Initializing worker threads...
Threads started!
CPU speed:
events per second: 113.73
General statistics:
total time: 60.0019s
total number of events: 6824
Latency (ms):
min: 8.78
avg: 8.79
max: 8.84
95th percentile: 8.74
sum: 60000.42
Threads fairness:
events (avg/stddev): 6824.0000/0.00
execution time (avg/stddev): 60.0004/0.00
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~ # time perl -e 'for ($i=0;$i<50_000_000;$i++) { $j--; }'
real 0m3.632s
user 0m3.632s
sys 0m0.000s
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~ # /usr/lib/lmbench/bin/mhz
2499 MHz, 0.4002 nanosec clock
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~ # /usr/lib/lmbench/bin/mhz
2499 MHz, 0.4002 nanosec clock
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~ # /usr/lib/lmbench/bin/mhz
2499 MHz, 0.4002 nanosec clock
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~ # /usr/lib/lmbench/bin/mhz
2499 MHz, 0.4002 nanosec clock
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~ # /usr/lib/lmbench/bin/mhz
2499 MHz, 0.4002 nanosec clock
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~ # /usr/lib/lmbench/bin/lat_mem_rd 256m 128
"stride=128
0.00049 1.601
0.00098 1.601
0.00195 1.601
0.00293 1.601
0.00391 1.601
0.00586 1.601
0.00781 1.601
0.01172 1.601
0.01562 1.601
0.02344 1.601
0.03125 1.601
0.04688 1.601
0.06250 1.602
0.09375 2.632
0.12500 2.631
0.18750 2.627
0.25000 2.628
0.37500 2.634
0.50000 2.699
0.75000 2.969
1.00000 3.160
1.50000 3.292
2.00000 3.486
3.00000 3.536
4.00000 3.645
6.00000 3.640
8.00000 3.718
12.00000 3.683
16.00000 3.772
24.00000 4.129
32.00000 5.363
48.00000 7.153
64.00000 7.290
96.00000 7.386
128.00000 7.448
192.00000 7.211
256.00000 7.534
m4.4xl E5-2686 v4
=======
test bgregg-focal-m4-v000 us-east-1 i-011b5e48994d149b9
(root) ~ # sysbench --max-requests=10000000 --max-time=60 --test=cpu --cpu-max-prime=100000 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --max-requests is deprecated, use --events instead
WARNING: --max-time is deprecated, use --time instead
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Prime numbers limit: 100000
Initializing worker threads...
Threads started!
CPU speed:
events per second: 36.70
General statistics:
total time: 60.0240s
total number of events: 2203
Latency (ms):
min: 26.95
avg: 27.25
max: 27.71
95th percentile: 27.17
sum: 60022.39
Threads fairness:
events (avg/stddev): 2203.0000/0.00
execution time (avg/stddev): 60.0224/0.00
test bgregg-focal-m4-v000 us-east-1 i-011b5e48994d149b9
(root) ~ # time perl -e 'for ($i=0;$i<50_000_000;$i++) { $j--; }'
real 0m3.103s
user 0m3.099s
sys 0m0.004s
test bgregg-focal-m4-v000 us-east-1 i-011b5e48994d149b9
(root) ~ # /usr/lib/lmbench/bin/x86_64-linux-gnu/mhz
1348 MHz, 0.7418 nanosec clock
test bgregg-focal-m4-v000 us-east-1 i-011b5e48994d149b9
(root) ~ # /usr/lib/lmbench/bin/x86_64-linux-gnu/mhz
2697 MHz, 0.3708 nanosec clock
test bgregg-focal-m4-v000 us-east-1 i-011b5e48994d149b9
(root) ~ # /usr/lib/lmbench/bin/x86_64-linux-gnu/mhz
1348 MHz, 0.7418 nanosec clock
test bgregg-focal-m4-v000 us-east-1 i-011b5e48994d149b9
(root) ~ # /usr/lib/lmbench/bin/x86_64-linux-gnu/mhz
1348 MHz, 0.7418 nanosec clock
test bgregg-focal-m4-v000 us-east-1 i-011b5e48994d149b9
(root) ~ # /usr/lib/lmbench/bin/x86_64-linux-gnu/mhz
2697 MHz, 0.3708 nanosec clock
test bgregg-focal-m4-v000 us-east-1 i-011b5e48994d149b9
(root) ~ # /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 256m 128
"stride=128
0.00049 1.484
0.00098 1.484
0.00195 1.485
0.00293 1.484
0.00391 1.485
0.00586 1.484
0.00781 1.484
0.01172 1.484
0.01562 1.485
0.02344 1.485
0.03125 1.485
0.04688 4.501
0.06250 9.547
0.09375 8.165
0.12500 10.305
0.18750 8.582
0.25000 6.316
0.37500 7.587
0.50000 7.673
0.75000 7.924
1.00000 7.811
1.50000 7.834
2.00000 7.921
3.00000 7.841
4.00000 7.920
6.00000 7.889
8.00000 8.073
12.00000 8.049
16.00000 8.193
24.00000 8.201
32.00000 9.819
48.00000 21.285
64.00000 27.707
96.00000 28.545
128.00000 28.905
192.00000 28.582
256.00000 28.877
m5.4xl 8175M
=======
test bgregg-focal-m5-v000 us-east-1 i-0dc1c6707f913288b
(root) ~ # sysbench --max-requests=10000000 --max-time=60 --test=cpu --cpu-max-prime=100000 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --max-requests is deprecated, use --events instead
WARNING: --max-time is deprecated, use --time instead
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Prime numbers limit: 100000
Initializing worker threads...
Threads started!
CPU speed:
events per second: 43.15
General statistics:
total time: 60.0171s
total number of events: 2590
Latency (ms):
min: 23.11
avg: 23.17
max: 24.22
95th percentile: 23.10
sum: 60015.57
Threads fairness:
events (avg/stddev): 2590.0000/0.00
execution time (avg/stddev): 60.0156/0.00
test bgregg-focal-m5-v000 us-east-1 i-0dc1c6707f913288b
(root) ~ # time perl -e 'for ($i=0;$i<50_000_000;$i++) { $j--; }'
real 0m2.586s
user 0m2.586s
sys 0m0.000s
test bgregg-focal-m5-v000 us-east-1 i-0dc1c6707f913288b
(root) ~ # /usr/lib/lmbench/bin/x86_64-linux-gnu/mhz
1548 MHz, 0.6460 nanosec clock
test bgregg-focal-m5-v000 us-east-1 i-0dc1c6707f913288b
(root) ~ # /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 256m 128
"stride=128
0.00049 1.293
0.00098 1.293
0.00195 1.293
0.00293 1.293
0.00391 1.293
0.00586 1.293
0.00781 1.293
0.01172 1.293
0.01562 1.293
0.02344 1.293
0.03125 1.294
0.04688 4.328
0.06250 4.526
0.09375 4.526
0.12500 4.526
0.18750 4.526
0.25000 4.526
0.37500 4.617
0.50000 4.620
0.75000 5.369
1.00000 6.687
1.50000 8.304
2.00000 7.982
3.00000 7.842
4.00000 7.913
6.00000 7.893
8.00000 8.176
12.00000 8.164
16.00000 8.353
24.00000 10.162
32.00000 13.407
48.00000 19.358
64.00000 21.346
96.00000 25.256
128.00000 24.159
192.00000 25.033
256.00000 26.436
m5n.4xl
=======
test bgregg-focal-m5n-v000 us-east-1 i-0088e54f523c544da
(root) ~ # sysbench --max-requests=10000000 --max-time=60 --test=cpu --cpu-max-prime=100000 run
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --max-requests is deprecated, use --events instead
WARNING: --max-time is deprecated, use --time instead
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Prime numbers limit: 100000
Initializing worker threads...
Threads started!
CPU speed:
events per second: 43.35
General statistics:
total time: 60.0019s
total number of events: 2601
Latency (ms):
min: 23.03
avg: 23.07
max: 23.33
95th percentile: 23.10
sum: 60000.34
Threads fairness:
events (avg/stddev): 2601.0000/0.00
execution time (avg/stddev): 60.0003/0.00
^^^ arm is 2.62x faster than that
test bgregg-focal-m5n-v000 us-east-1 i-0088e54f523c544da
(root) ~ # time perl -e 'for ($i=0;$i<50_000_000;$i++) { $j--; }'
real 0m2.680s
user 0m2.680s
sys 0m0.000s
test bgregg-focal-m5n-v000 us-east-1 i-0088e54f523c544da
(root) ~ # /usr/lib/lmbench/bin/x86_64-linux-gnu/mhz
1549 MHz, 0.6456 nanosec clock
test bgregg-focal-m5n-v000 us-east-1 i-0088e54f523c544da
(root) ~ # /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_mem_rd 256m 128
"stride=128
0.00049 1.293
0.00098 1.293
0.00195 1.293
0.00293 1.292
0.00391 1.293
0.00586 1.293
0.00781 1.293
0.01172 1.293
0.01562 1.293
0.02344 1.293
0.03125 1.294
0.04688 4.327
0.06250 4.524
0.09375 4.524
0.12500 4.524
0.18750 4.524
0.25000 4.524
0.37500 4.615
0.50000 4.617
0.75000 5.118
1.00000 6.635
1.50000 8.424
2.00000 7.861
3.00000 7.988
4.00000 7.834
6.00000 8.041
8.00000 7.879
12.00000 7.908
16.00000 8.193
24.00000 9.147
32.00000 11.733
48.00000 17.865
64.00000 19.387
96.00000 25.043
128.00000 23.058
192.00000 25.487
256.00000 25.616
Analysis
========
x86 loop (verified using gdb and si):
(gdb) x/20i 0x000055555556ffd8
0x55555556ffd8: pxor %xmm0,%xmm0
0x55555556ffdc: cvtsi2sd %rcx,%xmm0
0x55555556ffe1: comisd %xmm0,%xmm1
0x55555556ffe5: jb 0x55555556ffa0
=> 0x55555556ffe7: mov %rbx,%rax
0x55555556ffea: xor %edx,%edx
0x55555556ffec: div %rcx
0x55555556ffef: test %rdx,%rdx
0x55555556fff2: je 0x55555556ffa0
0x55555556fff4: add $0x1,%rcx
0x55555556fff8: jns 0x55555556ffd8
arm loop:
(gdb) x/15i 0x0000aaaaaaac3a80
=> 0xaaaaaaac3a80: add x2, x2, #0x1
0xaaaaaaac3a84: ucvtf d0, x2
0xaaaaaaac3a88: fcmpe d0, d8
0xaaaaaaac3a8c: b.hi 0xaaaaaaac3a50 // b.pmore
0xaaaaaaac3a90: udiv x1, x19, x2
0xaaaaaaac3a94: msub x3, x1, x2, x19
0xaaaaaaac3a98: cbnz x3, 0xaaaaaaac3a80
^^^ 11 x86 instructions vs 7 arm
test bgregg-focal-m5n-v000 us-east-1 i-0088e54f523c544da
(root) ~ # readelf -s `which sysbench`
Symbol table '.dynsym' contains 391 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lua_pushlstring
2: 0000000000000000 0 FUNC GLOBAL DEFAULT UND strdup@GLIBC_2.2.5 (2)
3: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_stmt_free_result@libmysqlclient_21.0 (3)
4: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_stmt_bind_param@libmysqlclient_21.0 (3)
5: 0000000000000000 0 FUNC GLOBAL DEFAULT UND luaL_checkudata
6: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lua_createtable
7: 0000000000000000 0 FUNC GLOBAL DEFAULT UND memcpy@GLIBC_2.14 (4)
8: 0000000000000000 0 FUNC GLOBAL DEFAULT UND pthread_rwlock_wrlock@GLIBC_2.2.5 (5)
9: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lua_pushnumber
10: 0000000000000000 0 FUNC GLOBAL DEFAULT UND realloc@GLIBC_2.2.5 (2)
11: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_get_ssl_cipher@libmysqlclient_21.0 (3)
12: 0000000000000000 0 FUNC GLOBAL DEFAULT UND io_queue_release@LIBAIO_0.1 (6)
13: 0000000000000000 0 FUNC GLOBAL DEFAULT UND pthread_mutex_destroy@GLIBC_2.2.5 (5)
14: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_deregisterTMCloneTab
15: 0000000000000000 0 FUNC GLOBAL DEFAULT UND floor@GLIBC_2.2.5 (7)
16: 0000000000000000 0 FUNC GLOBAL DEFAULT UND __ctype_b_loc@GLIBC_2.3 (8)
17: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_num_fields@libmysqlclient_21.0 (3)
18: 0000000000000000 0 FUNC GLOBAL DEFAULT UND __pthread_unregister_canc@GLIBC_2.3.3 (9)
19: 0000000000000000 0 FUNC GLOBAL DEFAULT UND exp@GLIBC_2.29 (10)
20: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lua_newuserdata
21: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lua_rawseti
22: 0000000000000000 0 FUNC GLOBAL DEFAULT UND strncpy@GLIBC_2.2.5 (2)
23: 0000000000000000 0 FUNC GLOBAL DEFAULT UND pthread_attr_init@GLIBC_2.2.5 (5)
24: 0000000000000000 0 FUNC GLOBAL DEFAULT UND log@GLIBC_2.29 (10)
25: 0000000000000000 0 FUNC GLOBAL DEFAULT UND free@GLIBC_2.2.5 (2)
26: 0000000000000000 0 FUNC GLOBAL DEFAULT UND pwrite@GLIBC_2.2.5 (5)
27: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_close@libmysqlclient_21.0 (3)
28: 0000000000000000 0 FUNC GLOBAL DEFAULT UND luaL_loadbuffer
29: 0000000000000000 0 FUNC GLOBAL DEFAULT UND fdatasync@GLIBC_2.2.5 (2)
30: 0000000000000000 0 FUNC GLOBAL DEFAULT UND __vsnprintf_chk@GLIBC_2.3.4 (11)
31: 0000000000000000 0 FUNC GLOBAL DEFAULT UND PQstatus
32: 0000000000000000 0 FUNC GLOBAL DEFAULT UND PQsetNoticeProcessor
33: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_options@libmysqlclient_21.0 (3)
34: 0000000000000000 0 FUNC GLOBAL DEFAULT UND PQprepare
35: 0000000000000000 0 FUNC GLOBAL DEFAULT UND close@GLIBC_2.2.5 (5)
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~ # readelf -s `which sysbench`
Symbol table '.dynsym' contains 398 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000007d68 0 SECTION LOCAL DEFAULT 11
2: 000000000003b000 0 SECTION LOCAL DEFAULT 24
3: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lua_pushlstring
4: 0000000000000000 0 FUNC GLOBAL DEFAULT UND pthread_cond_broadcast@GLIBC_2.17 (2)
5: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_stmt_free_result@libmysqlclient_21.0 (3)
6: 0000000000000000 0 FUNC GLOBAL DEFAULT UND fopen@GLIBC_2.17 (4)
7: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_stmt_bind_param@libmysqlclient_21.0 (3)
8: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mmap@GLIBC_2.17 (4)
9: 0000000000000000 0 FUNC GLOBAL DEFAULT UND luaL_checkudata
10: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lua_createtable
11: 0000000000000000 0 FUNC GLOBAL DEFAULT UND pthread_once@GLIBC_2.17 (2)
12: 0000000000000000 0 FUNC GLOBAL DEFAULT UND __strcpy_chk@GLIBC_2.17 (4)
13: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lua_pushnumber
14: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_get_ssl_cipher@libmysqlclient_21.0 (3)
15: 0000000000000000 0 FUNC GLOBAL DEFAULT UND io_queue_release@LIBAIO_0.1 (5)
16: 0000000000000000 0 FUNC GLOBAL DEFAULT UND munmap@GLIBC_2.17 (4)
17: 0000000000000000 0 NOTYPE WEAK DEFAULT UND _ITM_deregisterTMCloneTab
18: 0000000000000000 0 FUNC GLOBAL DEFAULT UND pthread_rwlock_unlock@GLIBC_2.17 (2)
19: 0000000000000000 0 FUNC WEAK DEFAULT UND __pthread_unwind_next@GLIBC_2.17 (2)
20: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_num_fields@libmysqlclient_21.0 (3)
21: 0000000000000000 0 FUNC GLOBAL DEFAULT UND exp@GLIBC_2.29 (6)
22: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lua_newuserdata
23: 0000000000000000 0 OBJECT GLOBAL DEFAULT UND stderr@GLIBC_2.17 (4)
24: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lua_rawseti
25: 0000000000000000 0 FUNC GLOBAL DEFAULT UND log@GLIBC_2.29 (6)
26: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_close@libmysqlclient_21.0 (3)
27: 0000000000000000 0 FUNC GLOBAL DEFAULT UND pthread_attr_init@GLIBC_2.17 (2)
28: 0000000000000000 0 FUNC GLOBAL DEFAULT UND luaL_loadbuffer
29: 0000000000000000 0 FUNC GLOBAL DEFAULT UND msync@GLIBC_2.17 (2)
30: 0000000000000000 0 FUNC GLOBAL DEFAULT UND pthread_rwlock_destroy@GLIBC_2.17 (2)
31: 0000000000000000 0 FUNC GLOBAL DEFAULT UND snprintf@GLIBC_2.17 (4)
32: 0000000000000000 0 FUNC GLOBAL DEFAULT UND lseek@GLIBC_2.17 (2)
33: 0000000000000000 0 FUNC GLOBAL DEFAULT UND PQstatus
34: 0000000000000000 0 FUNC GLOBAL DEFAULT UND PQsetNoticeProcessor
35: 0000000000000000 0 FUNC GLOBAL DEFAULT UND mysql_options@libmysqlclient_21.0 (3)
[...]
m5n build:
===============================================================================
sysbench version : 1.1.0-ead2689
CC : gcc
CFLAGS : -O3 -funroll-loops -Wall -Wextra -Wpointer-arith -Wbad-function-cast -Wstrict-prototypes -Wnested-externs -Wno-format-zero-length -Wundef -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -Wcast-align -Wvla -pthread
CPPFLAGS : -D_GNU_SOURCE -I$(top_srcdir)/src -I$(abs_top_builddir)/third_party/luajit/inc -I$(abs_top_builddir)/third_party/concurrency_kit/include
LDFLAGS :
LIBS : -laio -lm
arm build:
===============================================================================
sysbench version : 1.1.0-ead2689
CC : gcc
CFLAGS : -O3 -funroll-loops -Wall -Wextra -Wpointer-arith -Wbad-function-cast -Wstrict-prototypes -Wnested-externs -Wno-format-zero-length -Wundef -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations -Wredundant-decls -Wcast-align -Wvla -pthread
CPPFLAGS : -D_GNU_SOURCE -DCK_MD_CACHELINE=128 -I$(top_srcdir)/src -I$(abs_top_builddir)/third_party/luajit/inc -I$(abs_top_builddir)/third_party/concurrency_kit/include
LDFLAGS :
LIBS : -laio -lm
similar results compiled :(
moving to -O0 on arm dropped perf by 20% :(
x86 laptop
----------
build sysbench with CFLAGS=-g (so source view works in perf top)
perf top -e cycles:ppp -p `pgrep sysbench`
entered hot function, o for addrs, s for source:
Samples: 274K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 61489970617
cpu_execute_event /home/bgregg/Build/sysbench/src/sysbench [Percent: local period]
0.00 │20259: movsd 0x5fe37(%rip),%xmm2 # 80098 <sizemods+0xd77>
│ for(c=3; c < max_prime; c++)
│20261: cmp $0x3,%rbp
│20265: ↓ jbe 202f0 <cpu_execute_event+0xb0>
│2026b: nopl 0x0(%rax,%rax,1)
1.98 │20270: add $0x1,%rbx
0.02 │20274: cmp %rbp,%rbx
│20277: ↓ je 202f0 <cpu_execute_event+0xb0>
│ t = sqrt((double)c);
0.05 │20279: pxor %xmm0,%xmm0
1.23 │2027d: cvtsi2sd %rbx,%xmm0
0.01 │20282: ucomisd %xmm0,%xmm3
0.25 │20286: movapd %xmm0,%xmm1
2.71 │2028a: sqrtsd %xmm1,%xmm1
0.00 │2028e: ↓ ja 202f9 <cpu_execute_event+0xb9>
│ for(l = 2; l <= t; l++)
0.02 │20290: comisd %xmm2,%xmm1
0.05 │20294: ↑ jb 20270 <cpu_execute_event+0x30>
│ if (c % l == 0)
0.15 │20296: test $0x1,%bl
0.15 │20299: ↑ je 20270 <cpu_execute_event+0x30>
│ for(l = 2; l <= t; l++)
│2029b: mov $0x2,%ecx
│202a0: ↓ jmp 202c4 <cpu_execute_event+0x84>
│202a2: nopw 0x0(%rax,%rax,1)
3.57 │202a8: pxor %xmm0,%xmm0
0.21 │202ac: cvtsi2sd %rcx,%xmm0
0.26 │202b1: comisd %xmm0,%xmm1
3.51 │202b5: ↑ jb 20270 <cpu_execute_event+0x30>
│ if (c % l == 0)
0.09 │202b7: mov %rbx,%rax
0.02 │202ba: xor %edx,%edx
85.00 │202bc: div %rcx
0.12 │202bf: test %rdx,%rdx
0.02 │202c2: ↑ je 20270 <cpu_execute_event+0x30>
│ for(l = 2; l <= t; l++)
0.28 │202c4: add $0x1,%rcx
0.30 │202c8: ↑ jns 202a8 <cpu_execute_event+0x68>
│202ca: mov %rcx,%rax
│202cd: mov %rcx,%rdx
│202d0: pxor %xmm0,%xmm0
│202d4: shr %rax
│202d7: and $0x1,%edx
│202da: or %rdx,%rax
│202dd: cvtsi2sd %rax,%xmm0
│202e2: addsd %xmm0,%xmm0
Note 85% on div
ARM
---
Samples: 133K of event 'cycles:ppp', 4000 Hz, Event count (approx.): 35833053709
cpu_execute_event /root/sysbench/src/sysbench [Percent: local period]
Percent│ t = sqrt((double)c);
│ for(l = 2; l <= t; l++)
│1e88c: fmov d9, #2.000000000000000000e+00
│ for(c=3; c < max_prime; c++)
│1e890: cmp x20, x19
│1e894: ↓ b.ls 1e8f0 <cpu_execute_event+0x80> // b.plast
0.11 │1e898: add x19, x19, #0x1
0.00 │1e89c: cmp x19, x20
│1e8a0: ↓ b.eq 1e8f0 <cpu_execute_event+0x80> // b.none
│ t = sqrt((double)c);
1.47 │1e8a4: scvtf d0, x19
0.17 │1e8a8: fsqrt d8, d0
│1e8ac: fcmp d0, #0.0
0.41 │1e8b0: ↓ b.mi 1e904 <cpu_execute_event+0x94> // b.first
│ for(l = 2; l <= t; l++)
2.33 │1e8b4: fcmpe d8, d9
│1e8b8: ↑ b.lt 1e898 <cpu_execute_event+0x28> // b.tstop
│ if (c % l == 0)
6.20 │1e8bc: ↑ tbz w19, #0, 1e898 <cpu_execute_event+0x28>
│ for(l = 2; l <= t; l++)
│1e8c0: mov x2, #0x2 // #2
│1e8c4: nop
84.35 │1e8c8: add x2, x2, #0x1
0.08 │1e8cc: ucvtf d0, x2
0.10 │1e8d0: fcmpe d0, d8
│1e8d4: ↑ b.hi 1e898 <cpu_execute_event+0x28> // b.pmore
│ if (c % l == 0)
1.37 │1e8d8: udiv x1, x19, x2
0.00 │1e8dc: msub x3, x1, x2, x19
│1e8e0: ↑ cbnz x3, 1e8c8 <cpu_execute_event+0x58>
│ for(c=3; c < max_prime; c++)
2.46 │1e8e4: add x19, x19, #0x1
0.94 │1e8e8: cmp x19, x20
│1e8ec: ↑ b.ne 1e8a4 <cpu_execute_event+0x34> // b.any
│ if (l > t )
│ n++;
│ }
│ return 0;
│ }
│1e8f0: mov w0, #0x0 // #0
│1e8f4: ldp x19, x20, [sp, #16]
│1e8f8: ldp d8, d9, [sp, #32]
│1e8fc: ldp x29, x30, [sp], #48
│1e900: ← ret
│ t = sqrt((double)c);
│1e904: → bl bfd0 <sqrt@plt>
│1e908: ↑ b 1e8b4 <cpu_execute_event+0x44>
test bgregg-focal-arm-v000 us-east-1 i-0d6f5a0062ee66c4e
(root) ~/pmc-cloud-tools # ./pmcarch-arm -p `pgrep sysbench` 1
K_CYCLES K_INSTR IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC%
2466749 2547477 1.03 796421651 7862138 0.99 930645 26546 97.15
2468178 2548302 1.03 796065561 7843408 0.99 930636 27156 97.08
2467727 2548142 1.03 796203146 7844933 0.99 933348 28198 96.98
2467949 2548597 1.03 796354230 7841235 0.98 932690 27761 97.02
2468077 2549395 1.03 796518061 7832774 0.98 925969 26813 97.10
2467745 2549433 1.03 796490782 7836392 0.98 915061 27066 97.04
2467138 2548741 1.03 796431984 7827463 0.98 935793 27997 97.01
2467637 2548942 1.03 796377545 7835719 0.98 927907 27169 97.07
2468255 2549979 1.03 796142152 7810915 0.98 935802 26982 97.12
2468141 2549222 1.03 795748285 7835197 0.98 925002 27254 97.05
2467586 2548973 1.03 795951417 7829429 0.98 932927 27509 97.05
2467948 2549608 1.03 796040137 7838052 0.98 928710 26509 97.15
^C
laptop:
^Croot@bgregg-ThinkPad-T14-Gen-1:/home/bgregg/Build/sysbench> ../../Git/pmc-cloud-tools/pmcarch -p `pgrep sysbench` 1
K_CYCLES K_INSTR IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC%
3758158 1743858 0.46 484794716 5464826 1.13 354910 113007 68.16
3638459 1658361 0.46 461119278 5203799 1.13 510232 187695 63.21
3725500 1728088 0.46 481719365 5425807 1.13 446062 149518 66.48
3708671 1721892 0.46 479231376 5388728 1.12 424046 127710 69.88
3769505 1753538 0.47 487235743 5491588 1.13 398418 112970 71.65
3646662 1666795 0.46 462998633 5232494 1.13 467052 170167 63.57
3750743 1747703 0.47 486235583 5471037 1.13 409565 126892 69.02
3704129 1725497 0.47 479530317 5398128 1.13 403939 129576 67.92
3763474 1758103 0.47 488885363 5493292 1.12 342379 118463 65.40
3610052 1607590 0.45 446467732 5074751 1.14 480854 193250 59.81
3749222 1744571 0.47 485291550 5464549 1.13 394307 111610 71.69
with 4 processes running:
root@bgregg-ThinkPad-T14-Gen-1:/home/bgregg/Build/sysbench> perf stat --topdown
nmi_watchdog enabled with topdown. May give wrong results.
Disable with echo 0 > /proc/sys/kernel/nmi_watchdog
^C
Performance counter stats for 'system wide':
retiring bad speculation frontend bound backend bound
S0-D0-C0 2 46.9% 2.9% 48.8% 1.3%
S0-D0-C1 2 48.0% 2.9% 48.1% 1.0%
S0-D0-C2 2 47.4% 3.0% 48.6% 1.1%
S0-D0-C3 2 48.0% 2.9% 48.0% 1.1%
2.699977797 seconds time elapsed
root@bgregg-ThinkPad-T14-Gen-1:/home/bgregg/Utils/pmu-tools> ./toplev.py -l2 /home/bgregg/Build/sysbench/src/sysbench --cpu-max-prime=10000 --time=10 --threads=1 cpu run
Consider disabling nmi watchdog to minimize multiplexing
(echo 0 > /proc/sys/kernel/nmi_watchdog as root)
Will measure complete system.
Using level 2.
sysbench 1.1.0-ead2689 (using bundled LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Prime numbers limit: 10000
Initializing worker threads...
Threads started!
CPU speed:
events per second: 1304.07
Throughput:
events/s (eps): 1304.0707
time elapsed: 10.0002s
total number of events: 13041
Latency (ms):
min: 0.69
avg: 0.77
max: 1.80
95th percentile: 0.86
sum: 9995.00
Threads fairness:
events (avg/stddev): 13041.0000/0.00
execution time (avg/stddev): 9.9950/0.00
# 4.2-full-perf on Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz [kblr/skylake]
C0 FE Frontend_Bound % Slots 46.7
C0 BE Backend_Bound % Slots 23.7
C0 FE Frontend_Bound.Fetch_Latency % Slots 46.2 <==
This metric represents fraction of slots the CPU was stalled
due to Frontend latency issues...
Sampling events: frontend_retired.latency_ge_16:pp frontend_retired.latency_ge_8:pp
C0 BE/Core Backend_Bound.Core_Bound % Slots 10.7
This metric represents fraction of slots where Core non-
memory issues were of a bottleneck...
C0-T0 MUX % 7.7
PerfMon Event Multiplexing accuracy indicator
C1 FE Frontend_Bound % Slots 51.4
C1 RET Retiring % Slots 45.0
C1 FE Frontend_Bound.Fetch_Latency % Slots 41.3 <==
C1 FE Frontend_Bound.Fetch_Bandwidth % Slots 10.1
This metric represents fraction of slots the CPU was stalled
due to Frontend bandwidth issues...
Sampling events: frontend_retired.latency_ge_2_bubbles_ge_1:pp frontend_retired.latency_ge_1:pp frontend_retired.latency_ge_2:pp
C1 RET Retiring.Heavy_Operations % Slots 33.6
This metric represents fraction of slots where the CPU was
retiring heavy-weight operations -- instructions that
require two or more uops ([ICL/TGL] this metric accounts
only for the subset of heavy operations that are delivered
by the microcode sequencer unit)...
C1-T0 MUX % 7.7
C2 FE Frontend_Bound % Slots 46.7
C2 RET Retiring % Slots 31.8
C2 FE Frontend_Bound.Fetch_Latency % Slots 40.2 <==
C2 RET Retiring.Heavy_Operations % Slots 11.8
C2-T0 MUX % 7.7
C3 FE Frontend_Bound % Slots 42.0
C3 BE Backend_Bound % Slots 27.3
C3 FE Frontend_Bound.Fetch_Latency % Slots 31.8 <==
C3 BE/Core Backend_Bound.Core_Bound % Slots 11.7
C3-T0 MUX % 7.7
C0-T1 MUX % 7.7
C1-T1 MUX % 7.7
C2-T1 MUX % 7.7
C3-T1 MUX % 7.7
root@bgregg-ThinkPad-T14-Gen-1:/home/bgregg/Utils/pmu-tools> tail /proc/cpuinfo
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple pml ept_mode_based_exec
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs itlb_multihit srbds
bogomips : 4199.88
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment