Skip to content

Instantly share code, notes, and snippets.

@cuviper
Last active October 23, 2023 17:25
Show Gist options
  • Save cuviper/700f10182e484790a4b8b84e3f00f586 to your computer and use it in GitHub Desktop.
Save cuviper/700f10182e484790a4b8b84e3f00f586 to your computer and use it in GitHub Desktop.
BabelStream, OpenMP vs. Rayon
$ perf stat -d ./omp-stream
BabelStream
Version: 3.4
Implementation: OpenMP
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function MBytes/sec Min (sec) Max Average
Copy 108865.244 0.00493 0.05374 0.00834
Mul 97387.786 0.00551 0.03843 0.00959
Add 100390.735 0.00802 0.03490 0.01230
Triad 104748.201 0.00769 0.04499 0.01212
Dot 116643.998 0.00460 0.03035 0.00887
Performance counter stats for './omp-stream':
665,460.03 msec task-clock:u # 115.871 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
60,325 page-faults:u # 0.091 K/sec
1,973,539,186,499 cycles:u # 2.966 GHz (49.83%)
20,578,592,028 stalled-cycles-frontend:u # 1.04% frontend cycles idle (49.75%)
1,892,921,586,383 stalled-cycles-backend:u # 95.92% backend cycles idle (10.05%)
412,763,127,962 instructions:u # 0.21 insn per cycle
# 4.59 stalled cycles per insn (20.14%)
78,155,286,693 branches:u # 117.446 M/sec (30.21%)
84,235,682 branch-misses:u # 0.11% of all branches (10.04%)
92,640,194,539 L1-dcache-loads:u # 139.212 M/sec (20.04%)
33,602,081,514 L1-dcache-load-misses:u # 36.27% of all L1-dcache hits (30.02%)
637,221,226 LLC-loads:u # 0.958 M/sec (29.95%)
134,179,050 LLC-load-misses:u # 21.06% of all LL-cache hits (39.89%)
5.743135905 seconds time elapsed
637.155108000 seconds user
28.291715000 seconds sys
$ perf stat -d ./target/release/babel_stream
BabelStream
Version: 0.5
Implmentation: Rust
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function Mbytes/sec Min (sec) Max Average
Copy 54328.164 0.00988 0.02706 0.01375
Mul 54421.785 0.00987 0.02571 0.01333
Add 51490.177 0.01564 0.03606 0.02052
Triad 52049.274 0.01547 0.03241 0.02031
Dot 50989.734 0.01053 0.02623 0.01478
Performance counter stats for './target/release/babel_stream':
1,062,303.36 msec task-clock:u # 122.712 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
42,124 page-faults:u # 0.040 K/sec
3,260,121,432,385 cycles:u # 3.069 GHz (49.86%)
541,201,721,259 stalled-cycles-frontend:u # 16.60% frontend cycles idle (49.90%)
2,227,238,034,453 stalled-cycles-backend:u # 68.32% backend cycles idle (10.04%)
331,330,149,597 instructions:u # 0.10 insn per cycle
# 6.72 stalled cycles per insn (20.07%)
50,746,277,795 branches:u # 47.770 M/sec (30.08%)
3,184,008,620 branch-misses:u # 6.27% of all branches (10.01%)
89,919,348,934 L1-dcache-loads:u # 84.646 M/sec (20.01%)
9,660,898,319 L1-dcache-load-misses:u # 10.74% of all L1-dcache hits (29.99%)
878,764,317 LLC-loads:u # 0.827 M/sec (29.95%)
849,262,989 LLC-load-misses:u # 96.64% of all LL-cache hits (39.89%)
8.656898836 seconds time elapsed
1038.920388000 seconds user
23.939748000 seconds sys
$ perf stat -d taskset -c 0-63 ./omp-stream
BabelStream
Version: 3.4
Implementation: OpenMP
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function MBytes/sec Min (sec) Max Average
Copy 52348.508 0.01026 0.01185 0.01043
Mul 49850.428 0.01077 0.01266 0.01098
Add 52537.521 0.01533 0.01801 0.01565
Triad 54129.328 0.01488 0.01763 0.01513
Dot 65692.195 0.00817 0.00940 0.00843
Performance counter stats for 'taskset -c 0-63 ./omp-stream':
407,153.55 msec task-clock:u # 61.544 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
48,376 page-faults:u # 0.119 K/sec
1,258,117,724,107 cycles:u # 3.090 GHz (50.07%)
10,514,200,105 stalled-cycles-frontend:u # 0.84% frontend cycles idle (49.94%)
1,201,580,975,236 stalled-cycles-backend:u # 95.51% backend cycles idle (9.95%)
179,689,025,851 instructions:u # 0.14 insn per cycle
# 6.69 stalled cycles per insn (19.96%)
24,049,840,548 branches:u # 59.068 M/sec (29.99%)
11,896,941 branch-misses:u # 0.05% of all branches (10.03%)
35,726,797,726 L1-dcache-loads:u # 87.748 M/sec (20.07%)
6,936,045,729 L1-dcache-load-misses:u # 19.41% of all L1-dcache hits (30.10%)
657,412,066 LLC-loads:u # 1.615 M/sec (30.09%)
103,466,460 LLC-load-misses:u # 15.74% of all LL-cache hits (40.10%)
6.615633299 seconds time elapsed
397.221694000 seconds user
10.255615000 seconds sys
$ perf stat -d taskset -c 0-63 ./target/release/babel_stream
BabelStream
Version: 0.5
Implmentation: Rust
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function Mbytes/sec Min (sec) Max Average
Copy 58058.929 0.00925 0.01154 0.00974
Mul 51776.537 0.01037 0.01149 0.01057
Add 53694.250 0.01500 0.01545 0.01518
Triad 54317.170 0.01483 0.01600 0.01503
Dot 61588.954 0.00872 0.00974 0.00894
Performance counter stats for 'taskset -c 0-63 ./target/release/babel_stream':
383,067.96 msec task-clock:u # 62.107 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
29,813 page-faults:u # 0.078 K/sec
1,211,090,824,687 cycles:u # 3.162 GHz (49.85%)
145,032,175,754 stalled-cycles-frontend:u # 11.98% frontend cycles idle (49.92%)
928,771,095,319 stalled-cycles-backend:u # 76.69% backend cycles idle (10.06%)
95,197,325,012 instructions:u # 0.08 insn per cycle
# 9.76 stalled cycles per insn (20.10%)
9,535,026,301 branches:u # 24.891 M/sec (30.11%)
543,578,990 branch-misses:u # 5.70% of all branches (10.00%)
30,401,146,966 L1-dcache-loads:u # 79.362 M/sec (19.98%)
3,961,642,174 L1-dcache-load-misses:u # 13.03% of all L1-dcache hits (29.95%)
786,657,019 LLC-loads:u # 2.054 M/sec (29.90%)
272,522,007 LLC-load-misses:u # 34.64% of all LL-cache hits (39.86%)
6.167891515 seconds time elapsed
375.015537000 seconds user
8.316741000 seconds sys
Operating System: Fedora 29 (Twenty Nine)
Kernel: Linux 5.0.5-200.fc29.ppc64le
$ lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 4
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported
CPU max MHz: 3800.0000
CPU min MHz: 2166.0000
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 10240K
NUMA node0 CPU(s): 0-63
NUMA node8 CPU(s): 64-127
$ rpm -q rust clang llvm-libs
rust-1.35.0-1.fc29.ppc64le
clang-7.0.1-6.fc29.ppc64le
llvm-libs-7.0.1-4.fc29.ppc64le
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment