As discussed with Carsten Bauer on the Julia Slack, I try to follow the tutorial https://github.com/RRZE-HPC/likwid/wiki/Tutorial%3A-Empirical-Roofline-Model in Julia. However, operational intensities vary significantly between several runs.
I followed the Julia setup at https://juliaperf.github.io/LIKWID.jl/stable/marker/#Example. A slightly modified version below serves as MWE.
$ export JULIA_EXCLUSIVE=1
$ echo $JULIA_EXCLUSIVE
1
Results vary between something like
$ likwid-perfctr -c 0 -g MEM_DP -m julia --check-bounds=no --threads=1 perfctr.jl
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
CPU type: Intel Coffeelake processor
CPU clock: 3.70 GHz
Warning: The Marker API requires the application to run on the selected CPUs.
Warning: likwid-perfctr pins the application only when using the -C command line option.
Warning: LIKWID assumes that the application does it before the first instrumented code region is started.
Warning: You can use the string in the environment variable LIKWID_THREADS to pin you application to
Warning: to the CPUs specified after the -c command line option.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region matmul, Group 1: MEM_DP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 0.011865 |
| call count | 1 |
+-------------------+------------+
+------------------------------------------+---------+------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 95714710 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 54637730 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 42188300 |
| PWR_PKG_ENERGY | PWR0 | 0.3635 |
| PWR_DRAM_ENERGY | PWR3 | 0.0084 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 0 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 786 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 52428800 |
| DRAM_READS | MBOX0C1 | 230678 |
| DRAM_WRITES | MBOX0C2 | 52570 |
+------------------------------------------+---------+------------+
+-----------------------------------+------------+
| Metric | HWThread 0 |
+-----------------------------------+------------+
| Runtime (RDTSC) [s] | 0.0119 |
| Runtime unhalted [s] | 0.0148 |
| Clock [MHz] | 4786.7012 |
| CPI | 0.5708 |
| Energy [J] | 0.3635 |
| Power [W] | 30.6379 |
| Energy DRAM [J] | 0.0084 |
| Power DRAM [W] | 0.7099 |
| DP [MFLOP/s] | 17674.8502 |
| AVX DP [MFLOP/s] | 17674.7839 |
| Packed [MUOPS/s] | 4418.6960 |
| Scalar [MUOPS/s] | 0.0662 |
| Memory load bandwidth [MBytes/s] | 1244.2578 |
| Memory load data volume [GBytes] | 0.0148 |
| Memory evict bandwidth [MBytes/s] | 283.5582 |
| Memory evict data volume [GBytes] | 0.0034 |
| Memory bandwidth [MBytes/s] | 1527.8159 |
| Memory data volume [GBytes] | 0.0181 |
| Operational intensity | 11.5687 |
+-----------------------------------+------------+
and
$ likwid-perfctr -c 0 -g MEM_DP -m julia --check-bounds=no --threads=1 perfctr.jl
--------------------------------------------------------------------------------
CPU name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
CPU type: Intel Coffeelake processor
CPU clock: 3.70 GHz
Warning: The Marker API requires the application to run on the selected CPUs.
Warning: likwid-perfctr pins the application only when using the -C command line option.
Warning: LIKWID assumes that the application does it before the first instrumented code region is started.
Warning: You can use the string in the environment variable LIKWID_THREADS to pin you application to
Warning: to the CPUs specified after the -c command line option.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region matmul, Group 1: MEM_DP
+-------------------+------------+
| Region Info | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] | 0.013603 |
| call count | 1 |
+-------------------+------------+
+------------------------------------------+---------+------------+
| Event | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
| INSTR_RETIRED_ANY | FIXC0 | 95690520 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 59653700 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 47162350 |
| PWR_PKG_ENERGY | PWR0 | 0.4736 |
| PWR_DRAM_ENERGY | PWR3 | 0.0139 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 0 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 786 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 52428800 |
| DRAM_READS | MBOX0C1 | 689521 |
| DRAM_WRITES | MBOX0C2 | 115639 |
+------------------------------------------+---------+------------+
+-----------------------------------+------------+
| Metric | HWThread 0 |
+-----------------------------------+------------+
| Runtime (RDTSC) [s] | 0.0136 |
| Runtime unhalted [s] | 0.0161 |
| Clock [MHz] | 4674.9392 |
| CPI | 0.6234 |
| Energy [J] | 0.4736 |
| Power [W] | 34.8176 |
| Energy DRAM [J] | 0.0139 |
| Power DRAM [W] | 1.0230 |
| DP [MFLOP/s] | 15416.5863 |
| AVX DP [MFLOP/s] | 15416.5285 |
| Packed [MUOPS/s] | 3854.1321 |
| Scalar [MUOPS/s] | 0.0578 |
| Memory load bandwidth [MBytes/s] | 3244.0247 |
| Memory load data volume [GBytes] | 0.0441 |
| Memory evict bandwidth [MBytes/s] | 544.0527 |
| Memory evict data volume [GBytes] | 0.0074 |
| Memory bandwidth [MBytes/s] | 3788.0774 |
| Memory data volume [GBytes] | 0.0515 |
| Operational intensity | 4.0698 |
+-----------------------------------+------------+
The same phenomenon happens when I use -C 0
instead of -c 0
. The variations are
of the order of at least 50%.
Hm, I can't reproduce on our cluster. In three independent runs I get
75.5374
,74.1830
, and74.6613
for the operational intensity.Full output: