Skip to content

Instantly share code, notes, and snippets.

@ranocha
Created November 25, 2021 08:22
Show Gist options
  • Save ranocha/0ad5716e77e55b2c61cbde10ad4f210c to your computer and use it in GitHub Desktop.
Save ranocha/0ad5716e77e55b2c61cbde10ad4f210c to your computer and use it in GitHub Desktop.
Empirical roofline model with LIKWID.jl - operational intensity varies

Operational intensities reported by LIKWID vary significantly

As discussed with Carsten Bauer on the Julia Slack, I try to follow the tutorial https://github.com/RRZE-HPC/likwid/wiki/Tutorial%3A-Empirical-Roofline-Model in Julia. However, operational intensities vary significantly between several runs.

I followed the Julia setup at https://juliaperf.github.io/LIKWID.jl/stable/marker/#Example. A slightly modified version below serves as MWE.

$ export JULIA_EXCLUSIVE=1
$ echo $JULIA_EXCLUSIVE
1

Results vary between something like

$ likwid-perfctr -c 0 -g MEM_DP -m julia --check-bounds=no --threads=1 perfctr.jl 
--------------------------------------------------------------------------------
CPU name:       Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
CPU type:       Intel Coffeelake processor
CPU clock:      3.70 GHz
Warning: The Marker API requires the application to run on the selected CPUs.
Warning: likwid-perfctr pins the application only when using the -C command line option.
Warning: LIKWID assumes that the application does it before the first instrumented code region is started.
Warning: You can use the string in the environment variable LIKWID_THREADS to pin you application to
Warning: to the CPUs specified after the -c command line option.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region matmul, Group 1: MEM_DP
+-------------------+------------+
|    Region Info    | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] |   0.011865 |
|     call count    |          1 |
+-------------------+------------+

+------------------------------------------+---------+------------+
|                   Event                  | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
|             INSTR_RETIRED_ANY            |  FIXC0  |   95714710 |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |   54637730 |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  |   42188300 |
|              PWR_PKG_ENERGY              |   PWR0  |     0.3635 |
|              PWR_DRAM_ENERGY             |   PWR3  |     0.0084 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |          0 |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |        786 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |   52428800 |
|                DRAM_READS                | MBOX0C1 |     230678 |
|                DRAM_WRITES               | MBOX0C2 |      52570 |
+------------------------------------------+---------+------------+

+-----------------------------------+------------+
|               Metric              | HWThread 0 |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     0.0119 |
|        Runtime unhalted [s]       |     0.0148 |
|            Clock [MHz]            |  4786.7012 |
|                CPI                |     0.5708 |
|             Energy [J]            |     0.3635 |
|             Power [W]             |    30.6379 |
|          Energy DRAM [J]          |     0.0084 |
|           Power DRAM [W]          |     0.7099 |
|            DP [MFLOP/s]           | 17674.8502 |
|          AVX DP [MFLOP/s]         | 17674.7839 |
|          Packed [MUOPS/s]         |  4418.6960 |
|          Scalar [MUOPS/s]         |     0.0662 |
|  Memory load bandwidth [MBytes/s] |  1244.2578 |
|  Memory load data volume [GBytes] |     0.0148 |
| Memory evict bandwidth [MBytes/s] |   283.5582 |
| Memory evict data volume [GBytes] |     0.0034 |
|    Memory bandwidth [MBytes/s]    |  1527.8159 |
|    Memory data volume [GBytes]    |     0.0181 |
|       Operational intensity       |    11.5687 |
+-----------------------------------+------------+

and

$ likwid-perfctr -c 0 -g MEM_DP -m julia --check-bounds=no --threads=1 perfctr.jl 
--------------------------------------------------------------------------------
CPU name:       Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
CPU type:       Intel Coffeelake processor
CPU clock:      3.70 GHz
Warning: The Marker API requires the application to run on the selected CPUs.
Warning: likwid-perfctr pins the application only when using the -C command line option.
Warning: LIKWID assumes that the application does it before the first instrumented code region is started.
Warning: You can use the string in the environment variable LIKWID_THREADS to pin you application to
Warning: to the CPUs specified after the -c command line option.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region matmul, Group 1: MEM_DP
+-------------------+------------+
|    Region Info    | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] |   0.013603 |
|     call count    |          1 |
+-------------------+------------+

+------------------------------------------+---------+------------+
|                   Event                  | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
|             INSTR_RETIRED_ANY            |  FIXC0  |   95690520 |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |   59653700 |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  |   47162350 |
|              PWR_PKG_ENERGY              |   PWR0  |     0.4736 |
|              PWR_DRAM_ENERGY             |   PWR3  |     0.0139 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |          0 |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |        786 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |   52428800 |
|                DRAM_READS                | MBOX0C1 |     689521 |
|                DRAM_WRITES               | MBOX0C2 |     115639 |
+------------------------------------------+---------+------------+

+-----------------------------------+------------+
|               Metric              | HWThread 0 |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     0.0136 |
|        Runtime unhalted [s]       |     0.0161 |
|            Clock [MHz]            |  4674.9392 |
|                CPI                |     0.6234 |
|             Energy [J]            |     0.4736 |
|             Power [W]             |    34.8176 |
|          Energy DRAM [J]          |     0.0139 |
|           Power DRAM [W]          |     1.0230 |
|            DP [MFLOP/s]           | 15416.5863 |
|          AVX DP [MFLOP/s]         | 15416.5285 |
|          Packed [MUOPS/s]         |  3854.1321 |
|          Scalar [MUOPS/s]         |     0.0578 |
|  Memory load bandwidth [MBytes/s] |  3244.0247 |
|  Memory load data volume [GBytes] |     0.0441 |
| Memory evict bandwidth [MBytes/s] |   544.0527 |
| Memory evict data volume [GBytes] |     0.0074 |
|    Memory bandwidth [MBytes/s]    |  3788.0774 |
|    Memory data volume [GBytes]    |     0.0515 |
|       Operational intensity       |     4.0698 |
+-----------------------------------+------------+

The same phenomenon happens when I use -C 0 instead of -c 0. The variations are of the order of at least 50%.

# perfctr.jl
using LIKWID
using LinearAlgebra
using Octavian
Marker.init()
A = rand(128, 64)
B = rand(64, 128)
C = zeros(128, 128)
# compile
matmul!(C, A, B)
Marker.startregion("matmul")
for _ in 1:100
matmul!(C, A, B)
end
Marker.stopregion("matmul")
Marker.close()
@carstenbauer
Copy link

Hm, I can't reproduce on our cluster. In three independent runs I get 75.5374, 74.1830, and 74.6613 for the operational intensity.

Full output:

➜  bauerc@cn-0252 trixi-likwid  sh perfctr.sh
--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
CPU type:	Intel Skylake SP processor
CPU clock:	2.39 GHz
Warning: The Marker API requires the application to run on the selected CPUs.
Warning: likwid-perfctr pins the application only when using the -C command line option.
Warning: LIKWID assumes that the application does it before the first instrumented code region is started.
Warning: You can use the string in the environment variable LIKWID_THREADS to pin you application to
Warning: to the CPUs specified after the -c command line option.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region matmul, Group 1: MEM_DP
+-------------------+------------+
|    Region Info    | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] |   0.014636 |
|     call count    |          1 |
+-------------------+------------+

+------------------------------------------+---------+------------+
|                   Event                  | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
|             INSTR_RETIRED_ANY            |  FIXC0  |   59303200 |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |   39972380 |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  |   33686690 |
|              PWR_PKG_ENERGY              |   PWR0  |     1.5754 |
|              PWR_DRAM_ENERGY             |   PWR3  |     0.1785 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |          0 |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |        764 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |          0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   26214400 |
|               CAS_COUNT_RD               | MBOX0C0 |       4103 |
|               CAS_COUNT_WR               | MBOX0C1 |       3373 |
|               CAS_COUNT_RD               | MBOX1C0 |       4245 |
|               CAS_COUNT_WR               | MBOX1C1 |       3357 |
|               CAS_COUNT_RD               | MBOX2C0 |       4199 |
|               CAS_COUNT_WR               | MBOX2C1 |       3391 |
|               CAS_COUNT_RD               | MBOX3C0 |       3885 |
|               CAS_COUNT_WR               | MBOX3C1 |       3207 |
|               CAS_COUNT_RD               | MBOX4C0 |       3797 |
|               CAS_COUNT_WR               | MBOX4C1 |       3192 |
|               CAS_COUNT_RD               | MBOX5C0 |       3855 |
|               CAS_COUNT_WR               | MBOX5C1 |       3285 |
+------------------------------------------+---------+------------+

+-----------------------------------+------------+
|               Metric              | HWThread 0 |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     0.0146 |
|        Runtime unhalted [s]       |     0.0167 |
|            Clock [MHz]            |  2841.0867 |
|                CPI                |     0.6740 |
|             Energy [J]            |     1.5754 |
|             Power [W]             |   107.6344 |
|          Energy DRAM [J]          |     0.1785 |
|           Power DRAM [W]          |    12.1975 |
|            DP [MFLOP/s]           | 14328.4039 |
|          AVX DP [MFLOP/s]         | 14328.3517 |
|          Packed [MUOPS/s]         |  1791.0440 |
|          Scalar [MUOPS/s]         |     0.0522 |
|  Memory read bandwidth [MBytes/s] |   105.3113 |
|  Memory read data volume [GBytes] |     0.0015 |
| Memory write bandwidth [MBytes/s] |    86.6006 |
| Memory write data volume [GBytes] |     0.0013 |
|    Memory bandwidth [MBytes/s]    |   191.9119 |
|    Memory data volume [GBytes]    |     0.0028 |
|       Operational intensity       |    74.6613 |
+-----------------------------------+------------+


➜  bauerc@cn-0252 trixi-likwid  sh perfctr.sh
--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
CPU type:	Intel Skylake SP processor
CPU clock:	2.39 GHz
Warning: The Marker API requires the application to run on the selected CPUs.
Warning: likwid-perfctr pins the application only when using the -C command line option.
Warning: LIKWID assumes that the application does it before the first instrumented code region is started.
Warning: You can use the string in the environment variable LIKWID_THREADS to pin you application to
Warning: to the CPUs specified after the -c command line option.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region matmul, Group 1: MEM_DP
+-------------------+------------+
|    Region Info    | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] |   0.014877 |
|     call count    |          1 |
+-------------------+------------+

+------------------------------------------+---------+------------+
|                   Event                  | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
|             INSTR_RETIRED_ANY            |  FIXC0  |   59303210 |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |   40674460 |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  |   34246180 |
|              PWR_PKG_ENERGY              |   PWR0  |     1.5839 |
|              PWR_DRAM_ENERGY             |   PWR3  |     0.1783 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |          0 |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |        764 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |          0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   26214400 |
|               CAS_COUNT_RD               | MBOX0C0 |       4241 |
|               CAS_COUNT_WR               | MBOX0C1 |       3404 |
|               CAS_COUNT_RD               | MBOX1C0 |       4214 |
|               CAS_COUNT_WR               | MBOX1C1 |       3423 |
|               CAS_COUNT_RD               | MBOX2C0 |       4134 |
|               CAS_COUNT_WR               | MBOX2C1 |       3346 |
|               CAS_COUNT_RD               | MBOX3C0 |       3907 |
|               CAS_COUNT_WR               | MBOX3C1 |       3247 |
|               CAS_COUNT_RD               | MBOX4C0 |       3812 |
|               CAS_COUNT_WR               | MBOX4C1 |       3214 |
|               CAS_COUNT_RD               | MBOX5C0 |       3915 |
|               CAS_COUNT_WR               | MBOX5C1 |       3315 |
+------------------------------------------+---------+------------+

+-----------------------------------+------------+
|               Metric              | HWThread 0 |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     0.0149 |
|        Runtime unhalted [s]       |     0.0170 |
|            Clock [MHz]            |  2843.7605 |
|                CPI                |     0.6859 |
|             Energy [J]            |     1.5839 |
|             Power [W]             |   106.4682 |
|          Energy DRAM [J]          |     0.1783 |
|           Power DRAM [W]          |    11.9849 |
|            DP [MFLOP/s]           | 14096.6948 |
|          AVX DP [MFLOP/s]         | 14096.6434 |
|          Packed [MUOPS/s]         |  1762.0804 |
|          Scalar [MUOPS/s]         |     0.0514 |
|  Memory read bandwidth [MBytes/s] |   104.2062 |
|  Memory read data volume [GBytes] |     0.0016 |
| Memory write bandwidth [MBytes/s] |    85.8197 |
| Memory write data volume [GBytes] |     0.0013 |
|    Memory bandwidth [MBytes/s]    |   190.0259 |
|    Memory data volume [GBytes]    |     0.0028 |
|       Operational intensity       |    74.1830 |
+-----------------------------------+------------+


➜  bauerc@cn-0252 trixi-likwid  sh perfctr.sh
--------------------------------------------------------------------------------
CPU name:	Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
CPU type:	Intel Skylake SP processor
CPU clock:	2.39 GHz
Warning: The Marker API requires the application to run on the selected CPUs.
Warning: likwid-perfctr pins the application only when using the -C command line option.
Warning: LIKWID assumes that the application does it before the first instrumented code region is started.
Warning: You can use the string in the environment variable LIKWID_THREADS to pin you application to
Warning: to the CPUs specified after the -c command line option.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region matmul, Group 1: MEM_DP
+-------------------+------------+
|    Region Info    | HWThread 0 |
+-------------------+------------+
| RDTSC Runtime [s] |   0.015037 |
|     call count    |          1 |
+-------------------+------------+

+------------------------------------------+---------+------------+
|                   Event                  | Counter | HWThread 0 |
+------------------------------------------+---------+------------+
|             INSTR_RETIRED_ANY            |  FIXC0  |   59303330 |
|           CPU_CLK_UNHALTED_CORE          |  FIXC1  |   40914770 |
|           CPU_CLK_UNHALTED_REF           |  FIXC2  |   34428480 |
|              PWR_PKG_ENERGY              |   PWR0  |     1.5893 |
|              PWR_DRAM_ENERGY             |   PWR3  |     0.1764 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE |   PMC0  |          0 |
|    FP_ARITH_INST_RETIRED_SCALAR_DOUBLE   |   PMC1  |        764 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE |   PMC2  |          0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE |   PMC3  |   26214400 |
|               CAS_COUNT_RD               | MBOX0C0 |       4214 |
|               CAS_COUNT_WR               | MBOX0C1 |       3091 |
|               CAS_COUNT_RD               | MBOX1C0 |       4322 |
|               CAS_COUNT_WR               | MBOX1C1 |       3262 |
|               CAS_COUNT_RD               | MBOX2C0 |       4214 |
|               CAS_COUNT_WR               | MBOX2C1 |       3161 |
|               CAS_COUNT_RD               | MBOX3C0 |       4003 |
|               CAS_COUNT_WR               | MBOX3C1 |       3101 |
|               CAS_COUNT_RD               | MBOX4C0 |       3946 |
|               CAS_COUNT_WR               | MBOX4C1 |       3090 |
|               CAS_COUNT_RD               | MBOX5C0 |       3905 |
|               CAS_COUNT_WR               | MBOX5C1 |       3071 |
+------------------------------------------+---------+------------+

+-----------------------------------+------------+
|               Metric              | HWThread 0 |
+-----------------------------------+------------+
|        Runtime (RDTSC) [s]        |     0.0150 |
|        Runtime unhalted [s]       |     0.0171 |
|            Clock [MHz]            |  2845.4127 |
|                CPI                |     0.6899 |
|             Energy [J]            |     1.5893 |
|             Power [W]             |   105.6947 |
|          Energy DRAM [J]          |     0.1764 |
|           Power DRAM [W]          |    11.7328 |
|            DP [MFLOP/s]           | 13946.9871 |
|          AVX DP [MFLOP/s]         | 13946.9363 |
|          Packed [MUOPS/s]         |  1743.3670 |
|          Scalar [MUOPS/s]         |     0.0508 |
|  Memory read bandwidth [MBytes/s] |   104.7212 |
|  Memory read data volume [GBytes] |     0.0016 |
| Memory write bandwidth [MBytes/s] |    79.9157 |
| Memory write data volume [GBytes] |     0.0012 |
|    Memory bandwidth [MBytes/s]    |   184.6369 |
|    Memory data volume [GBytes]    |     0.0028 |
|       Operational intensity       |    75.5374 |
+-----------------------------------+------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment