Document resumes the bug in the SLURM's acct_gather_energy/rapl
plugin that reports the energy consumption of the jobs using MSRs of RAPL framework.
RAPL supports multiple power domains. The RAPL power domain is a physically meaningful domain (e.g., Processor Package, DRAM etc) for power management. Each power domain informs the energy consumption of the domain.
RAPL provides the following power domains for both measuring and limiting energy consumption:
- Package: Package (PKG) domain measures the energy consumption of the entire socket. It includes the consumption of all the cores, integrated graphics and also the uncore components (last level caches, memory controller).
- Power Plane 0 (PP0) : measures the energy consumption of all processor cores on the socket.
- Power Plane 1 (PP1) : measures the energy consumption of processor graphics (GPU) on the socket (desktop models only).
- DRAM: measures the energy consumption of random access memory (RAM) attached to the integrated memory controller.
The support for different power domains varies according to the processor model. This is an important key point which will be recalled in the bug description.
RAPL energy counters can be fetched in two modes:
- Model Specific Registers (MSR)
- Powercap framework (
/sys/class/powercap
)
For sure there can be other ways to fetch these counters like eBPF events but those modes are outside the scope of present document.
The RAPL energy counters can be accessed through model-specific registers (MSRs). The counters are 32-bit registers that indicate the energy consumed since the processor was booted up. The counters are updated approximately once every millisecond. The energy is counted in multiples of model-specific energy units. This is important and it is the source of the bug which will be discussed later in the document. The MSRs can be accessed directly on Linux using the msr driver in the kernel. Reading RAPL domain values directly from MSRs requires detecting the CPU model and reading the RAPL energy units before reading the RAPL domain counters.
From Linux Kernel version 3.13 onwards, RAPL values can be read using Power Capping Framework. Linux Power Capping framework exposes power capping devices to user space via sysfs in the form of a tree of objects mounted at /sys/class/powercap/intel-rapl
.
Behind the scenes, Powercap framework reads the MSRs and writes them back to the /sys/class/powercap/intel-rapl
, albeit, at a lower frequency than the MSRs.
When energy counters at a very high frequency is desired, MSRs should be preferred over the Powercap framework. Typically, this need translates to profiling the codes. In other cases, Powercap framework can be used to fetch energy counters as they are more easy and straight-forward to read.
As briefed in MSRs section, all the counters are exposed as 32-bit registers, including the units of the energy. The names and addresses of the MSRs are described in Intel's Software Developers Manual Volume 3 (SDM) (Section 15.10.1). The name of the power unit MSR is MSR_RAPL_POWER_UNIT
.
As briefed in the RAPL framework, RAPL exposes different domains and we are mostly interested in Package and DRAM domains. But not all processors expose DRAM domain counters. The problem comes from the fact that Package and DRAM domain counters use different power units which is not well documented. To make it complex further, the difference in the power units between Package and DRAM domains is processor specific. Intel shed some light into this in SDM Volume 4. According to this document, DRAM uses a power unit of 15.3 micro Joules for DRAM domain counters for following micro-architectures:
- Haswell (Xeon E5 v3, Table 2-32 in SDM Volume 4)
- Broadwell (Xeon E5 v4, Table 2-36 in SDM Volume 4)
- Skylake, Cascade Lake and Copper Lake (Models with
06_55H
CPUID, Table 2-50 in SDM Volume 4) - IceLake (Table 2.17.7 in SDM Volume 4)
Typical power unit for Package domain is around 60 micro Joules (although it depends on processor and it can be fetched by MSR_RAPL_POWER_UNIT
). Thus, it is clear that not accounting a different power unit for DRAM domain and using the same power unit as Package domain for DRAM can result in considerable over-estimation of energy. This is what triggers the bug in SLURM's acct_gather_energy/rapl
plugin.
Looking into SLURM acct_gather_energy_rapl
source code, it is clear that they fetch MSR_RAPL_POWER_UNIT
and use the same unit for Package and DRAM domains.
To make the case for bug in SLURM plugin stronger, powercap
driver in Linux kernel takes this difference in power units in account for estimating Package and DRAM energy consumption. The specific struct for DRAM power unit has been defined here and this has been used in HASWELL_X, BROADWELL_X, SKYLAKE_X, ICELAKE_X, ICELAKE_D, XEON_PHI_KNL, XEON_PHI_KHM micro architectures.
We can actually do some tests and verify this bug in the SLURM's plugin. For the test, we borrow a very nice script from here that reports RAPL power counters using MSRs and Powercap frameworks and takes different power units for Package and DRAM into account. We use a machine that has SKYLAKE_X architecture and here is the info from /proc/cpuinfo
processor : 79
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
stepping : 7
microcode : 0x5003604
cpu MHz : 2499.999
cache size : 28160 KB
physical id : 1
siblings : 40
core id : 28
cpu cores : 20
apicid : 121
initial apicid : 121
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit mmio_stale_data retbleed eibrs_pbrsb gds
bogomips : 5008.16
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Let us verify first verify the magnitude of values between powercap framework and MSRs using the rapl-read.c
script on this node:
Using powercap framework, output is
$ gcc -o rapl-read rapl-read.c -lm
$ ./rapl-read -s
RAPL read -- use -s for sysfs, -p for perf_event, -m for msr
Found Skylake-X Processor type
0 (0), 1 (0), 2 (0), 3 (0), 4 (0), 5 (0), 6 (0), 7 (0)
8 (0), 9 (0), 10 (0), 11 (0), 12 (0), 13 (0), 14 (0), 15 (0)
16 (0), 17 (0), 18 (0), 19 (0), 20 (1), 21 (1), 22 (1), 23 (1)
24 (1), 25 (1), 26 (1), 27 (1), 28 (1), 29 (1), 30 (1), 31 (1)
32 (1), 33 (1), 34 (1), 35 (1), 36 (1), 37 (1), 38 (1), 39 (1)
40 (0), 41 (0), 42 (0), 43 (0), 44 (0), 45 (0), 46 (0), 47 (0)
48 (0), 49 (0), 50 (0), 51 (0), 52 (0), 53 (0), 54 (0), 55 (0)
56 (0), 57 (0), 58 (0), 59 (0), 60 (1), 61 (1), 62 (1), 63 (1)
64 (1), 65 (1), 66 (1), 67 (1), 68 (1), 69 (1), 70 (1), 71 (1)
72 (1), 73 (1), 74 (1), 75 (1), 76 (1), 77 (1), 78 (1), 79 (1)
Detected 80 cores in 2 packages
Trying sysfs powercap interface to gather results
Sleeping 1 second
Package 0
package-0 : 130.229830J
dram : 30.331990J
Package 1
package-1 : 116.775031J
dram : 29.468916J
The power consumption of packages is ~246 W
and DRAM is ~60 W
(Readings are made for 1s).
Using MSRs, output is
# ./rapl-read -m
RAPL read -- use -s for sysfs, -p for perf_event, -m for msr
Found Skylake-X Processor type
0 (0), 1 (0), 2 (0), 3 (0), 4 (0), 5 (0), 6 (0), 7 (0)
8 (0), 9 (0), 10 (0), 11 (0), 12 (0), 13 (0), 14 (0), 15 (0)
16 (0), 17 (0), 18 (0), 19 (0), 20 (1), 21 (1), 22 (1), 23 (1)
24 (1), 25 (1), 26 (1), 27 (1), 28 (1), 29 (1), 30 (1), 31 (1)
32 (1), 33 (1), 34 (1), 35 (1), 36 (1), 37 (1), 38 (1), 39 (1)
40 (0), 41 (0), 42 (0), 43 (0), 44 (0), 45 (0), 46 (0), 47 (0)
48 (0), 49 (0), 50 (0), 51 (0), 52 (0), 53 (0), 54 (0), 55 (0)
56 (0), 57 (0), 58 (0), 59 (0), 60 (1), 61 (1), 62 (1), 63 (1)
64 (1), 65 (1), 66 (1), 67 (1), 68 (1), 69 (1), 70 (1), 71 (1)
72 (1), 73 (1), 74 (1), 75 (1), 76 (1), 77 (1), 78 (1), 79 (1)
Detected 80 cores in 2 packages
Trying /dev/msr interface to gather results
Listing paramaters for package #0
DRAM: Using 0.000015 instead of 0.000061
Power units = 0.125W
CPU Energy units = 0.00006104J
DRAM Energy units = 0.00001526J
Time units = 0.00097656s
Package thermal spec: 150.000W
Package minimum power: 69.000W
Package maximum power: 376.000W
Package maximum time window: 0.014648s
Package power limits are unlocked
Package power limit #1: 150.000W for 0.108398s (enabled, not_clamped)
Package power limit #2: 180.000W for 0.054688s (enabled, clamped)
Listing paramaters for package #1
DRAM: Using 0.000015 instead of 0.000061
Power units = 0.125W
CPU Energy units = 0.00006104J
DRAM Energy units = 0.00001526J
Time units = 0.00097656s
Package thermal spec: 150.000W
Package minimum power: 69.000W
Package maximum power: 376.000W
Package maximum time window: 0.014648s
Package power limits are unlocked
Package power limit #1: 150.000W for 0.108398s (enabled, not_clamped)
Package power limit #2: 180.000W for 0.054688s (enabled, clamped)
Sleeping 1 second
Package 0:
Package energy: 132.407776J
PowerPlane0 (cores): 0.000000J
DRAM: 30.376007J
Package 1:
Package energy: 118.202454J
PowerPlane0 (cores): 0.000000J
DRAM: 29.264847J
Note: the energy measurements can overflow in 60s or so
so try to sample the counters more often than that.
The power consumption of packages is ~250 W
and DRAM is ~60 W
. They are in the same range as the values reported by powercap framework.
Let's see what SLURM reports as power consumption of this node using scontrol
. Here is the output:
NodeName=[redacted] Arch=x86_64 CoresPerSocket=20
CPUAlloc=80 CPUEfctv=80 CPUTot=80 CPULoad=40.09
AvailableFeatures=prof
ActiveFeatures=prof
Gres=(null)
NodeAddr=[redacted] NodeHostName=[redacted] Version=23.02.6
OS=Linux 5.14.0-284.55.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Feb 19 16:57:59 EST 2024
RealMemory=191000 AllocMem=160000 FreeMem=154645 Sockets=2 Boards=1
MemSpecLimit=30000
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=[redacted]
BootTime=2024-06-27T08:51:18 SlurmdStartTime=2024-06-27T10:01:47
LastBusyTime=2024-07-14T23:09:03 ResumeAfterTime=None
CfgTRES=cpu=80,mem=191000M,billing=40
AllocTRES=cpu=80,mem=160000M
CapWatts=n/a
CurrentWatts=479 AveWatts=53
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
It is clear that the CurrentWatts
reported by SLURM (where this value is coming from RAPL plugin) is considerably higher than values reported by Powercap framework (which takes difference of power units for Package and DRAM into account). This can be further verified by looking at the IPMI DCMI output which gives the power consumption of entire node and here is the output:
$ ipmitool dcmi power reading
Instantaneous power reading: 363 Watts
Minimum during sampling period: 64 Watts
Maximum during sampling period: 466 Watts
Average power reading over sample period: 284 Watts
IPMI timestamp: Mon Jul 15 09:48:08 2024
Sampling period: 01566504 Seconds.
Power reading state is: activated
Thus the power reported by SLURM is higher than power reported by IPMI which is not possible (Even the maximum power during sampling period is lower than the SLURM's value). Moreover power reported by SLURM RAPL plugin is only for Package and DRAM where as IPMI gives power consumption of whole node. So, theoritically power consumption reported by SLURM plugin must stay below the one reported by IPMI. One can argue that values are not taken in the same time, so they might not be realiable. Looking into the historical power readings of IPMI, it confirms that values stayed around 360 W when tests are made.
Take the readings reported by SLURM's RAPL plugin with a grain of salt. These values are non realiable if the processors have architectures in the list presented in [Deepdive into MSRs](#Deepdive into MSRs) section and DRAM package available. There has been a stale patch proposed to fix this on SLURM since a while but it never got any attention from SchedMD.