Skip to content

Instantly share code, notes, and snippets.

@ubermunck
Created January 25, 2022 06:51
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save ubermunck/2f116b7817812ae6255d19a4e10242f4 to your computer and use it in GitHub Desktop.
Save ubermunck/2f116b7817812ae6255d19a4e10242f4 to your computer and use it in GitHub Desktop.
Appendix: testing quotas and cpusets

Appendix: testing quotas and cpusets

This is the appendix of the Avoiding CPU throttling in a containerized environment article published on Uber's Engineering Blog by Joakim Recht and Yury Vostrikov.

Our first analysis included low level benchmarking of the effect of quotas vs cpusets for multi-threaded applications. In this section, we’ll go over the details of these benchmarks.

First, we create a new cpu,cpuacct cgroup and run Go based test load in it. Go supports controlling multi-threading using the GOMAXPROCS environment variable. We can clearly observe that without setting GOMAXPROCS to the allocated number of cores, our test is heavily throttled.

Next, we create a cpuset cgroup and add it to the existing cgroups of a process. Running our test load no longer exhibits excessive throttling.

Test with quotas

First, we create a cpu cgroup called lat and allocate 200ms of CPU time in 100ms wallclock time to it - that is, full cores. For convenience we use the commands from the cgroup-tools package to do this, but managing cgroups can also be done directly by modifying /sys/fs/cgroup:

# cgcreate -gcpu,cpuacct:lat
# echo 200000 > /sys/fs/cgroup/cpu/lat/cpu.cfs_quota_us 

We then create a new shell within this cgroup:

# cgexec -gcpu,cpuacct:lat bash

Inside this shell, we run a latency benchmark. The benchmark spawns 32 CPU intensive threads and measures the latency of the sleep() syscall. Because of how CPU quota works, the benchmark program consumes all allocated CPU time before the end of the scheduling period and gets throttled. Throttling can be seen as a second spike at 100ms at the latency distribution histogram.

# ./cgroup-lat 
14ms-34ms    0.667%  ▏          1
34ms-55ms    0.667%  ▏          1
55ms-76ms    0%      ▏            
76ms-96ms    0.667%  ▏          1  
96ms-117ms   96.7%   █████▏   145
117ms-138ms  0%      ▏       
138ms-158ms  0%      ▏       
158ms-179ms  0%      ▏       
179ms-200ms  1.33%   ▏          2

We can also verify it by observing cpu.stat of cpu:lat. Nr_throttled, which is the number of periods in which the cgroup was throttled. In this case it was throttled 150 times out of 161 scheduling periods:

$ cat /sys/fs/cgroup/cpu/lat/cpu.stat | head -2
nr_periods 161
nr_throttled 150

Testing with cpuset

To test the same scenario with cpusets we create a new cpuset:lat cgroup and limit it to only the first two CPU cores. We also limit memory allocations to the first NUMA domain, although for this test it’s not relevant since it’s CPU bound.

# cgcreate -gcpuset:lat
# echo 0-1 > /sys/fs/cgroup/cpuset/lat/cpuset.cpus
# echo 0 > /sys/fs/cgroup/cpuset/lat/cpuset.mems

We then put the current shell (which is already in CPU quota cpu:lat) into the newly created cpuset cgroup. The result is that the current bash is limited both by the cpu:lat cgroup and the cpuset:lat cgroup.

# cgclassify -g cpuset:lat $$

We re-run the benchmark. Because cpuset:lat cgroup limits the available cores, our benchmark program no longer consumes all CPU quota before the scheduling period and therefore no longer gets heavily throttled. We can clearly observe this from latency the distribution:

# ./cgroup-lat 
19ms-22ms  97.7%   █████▏  716
22ms-24ms  0.409%  ▏         3
24ms-26ms  0%      ▏       
26ms-28ms  0.136%  ▏         1
28ms-31ms  0%      ▏       
31ms-33ms  0%      ▏       
33ms-35ms  0.136%  ▏         1
35ms-37ms  0%      ▏       
37ms-40ms  1.64%   ▏        12

Re-checking throttling statistics further proves our point. It was throttled only 13 times in 160 scheduling periods (the numbers include throttling from the previous test):

# cat /sys/fs/cgroup/cpu/lat/cpu.stat | head -2
nr_periods 315
nr_throttled 163

Benchmark source

The benchmark was executed as a simple Go program that starts 32 goroutines, which each run at full CPU. Next to these, we run a loop that records the elapsed time and prints it as a histogram.

package main

import (
	"fmt"
	"os"
	"time"

	"github.com/aybabtme/uniplot/histogram"
)

func burner() {
	var i int
	for {
		i += 1
	}
}

func worker() {
	delay := time.Millisecond * 10
	var delays []float64
	deadline := time.Now().Add(time.Second * 15)

	for {
		t := time.Now()
		if t.After(deadline) {
			break
		}
		time.Sleep(delay)
		dur := time.Now().Sub(t)
		delays = append(delays, float64(dur))
	}
	hist := histogram.Hist(9, delays)
	histogram.Fprintf(os.Stdout, hist, histogram.Linear(5), func(v float64) string {
		return fmt.Sprintf("%dms", time.Duration(v).Milliseconds())
	})
}

func main() {
	for i := 0; i < 32; i++ {
		go burner()
	}
	worker()
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment