Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
What happens when resource limit and request are set...

This research focuses on what happens inside of Kubernetes when you apply a Pod with Limit and Request in source code level.

Basically, when you apply kubectl apply -f pod.yaml, it will be transformed into JSON format and will be sent to kube-api-server in your kubectl cli, unless you use server-side-apply but it's a different story for now.

When kube-api-server gets the request, it stores the Pod spec to etcd as desired state. This is where our story begins.

The parameter flow is the following order.

Pod spec -> kubelet -> CRI Runtime -> OCI runtime -> cgroups

kubelet

Kubelet has SyncPod process which literally syncs each Pod spec to existing container. kubelet process runs on each node and it gets all pod information that has the same nodename as itself.

https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L661

func (m *kubeGenericRuntimeManager) SyncPod()

Pod creation is defined in Step 7 in this function, so let's jump there.

	// Step 7: start containers in podContainerChanges.ContainersToStart.
	for _, idx := range podContainerChanges.ContainersToStart {
		start("container", containerStartSpec(&pod.Spec.Containers[idx]))
	}

This start() being called exists right above this as helper.

start := func(typeName string, spec *startSpec) error {
...
    // NOTE (aramase) podIPs are populated for single stack and dual stack clusters. Send only podIPs.
    if msg, err := m.startContainer(podSandboxID, podSandboxConfig, spec, pod, podStatus, pullSecrets, podIP, podIPs); err != nil {
        startContainerResult.Fail(err, msg)
...
}

This helper calls startContainer that exists in kuberuntime_container.go https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/kuberuntime/kuberuntime_container.go#L134

func (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, spec *startSpec, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, podIPs []string) (string, error) {
...
	// Step 2: create the container.
	// For a new container, the RestartCount should be 0
	restartCount := 0
	containerStatus := podStatus.FindContainerStatusByName(container.Name)
	if containerStatus != nil {
		restartCount = containerStatus.RestartCount + 1
	}

...

	containerConfig, cleanupAction, err := m.generateContainerConfig(container, pod, restartCount, podIP, imageRef, podIPs, target)
	if cleanupAction != nil {
		defer cleanupAction()
	}

OK, generateContainerConfig seems to be the config that will be passed to CRI. I found cpuRequest and cpuLimit here.

https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/kuberuntime/kuberuntime_container_linux.go#L55-L58

var cpuShares int64
cpuRequest := container.Resources.Requests.Cpu()
cpuLimit := container.Resources.Limits.Cpu()
memoryLimit := container.Resources.Limits.Memory().Value()
oomScoreAdj := int64(qos.GetContainerOOMScoreAdjust(pod, container,
    int64(m.machineInfo.MemoryCapacity)))
...
if m.cpuCFSQuota {
  // if cpuLimit.Amount is nil, then the appropriate default value is returned
  // to allow full usage of cpu resource.
  cpuPeriod := int64(quotaPeriod)
  if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.CPUCFSQuotaPeriod) {
    cpuPeriod = int64(m.cpuCFSQuotaPeriod.Duration / time.Microsecond)
  }
  cpuQuota := milliCPUToQuota(cpuLimit.MilliValue(), cpuPeriod)
  lc.Resources.CpuQuota = cpuQuota
  lc.Resources.CpuPeriod = cpuPeriod
}

OK, so these CpuQuota and CpuPeriod seem to be set in the config. milliCPUToQuota simply convert CPU numbers with each CPU cycle that are used in Cgroups.

Ref. https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/kuberuntime/helpers_linux.go#L46

func milliCPUToQuota(milliCPU int64, period int64) (quota int64) {
	// CFS quota is measured in two values:
	//  - cfs_period_us=100ms (the amount of time to measure usage across)
	//  - cfs_quota=20ms (the amount of cpu time allowed to be used across a period)
	// so in the above example, you are limited to 20% of a single CPU
	// for multi-cpu environments, you just scale equivalent amounts
	// see https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt for details
	if milliCPU == 0 {
		return
	}

	// we then convert your milliCPU to a value normalized over a period
	quota = (milliCPU * period) / milliCPUToCPU

	// quota needs to be a minimum of 1ms.
	if quota < minQuotaPeriod {
		quota = minQuotaPeriod
	}

	return
}

FYI: CFS quota is a process scheduler in Linux kernel https://en.wikipedia.org/wiki/Completely_Fair_Scheduler

It basically converts from "Human readable request and limit to cgroups friendly value Quota in milliseconds.

Appendix: This implementation came from https://github.com/google/lmctfy and now it's part of libcontainer which is the has core logic of runc.

I let you know now, but this CpuQuota and CpuPeriod are used in runc too. So basically this value won't be converted and used as it is. https://github.com/opencontainers/runc/blob/master/update.go#L271-L272

In short, kubelet converts Request and Limit to cgroups value. It will just be passed to CRI/OCI runtime.

How kubelet calls CRI

Anyways, finally kubelet calls CreateContainer() to call CRI runtime in gRPC.

https://github.com/kubernetes/kubernetes/blob/release-1.19/pkg/kubelet/kuberuntime/kuberuntime_container.go#L170

https://github.com/kubernetes/kubernetes/blob/release-1.19/staging/src/k8s.io/cri-api/pkg/apis/services.go#L35

containerd

Let's see what containerd does as a CRI example.

Here we start.

https://github.com/containerd/containerd/blob/master/pkg/cri/server/container_create.go#L50

OK so this is where container is created? :thinking_face:

https://github.com/containerd/containerd/blob/master/pkg/cri/server/container_create.go#L245

Actually, I was stuck here. So I asked a question in containerd channel on CNCF Slack.

So we need to find NewTask...

Looks like c.client.TaskService().Create(ctx, request) in the following code is where containerd calls OCI runtime.

https://github.com/containerd/containerd/blob/master/container.go#L298

OK, so we need to find Create() in runc now.

runc

Here we come.

https://github.com/opencontainers/runc/blob/master/libcontainer/factory_linux.go#L239

Huh, so you call struct pointer and that's it? I don't really understand what this does here, but it doesn't really matter to know Request and Limit for now.

https://github.com/opencontainers/runc/blob/master/libcontainer/factory_linux.go#L264

c := &linuxContainer{
  id:            id,
  root:          containerRoot,
  config:        config,
  initPath:      l.InitPath,
  initArgs:      l.InitArgs,
  criuPath:      l.CriuPath,
  newuidmapPath: l.NewuidmapPath,
  newgidmapPath: l.NewgidmapPath,
  cgroupManager: l.NewCgroupsManager(config.Cgroups, nil),
}
if intelrdt.IsCATEnabled() || intelrdt.IsMBAEnabled() {
  c.intelRdtManager = l.NewIntelRdtManager(config, id, "")
}
c.state = &stoppedState{c: c}
return c, nil
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment