Skip to content

Instantly share code, notes, and snippets.

@sjenning
Last active February 7, 2019 19:34
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sjenning/797329bffceec0c901873391a22d04b5 to your computer and use it in GitHub Desktop.
Save sjenning/797329bffceec0c901873391a22d04b5 to your computer and use it in GitHub Desktop.
Running multiple kubelets that each schedule to a different NUMA pinned domain

Kubernetes on a NUMA machine

Example Topology

2 scoket (2 node) machine with 4 cores/socket, 0-3 on node0 and 4-7 on node1

kubelet options

kubelet-node0

--cgroup-root=node0.slice --max-pods=1 --node-labels=numa-type=red

kubelet-node1

Avoid collisions on used ports

--cgroup-root=node1.slice --cadvisor-port=4294 --healthz-port=10348 --port=10350 --read-only-port=10355 --max-pods=1 --node-labels=numa-type=blue

node slices

nodeX.slice

[Unit]
Description=NUMA NodeX Slice
Documentation=man:systemd.special(7)
DefaultDependencies=no
Before=slices.target
Requires=-.slice
After=-.slice

cpuset controller

# lock system services to first cpu per node
system.slice/cpuset.cpus = 0,4
system.slice/cpuset.mem = 0,1

# lock node0 services to non-system service cpus and memory on node 0
node0.slice/cpuset.cpus = 0-3
node0.slice/cpuset.mem = 0

# lock node1 services to non-system service cpus and memory on node 1
node1.slice/cpuset.cpus = 5-7
node1.slice/cpuset.mem = 1

Pod Specs

apiVersion: v1
kind: Pod
metadata:
  name: demo0
spec:
  containers:
  - image: fedora:24
    name: fedora
    imagePullPolicy: Always
    command:
    - /usr/bin/sleep
    - "3600"
  nodeSelector:
    numa-type: red

Results

Control group /:
-.slice
├─node0.slice
│ ├─docker-9eed23bcc3a27c7a33328e2146656c09a54a22760567072a4dc75f7c33678c51.scope
│ │ └─6511 /pause
│ └─docker-7e06186716adf7549ba0d6768034cd69a35cebd9fba4ad03ba2ef44dbaab737a.scope
│   └─6589 /usr/bin/sleep 3600
├─node1.slice
│ ├─docker-af91dcfe3dc3be446354e98748610738a76d5b4884ac5810e374171bf126ae08.scope
│ │ └─6921 /pause
│ └─docker-ee85ac4ba933e9411cb836286294a460df90824c1384956cad064c477cbb471e.scope
│   └─6999 /usr/bin/sleep 3600
├─init.scope
│ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 24
├─system.slice
...
│ ├─kubelet-node1.service
│ │ ├─3639 /usr/local/bin/kubelet --logtostderr=true --v=3 --api-servers=http://127.0.0.1:8080 --address=127.0.0.1 --hostname-override=numa-node1.example.com --allow-privileged=false --cgroup-root=node1.slice --cadvisor-port=4294 --healthz-port=10348 --port=10350 --read-only-port=10355 --max-pods=1 --node-labels=numa-type=blue
│ │ └─3678 journalctl -k -f
...
│ ├─kubelet-node0.service
│ │ ├─2388 /usr/local/bin/kubelet --logtostderr=true --v=3 --api-servers=http://127.0.0.1:8080 --address=127.0.0.1 --hostname-override=numa-node0.example.com --allow-privileged=false --cgroup-root=node0.slice --max-pods=1 --node-labels=numa-type=red
│ │ └─2428 journalctl -k -f
...
│ └─docker.service
│   └─888 /usr/bin/docker daemon --exec-opt native.cgroupdriver=systemd --selinux-enabled --log-driver=journald
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment