Title: Setting up K3s on Raspberry Pi's and Jetson Nano's
Author: Mitch Murphy
Date: 2021-10-09
In this article we will be setting up a 5 node K3s cluster: one control plane, three workers (Raspberry Pis) and one GPU worker (Nvidia Jetson Nano) to enable GPU workloads such as Tensorflow. Let's get started.
This is a pretty cost effective cluster (for the computational power at least), here is what I will be using:
- 2 x Raspberry Pi 4 Model B - 8GB - $87.25
- 2 x Raspberry Pi 4 Model B - 4GB - $59.75
- 1 x Nvidia Jetson Nano 4GB - $169.99
- 4 x Crucial MX500 500GB SSD - $54.99
- 4 x SATA/SSD to USB Adapter - $9.99
- 1 x 1ft USB C Cables, 5 pack - $8.99
- 1 x USB Charging Station - 60W, 12A - $27.99
- 1 x 1ft CAT 6 Cables, 5 pack - $10.99
- 1 x NETGEAR Ethernet Switch - $19.99
- 1 x Raspberry Pi Cluster Case - $84.99
Note that the Nvidia Jetson Nano was only $99.99 when I bought it, the same model with 4GB of RAM is now 169.99, there is a 2GB version on Amazon for $59.00.
This brings the total cost to build this exact cluster at $848.87.
Login! ssh ubuntu@<IPADDR>
and use the default password of ubuntu
. It will require you to change this. We will be disabling this account next.
Create user:
sudo adduser master
Add groups:sudo usermod -a -G adm,dialout,cdrom,floppy,sudo,audio,dip,video,plugdev,netdev,lxd master
Now logout and log back in: ssh master@<IPADDR>
and then delete the default user: sudo deluser --remove-home ubuntu
.
It time to rename our nodes. I will be naming master node as k3s-master and similarly worker nodes as k3s-worker to k3s-worker3. Change the hostname with: sudo hostnamectl set-hostname k3s-master
.
We are going to update our installation, so we have latest and greatest packages by running: sudo apt update && sudo apt upgrade -y
. Now reboot.
As cloud-init is present on this image we are going to edit also: sudo nano /etc/cloud/cloud.cfg
. Change preserve_hostname
to true
. Reboot again.
It is good practice to disable username/password SSH login, this is done by editing sudo nano /etc/ssh/sshd_config
, as so:
From:
#PermitRootLogin prohibit-password
#PasswordAuthentication yes
#PubkeyAuthentication yes
To:
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
After making the change, validate that we have no errors and restart SSH daemon.
sudo /usr/sbin/sshd -t
sudo systemctl restart sshd.service
Before doing this, generate a local key pair with ssh-keygen
, and then copy this to the server with ssh-copy-id -i <IDENTITY_FILE> master@k3s-master
. Next, edit your ~/.ssh/config
file to reflect:
Host k3s-master
Hostname k3s-master
User master
IdentityFile ~/.ssh/id_k3s-master
Host k3s-worker1
Hostname k3s-worker1
User worker
IdentityFile ~/.ssh/id_k3s-worker1
Host k3s-worker2
Hostname k3s-worker2
User worker
IdentityFile ~/.ssh/id_k3s-worker2
Host k3s-worker-gpu
Hostname k3s-worker-gpu
User worker
IdentityFile ~/.ssh/id_k3s-worker-gpu
You should also update your /etc/hosts
file:
192.168.0.100 k3s-master
192.168.0.101 k3s-worker1
192.168.0.102 k3s-worker2
192.168.0.104 k3s-worker-gpu
Make sure you enable cgroups
by editing /boot/firmware/cmdline.txt
: add the following:
cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1
Disable wireless/bluetooth by adding the following lines to /boot/firmware/config.txt
:
dtoverlay=disable-wifi
dtoverlay=disable-bluetooth
You also need to disable IPv6. Add the following lines to /etc/sysctl.conf
:
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1
Reload: sudo sysctl -p
. You may also need to create the following script at /etc/rc.local
:
#!/bin/bash
# /etc/rc.local
/etc/sysctl.d
/etc/init.d/procps restart
exit 0
Change permissions on above file: sudo chmod 755 /etc/rc.local
. Finally reboot to take effect with sudo reboot
.
Rinse and repeat for all worker nodes. It is also advisable to do the same for communication among all the nodes (control planes and worker).
- IPv4:
192.168.0.100
- Domain:
cluster.smigula.io
- User:
master
- Password:
<PASSWD>
While we are booting off an SD card (class 10), they are notorious for slow read/write speeds to we are going to attach 500gb SSD drives to all of our nodes.
- Make sure you format each drive with the
ext4
type. - Next create a folder on each node which will serve as the mount point at
/mnt/storage
- Get the UUID of the device you want to automount:
blkid
- Add the entry to
/etc/fstab
:
UUID=<MY_UUID> /mnt/storage ext4 defaults,auto,users,rw,nofail 0 0
curl -sfL https://get.k3s.io | INSTALL_KUBE_EXEC="--write-kubeconfig-mode 664 \
--bind-address 192.168.0.100 --advertise-address 192.168.0.100 \
--default-local-storage-path /mnt/storage --cluster-init --node-label memory=high" sh -
Note: Here I add the memory
label to each node, as this cluster will be comprised of 8gb, 4gb and 2gb nodes.
# define what Helm version and where to install:
export HELM_VERSION=v3.7.0
export HELM_INSTALL_DIR=/usr/local/bin
# download the binary and get into place:
cd /tmp
wget https://get.helm.sh/helm-$HELM_VERSION-linux-arm64.tar.gz
tar xvzf helm-$HELM_VERSION-linux-arm64.tar.gz
sudo mv linux-arm64/helm $HELM_INSTALL_DIR/helm
# clean up:
rm -rf linux-arm64 && rm helm-$HELM_VERSION-linux-arm64.tar.gz
helm repo add stable https://charts.helm.sh/stable
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add rancher-latest https://releases.rancher.com/server-charts/latest
helm repo add jetstack https://charts.jetstack.io
# this is necessary to address https://github.com/rancher/k3s/issues/1126 for now:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml >> ~/.bashrc
source ~/.bashrc
# make sure that we install the dashboard in the kube-system namespace:
sudo kubectl config set-context --current --namespace=kube-system
# install the dashboard, note how we explicitly ask for the Arm version:
helm install kdash stable/kubernetes-dashboard \
--set=image.repository=k8s.gcr.io/kubernetes-dashboard-arm64
# wait until you see the pod in 'Running' state:
watch kubectl get pods -l "app=kubernetes-dashboard,release=kdash"
The dashboard is then available at: https://localhost:10443/
. In order to login you must either grab the Kube config file or use a token. Since I would like to be able to issue remote kubectl
and helm
commands I will copy over the config file, as sp:
- Copy the content of
/etc/rancher/k3s/k3s.yaml
and paste it into a file on your host machine, for example,k3s-rpi.yaml
- Change the line server:
https://127.0.0.1:6443
to server:https://k3s-master:6443
(or server:https://192.168.0.100:6443
if you haven’t updated your /etc/hosts file ;) - Now you can access the cluster like so:
kubectl --kubeconfig=./k3s-rpi.yaml
get nodes
kubectl --insecure-skip-tls-verify --kubeconfig=./k3s-rpi.yaml port-forward \
--namespace kube-system \
svc/kdash-kubernetes-dashboard 10443:443
- IPv4:
192.168.0.101
- Domain:
- User:
worker
- Password:
<PASSWD>
Token can be found at /var/lib/rancher/k3s/server/token
on the control plane.
curl -sfL https://get.k3s.io | K3S_URL=https://192.168.0.100:6443 K3S_TOKEN=<TOKEN> \
INSTALL_KUBE_EXEC="--node-label memory=high" sh -
Create the file /etc/rancher/k3s/registries.yaml
, and add the following to it:
mirrors:
"docker.io":
endpoint:
- "https://docker.io"
configs:
"docker.io":
auth:
username: "smigula"
password: <TOKEN>
tls:
insecure_skip_verify: true
Note: you will need to do this for all worker nodes. Can this be added to the /etc/rancher/k3s/nodes/
as an Ansible playbook?
This section will cover what is needed to configure a node (eg Nvidia Jetson Nano) to give containers access to a GPU.
- Create user:
sudo useradd worker
- Set password:
sudo passwd worker
- Add groups to user:
sudo usermod -aG adm,cdrom,sudo,audio,dip,video,plugdev,i2c,lpadmin,gdm,sambashare,weston-launch,gpio worker
You need to set the swap size to 8gb, use the following script:
git clone https://github.com/JetsonHacksNano/resizeSwapMemory.git
cd resizeSwapMemory
chmod +x setSwapMemorySize.sh
./setSwapMemorySize.sh -g 8
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1
- Edit
sudo vi /etc/default/networking
- Set the parameter CONFIGURE_INTERFACES to no
sudo vi /etc/network/interfaces
auto eth0
iface eth0 inet static
address 192.168.0.104
netmask 255.255.255.0
gateway 192.168.0.1
Download K3s and kubectl Start by downloading the K3s and kubectl ARM64 binaries and copy them to /usr/local/bin with execution permissions:
sudo wget -c "https://github.com/k3s-io/k3s/releases/download/v1.19.7%2Bk3s1/k3s-arm64" -O /usr/local/bin/k3s ; chmod 755 /usr/local/bin/k3s
sudo wget -c "https://dl.k8s.io/v1.20.0/kubernetes-client-linux-arm64.tar.gz" -O /usr/local/bin/kubectl ; chmod 755 /usr/local/bin/kubectl
We need to provide some configuration to K3s, first create the directory with: mkdir -p /etc/rancher/k3s/
, and then create the file /etc/rancher/k3s/config.yaml
with these contents:
node-ip: 192.168.0.104
server: https://192.168.0.100:6443
token: <TOKEN>
export TOKEN=K10513ec520ffb7ce3d94da39d5a26be5da9324769f035498595c9941d21bcfeb62::server:ed7aefd846db06468a6c78fb91d461d2
curl -sfL https://get.k3s.io | K3S_URL=https://192.168.0.100:6443 K3S_TOKEN=$TOKEN \
INSTALL_KUBE_EXEC="--node-label memory=medium --node-label=gpu=nvidia" sh -
Consult the K3s Advanced Options and Configuration Guide; for this type of node we are specifically concerned with setting the container runtime to nvidia-container-runtimenvidia-container-runtime
. First stop the k3s-agent
service with sudo systemctl stop k3s-agent
. Then create the file /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
, and add the following content:
[plugins.opt]
path = "{{ .NodeConfig.Containerd.Opt }}"
[plugins.cri]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
{{- if .IsRunningInUserNS }}
disable_cgroup = true
disable_apparmor = true
restrict_oom_score_adj = true
{{end}}
{{- if .NodeConfig.AgentConfig.PauseImage }}
sandbox_image = "{{ .NodeConfig.AgentConfig.PauseImage }}"
{{end}}
{{- if not .NodeConfig.NoFlannel }}
[plugins.cri.cni]
bin_dir = "{{ .NodeConfig.AgentConfig.CNIBinDir }}"
conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}"
{{end}}
[plugins.cri.containerd.runtimes.runc]
# ---- changed from 'io.containerd.runc.v2' for GPU support
runtime_type = "io.containerd.runtime.v1.linux"
# ---- added for GPU support
[plugins.linux]
runtime = "nvidia-container-runtime"
{{ if .PrivateRegistryConfig }}
{{ if .PrivateRegistryConfig.Mirrors }}
[plugins.cri.registry.mirrors]{{end}}
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
[plugins.cri.registry.mirrors."{{$k}}"]
endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}]
{{end}}
{{range $k, $v := .PrivateRegistryConfig.Configs }}
{{ if $v.Auth }}
[plugins.cri.registry.configs."{{$k}}".auth]
{{ if $v.Auth.Username }}username = "{{ $v.Auth.Username }}"{{end}}
{{ if $v.Auth.Password }}password = "{{ $v.Auth.Password }}"{{end}}
{{ if $v.Auth.Auth }}auth = "{{ $v.Auth.Auth }}"{{end}}
{{ if $v.Auth.IdentityToken }}identitytoken = "{{ $v.Auth.IdentityToken }}"{{end}}
{{end}}
{{ if $v.TLS }}
[plugins.cri.registry.configs."{{$k}}".tls]
{{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}}
{{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}}
{{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}}
{{end}}
{{end}}
{{end}}
Now restart K3s with sudo systemctl restart k3s-agent
.
Nvidia created a Docker image that will test to make sure all devices are configured properly. Change into your home directoy, and copy over the denos: cp -R /usr/local/cuda/samples .
. Next, create a Dockerfile to perform the deviceQuery
test:
FROM nvcr.io/nvidia/l4t-base:r32.5.0
RUN apt-get update && apt-get install -y --no-install-recommends make g++
COPY ./samples /tmp/samples
WORKDIR /tmp/samples/1_Utilities/deviceQuery
RUN make clean && make
CMD ["./deviceQuery"]
- Build:
docker build -t xift/jetson_devicequery:r32.5.0 . -f Dockerfile.deviceQuery
- Run:
docker run --rm --runtime nvidia xift/jetson_devicequery:r32.5.0
- If everything is configured correctly you should see something like:
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
By default, K3s will use containerd to run containers so lets ensure that works properly (CUDA support). For this, we will create a simple bash script that uses ctr
instead of docker
:
#!/bin/bash
IMAGE=xift/jetson_devicequery:r32.5.0
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
ctr i pull docker.io/${IMAGE}
ctr run --rm --gpus 0 --tty docker.io/${IMAGE} deviceQuery
You should get the same result as above. The final, and real, test is to deploy a pod to the cluster (selecting only those nodes with the gpu: nvidia
label). Create the following file, pod_deviceQuery.yaml
:
apiVersion: v1
kind: Pod
metadata:
name: devicequery
spec:
containers:
- name: nvidia
image: xift/jetson_devicequery:r32.5.0
command: [ "./deviceQuery" ]
nodeSelector:
gpu: nvidia
Create this pod with kubectl apply -f pod_deviceQuery.yaml
, once the image is pulled and the container is created, it will run the deviceQuery
command and then exit, so it may look as if the pod failed. Simply take a look at the logs and look for the above PASS
, with kubectl logs devicequery
. If all checks out we are now ready to deploy GPU workloads to our K3s cluster!
Note you may also want to taint this node so that non-GPU workloads will not be scheduled.
Stay tuned!