Skip to content

Instantly share code, notes, and snippets.

@mkm29
Last active April 11, 2022 17:28
Show Gist options
  • Save mkm29/da315f5c292e50fa185486d8d283199b to your computer and use it in GitHub Desktop.
Save mkm29/da315f5c292e50fa185486d8d283199b to your computer and use it in GitHub Desktop.
Details on how to setup and configure K3s on a Raspberry Pi cluster

Howto Guide: Home Kubernetes Cluster

Title: Setting up K3s on Raspberry Pi's and Jetson Nano's
Author: Mitch Murphy
Date: 2021-10-09

In this article we will be setting up a 5 node K3s cluster: one control plane, three workers (Raspberry Pis) and one GPU worker (Nvidia Jetson Nano) to enable GPU workloads such as Tensorflow. Let's get started.

Materials

This is a pretty cost effective cluster (for the computational power at least), here is what I will be using:

Note that the Nvidia Jetson Nano was only $99.99 when I bought it, the same model with 4GB of RAM is now 169.99, there is a 2GB version on Amazon for $59.00.

This brings the total cost to build this exact cluster at $848.87.

Prerequisites

Login! ssh ubuntu@<IPADDR> and use the default password of ubuntu. It will require you to change this. We will be disabling this account next.

Create user: sudo adduser master
Add groups: sudo usermod -a -G adm,dialout,cdrom,floppy,sudo,audio,dip,video,plugdev,netdev,lxd master

Now logout and log back in: ssh master@<IPADDR> and then delete the default user: sudo deluser --remove-home ubuntu.

It time to rename our nodes. I will be naming master node as k3s-master and similarly worker nodes as k3s-worker to k3s-worker3. Change the hostname with: sudo hostnamectl set-hostname k3s-master.

We are going to update our installation, so we have latest and greatest packages by running: sudo apt update && sudo apt upgrade -y. Now reboot.

As cloud-init is present on this image we are going to edit also: sudo nano /etc/cloud/cloud.cfg. Change preserve_hostname to true. Reboot again.

SSH

It is good practice to disable username/password SSH login, this is done by editing sudo nano /etc/ssh/sshd_config, as so:

From:
#PermitRootLogin prohibit-password
#PasswordAuthentication yes
#PubkeyAuthentication yes
To:
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes

After making the change, validate that we have no errors and restart SSH daemon.

sudo /usr/sbin/sshd -t
sudo systemctl restart sshd.service

Before doing this, generate a local key pair with ssh-keygen, and then copy this to the server with ssh-copy-id -i <IDENTITY_FILE> master@k3s-master. Next, edit your ~/.ssh/config file to reflect:

Host k3s-master
Hostname k3s-master
User master
IdentityFile ~/.ssh/id_k3s-master

Host k3s-worker1
Hostname k3s-worker1
User worker
IdentityFile ~/.ssh/id_k3s-worker1

Host k3s-worker2
Hostname k3s-worker2
User worker
IdentityFile ~/.ssh/id_k3s-worker2

Host k3s-worker-gpu
Hostname k3s-worker-gpu
User worker
IdentityFile ~/.ssh/id_k3s-worker-gpu

You should also update your /etc/hosts file:

192.168.0.100   k3s-master
192.168.0.101   k3s-worker1
192.168.0.102   k3s-worker2
192.168.0.104   k3s-worker-gpu

Make sure you enable cgroups by editing /boot/firmware/cmdline.txt: add the following:

cgroup_enable=cpuset cgroup_enable=memory cgroup_memory=1

Disable wireless/bluetooth by adding the following lines to /boot/firmware/config.txt:

dtoverlay=disable-wifi
dtoverlay=disable-bluetooth

You also need to disable IPv6. Add the following lines to /etc/sysctl.conf:

net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1

Reload: sudo sysctl -p. You may also need to create the following script at /etc/rc.local:

#!/bin/bash
# /etc/rc.local

/etc/sysctl.d
/etc/init.d/procps restart

exit 0

Change permissions on above file: sudo chmod 755 /etc/rc.local. Finally reboot to take effect with sudo reboot.

Rinse and repeat for all worker nodes. It is also advisable to do the same for communication among all the nodes (control planes and worker).

K3s Master

  • IPv4: 192.168.0.100
  • Domain: cluster.smigula.io
  • User: master
  • Password: <PASSWD>

Mount Storage Volume

While we are booting off an SD card (class 10), they are notorious for slow read/write speeds to we are going to attach 500gb SSD drives to all of our nodes.

  1. Make sure you format each drive with the ext4 type.
  2. Next create a folder on each node which will serve as the mount point at /mnt/storage
  3. Get the UUID of the device you want to automount: blkid
  4. Add the entry to /etc/fstab:

UUID=<MY_UUID> /mnt/storage ext4 defaults,auto,users,rw,nofail 0 0

Install K3s

curl -sfL https://get.k3s.io | INSTALL_KUBE_EXEC="--write-kubeconfig-mode 664 \
--bind-address 192.168.0.100 --advertise-address 192.168.0.100 \
--default-local-storage-path /mnt/storage --cluster-init --node-label memory=high" sh -

Note: Here I add the memory label to each node, as this cluster will be comprised of 8gb, 4gb and 2gb nodes.

Install Helm

# define what Helm version and where to install:
export HELM_VERSION=v3.7.0
export HELM_INSTALL_DIR=/usr/local/bin

# download the binary and get into place:
cd /tmp
wget https://get.helm.sh/helm-$HELM_VERSION-linux-arm64.tar.gz
tar xvzf helm-$HELM_VERSION-linux-arm64.tar.gz
sudo mv linux-arm64/helm $HELM_INSTALL_DIR/helm

# clean up:
rm -rf linux-arm64 && rm helm-$HELM_VERSION-linux-arm64.tar.gz

Add Helm Repos

helm repo add stable https://charts.helm.sh/stable
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add rancher-latest https://releases.rancher.com/server-charts/latest
helm repo add jetstack https://charts.jetstack.io

Install Dashboard

# this is necessary to address https://github.com/rancher/k3s/issues/1126 for now:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml >> ~/.bashrc
source ~/.bashrc

# make sure that we install the dashboard in the kube-system namespace:
sudo kubectl config set-context --current --namespace=kube-system

# install the dashboard, note how we explicitly ask for the Arm version:
helm install kdash stable/kubernetes-dashboard \
    --set=image.repository=k8s.gcr.io/kubernetes-dashboard-arm64

# wait until you see the pod in 'Running' state:
watch kubectl get pods -l "app=kubernetes-dashboard,release=kdash"

Access Dashboard

The dashboard is then available at: https://localhost:10443/. In order to login you must either grab the Kube config file or use a token. Since I would like to be able to issue remote kubectl and helm commands I will copy over the config file, as sp:

  1. Copy the content of /etc/rancher/k3s/k3s.yaml and paste it into a file on your host machine, for example, k3s-rpi.yaml
  2. Change the line server: https://127.0.0.1:6443 to server: https://k3s-master:6443 (or server: https://192.168.0.100:6443 if you haven’t updated your /etc/hosts file ;)
  3. Now you can access the cluster like so: kubectl --kubeconfig=./k3s-rpi.yaml get nodes
kubectl --insecure-skip-tls-verify --kubeconfig=./k3s-rpi.yaml port-forward \
    --namespace kube-system \
    svc/kdash-kubernetes-dashboard 10443:443

Workers

Worker 1

  • IPv4: 192.168.0.101
  • Domain:
  • User: worker
  • Password: <PASSWD>

Token can be found at /var/lib/rancher/k3s/server/token on the control plane.

Install

curl -sfL https://get.k3s.io | K3S_URL=https://192.168.0.100:6443 K3S_TOKEN=<TOKEN> \
  INSTALL_KUBE_EXEC="--node-label memory=high" sh -

Add Private Registry

Create the file /etc/rancher/k3s/registries.yaml, and add the following to it:

mirrors:
  "docker.io":
    endpoint:
      - "https://docker.io"
configs:
  "docker.io":
    auth:
      username: "smigula"
      password: <TOKEN>
    tls:
      insecure_skip_verify: true

Note: you will need to do this for all worker nodes. Can this be added to the /etc/rancher/k3s/nodes/ as an Ansible playbook?

GPU Support

This section will cover what is needed to configure a node (eg Nvidia Jetson Nano) to give containers access to a GPU.

  1. Create user: sudo useradd worker
  2. Set password: sudo passwd worker
  3. Add groups to user: sudo usermod -aG adm,cdrom,sudo,audio,dip,video,plugdev,i2c,lpadmin,gdm,sambashare,weston-launch,gpio worker

Swap

You need to set the swap size to 8gb, use the following script:

git clone https://github.com/JetsonHacksNano/resizeSwapMemory.git
cd resizeSwapMemory
chmod +x setSwapMemorySize.sh
./setSwapMemorySize.sh -g 8

Disable IPv6

sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1

Assign Static IP

  1. Edit sudo vi /etc/default/networking
  2. Set the parameter CONFIGURE_INTERFACES to no
  3. sudo vi /etc/network/interfaces
auto eth0
iface eth0 inet static
  address 192.168.0.104
  netmask 255.255.255.0
  gateway 192.168.0.1

Deploy K3s

Download K3s and kubectl Start by downloading the K3s and kubectl ARM64 binaries and copy them to /usr/local/bin with execution permissions:

sudo wget -c "https://github.com/k3s-io/k3s/releases/download/v1.19.7%2Bk3s1/k3s-arm64" -O /usr/local/bin/k3s ; chmod 755 /usr/local/bin/k3s
sudo wget -c "https://dl.k8s.io/v1.20.0/kubernetes-client-linux-arm64.tar.gz" -O /usr/local/bin/kubectl ; chmod 755 /usr/local/bin/kubectl

We need to provide some configuration to K3s, first create the directory with: mkdir -p /etc/rancher/k3s/, and then create the file /etc/rancher/k3s/config.yaml with these contents:

node-ip: 192.168.0.104
server: https://192.168.0.100:6443
token: <TOKEN>

Install K3s

export TOKEN=K10513ec520ffb7ce3d94da39d5a26be5da9324769f035498595c9941d21bcfeb62::server:ed7aefd846db06468a6c78fb91d461d2
curl -sfL https://get.k3s.io | K3S_URL=https://192.168.0.100:6443 K3S_TOKEN=$TOKEN \
  INSTALL_KUBE_EXEC="--node-label memory=medium --node-label=gpu=nvidia" sh -

Configuration

Consult the K3s Advanced Options and Configuration Guide; for this type of node we are specifically concerned with setting the container runtime to nvidia-container-runtimenvidia-container-runtime. First stop the k3s-agent service with sudo systemctl stop k3s-agent. Then create the file /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl, and add the following content:

[plugins.opt]
  path = "{{ .NodeConfig.Containerd.Opt }}"

[plugins.cri]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"

{{- if .IsRunningInUserNS }}
  disable_cgroup = true
  disable_apparmor = true
  restrict_oom_score_adj = true
{{end}}

{{- if .NodeConfig.AgentConfig.PauseImage }}
  sandbox_image = "{{ .NodeConfig.AgentConfig.PauseImage }}"
{{end}}

{{- if not .NodeConfig.NoFlannel }}
[plugins.cri.cni]
  bin_dir = "{{ .NodeConfig.AgentConfig.CNIBinDir }}"
  conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}"
{{end}}

[plugins.cri.containerd.runtimes.runc]
  # ---- changed from 'io.containerd.runc.v2' for GPU support
  runtime_type = "io.containerd.runtime.v1.linux"

# ---- added for GPU support
[plugins.linux]
  runtime = "nvidia-container-runtime"

{{ if .PrivateRegistryConfig }}
{{ if .PrivateRegistryConfig.Mirrors }}
[plugins.cri.registry.mirrors]{{end}}
{{range $k, $v := .PrivateRegistryConfig.Mirrors }}
[plugins.cri.registry.mirrors."{{$k}}"]
  endpoint = [{{range $i, $j := $v.Endpoints}}{{if $i}}, {{end}}{{printf "%q" .}}{{end}}]
{{end}}

{{range $k, $v := .PrivateRegistryConfig.Configs }}
{{ if $v.Auth }}
[plugins.cri.registry.configs."{{$k}}".auth]
  {{ if $v.Auth.Username }}username = "{{ $v.Auth.Username }}"{{end}}
  {{ if $v.Auth.Password }}password = "{{ $v.Auth.Password }}"{{end}}
  {{ if $v.Auth.Auth }}auth = "{{ $v.Auth.Auth }}"{{end}}
  {{ if $v.Auth.IdentityToken }}identitytoken = "{{ $v.Auth.IdentityToken }}"{{end}}
{{end}}
{{ if $v.TLS }}
[plugins.cri.registry.configs."{{$k}}".tls]
  {{ if $v.TLS.CAFile }}ca_file = "{{ $v.TLS.CAFile }}"{{end}}
  {{ if $v.TLS.CertFile }}cert_file = "{{ $v.TLS.CertFile }}"{{end}}
  {{ if $v.TLS.KeyFile }}key_file = "{{ $v.TLS.KeyFile }}"{{end}}
{{end}}
{{end}}
{{end}}

Now restart K3s with sudo systemctl restart k3s-agent.

Test

Nvidia created a Docker image that will test to make sure all devices are configured properly. Change into your home directoy, and copy over the denos: cp -R /usr/local/cuda/samples .. Next, create a Dockerfile to perform the deviceQuery test:

FROM nvcr.io/nvidia/l4t-base:r32.5.0
RUN apt-get update && apt-get install -y --no-install-recommends make g++
COPY ./samples /tmp/samples
WORKDIR /tmp/samples/1_Utilities/deviceQuery
RUN make clean && make
CMD ["./deviceQuery"]
  1. Build: docker build -t xift/jetson_devicequery:r32.5.0 . -f Dockerfile.deviceQuery
  2. Run: docker run --rm --runtime nvidia xift/jetson_devicequery:r32.5.0
  3. If everything is configured correctly you should see something like:
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

By default, K3s will use containerd to run containers so lets ensure that works properly (CUDA support). For this, we will create a simple bash script that uses ctr instead of docker:

#!/bin/bash
IMAGE=xift/jetson_devicequery:r32.5.0
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
ctr i pull docker.io/${IMAGE}
ctr run --rm --gpus 0 --tty docker.io/${IMAGE} deviceQuery

You should get the same result as above. The final, and real, test is to deploy a pod to the cluster (selecting only those nodes with the gpu: nvidia label). Create the following file, pod_deviceQuery.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: devicequery
spec:
  containers:
  - name: nvidia
    image: xift/jetson_devicequery:r32.5.0
    command: [ "./deviceQuery" ]
  nodeSelector:
    gpu: nvidia

Create this pod with kubectl apply -f pod_deviceQuery.yaml, once the image is pulled and the container is created, it will run the deviceQuery command and then exit, so it may look as if the pod failed. Simply take a look at the logs and look for the above PASS, with kubectl logs devicequery. If all checks out we are now ready to deploy GPU workloads to our K3s cluster!

Note you may also want to taint this node so that non-GPU workloads will not be scheduled.

Tensorflow

Stay tuned!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment