Skip to content

Instantly share code, notes, and snippets.

@mproffitt
Last active August 24, 2025 08:35
Show Gist options
  • Select an option

  • Save mproffitt/a828c074b09bbf65dae184790baacb41 to your computer and use it in GitHub Desktop.

Select an option

Save mproffitt/a828c074b09bbf65dae184790baacb41 to your computer and use it in GitHub Desktop.
kind + CAPI vclusters + GPU

kind + CAPI vclusters + GPU

This talks through the steps required to build a kind cluster with GPU support, then share that to a vCluster running inside the kind cluster.

This is an expansion on the tutorial https://www.substratus.ai/blog/kind-with-gpus/ including the steps required to make GPU work with containerd runtime for docker.

Install nvidia container toolkit

This guide requires an NVidia graphics card.

Follow the instructions for installing the nvidia container toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Configure the default runtime for docker

Note

If you are using containerd as your container runtime for docker, you need to set the default for both docker and containerd, otherwise the nodes will not allocate for GPU leaving it unavailable with pods stuck in pending state.

If you are only using docker then:

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker

If using both containerd and docker then:

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default
sudo systemctl restart containerd docker

The nvidia container runtime needs to be instructed to accept visible devices as volume mounts. This involves changing the container-runtime config to uncomment the accept-nvidia-visible-devices-as-volume-mounts

sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml

Create kind cluster

Once this has been completed, we can spin up a kind cluster. This cluster uses the following kind.yaml

kind.yaml configuration
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
name: gputest
nodes:
  - role: control-plane
    extraMounts:
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all
  - role: worker
    extraMounts:
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all
  - role: worker
    extraMounts:
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all
  - role: worker
    extraMounts:
      - hostPath: /dev/null
        containerPath: /var/run/nvidia-container-devices/all
kind create cluster --config kind.yaml

This will create a single master, 3 worker cluster with nvidia mounted.

Optional step symlink ldconfig

In Sams article K8s Kind with GPUs linked above, he lists a step where he required symlinking ldconfig to ldconfig.real inside the cluster nodes

I found I did not need to undertake this step with Kind nodes 1.29 however if you are using an older container runtime, this may still be relevant for your environment.

If this is the case, apply the following to add a symlink to the kind nodes.

for name in $(k get no -o jsonpath="{.items[*].metadata.name}"); do
    docker exec -ti ${name} ln -s /sbin/ldconfig /sbin/ldconfig.real
done

You should only need this step if the GPU operator fails to start.

Install GPU operator

For your cluster to become gpu ready you need to install the GPU operator from nvidia. This can be installed via helm from nvidia/gpu-operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator --set driver.enabled=false

Depending on your system, it may then take a while for the operator to become fully available.

Once ready, the worker nodes should have the allocation nvidia.com/gpu assigned to them

Caution

This may take a while to become ready. In testing, I saw times up to 5 minutes for a 3 node cluster.

$ kubectl get node -o yaml | yq '.items[] | [{"name": .metadata.name, "status": .status.allocatable."nvidia.com/gpu"}]'
- name: gputest-control-plane
  status: null
- name: gputest-worker
  status: "1"
- name: gputest-worker2
  status: "1"
- name: gputest-worker3
  status: "1"

If you are only running a single node cluster, this may be on the control-plane instead.

Install vClusters

Once the kind cluster is ready, we want to be able to schedule GPU workloads inside a vcluster loaded inside our kind cluster.

To make this possible, we need to instruct our vcluster nodes to read their state from the real kind cluster nodes.

First, lets install vcluster. We'll do this using ClusterAPI.

Note

This requires clusterctl >= 1.9.0

If you do not have clusterctl installed on your machine, you can install it by running the following for AMD 64. Remember to change the architecture if you are on a different platform.

VERSION=$(curl --silent "https://api.github.com/repos/kubernetes-sigs/cluster-api/releases/latest" | jq -r .tag_name)
curl -L https://github.com/kubernetes-sigs/cluster-api/releases/download/${VERSION}/clusterctl-linux-amd64 -o clusterctl \
    && sudo install -o root -g root -m 0755 clusterctl /usr/local/bin/clusterctl

Once you have the latest clusterctl binary installed, initialise this into the kind cluster with:

clusterctl init --infrastructure vcluster

In order to use the GPU with vCluster, we need to expose the real kind nodes to vcluster rather than using a virtual node for this purpose. This is achieved by syncing real nodes to the virtual cluster by setting the following values to the cluster chart:

sync:
  fromHost:
    nodes:
      enabled: true

Ref: https://www.vcluster.com/docs/vcluster/configure/vcluster-yaml/sync/from-host/nodes

Save the following yaml out as cluster.yaml:

Cluster API for vClusters CR
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: kind
  namespace: vcluster
spec:
  controlPlaneRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: VCluster
    name: kind
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
    kind: VCluster
    name: kind
---
apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1
kind: VCluster
metadata:
  name: kind
  namespace: vcluster
spec:
  controlPlaneEndpoint:
    host: ""
    port: 0
  helmRelease:
    chart:
      name: vcluster
      repo: https://charts.loft.sh
      version: 0.22.1
    values: |-
      sync:
        fromHost:
          nodes:
            enabled: true

Next, create a new vcluster namespace and apply the cluster.yaml file

k create ns vcluster
k apply -f cluster.yaml

Wait for the cluster to come up and then connect to it with

vcluster connect kind -n vcluster

Note

Similar to clsuterctl above, if you do not have vcluster installed, then you may install it for amd64 with:

VERSION=$(curl --silent "https://api.github.com/repos/loft-sh/vcluster/releases/latest" | jq -r .tag_name)
curl -L -o vcluster "https://github.com/loft-sh/vcluster/releases/download/${VERSION}/vcluster-linux-amd64" \
    && sudo install -c -m 0755 vcluster /usr/local/bin && rm -f vcluster

We can verify that the node has the allocation by again running:

kubectl get node -o yaml | yq '.items[] | [{"name": .metadata.name, "status": .status.allocatable."nvidia.com/gpu"}]'
- name: gputest-worker
  status: "1"

Now we need the GPU operator to be installed again, however as we're running in a vCluster, and the allocations are coming from the kind cluster nodes, we can safely ignore installing the toolkit and just install the operator.

helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --set driver.enabled=false,toolkit.enabled=false

Once the operator has started inside the vcluster, create a test pod to verify that the GPU is accessible

kubectl apply -f - << EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

Wait for a few seconds for the pod to start and then check its logs

$ k logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

That's it. You should have a successful GPU enabled VCluster running inside a kind cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment