romilbhardwaj/SkyPilotLocalGPUs.md

## SkyPilotLocalGPUs.md

      
    Raw
  

              SkyPilotLocalGPUs.md
            
          
    Using local GPUs with SkyPilot + Kubernetes

This is a guide to using GPUs on your local machine with SkyPilot. This guide sets up a Kubernetes cluster (using KinD) so you can use SkyPilot's Kubernetes support to get it running.
Inspired by Klueska's comment and Sam Stoelinga's blog post.
Prerequisites


Docker
SkyPilot
NVIDIA Container Toolkit. If not installed, follow guide below.

Install the NVIDIA container toolkit

Follow the official install docs:

Configure the repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list


Install the NVIDIA Container Toolkit packages:

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit


Configure NVIDIA to be the default runtime for docker:

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker

Creating your local Kubernetes cluster with GPUs


Set accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml:

sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml


Create a Kind Cluster:

kind create cluster --name skypilot --config - <<EOF
apiVersion: kind.x-k8s.io/v1alpha4
kind: Cluster
nodes:
- role: control-plane
  image: kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
  # required for GPU workaround
  extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all
EOF


Run patch for missing ldconfig.real:

# https://github.com/NVIDIA/nvidia-docker/issues/614#issuecomment-423991632
docker exec -ti skypilot-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real


Install the NVIDIA GPU operator:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator --set driver.enabled=false


Wait for a bit for GPU operator to get installed. Check status with kubectl get pods -A and make sure all pods in gpu-operator namespace are running. This may take a couple of minutes.


Verify GPU operator is installed correctly by running kubectl describe nodes | grep nvidia.com/gpu and make sure the output is similar to the following:
nvidia.com/gpu:  1
nvidia.com/gpu:  1


Run SkyPilot GPU Labelling script to label nodes with GPUs:


python -m sky.utils.kubernetes.gpu_labeler


Wait for labelling jobs to complete. To check the status of GPU labeling jobs, run kubectl get jobs -n kube-system -l job=sky-gpu-labeler.


Run sky check. This should show Kubernetes: enabled without any warnings.


You're ready to go! Run sky show-gpus --cloud kubernetes to see the GPUs available on your local machine


(base) gcpuser@ray-test-2ea4-head-fcdc6cbf-compute:~$ sky show-gpus --cloud kubernetes
COMMON_GPU  AVAILABLE_QUANTITIES
T4          1, 2

You should then be able to run SkyPilot commands as usual, e.g.:
sky launch -c test --cloud kubernetes --gpus T4:1 -- nvidia-smi