- Use my fork of ACS-Engine on the
k8s-gpu-flag
branch - Launch and build ACS-Engine, and customize the kubernetes example template to have NC6 VMs (example
- Create RG:
az group create --name k8s --location southcentralus
(or other region with GPU) - Deploy the generated template
az group deployment create --ressource-group k8s --template-file azuredeploy.json --parameters @azuredeploy.pamareters.json
- Either SSH into each node and run this script : install-nvidia-driver.sh
- Or enable SSH agent forwarding on your master and run this one that will take care of everything: https://github.com/wbuchwalter/acs-k8s-gpu/blob/master/setup-nodes.sh
TODO: It would be cool to improve this script to install the NVIDIA libraries and binaries in a specific folder, which will make it much easier to expose the drivers to the container in the next step
- You need to specify
alpha.kubernetes.io/nvidia-gpu: 1
as a limit and request - You need to expose the drivers to the container as a volume. If you are using TF original docker image, it is based on ubuntu 16.04, just like your cluster's VM, so you can just mount
/usr/bin
and/usr/lib/x86_64-linux-gnu
, it's a bit dirty but it works. Ideally, improve the previous script to install the driver in a specific directory and only expose this one.
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
containers:
- image: nvidia/cuda
name: nvidia-smi
args:
- nvidia-smi
resources:
limits:
alpha.kubernetes.io/nvidia-gpu: 1
requests:
alpha.kubernetes.io/nvidia-gpu: 1
volumeMounts:
- mountPath: /usr/bin/
name: binaries
- mountPath: /usr/lib/x86_64-linux-gnu
name: libraries
volumes:
- name: binaries
hostPath:
path: /usr/bin/
- name: libraries
hostPath:
path: /usr/lib/x86_64-linux-gnu
- My acs-engine fork uses the current released version of k8s, which only supports 1 GPU. I plan on upgrading to 1.6 beta soon with multi-GPU support, but haven't had time yet.