ritazh/docker-gpu-howto.md

## docker-gpu-howto.md

      
    Raw
  

              docker-gpu-howto.md
            
          
    1. Create k8s cluster


Use my fork of ACS-Engine on the k8s-gpu-flag branch
Launch and build ACS-Engine, and customize the kubernetes example template to have NC6 VMs (example
Create RG: az group create --name k8s --location southcentralus (or other region with GPU)
Deploy the generated template az group deployment create --ressource-group k8s --template-file azuredeploy.json --parameters @azuredeploy.pamareters.json

2. Install drivers


Either SSH into each node and run this script : install-nvidia-driver.sh
Or enable SSH agent forwarding on your master and run this one that will take care of everything: https://github.com/wbuchwalter/acs-k8s-gpu/blob/master/setup-nodes.sh

TODO: It would be cool to improve this script to install the NVIDIA libraries and binaries in a specific folder, which will make it much easier to expose the drivers to the container in the next step

3. Scheduling a GPU container


You need to specify alpha.kubernetes.io/nvidia-gpu: 1 as a limit and request
You need to expose the drivers to the container as a volume. If you are using TF original docker image, it is based on ubuntu 16.04, just like your cluster's VM, so you can just mount /usr/bin and /usr/lib/x86_64-linux-gnu, it's a bit dirty but it works. Ideally, improve the previous script to install the driver in a specific directory and only expose this one.

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:      
  containers:
  - image: nvidia/cuda
    name: nvidia-smi
    args:
      - nvidia-smi
    resources:
      limits:
        alpha.kubernetes.io/nvidia-gpu: 1
      requests:
        alpha.kubernetes.io/nvidia-gpu: 1  
    volumeMounts:
    - mountPath: /usr/bin/
      name: binaries
    - mountPath: /usr/lib/x86_64-linux-gnu
      name: libraries
  volumes:
  - name: binaries
    hostPath:
      path: /usr/bin/
  - name: libraries
    hostPath:
      path: /usr/lib/x86_64-linux-gnu
4. Limitations


My acs-engine fork uses the current released version of k8s, which only supports 1 GPU. I plan on upgrading to 1.6 beta soon with multi-GPU support, but haven't had time yet.