dudash/gpu-operator-quick-demo-cheatsheet.md

## gpu-operator-quick-demo-cheatsheet.md

      
    Raw
  

              gpu-operator-quick-demo-cheatsheet.md
            
          
    Prerequisites OCP cluster 4.9+

Setup GPU Nodes

👉 Add a MachineSet for GPU Workers by exporting an existing worker's MachineSet and switching the instance type, name, and selflink.
You have some choices here depending on what you want to do.

Amazon EC2 P3 Instances have up to 8 NVIDIA Tesla V100 GPUs.
Amazon EC2 G3 Instances have up to 4 NVIDIA Tesla M60 GPUs.
Amazon EC2 G4 Instances have up to 4 NVIDIA T4 GPUs.
Amazon EC2 P4 Instances have up to 8 NVIDIA Tesla A100 GPUs.

They're not cheap so check costs before picking, I demo with the g4dn.4xlarge (currently costs $1.204/hr)


Instance
GPUs
vCPU
Memory (GiB)
GPU Memory (GiB)
Instance Storage (GB)
Network Performance (Gbps)***
EBS Bandwidth (Gbps)


g4dn.4xlarge
1
16
64
16
1 x 225 NVMe SSD
Up to 25
4.75


Check for drivertool kit

This should exist in 4.8 and later, however some streams in 4.8.z and 4.9.8 might be buggy (don't be there)
oc get -n openshift is/driver-toolkit
Check for pods

oc get pods -n openshift-nfd
Setup NFD Operator

cat << EOF | oc create -f -

apiVersion: v1
kind: Namespace
metadata:
  name: openshift-nfd
EOF

👉 Install NodeFeatureDiscovery from OperatorHub into the new namespace
👉 Goto Compute -> Nodes -> look at labels
👉 Create an instance of NFD from operator hub installed operator
👉 Goto Compute -> Nodes -> look at labels (now there should be pci labels)
(note that the PCI ids are used to identify hardware)
Setup GPU Operator

cat << EOF | oc create -f -

apiVersion: v1
kind: Namespace
metadata:
  name: nvidia-gpu-operator
EOF

👉 Install GPU operator into that namespace
👉 Create an instance of ClusterPolicy from the installed operator
Wait 10-20 min

Verify installation

oc get pods,ds -n nvidia-gpu-operator
Sanity check that we have monitoring setup

oc describe ns/nvidia-gpu-operator | grep cluster-monitoring
Look at GPU info

oc project nvidia-gpu-operator
oc get pods | grep nvidia-driver-daemonset
oc exec -it $POD -- nvidia-smi
Demo GPU app

cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
 restartPolicy: OnFailure
 containers:
 - name: cuda-vectoradd
   image: "nvidia/samples:vectoradd-cuda11.2.1"
   resources:
     limits:
       nvidia.com/gpu: 1
EOF

oc logs pod/cuda-vectoradd
More Advanced Demo (jupyter notebook demos)

Warning: this is a pretty big container - 6.5GB
cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cheminformatics
  labels:
    app: cheminformatics
spec:
 restartPolicy: OnFailure
 containers:
 - name: cheminformatics
   image: "nvcr.io/nvidia/clara/cheminformatics_demo:0.1.2"
   resources:
     limits:
       nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: cheminformatics
spec:
  selector:
    app: cheminformatics
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8888
EOF

oc expose svc cheminformatics