Skip to content

Instantly share code, notes, and snippets.

@dudash
Last active December 21, 2022 00:06
Show Gist options
  • Save dudash/f88020d369ff4795f4c65d6a4f7eb541 to your computer and use it in GitHub Desktop.
Save dudash/f88020d369ff4795f4c65d6a4f7eb541 to your computer and use it in GitHub Desktop.
GPU OpenShift demo install and setup

Prerequisites OCP cluster 4.9+

Setup GPU Nodes

πŸ‘‰ Add a MachineSet for GPU Workers by exporting an existing worker's MachineSet and switching the instance type, name, and selflink. You have some choices here depending on what you want to do.

  • Amazon EC2 P3 Instances have up to 8 NVIDIA Tesla V100 GPUs.
  • Amazon EC2 G3 Instances have up to 4 NVIDIA Tesla M60 GPUs.
  • Amazon EC2 G4 Instances have up to 4 NVIDIA T4 GPUs.
  • Amazon EC2 P4 Instances have up to 8 NVIDIA Tesla A100 GPUs.

They're not cheap so check costs before picking, I demo with the g4dn.4xlarge (currently costs $1.204/hr)

Instance GPUs vCPU Memory (GiB) GPU Memory (GiB) Instance Storage (GB) Network Performance (Gbps)*** EBS Bandwidth (Gbps)
g4dn.4xlarge 1 16 64 16 1 x 225 NVMe SSD Up to 25 4.75

Check for drivertool kit

This should exist in 4.8 and later, however some streams in 4.8.z and 4.9.8 might be buggy (don't be there)

oc get -n openshift is/driver-toolkit

Check for pods

oc get pods -n openshift-nfd

Setup NFD Operator

cat << EOF | oc create -f -

apiVersion: v1
kind: Namespace
metadata:
  name: openshift-nfd
EOF

πŸ‘‰ Install NodeFeatureDiscovery from OperatorHub into the new namespace

πŸ‘‰ Goto Compute -> Nodes -> look at labels

πŸ‘‰ Create an instance of NFD from operator hub installed operator

πŸ‘‰ Goto Compute -> Nodes -> look at labels (now there should be pci labels)

(note that the PCI ids are used to identify hardware)

Setup GPU Operator

cat << EOF | oc create -f -

apiVersion: v1
kind: Namespace
metadata:
  name: nvidia-gpu-operator
EOF

πŸ‘‰ Install GPU operator into that namespace

πŸ‘‰ Create an instance of ClusterPolicy from the installed operator

Wait 10-20 min

Verify installation

oc get pods,ds -n nvidia-gpu-operator

Sanity check that we have monitoring setup

oc describe ns/nvidia-gpu-operator | grep cluster-monitoring

Look at GPU info

oc project nvidia-gpu-operator

oc get pods | grep nvidia-driver-daemonset

oc exec -it $POD -- nvidia-smi

Demo GPU app

cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
 restartPolicy: OnFailure
 containers:
 - name: cuda-vectoradd
   image: "nvidia/samples:vectoradd-cuda11.2.1"
   resources:
     limits:
       nvidia.com/gpu: 1
EOF

oc logs pod/cuda-vectoradd

More Advanced Demo (jupyter notebook demos)

Warning: this is a pretty big container - 6.5GB

cat << EOF | oc create -f -

apiVersion: v1
kind: Pod
metadata:
  name: cheminformatics
  labels:
    app: cheminformatics
spec:
 restartPolicy: OnFailure
 containers:
 - name: cheminformatics
   image: "nvcr.io/nvidia/clara/cheminformatics_demo:0.1.2"
   resources:
     limits:
       nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: cheminformatics
spec:
  selector:
    app: cheminformatics
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8888
EOF

oc expose svc cheminformatics

@dudash
Copy link
Author

dudash commented Mar 31, 2022

Maybe try this too:
kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml

@dudash
Copy link
Author

dudash commented Mar 31, 2022

And if you have knative installed, plus have a serverless app needing GPU(s),
try this: kn service create hello --image <service-image> --limit nvidia.com/gpu=1

@dudash
Copy link
Author

dudash commented Jun 21, 2022

The above assumes you've got an OpenShift cluster already, if you need to get one, go here: https://console.redhat.com/openshift/create

e.g. in a AWS self hosted install (or if you are a Red Hatter using the RHPDS Open Environment) you would do these things:
Log into your AWS cluster via CLI

  • aws configure --profile sandbox-uniqueid
  • export AWS_PROFILE=sandbox-uniqueid
  • aws route53 list-hosted-zones-by-name

If you don’t have ssh keys to use: ssh-keygen -t ed25519 -N '' -f ~/.ssh/opentlc-sandbox

Make the config: ./openshift-install create install-config --dir=.
Tweak the config as desired (note: we will add GPU workers later)
Install the cluster: ./openshift-install create cluster --dir=. --log-level=info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment