π Add a MachineSet for GPU Workers by exporting an existing worker's MachineSet and switching the instance type, name, and selflink. You have some choices here depending on what you want to do.
- Amazon EC2 P3 Instances have up to 8 NVIDIA Tesla V100 GPUs.
- Amazon EC2 G3 Instances have up to 4 NVIDIA Tesla M60 GPUs.
- Amazon EC2 G4 Instances have up to 4 NVIDIA T4 GPUs.
- Amazon EC2 P4 Instances have up to 8 NVIDIA Tesla A100 GPUs.
They're not cheap so check costs before picking, I demo with the g4dn.4xlarge (currently costs $1.204/hr)
Instance | GPUs | vCPU | Memory (GiB) | GPU Memory (GiB) | Instance Storage (GB) | Network Performance (Gbps)*** | EBS Bandwidth (Gbps) |
---|---|---|---|---|---|---|---|
g4dn.4xlarge | 1 | 16 | 64 | 16 | 1 x 225 NVMe SSD | Up to 25 | 4.75 |
This should exist in 4.8 and later, however some streams in 4.8.z and 4.9.8 might be buggy (don't be there)
oc get -n openshift is/driver-toolkit
oc get pods -n openshift-nfd
cat << EOF | oc create -f -
apiVersion: v1
kind: Namespace
metadata:
name: openshift-nfd
EOF
π Install NodeFeatureDiscovery from OperatorHub into the new namespace
π Goto Compute -> Nodes -> look at labels
π Create an instance of NFD from operator hub installed operator
π Goto Compute -> Nodes -> look at labels (now there should be pci
labels)
(note that the PCI ids are used to identify hardware)
cat << EOF | oc create -f -
apiVersion: v1
kind: Namespace
metadata:
name: nvidia-gpu-operator
EOF
π Install GPU operator into that namespace
π Create an instance of ClusterPolicy from the installed operator
oc get pods,ds -n nvidia-gpu-operator
oc describe ns/nvidia-gpu-operator | grep cluster-monitoring
oc project nvidia-gpu-operator
oc get pods | grep nvidia-driver-daemonset
oc exec -it $POD -- nvidia-smi
cat << EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1
EOF
oc logs pod/cuda-vectoradd
Warning: this is a pretty big container - 6.5GB
cat << EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
name: cheminformatics
labels:
app: cheminformatics
spec:
restartPolicy: OnFailure
containers:
- name: cheminformatics
image: "nvcr.io/nvidia/clara/cheminformatics_demo:0.1.2"
resources:
limits:
nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
name: cheminformatics
spec:
selector:
app: cheminformatics
ports:
- protocol: TCP
port: 80
targetPort: 8888
EOF
oc expose svc cheminformatics
The above assumes you've got an OpenShift cluster already, if you need to get one, go here: https://console.redhat.com/openshift/create
e.g. in a AWS self hosted install (or if you are a Red Hatter using the RHPDS Open Environment) you would do these things:
Log into your AWS cluster via CLI
aws configure --profile sandbox-uniqueid
export AWS_PROFILE=sandbox-uniqueid
aws route53 list-hosted-zones-by-name
If you donβt have ssh keys to use:
ssh-keygen -t ed25519 -N '' -f ~/.ssh/opentlc-sandbox
Make the config:
./openshift-install create install-config --dir=.
Tweak the config as desired (note: we will add GPU workers later)
Install the cluster:
./openshift-install create cluster --dir=. --log-level=info