Skip to content

Instantly share code, notes, and snippets.

@wabouhamad
Last active October 14, 2019 20:41
Show Gist options
  • Save wabouhamad/f14e1833d3bef585171c0b6d1da411d1 to your computer and use it in GitHub Desktop.
Save wabouhamad/f14e1833d3bef585171c0b6d1da411d1 to your computer and use it in GitHub Desktop.
This is on OCP 4.2 build: 4.2.0-0.nightly-2019-10-03-224032
Deploy NFD:
cd $GOPATH/src/github.com/openshift
git clone https://github.com/openshift/cluster-nfd-operator.git
cd cluster-nfd-operator
git checkout release-4.2
make deploy
Verify that the openshift-nfd-operator is running in openshift-nfd-operator namespace.
Verify the nfd-master and nfd-worker pods for each respective node are deployed in openshift-nfd namespace:
# oc get pods -n openshift-nfd-operator
NAME READY STATUS RESTARTS AGE
nfd-operator-b7f4fbff8-sspvz 1/1 Running 0 3d23h
# oc get pods -n openshift-nfd
NAME READY STATUS RESTARTS AGE
nfd-master-jlfbs 1/1 Running 0 3d23h
nfd-master-lzw2r 1/1 Running 0 3d23h
nfd-master-qlhsj 1/1 Running 0 3d23h
nfd-worker-2xbqn 1/1 Running 2 3d23h
nfd-worker-9ng5z 1/1 Running 2 3d23h
nfd-worker-rz4jl 1/1 Running 3 3d23h
nfd-worker-xqr9h 1/1 Running 2 3d23h
Next create a new worker machineset to add a new NVIDIA GPU enabled node (e.g. g3.4xlarge, g3.8xlarge instance)
You can save an existing worker machineset yaml file, from the openshift-machine-api namespace, edit it (change the name, instance type to g3.4xlarge). Need to also have GPU enabled instances in that zone. Then oc create -f <gpu_worker_machineset>.yaml.
Wait a few minutes for the new gpu worker node to be deployed. Verify with oc get nodes and oc describe node
Once new gpu worker node is added to the cluster, NFD will add the labels on that node, and you should see one label specific to the nvidia GPU "feature.node.kubernetes.io/pci-10de.present=true".
Now deploy SRO:
cd $GOPATH/src/github.com/openshift-psap
git clone https://github.com/openshift-psap/special-resource-operator.git
cd special-resource-operator
git checkout release-4.2
make deploy
Verify the nvidia driver, device plugin container stack is deployed:
# oc get pods -n openshift-sro
NAME READY STATUS RESTARTS AGE
nvidia-dcgm-exporter-49bgx 2/2 Running 0 3d21h
nvidia-device-plugin-daemonset-khq4n 1/1 Running 0 3d21h
nvidia-device-plugin-validation 0/1 Completed 0 3d21h
nvidia-driver-daemonset-9tmb9 1/1 Running 0 3d21h
nvidia-driver-validation 0/1 Completed 0 3d21h
nvidia-feature-discovery-4f5q4 1/1 Running 0 3d21h
nvidia-grafana-67bdb6d6-s62dl 1/1 Running 0 3d21h
special-resource-operator-77cd96658f-b2mk5 1/1 Running 0 3d21h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment