This Gist explains how to deploy the Habana Labs operator in an OpenShift cluster. The first step is to deploy an OpenShift 4.11 cluster. Simply follow the documentation.
The habanalabs
module needs to load a firmware, and the driver container will copy it to /var/lib/firmware
on the node. We need to tell the node kernel to lookup that folder for firmwares, as it's not a default path. This is done by applying a MachineConfig to all worker nodes. The following command line will do it.
$ oc apply -f https://gist.github.com/fabiendupont/8b092ea8d79f28b698e23ae82b644438/raw/machineconfig-firmware-path.yaml
All the worker nodes will reboot, so you may lose access to the OpenShift console for some time.
The Habana Labs operator relies on the Kernel Module Management operator to deploy the kernel module and the device plugin. So, we need to install it. It's available via an OLM catalog, that we can add with the following command.
$ oc apply -f https://gist.github.com/fabiendupont/8b092ea8d79f28b698e23ae82b644438/raw/catalogsource-kmmo.yaml
Then, we can head to the OperatorHub page in OpenShift console and install the OOT Operator (legacy name of KMM Operator).
Similarly, we will need to add a catalog for the Habana Labs operator. Below is the command to install it.
$ oc apply -f https://gist.github.com/fabiendupont/8b092ea8d79f28b698e23ae82b644438/raw/catalogsource-habana-ai.yaml
Then, go to Operators > OperatorHub and install the operator. The default values are fine.
Once the operator is installed, go to Operators > Installed Operators and click on Habana AI Operator. If it's not in the list, check that you looking up the habana-ai-operator
project.
You can then go to the Device Config
tab and click on the Create DeviceConfig button. It will open a dialog and you can use the default values. This configures the operator to apply the 1.6.0-439
version of the driver on all nodes with a Habana device (PCI vendor ID 1da3
).
The following job will run a pod that simply sleep forever, allowing us to run the hl-smi
command from the pod terminal.
$ oc apply -f https://github.com/fabiendupont/habana-ai-smi/raw/main/job.yaml
Note that pods will require the CAP_SYS_RAWIO
capability. For that, they'll have to run privileged.