Skip to content

Instantly share code, notes, and snippets.

@fabiendupont
Last active August 30, 2022 20:54
Show Gist options
  • Save fabiendupont/8b092ea8d79f28b698e23ae82b644438 to your computer and use it in GitHub Desktop.
Save fabiendupont/8b092ea8d79f28b698e23ae82b644438 to your computer and use it in GitHub Desktop.
Habana.ai

Install Habana Labs operator

This Gist explains how to deploy the Habana Labs operator in an OpenShift cluster. The first step is to deploy an OpenShift 4.11 cluster. Simply follow the documentation.

Prepare nodes

The habanalabs module needs to load a firmware, and the driver container will copy it to /var/lib/firmware on the node. We need to tell the node kernel to lookup that folder for firmwares, as it's not a default path. This is done by applying a MachineConfig to all worker nodes. The following command line will do it.

$ oc apply -f https://gist.github.com/fabiendupont/8b092ea8d79f28b698e23ae82b644438/raw/machineconfig-firmware-path.yaml

All the worker nodes will reboot, so you may lose access to the OpenShift console for some time.

Kernel Module Management operator

The Habana Labs operator relies on the Kernel Module Management operator to deploy the kernel module and the device plugin. So, we need to install it. It's available via an OLM catalog, that we can add with the following command.

$ oc apply -f https://gist.github.com/fabiendupont/8b092ea8d79f28b698e23ae82b644438/raw/catalogsource-kmmo.yaml

Then, we can head to the OperatorHub page in OpenShift console and install the OOT Operator (legacy name of KMM Operator).

Habana AI operator

Similarly, we will need to add a catalog for the Habana Labs operator. Below is the command to install it.

$ oc apply -f https://gist.github.com/fabiendupont/8b092ea8d79f28b698e23ae82b644438/raw/catalogsource-habana-ai.yaml

Then, go to Operators > OperatorHub and install the operator. The default values are fine.

Once the operator is installed, go to Operators > Installed Operators and click on Habana AI Operator. If it's not in the list, check that you looking up the habana-ai-operator project.

You can then go to the Device Config tab and click on the Create DeviceConfig button. It will open a dialog and you can use the default values. This configures the operator to apply the 1.6.0-439 version of the driver on all nodes with a Habana device (PCI vendor ID 1da3).

Run workload

The following job will run a pod that simply sleep forever, allowing us to run the hl-smi command from the pod terminal.

$ oc apply -f https://github.com/fabiendupont/habana-ai-smi/raw/main/job.yaml

Note that pods will require the CAP_SYS_RAWIO capability. For that, they'll have to run privileged.

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: habana-ai
namespace: openshift-marketplace
spec:
displayName: Habana AI
grpcPodConfig:
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/master: ''
priorityClassName: system-cluster-critical
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 120
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 120
image: 'ghcr.io/fabiendupont/habana-ai-operator-catalog:v99.0.0'
priority: -100
publisher: Red Hat
sourceType: grpc
updateStrategy:
registryPoll:
interval: 10m0s
---
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
name: kmmo
namespace: openshift-marketplace
spec:
displayName: KMMO
grpcPodConfig:
nodeSelector:
kubernetes.io/os: linux
node-role.kubernetes.io/master: ''
priorityClassName: system-cluster-critical
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 120
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 120
image: 'ghcr.io/fabiendupont/oot-operator-catalog:v0.0.1'
priority: -100
publisher: Red Hat
sourceType: grpc
updateStrategy:
registryPoll:
interval: 10m0s
---
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 05-worker-kernelarg-firmware-path
spec:
config:
ignition:
version: 3.2.0
kernelArguments:
- 'firmware_class.path=/var/lib/firmware'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment