In this gist we are going to deploy a containerized BentoML service to Kubernetes as a server-less function using Knative.
- A BentoML service that you have already locally tested. Refer to this gist for more information on how to create one as an example
- Containerized the BentoML Service. Refer to this gist for more information on how to containerize existing BentoML services.
- A Virtual Machine/Bare-metal server with Ubuntu/Debian based OS and NVIDIA CUDA enabled GPU that you can use to deploy Kubernetes and test this in.
I'm doing this on a small dekstop I have at home. This one has a old GTX 1660 with 6GB VRAM. Since the model we are loading is only 600 MB. This system is enough to run our Prompt Engineering service (detailed in step 2 Gist).
We are going to:
- Create a Kuberenetes Cluster
- Prepare cluster by enabling various add-ons like: Metrics Server, Metal LB (for Bare-Metal Load Balancing) NVIDIA GPU Operator, and Knative
- Deploy Containerized BentoML application as a server-less function using Knative Serving.
Let's get started.
You can chose any installer/distribution of your choice. I am going to use a canonical product called Microk8s which is a zero-ops CNCF certified Kubernetes Installer.
# Install Microk8s from the snap store
sudo snap install microk8s --classic --channel=1.30
# Join the group
sudo usermod -a -G microk8s $USER
mkdir -p ~/.kube
chmod 0700 ~/.kube
newgrp
# Wait for the K8s cluster to be ready. Takes less than 1 min
microk8s status --wait-ready
# Alias kubectl
alias kubectl="microk8s kubectl"
# Check for pods and nodes
kubectl get po, nodes -A
And voila you have a K8s cluster up and running
# Install K8s metrics server
sudo microk8s enable metrics-server
# This is an optional step. Since I am deploying this on a bare-metal server there is not Load Balancer. If you are deploying this on Cloud this step is not needed.\
# Provide a private IP range that doesn't collide with your existing Router Settings.
sudo microk8s enable metallb
# Install NVIDIA GPU operator - This will take a few minutes. Wait to make sure
# All nvidia gpu operator resources are running/completed.
sudo microk8s enable nvidia
# Install Knative
# Refer documentation for more info:
# https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#verifying-image-signatures
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.14.1/serving-crds.yaml
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.14.1/serving-core.yaml
# Install networking Layer
# We are going to use Kourier - since it's the most lightweight one. You can chose to install ISTIO or Contour if you'd like
# Follow the documentation for more information
kubectl apply -f https://github.com/knative/net-kourier/releases/download/knative-v1.14.0/kourier.yaml
kubectl patch configmap/config-network \
--namespace knative-serving \
--type merge \
--patch '{"data":{"ingress-class":"kourier.ingress.networking.knative.dev"}}'
# Fetch the External Load Balancer IP (in this case will be an IP that the Metallb has provisioned for the Kourier Ingress Service)
kubectl --namespace kourier-system get service kourier
# Configure DNS: https://knative.dev/docs/install/yaml-install/serving/install-serving-with-yaml/#configure-dns
# Since in this gist context we are only concerned about local testing of serverless function we are not going to worry
# About DNS and exposing services outside the cluster and hence setup a Magic DNS.
# In case you want to just simply use the "Real DNS" option from the documentation
# You can additionally also use something like Cloudflare tunnels to securely expose your Services without any External
# Load Balancers
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.14.1/serving-default-domain.yaml
With all the above completed, Verify the installation:
kubectl get pods -n knative-serving
With this we have everything we need to deploy our BentoML service as a serverless-function. Let's get on with the next step.
In this step we are going to define a Knative Service to utilize it's serverless Serving capabilities. In order to do so - first in the root of any of your Containerized BentoML application code create a knative-serving.yaml
file and copy over the following contents to it and make any necessary changes to image names:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: prompt-enhancer
namespace: default
spec:
template:
spec:
containers:
- name: prompt-enhancer-bentoml
image: <private-repository>/<image-name>:<tag>
imagePullPolicy: Always
ports:
- containerPort: 3000 # Port to route to
resources:
limits:
nvidia.com/gpu: 1
livenessProbe:
httpGet:
path: /healthz
initialDelaySeconds: 5
periodSeconds: 5
readinessProbe:
httpGet:
path: /healthz
initialDelaySeconds: 15
periodSeconds: 5
failureThreshold: 3
timeoutSeconds: 60
The above uses the Container we created from BentoML Build steps to define a server-less function. But before we deploy this, let's make a small update so that Knative knows of our Private Docker Registry credentials and can use that to pull private images, without which this would fail.
# Create the Docker Credentials as a Kubernetes Secret
kubectl create secret docker-registry regcred \
--docker-server=<private-registry-url> \
--docker-email=<private-registry-email> \
--docker-username=<private-registry-user> \
--docker-password=<private-registry-password>
# Patch the default service account so that it uses this new credential
kubectl patch serviceaccount default -p "{\"imagePullSecrets\": [{\"name\": \"regcred\"}]}"
At this point we are ready to deploy our serverless function. Let's go ahead and do that.
kubectl apply -f knative-serving.yaml
Give it a few minutes. If your models are large, this will take a few moments for the first time when Kubernetes pulls the image. This will be only 1 time, since if your container doesn't change too much, and just your application logic, then only the new image layers are downloaded from your Container Registry.
Check it's status frequently:
kubectl get ksvc,po
Once the status of the ksvc
turns to Ready=TRUE
then you should be able to start calling your function. Just open up a browser and navigate to the URL it shows on the kubectl get ksvc
response.
Note: Don't freak out when you don't see the pods. They auto-scale down to 0. Benefits of having server-less function. The start-up times are very fast so don't worry about function calling. Hit the end point and you will see the Pods starting up right away.
This deployment is a basic example. I have not covered topics related to Knative auto-scaling, metrics observation through Prometheus or Log scraping into a persistent storage for seeing all logs across all your pod's lifecycles in 1 single place.