Here's a step-by-step guide to create a cost-conscious Kubernetes cluster in Google Kubernetes Engine (GKE), configure nodes with GPUs, and set up NVIDIA NGC Infrastructure Manager (NIM) along with deploying an LLM that uses the OpenAI API.
- Google Cloud Account: Ensure you have an active Google Cloud account.
- gcloud CLI: Install the Google Cloud SDK.
- kubectl: Install
kubectl
if it's not already installed. - NVIDIA GPU Driver Support: Ensure you have access to NVIDIA resources and APIs.
-
Set your project:
gcloud config set project [PROJECT_ID]
-
Enable GKE and Compute Engine APIs:
gcloud services enable container.googleapis.com compute.googleapis.com
-
Create a service account key (optional, for programmatic access):
gcloud iam service-accounts keys create key.json \ --iam-account [SERVICE_ACCOUNT]@[PROJECT_ID].iam.gserviceaccount.com
-
Create the Kubernetes cluster:
gcloud container clusters create nim-cluster \ --zone us-central1-a \ --num-nodes 3 \ --machine-type e2-standard-2 \ --cluster-version latest
-
Add a GPU node pool:
gcloud container node-pools create gpu-pool \ --cluster nim-cluster \ --zone us-central1-a \ --accelerator type=nvidia-tesla-t4,count=1 \ --machine-type n1-standard-4 \ --num-nodes 1 \ --node-labels purpose=gpu \ --node-taints gpu=true:NoSchedule
-
Install NVIDIA GPU drivers on GPU nodes:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/stable/deployments/k8s-device-plugin.yaml
-
Add the Helm repository for NVIDIA:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ && helm repo update
-
Install the NIM operator:
helm install nim-operator nvidia/nim-operator --namespace nvidia \ --create-namespace
-
Verify installation:
kubectl get pods -n nvidia
-
Deploy a sample application:
Create a deployment file (
llm-deployment.yaml
):apiVersion: apps/v1 kind: Deployment metadata: name: openai-llm spec: replicas: 1 selector: matchLabels: app: openai-llm template: metadata: labels: app: openai-llm spec: containers: - name: openai-llm image: python:3.9-slim command: ["python"] args: ["-m", "http.server", "8000"] resources: limits: nvidia.com/gpu: 1 env: - name: OPENAI_API_KEY value: "<your-openai-api-key>"
Apply the deployment:
kubectl apply -f llm-deployment.yaml
-
Expose the service:
kubectl expose deployment openai-llm --type=LoadBalancer --name=openai-service
-
Verify the service is accessible:
kubectl get svc
Note the external IP of the service.
- Install
curl
on your local machine (if not already installed). - Test the deployed LLM service:
curl -X POST http://<EXTERNAL_IP>:8000/inference \ -H "Authorization: Bearer <your-openai-api-key>" \ -d '{"prompt": "What is Kubernetes?", "max_tokens": 100}'
- Use the Google Cloud Console to monitor the cost of your GPU node.
- Shut down or scale the GPU node pool when not in use:
gcloud container clusters resize nim-cluster \ --node-pool gpu-pool \ --num-nodes 0 --zone us-central1-a
- You can adapt this workflow to install specific NIM models by updating the Helm values file.
- Replace placeholders like
<your-openai-api-key>
and<EXTERNAL_IP>
with your actual values. - For more advanced configurations, refer to the NVIDIA documentation.
Let me know if you need any further assistance!