This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| OPENSHIFT GPU CLUSTER — DEEP DIVE | |
| OpenShift Container Platform 4.21 · NVIDIA H200 · AMD MI350X · Multi-Node GenAI | |
| ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | |
| [ GenAI Workload: Meta Llama 3.1 405B ] | |
| ~900 GB VRAM in BF16 · all-reduce 810 GB every training step | |
| single H100 node (640 GB) can't fit it → multi-node + RDMA required | |
| │ | |
| ▼ | |
| ┌──────────────────────────────────────────┐ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| AI INFERENCE INGRESS FLOW — AGENTIC MODE | |
| ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | |
| ON-PREMISE AGENTIC AI — MCP TOOL CALL FLOW | |
| Red Hat OpenShift AI 3.4 · MaaS · vLLM · llm-d · MCP | |
| ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | |
| ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ | |
| │ AI Agent │ │ Dev / IDE │ │ Pipeline │ | |
| │ LangChain / │ │ VS Code / │ │ CI/CD / RAG │ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| AI INFERENCE INGRESS FLOW (Red Hat OpenShift AI Self-Managed 3.4 + Models-as-a-Service) | |
| ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | |
| ON-PREMISE AI INFERENCE INGRESS | |
| Red Hat OpenShift AI 3.4 · MaaS · vLLM · llm-d | |
| ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ | |
| [ App / RAG Pipeline / Chat UI ] | |
| POST /llm/gpt-oss-120b/v1/chat/completions | |
| │ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| *********************** OpenShift Heterogeneous GPU Cluster *********************** | |
| Example: | |
| NVIDIA H100/H200/B200 + AMD MI350X + CPU-Only Workers + InfiniBand | |
| One cluster. Multiple GPU vendors. Zero drama. | |
| Here’s exactly what we’re running in production: | |
| ┌────────────────────────────────────┐ | |
| │ OpenShift Control Plane │ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Understanding GPU Sharing Strategies for AI Workloads on OpenShift | |
| When running AI inference (especially with vLLM) on NVIDIA or AMD GPUs, which is most of the real-world, | |
| choosing the right sharing method - or not - is critical. Here's a practical breakdown I sketched out — | |
| NVIDIA vs AMD partitioning side-by-side. | |
| ******************************NVIDIA GPU Sharing Decision Tree****************************** | |
| START ─────────────────────────────────────► Need full physical GPU? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| GPU-to-GPU Communication Between Two Nodes | |
| (NVIDIA GPUs in Kubernetes/OpenShift) | |
| Without GPUDirect RDMA → Data path goes through CPU memory twice → high overhead | |
| With GPUDirect RDMA → Direct path from GPU memory → NIC → remote GPU memory → minimal overhead | |
| ──────────────────────────────────────────────────────────────────────────────────── | |
| WITHOUT GPUDirect RDMA | |
| ──────────────────────────────────────────────────────────────────────────────────── |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| +---------------------------------+ | |
| | Internet / | | |
| | External Clients | | |
| 1. User types URL | (HTTPS/TLS) | <-- 1. Browser resolves DNS → F5 VIP | |
| → HTTPS request| | 2. TLS handshake with F5 cert | |
| +-----------------^---------------+ | |
| | | |
| | 2. External TLS termination at F5 | |
| +---------------------------------+ | |
| | F5 BIG-IP | <-- 3. Decrypt (WAF / inspection / iRules) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| An AI Gateway (or External Ingress / Edge Proxy) fits into a vLLM + llm-d deployment as a production-grade front-end layer that handles external exposure, security, governance, observability, and advanced traffic control—while llm-d's built-in Inference Gateway (IGW) focuses on intra-cluster intelligent routing (e.g., prefix/KV-cache aware scheduling, disaggregation decisions, load balancing across vLLM replicas). | |
| This separation of concerns is common in production setups: the AI Gateway/edge proxy sits at the cluster edge (or externally) for global concerns, proxying to llm-d's IGW (or directly to vLLM if not using full distributed features). | |
| Core Architecture Fit | |
| llm-d + vLLM → Provides the inference engine and smart internal routing (via IGW, built on Kubernetes Gateway API Inference Extension + kgateway/Envoy). | |
| Handles: Cache-aware load balancing, disaggregated prefill/decode, multi-tier KV cache. | |
| Exposes OpenAI-compatible endpoints (e.g., /v1/chat/completions). | |
| AI Gateway / External Ingress / Edge Pro |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| We're moving to vLLM for fast LLM inference, but once we hit production-scale walls (high concurrency, multi-turn chats, long contexts, multi-node, tight SLOs), what comes next — and should we switch engines to something like SGLang, or layer on an orchestration framework like llm-d (or alternatives) to keep vLLM as the core engine? | |
| In other words, the typical customer journey you're observing is: | |
| Start with vLLM → great single-node throughput, easy to deploy, mature. | |
| - Hit scaling pain points → TTFT spikes, unstable TPOT/ITL, KV cache pressure, poor multi-node efficiency, inefficient prefix reuse in chat/agent workloads. | |
| Then evaluate the "whole stack": | |
| - Stick with vLLM and add orchestration (llm-d, AIBrix, Dynamo, KServe, etc.) for disaggregation, smart routing, tiered KV, independent scaling. | |
| - Or switch the inference engine entirely to SGLang (for better multi-turn prefix caching, structured output, RadixAttention gains in conversational/agent use cases) and pair it with OME or similar orchestration. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Running GPU-accelerated workloads in Kubernetes (or OpenShift) clusters—especially on H100/H200/B200 hardware—requires smart decisions about sharing to maximize utilization, isolation, performance, and cost efficiency. | |
| From my hands-on tests on 1- and 2-node H100 setups with vLLM, large/medium LLM inference often saturates VRAM (~70–80 GB used), making concurrency methods like MIG, MPS, or time-slicing less effective or even harmful due to interference and context-switching overhead. Dedicated full GPUs win for production stability and throughput in those cases. | |
| However, non-LLM and small ML workloads (e.g., classical ML, computer vision inference/training, embedding models, ResNets/CNNs, scikit-learn acceleration, batch processing) behave differently: | |
| Lower peak VRAM usage (rarely full saturation). | |
| Burstier or shorter kernels. | |
| More tolerance for sharing. | |
| These benefit significantly from time-slicing (easy oversubscription for bursty/dev jobs) and MPS (better concurrent kernel overlap, often 2x+ density vs. |
NewerOlder