dmarcus-wire

## OPENSHIFT GPU CLUSTER — DEEP DIVE
OPENSHIFT GPU CLUSTER — DEEP DIVE
OpenShift Container Platform 4.21  ·  NVIDIA H200  ·  AMD MI350X  ·  Multi-Node GenAI
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  [ GenAI Workload: Meta Llama 3.1 405B ]
    ~900 GB VRAM in BF16  ·  all-reduce 810 GB every training step
    single H100 node (640 GB) can't fit it  →  multi-node + RDMA required
                    │
                    ▼
  ┌──────────────────────────────────────────┐

## ai-inference-ingress-flow-agentic-mode
AI INFERENCE INGRESS FLOW — AGENTIC MODE

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ON-PREMISE AGENTIC AI — MCP TOOL CALL FLOW
  Red Hat OpenShift AI 3.4 · MaaS · vLLM · llm-d · MCP
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
  │  AI Agent   │ │ Dev / IDE   │ │  Pipeline   │
  │ LangChain / │ │ VS Code /   │ │ CI/CD / RAG │

## ai-inference-ingress-flow
AI INFERENCE INGRESS FLOW (Red Hat OpenShift AI Self-Managed 3.4 + Models-as-a-Service)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  ON-PREMISE AI INFERENCE INGRESS
  Red Hat OpenShift AI 3.4 · MaaS · vLLM · llm-d
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  [ App / RAG Pipeline / Chat UI ]
    POST /llm/gpt-oss-120b/v1/chat/completions
                    │

## ocp-hetero-gpu-cluster
*********************** OpenShift Heterogeneous GPU Cluster ***********************

Example:
NVIDIA H100/H200/B200 + AMD MI350X + CPU-Only Workers + InfiniBand
One cluster. Multiple GPU vendors. Zero drama.

Here’s exactly what we’re running in production:

                               ┌────────────────────────────────────┐
                               │       OpenShift Control Plane      │

## nvidia-amd-gpu-sharing-methods
      Understanding GPU Sharing Strategies for AI Workloads on OpenShift

When running AI inference (especially with vLLM) on NVIDIA or AMD GPUs, which is most of the real-world,
choosing the right sharing method - or not - is critical. Here's a practical breakdown I sketched out —
NVIDIA vs AMD partitioning side-by-side.

******************************NVIDIA GPU Sharing Decision Tree******************************


START ─────────────────────────────────────► Need full physical GPU?

## rdma
                  GPU-to-GPU Communication Between Two Nodes
                           (NVIDIA GPUs in Kubernetes/OpenShift)

Without GPUDirect RDMA → Data path goes through CPU memory twice → high overhead
With GPUDirect RDMA → Direct path from GPU memory → NIC → remote GPU memory → minimal overhead

────────────────────────────────────────────────────────────────────────────────────
                                 WITHOUT GPUDirect RDMA
────────────────────────────────────────────────────────────────────────────────────

## gist:b7a95175dc4dbade8b0a9568d0ff668c
                  +---------------------------------+
                  |         Internet /              |
                  |      External Clients           |
1. User types URL |        (HTTPS/TLS)              |   <-- 1. Browser resolves DNS → F5 VIP
   → HTTPS request|                                 |       2. TLS handshake with F5 cert
                  +-----------------^---------------+
                                    |
                                    | 2. External TLS termination at F5
                  +---------------------------------+
                  |          F5 BIG-IP              |   <-- 3. Decrypt (WAF / inspection / iRules)

## ai-gateway-vllm-llmd-integration
An AI Gateway (or External Ingress / Edge Proxy) fits into a vLLM + llm-d deployment as a production-grade front-end layer that handles external exposure, security, governance, observability, and advanced traffic control—while llm-d's built-in Inference Gateway (IGW) focuses on intra-cluster intelligent routing (e.g., prefix/KV-cache aware scheduling, disaggregation decisions, load balancing across vLLM replicas).
This separation of concerns is common in production setups: the AI Gateway/edge proxy sits at the cluster edge (or externally) for global concerns, proxying to llm-d's IGW (or directly to vLLM if not using full distributed features).

Core Architecture Fit

llm-d + vLLM → Provides the inference engine and smart internal routing (via IGW, built on Kubernetes Gateway API Inference Extension + kgateway/Envoy).
Handles: Cache-aware load balancing, disaggregated prefill/decode, multi-tier KV cache.
Exposes OpenAI-compatible endpoints (e.g., /v1/chat/completions).

AI Gateway / External Ingress / Edge Pro

## gist:c5db1502250064a1807784618c1c431b
We're moving to vLLM for fast LLM inference, but once we hit production-scale walls (high concurrency, multi-turn chats, long contexts, multi-node, tight SLOs), what comes next — and should we switch engines to something like SGLang, or layer on an orchestration framework like llm-d (or alternatives) to keep vLLM as the core engine?

In other words, the typical customer journey you're observing is:

Start with vLLM → great single-node throughput, easy to deploy, mature.
- Hit scaling pain points → TTFT spikes, unstable TPOT/ITL, KV cache pressure, poor multi-node efficiency, inefficient prefix reuse in chat/agent workloads.

Then evaluate the "whole stack":
- Stick with vLLM and add orchestration (llm-d, AIBrix, Dynamo, KServe, etc.) for disaggregation, smart routing, tiered KV, independent scaling.
- Or switch the inference engine entirely to SGLang (for better multi-turn prefix caching, structured output, RadixAttention gains in conversational/agent use cases) and pair it with OME or similar orchestration.

## k8s-dra-gpu-concurrency-guide
Running GPU-accelerated workloads in Kubernetes (or OpenShift) clusters—especially on H100/H200/B200 hardware—requires smart decisions about sharing to maximize utilization, isolation, performance, and cost efficiency.
From my hands-on tests on 1- and 2-node H100 setups with vLLM, large/medium LLM inference often saturates VRAM (~70–80 GB used), making concurrency methods like MIG, MPS, or time-slicing less effective or even harmful due to interference and context-switching overhead. Dedicated full GPUs win for production stability and throughput in those cases.
However, non-LLM and small ML workloads (e.g., classical ML, computer vision inference/training, embedding models, ResNets/CNNs, scikit-learn acceleration, batch processing) behave differently:

Lower peak VRAM usage (rarely full saturation).
Burstier or shorter kernels.
More tolerance for sharing.

These benefit significantly from time-slicing (easy oversubscription for bursty/dev jobs) and MPS (better concurrent kernel overlap, often 2x+ density vs.
	OPENSHIFT GPU CLUSTER — DEEP DIVE
	OpenShift Container Platform 4.21 · NVIDIA H200 · AMD MI350X · Multi-Node GenAI
	━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

	[ GenAI Workload: Meta Llama 3.1 405B ]
	~900 GB VRAM in BF16 · all-reduce 810 GB every training step
	single H100 node (640 GB) can't fit it → multi-node + RDMA required
	│
	▼
	┌──────────────────────────────────────────┐
	AI INFERENCE INGRESS FLOW — AGENTIC MODE

	━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
	ON-PREMISE AGENTIC AI — MCP TOOL CALL FLOW
	Red Hat OpenShift AI 3.4 · MaaS · vLLM · llm-d · MCP
	━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

	┌─────────────┐ ┌─────────────┐ ┌─────────────┐
	│ AI Agent │ │ Dev / IDE │ │ Pipeline │
	│ LangChain / │ │ VS Code / │ │ CI/CD / RAG │
	AI INFERENCE INGRESS FLOW (Red Hat OpenShift AI Self-Managed 3.4 + Models-as-a-Service)

	━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
	ON-PREMISE AI INFERENCE INGRESS
	Red Hat OpenShift AI 3.4 · MaaS · vLLM · llm-d
	━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

	[ App / RAG Pipeline / Chat UI ]
	POST /llm/gpt-oss-120b/v1/chat/completions
	│
	********************* OpenShift Heterogeneous GPU Cluster *********************

	Example:
	NVIDIA H100/H200/B200 + AMD MI350X + CPU-Only Workers + InfiniBand
	One cluster. Multiple GPU vendors. Zero drama.

	Here’s exactly what we’re running in production:

	┌────────────────────────────────────┐
	│ OpenShift Control Plane │
	Understanding GPU Sharing Strategies for AI Workloads on OpenShift

	When running AI inference (especially with vLLM) on NVIDIA or AMD GPUs, which is most of the real-world,
	choosing the right sharing method - or not - is critical. Here's a practical breakdown I sketched out —
	NVIDIA vs AMD partitioning side-by-side.

	****************************NVIDIA GPU Sharing Decision Tree****************************


	START ─────────────────────────────────────► Need full physical GPU?
	GPU-to-GPU Communication Between Two Nodes
	(NVIDIA GPUs in Kubernetes/OpenShift)

	Without GPUDirect RDMA → Data path goes through CPU memory twice → high overhead
	With GPUDirect RDMA → Direct path from GPU memory → NIC → remote GPU memory → minimal overhead

	────────────────────────────────────────────────────────────────────────────────────
	WITHOUT GPUDirect RDMA
	────────────────────────────────────────────────────────────────────────────────────
	+---------------------------------+
	\| Internet / \|
	\| External Clients \|
	1. User types URL \| (HTTPS/TLS) \| <-- 1. Browser resolves DNS → F5 VIP
	→ HTTPS request\| \| 2. TLS handshake with F5 cert
	+-----------------^---------------+
	\|
	\| 2. External TLS termination at F5
	+---------------------------------+
	\| F5 BIG-IP \| <-- 3. Decrypt (WAF / inspection / iRules)
	An AI Gateway (or External Ingress / Edge Proxy) fits into a vLLM + llm-d deployment as a production-grade front-end layer that handles external exposure, security, governance, observability, and advanced traffic control—while llm-d's built-in Inference Gateway (IGW) focuses on intra-cluster intelligent routing (e.g., prefix/KV-cache aware scheduling, disaggregation decisions, load balancing across vLLM replicas).
	This separation of concerns is common in production setups: the AI Gateway/edge proxy sits at the cluster edge (or externally) for global concerns, proxying to llm-d's IGW (or directly to vLLM if not using full distributed features).

	Core Architecture Fit

	llm-d + vLLM → Provides the inference engine and smart internal routing (via IGW, built on Kubernetes Gateway API Inference Extension + kgateway/Envoy).
	Handles: Cache-aware load balancing, disaggregated prefill/decode, multi-tier KV cache.
	Exposes OpenAI-compatible endpoints (e.g., /v1/chat/completions).

	AI Gateway / External Ingress / Edge Pro
	We're moving to vLLM for fast LLM inference, but once we hit production-scale walls (high concurrency, multi-turn chats, long contexts, multi-node, tight SLOs), what comes next — and should we switch engines to something like SGLang, or layer on an orchestration framework like llm-d (or alternatives) to keep vLLM as the core engine?

	In other words, the typical customer journey you're observing is:

	Start with vLLM → great single-node throughput, easy to deploy, mature.
	- Hit scaling pain points → TTFT spikes, unstable TPOT/ITL, KV cache pressure, poor multi-node efficiency, inefficient prefix reuse in chat/agent workloads.

	Then evaluate the "whole stack":
	- Stick with vLLM and add orchestration (llm-d, AIBrix, Dynamo, KServe, etc.) for disaggregation, smart routing, tiered KV, independent scaling.
	- Or switch the inference engine entirely to SGLang (for better multi-turn prefix caching, structured output, RadixAttention gains in conversational/agent use cases) and pair it with OME or similar orchestration.
	Running GPU-accelerated workloads in Kubernetes (or OpenShift) clusters—especially on H100/H200/B200 hardware—requires smart decisions about sharing to maximize utilization, isolation, performance, and cost efficiency.
	From my hands-on tests on 1- and 2-node H100 setups with vLLM, large/medium LLM inference often saturates VRAM (~70–80 GB used), making concurrency methods like MIG, MPS, or time-slicing less effective or even harmful due to interference and context-switching overhead. Dedicated full GPUs win for production stability and throughput in those cases.
	However, non-LLM and small ML workloads (e.g., classical ML, computer vision inference/training, embedding models, ResNets/CNNs, scikit-learn acceleration, batch processing) behave differently:

	Lower peak VRAM usage (rarely full saturation).
	Burstier or shorter kernels.
	More tolerance for sharing.

	These benefit significantly from time-slicing (easy oversubscription for bursty/dev jobs) and MPS (better concurrent kernel overlap, often 2x+ density vs.