Skip to content

Instantly share code, notes, and snippets.

@dmarcus-wire
dmarcus-wire / OPENSHIFT GPU CLUSTER — DEEP DIVE
Created April 3, 2026 15:59
OPENSHIFT GPU CLUSTER — DEEP DIVE
OPENSHIFT GPU CLUSTER — DEEP DIVE
OpenShift Container Platform 4.21 · NVIDIA H200 · AMD MI350X · Multi-Node GenAI
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[ GenAI Workload: Meta Llama 3.1 405B ]
~900 GB VRAM in BF16 · all-reduce 810 GB every training step
single H100 node (640 GB) can't fit it → multi-node + RDMA required
┌──────────────────────────────────────────┐
@dmarcus-wire
dmarcus-wire / ai-inference-ingress-flow-agentic-mode
Last active April 3, 2026 14:36
AI INFERENCE INGRESS FLOW — AGENTIC MODE
AI INFERENCE INGRESS FLOW — AGENTIC MODE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ON-PREMISE AGENTIC AI — MCP TOOL CALL FLOW
Red Hat OpenShift AI 3.4 · MaaS · vLLM · llm-d · MCP
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ AI Agent │ │ Dev / IDE │ │ Pipeline │
│ LangChain / │ │ VS Code / │ │ CI/CD / RAG │
@dmarcus-wire
dmarcus-wire / ai-inference-ingress-flow
Last active April 3, 2026 14:33
AI INFERENCE INGRESS FLOW
AI INFERENCE INGRESS FLOW (Red Hat OpenShift AI Self-Managed 3.4 + Models-as-a-Service)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
ON-PREMISE AI INFERENCE INGRESS
Red Hat OpenShift AI 3.4 · MaaS · vLLM · llm-d
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
[ App / RAG Pipeline / Chat UI ]
POST /llm/gpt-oss-120b/v1/chat/completions
@dmarcus-wire
dmarcus-wire / ocp-hetero-gpu-cluster
Last active March 28, 2026 22:41
OpenShift Heterogeneous GPU Cluster
*********************** OpenShift Heterogeneous GPU Cluster ***********************
Example:
NVIDIA H100/H200/B200 + AMD MI350X + CPU-Only Workers + InfiniBand
One cluster. Multiple GPU vendors. Zero drama.
Here’s exactly what we’re running in production:
┌────────────────────────────────────┐
│ OpenShift Control Plane │
@dmarcus-wire
dmarcus-wire / nvidia-amd-gpu-sharing-methods
Last active March 28, 2026 23:02
nvidia and amd sharing methods GPU sharing methods
Understanding GPU Sharing Strategies for AI Workloads on OpenShift
When running AI inference (especially with vLLM) on NVIDIA or AMD GPUs, which is most of the real-world,
choosing the right sharing method - or not - is critical. Here's a practical breakdown I sketched out —
NVIDIA vs AMD partitioning side-by-side.
******************************NVIDIA GPU Sharing Decision Tree******************************
START ─────────────────────────────────────► Need full physical GPU?
@dmarcus-wire
dmarcus-wire / rdma
Last active March 27, 2026 11:53
with and without gpudirect rdma
GPU-to-GPU Communication Between Two Nodes
(NVIDIA GPUs in Kubernetes/OpenShift)
Without GPUDirect RDMA → Data path goes through CPU memory twice → high overhead
With GPUDirect RDMA → Direct path from GPU memory → NIC → remote GPU memory → minimal overhead
────────────────────────────────────────────────────────────────────────────────────
WITHOUT GPUDirect RDMA
────────────────────────────────────────────────────────────────────────────────────
+---------------------------------+
| Internet / |
| External Clients |
1. User types URL | (HTTPS/TLS) | <-- 1. Browser resolves DNS → F5 VIP
→ HTTPS request| | 2. TLS handshake with F5 cert
+-----------------^---------------+
|
| 2. External TLS termination at F5
+---------------------------------+
| F5 BIG-IP | <-- 3. Decrypt (WAF / inspection / iRules)
@dmarcus-wire
dmarcus-wire / ai-gateway-vllm-llmd-integration
Created March 11, 2026 17:13
Overview of how an external AI Gateway / Edge Proxy complements llm-d's Inference Gateway (IGW) and vLLM in production Kubernetes/OpenShift deployments — including architecture fit, patterns, benefits, popular options (Kong, Istio, Apigee, etc.), and a visual ASCII flow.
An AI Gateway (or External Ingress / Edge Proxy) fits into a vLLM + llm-d deployment as a production-grade front-end layer that handles external exposure, security, governance, observability, and advanced traffic control—while llm-d's built-in Inference Gateway (IGW) focuses on intra-cluster intelligent routing (e.g., prefix/KV-cache aware scheduling, disaggregation decisions, load balancing across vLLM replicas).
This separation of concerns is common in production setups: the AI Gateway/edge proxy sits at the cluster edge (or externally) for global concerns, proxying to llm-d's IGW (or directly to vLLM if not using full distributed features).
Core Architecture Fit
llm-d + vLLM → Provides the inference engine and smart internal routing (via IGW, built on Kubernetes Gateway API Inference Extension + kgateway/Envoy).
Handles: Cache-aware load balancing, disaggregated prefill/decode, multi-tier KV cache.
Exposes OpenAI-compatible endpoints (e.g., /v1/chat/completions).
AI Gateway / External Ingress / Edge Pro
@dmarcus-wire
dmarcus-wire / gist:c5db1502250064a1807784618c1c431b
Last active March 9, 2026 14:20
vLLM vs SGLang: Competitive & Complementary LLM Inference Engines + llm-d Alternatives Feature Comparison
We're moving to vLLM for fast LLM inference, but once we hit production-scale walls (high concurrency, multi-turn chats, long contexts, multi-node, tight SLOs), what comes next — and should we switch engines to something like SGLang, or layer on an orchestration framework like llm-d (or alternatives) to keep vLLM as the core engine?
In other words, the typical customer journey you're observing is:
Start with vLLM → great single-node throughput, easy to deploy, mature.
- Hit scaling pain points → TTFT spikes, unstable TPOT/ITL, KV cache pressure, poor multi-node efficiency, inefficient prefix reuse in chat/agent workloads.
Then evaluate the "whole stack":
- Stick with vLLM and add orchestration (llm-d, AIBrix, Dynamo, KServe, etc.) for disaggregation, smart routing, tiered KV, independent scaling.
- Or switch the inference engine entirely to SGLang (for better multi-turn prefix caching, structured output, RadixAttention gains in conversational/agent use cases) and pair it with OME or similar orchestration.
@dmarcus-wire
dmarcus-wire / k8s-dra-gpu-concurrency-guide
Created March 9, 2026 03:52
What DRA Brings to the TableOptimizing NVIDIA GPU Sharing in Kubernetes/OpenShift: MIG, MPS, Time-Slicing, and DRA for LLM + Non-LLM Workloads (2026 Guide)
Running GPU-accelerated workloads in Kubernetes (or OpenShift) clusters—especially on H100/H200/B200 hardware—requires smart decisions about sharing to maximize utilization, isolation, performance, and cost efficiency.
From my hands-on tests on 1- and 2-node H100 setups with vLLM, large/medium LLM inference often saturates VRAM (~70–80 GB used), making concurrency methods like MIG, MPS, or time-slicing less effective or even harmful due to interference and context-switching overhead. Dedicated full GPUs win for production stability and throughput in those cases.
However, non-LLM and small ML workloads (e.g., classical ML, computer vision inference/training, embedding models, ResNets/CNNs, scikit-learn acceleration, batch processing) behave differently:
Lower peak VRAM usage (rarely full saturation).
Burstier or shorter kernels.
More tolerance for sharing.
These benefit significantly from time-slicing (easy oversubscription for bursty/dev jobs) and MPS (better concurrent kernel overlap, often 2x+ density vs.