Inference Engineering · Your Lab Home · Analogy · Glossary · Lessons

Your Lab

A real 4×H100 OpenShift cluster running live vLLM models — the course's measuring instrument. Every lesson can pull real numbers from it.

Why this matters

The calculator gives you the theory; this cluster gives you the truth. When Lesson 9 says "prefill ≫ decode throughput," you can watch it: 3,288 tok/s prefill vs ~108 tok/s decode on your own H100s. When Lesson 10 computes a KV cache, you can read the real usage % off vLLM. Refresh anytime with bash learning/tools/cluster-probe.sh.

⚠ Health: one NVLink pair is down action needed

Verified 2026-06-19: only GPU2↔GPU3 has working NVLink (12 links). GPU0↔GPU1 report zero links — a 4×H100 NVL box should have two bridged pairs, but you currently have one, and the tensor-parallel model is sitting on the broken pair. Full diagnosis, severity ranking + a rebalance proposal → Cluster Findings & Tuning Proposal.

The hardware

GPUs4 × NVIDIA H100 NVL, 94 GiB HBM3 each (95,830 MiB)
Logical GPUs20 — each physical GPU time-sliced ×5 (MIG disabled)
Node128 vCPU · ~2 TiB RAM · single worker (control-plane + worker)
Power/thermal400 W cap/GPU; idle ~60–110 W, 38–64 °C when not generating
NUMA 0 NUMA 1 GPU 0GPU 1GPU 2GPU 3 94 GiB · ×594 GiB · ×5 94 GiB · ×594 GiB · ×5 used 93used 93 used 88idle 0 PCIe / NUMA hop (slow) NVLink ×12 (fast) Tensor-parallel models want the NVLinked pair (GPU2–GPU3) — all-reduce every layer rides this link
Interconnect is not uniform: GPU2↔GPU3 share 12 NVLink connections; GPU0↔GPU1 only PCIe. Note: GPU0↔GPU1 should also be an NVLink pair — that bridge is currently down (see Findings). A tensor-parallel job on the wrong pair pays a latency tax every layer.
Time-slicing caveat: ×5 slicing means 5 pods can land on one physical GPU, but they share its 94 GiB with no isolation — the "20 GPUs" are scheduling slots, not 20 private memories. Overcommit and you OOM.

The models running now

NamespaceModelGPUsNotable serving flags
llm-serving-qwen36Qwen3.6-27B-FP8 (vLLM)TP 1 util 0.75 · 128K ctx · max-seqs 8 · kv fp8 · chunked prefill · prefix cache
qwen35-27b-fixedQwen3.5-27B-FP8 (vLLM)TP 2 util 0.92 · 128K ctx · max-seqs 64 · kv fp8 · chunked prefill
llm-serving-embeddingBGE-M3 embedding1retrieval encoder
llm-serving-rerankerBGE-M3 reranker1retrieval reranker
bert-nerBERT NER1entity extraction

Also present but not on GPU right now: llama, gemma4, gpt-oss, qwen3-vl, classifier, sentiment, bm25 namespaces (scaled to 0).

The Rosetta Stone — every flag is a lesson

This is the bridge from your ops layer to the internals. Each serving flag you set is one of the levers these lessons explain:

vLLM flag (on your cluster)What it controls — and the lesson
--kv-cache-dtype fp81 byte/KV element → halves the cache vs FP16 — Lesson 10
GQA: 4 KV heads (arch)fewer kv_heads → 6× smaller cache for Qwen3 — Lesson 10
--gpu-memory-utilizationfraction of 94 GiB vLLM claims for weights + KV — Lesson 10
--max-model-len 131072per-request context ceiling; KV grows linearly with it — Lesson 10
--max-num-seqsmax concurrent sequences = the batch cap — Lesson 10 / 3
--max-num-batched-tokenstoken budget per engine step (prefill+decode) — Lesson 9 / 3
--enable-chunked-prefillsplit long prefills, interleave with decode — Lesson 12 (SARATHI)
--enable-prefix-cachingreuse KV for shared prompt prefixes — Lesson 12
--tensor-parallel-size 2split one model across 2 GPUs (wants NVLink) — future TP lesson

Live snapshot — measured 2026-06-19

Prefill vs decodeQwen3.5-27B: prompt ~3,288 tok/s vs generation ~108 tok/s (single request) — the Lesson 9 asymmetry, live
Workload shapeprompt:generation ratio 19:1 (qwen36) and 47:1 (qwen35) — heavily prefill-bound RAG
KV cache usage0.2–0.5% under light load (short RAG contexts vs the 128K ceiling)
Refreshbash learning/tools/cluster-probe.sh
← Glossary Lesson 10 (uses these numbers) →