A real 4×H100 OpenShift cluster running live vLLM models — the course's measuring instrument. Every lesson can pull real numbers from it.
The calculator gives you the theory; this cluster gives
you the truth. When Lesson 9 says "prefill ≫ decode throughput," you can
watch it: 3,288 tok/s prefill vs ~108 tok/s decode on your
own H100s. When Lesson 10 computes a KV cache, you can read the real usage % off
vLLM. Refresh anytime with bash learning/tools/cluster-probe.sh.
Verified 2026-06-19: only GPU2↔GPU3 has working NVLink (12 links). GPU0↔GPU1 report zero links — a 4×H100 NVL box should have two bridged pairs, but you currently have one, and the tensor-parallel model is sitting on the broken pair. Full diagnosis, severity ranking + a rebalance proposal → Cluster Findings & Tuning Proposal.
| GPUs | 4 × NVIDIA H100 NVL, 94 GiB HBM3 each (95,830 MiB) |
|---|---|
| Logical GPUs | 20 — each physical GPU time-sliced ×5 (MIG disabled) |
| Node | 128 vCPU · ~2 TiB RAM · single worker (control-plane + worker) |
| Power/thermal | 400 W cap/GPU; idle ~60–110 W, 38–64 °C when not generating |
| Namespace | Model | GPUs | Notable serving flags |
|---|---|---|---|
| llm-serving-qwen36 | Qwen3.6-27B-FP8 (vLLM) | TP 1 | util 0.75 · 128K ctx · max-seqs 8 · kv fp8 · chunked prefill · prefix cache |
| qwen35-27b-fixed | Qwen3.5-27B-FP8 (vLLM) | TP 2 | util 0.92 · 128K ctx · max-seqs 64 · kv fp8 · chunked prefill |
| llm-serving-embedding | BGE-M3 embedding | 1 | retrieval encoder |
| llm-serving-reranker | BGE-M3 reranker | 1 | retrieval reranker |
| bert-ner | BERT NER | 1 | entity extraction |
Also present but not on GPU right now:
llama, gemma4, gpt-oss, qwen3-vl,
classifier, sentiment, bm25 namespaces (scaled to 0).
This is the bridge from your ops layer to the internals. Each serving flag you set is one of the levers these lessons explain:
| vLLM flag (on your cluster) | What it controls — and the lesson |
|---|---|
| --kv-cache-dtype fp8 | 1 byte/KV element → halves the cache vs FP16 — Lesson 10 |
| GQA: 4 KV heads (arch) | fewer kv_heads → 6× smaller cache for Qwen3 — Lesson 10 |
| --gpu-memory-utilization | fraction of 94 GiB vLLM claims for weights + KV — Lesson 10 |
| --max-model-len 131072 | per-request context ceiling; KV grows linearly with it — Lesson 10 |
| --max-num-seqs | max concurrent sequences = the batch cap — Lesson 10 / 3 |
| --max-num-batched-tokens | token budget per engine step (prefill+decode) — Lesson 9 / 3 |
| --enable-chunked-prefill | split long prefills, interleave with decode — Lesson 12 (SARATHI) |
| --enable-prefix-caching | reuse KV for shared prompt prefixes — Lesson 12 |
| --tensor-parallel-size 2 | split one model across 2 GPUs (wants NVLink) — future TP lesson |
| Prefill vs decode | Qwen3.5-27B: prompt ~3,288 tok/s vs generation ~108 tok/s (single request) — the Lesson 9 asymmetry, live |
|---|---|
| Workload shape | prompt:generation ratio 19:1 (qwen36) and 47:1 (qwen35) — heavily prefill-bound RAG |
| KV cache usage | 0.2–0.5% under light load (short RAG contexts vs the 128K ceiling) |
| Refresh | bash learning/tools/cluster-probe.sh |