Your Lab · 4×H100 Inference Cluster

Your Lab

A real 4×H100 OpenShift cluster running live vLLM models — the course's measuring instrument. Every lesson can pull real numbers from it.

Why this matters

The calculator gives you the theory; this cluster gives you the truth. When Lesson 9 says "prefill ≫ decode throughput," you can watch it: 3,288 tok/s prefill vs ~108 tok/s decode on your own H100s. When Lesson 10 computes a KV cache, you can read the real usage % off vLLM. Refresh anytime with bash learning/tools/cluster-probe.sh.

⚠ Health: one NVLink pair is down action needed

Verified 2026-06-19: only GPU2↔GPU3 has working NVLink (12 links). GPU0↔GPU1 report zero links — a 4×H100 NVL box should have two bridged pairs, but you currently have one, and the tensor-parallel model is sitting on the broken pair. Full diagnosis, severity ranking + a rebalance proposal → Cluster Findings & Tuning Proposal.

The hardware

GPUs	4 × NVIDIA H100 NVL, 94 GiB HBM3 each (95,830 MiB)
Logical GPUs	20 — each physical GPU time-sliced ×5 (MIG disabled)
Node	128 vCPU · ~2 TiB RAM · single worker (control-plane + worker)
Power/thermal	400 W cap/GPU; idle ~60–110 W, 38–64 °C when not generating

GPUs

4 × NVIDIA H100 NVL, 94 GiB HBM3 each (95,830 MiB)

Logical GPUs

20 — each physical GPU time-sliced ×5 (MIG disabled)

Node

128 vCPU · ~2 TiB RAM · single worker (control-plane + worker)

Power/thermal

400 W cap/GPU; idle ~60–110 W, 38–64 °C when not generating

Time-slicing caveat: ×5 slicing means 5 pods can land on one physical GPU, but they share its 94 GiB with no isolation — the "20 GPUs" are scheduling slots, not 20 private memories. Overcommit and you OOM.

The models running now

Namespace	Model	GPUs	Notable serving flags
llm-serving-qwen36	Qwen3.6-27B-FP8 (vLLM)	TP 1	util 0.75 · 128K ctx · max-seqs 8 · kv fp8 · chunked prefill · prefix cache
qwen35-27b-fixed	Qwen3.5-27B-FP8 (vLLM)	TP 2	util 0.92 · 128K ctx · max-seqs 64 · kv fp8 · chunked prefill
llm-serving-embedding	BGE-M3 embedding	1	retrieval encoder
llm-serving-reranker	BGE-M3 reranker	1	retrieval reranker
bert-ner	BERT NER	1	entity extraction

Namespace

Model

GPUs

Notable serving flags

llm-serving-qwen36

Qwen3.6-27B-FP8 (vLLM)

TP 1

util 0.75 · 128K ctx · max-seqs 8 · kv fp8 · chunked prefill · prefix cache

qwen35-27b-fixed

Qwen3.5-27B-FP8 (vLLM)

TP 2

util 0.92 · 128K ctx · max-seqs 64 · kv fp8 · chunked prefill

llm-serving-embedding

BGE-M3 embedding

retrieval encoder

llm-serving-reranker

BGE-M3 reranker

retrieval reranker

bert-ner

BERT NER

entity extraction

Also present but not on GPU right now: llama, gemma4, gpt-oss, qwen3-vl, classifier, sentiment, bm25 namespaces (scaled to 0).

The Rosetta Stone — every flag is a lesson

This is the bridge from your ops layer to the internals. Each serving flag you set is one of the levers these lessons explain:

vLLM flag (on your cluster)	What it controls — and the lesson
--kv-cache-dtype fp8	1 byte/KV element → halves the cache vs FP16 — Lesson 10
GQA: 4 KV heads (arch)	fewer kv_heads → 6× smaller cache for Qwen3 — Lesson 10
--gpu-memory-utilization	fraction of 94 GiB vLLM claims for weights + KV — Lesson 10
--max-model-len 131072	per-request context ceiling; KV grows linearly with it — Lesson 10
--max-num-seqs	max concurrent sequences = the batch cap — Lesson 10 / 3
--max-num-batched-tokens	token budget per engine step (prefill+decode) — Lesson 9 / 3
--enable-chunked-prefill	split long prefills, interleave with decode — Lesson 12 (SARATHI)
--enable-prefix-caching	reuse KV for shared prompt prefixes — Lesson 12
--tensor-parallel-size 2	split one model across 2 GPUs (wants NVLink) — future TP lesson

vLLM flag (on your cluster)

What it controls — and the lesson

--kv-cache-dtype fp8

1 byte/KV element → halves the cache vs FP16 — Lesson 10

GQA: 4 KV heads (arch)

fewer kv_heads → 6× smaller cache for Qwen3 — Lesson 10

--gpu-memory-utilization

fraction of 94 GiB vLLM claims for weights + KV — Lesson 10

--max-model-len 131072

per-request context ceiling; KV grows linearly with it — Lesson 10

--max-num-seqs

max concurrent sequences = the batch cap — Lesson 10 / 3

--max-num-batched-tokens

token budget per engine step (prefill+decode) — Lesson 9 / 3

--enable-chunked-prefill

split long prefills, interleave with decode — Lesson 12 (SARATHI)

--enable-prefix-caching

reuse KV for shared prompt prefixes — Lesson 12

--tensor-parallel-size 2

split one model across 2 GPUs (wants NVLink) — future TP lesson

Live snapshot — measured 2026-06-19

Prefill vs decode	Qwen3.5-27B: prompt ~3,288 tok/s vs generation ~108 tok/s (single request) — the Lesson 9 asymmetry, live
Workload shape	prompt:generation ratio 19:1 (qwen36) and 47:1 (qwen35) — heavily prefill-bound RAG
KV cache usage	0.2–0.5% under light load (short RAG contexts vs the 128K ceiling)
Refresh	`bash learning/tools/cluster-probe.sh`

Prefill vs decode

Qwen3.5-27B: prompt ~3,288 tok/s vs generation ~108 tok/s (single request) — the Lesson 9 asymmetry, live

Workload shape

prompt:generation ratio 19:1 (qwen36) and 47:1 (qwen35) — heavily prefill-bound RAG

KV cache usage

0.2–0.5% under light load (short RAG contexts vs the 128K ceiling)

Refresh

bash learning/tools/cluster-probe.sh