A real fault on your 4×H100 lab — diagnosed, with a fix-it plan.
Tensor-parallel decode does an all-reduce every layer, every token to stitch the two GPUs' work together. On NVLink that's a blink; over PCIe it crawls — and your TP model has no NVLink option on its pair at all.
| Sev | Finding | Fix |
|---|---|---|
| P0 | NVLink bridge down on GPU0–GPU1. Only GPU2↔GPU3 has links; GPU0/1 report zero. | Reseat / install the bridge on the 0–1 pair (hardware), then re-verify. |
| P1 | TP model on the broken pair. qwen35-27b all-reduces over PCIe. | Move it to GPU2+GPU3, or run TP=1 (fits on one GPU). |
| P1 | Imbalance. GPU3 idle; GPU2 runs 4 servers at 88/94 GB. | Spread across all 4 GPUs (layouts below). |
| P2 | Prefix caching off on qwen35 — your most prefill-heavy model (47:1). | Add --enable-prefix-caching (one flag). |
| P2 | Driver upgrade failed (gpu-driver-upgrade-state). |
Resolve the GPU-operator upgrade; re-check NVLink. |
| P3 | Time-slicing the big models — no memory isolation, non-deterministic placement. | Dedicated whole GPUs (or MIG) for big models; time-slice only the small ones. |
A 27B FP8 model (~27 GB) fits on one 94 GB GPU — qwen36 already proves it. Drop TP, one model per GPU: no all-reduce, best throughput, and it doesn't even need the broken bridge.
| GPU0 | qwen36-27b (TP=1) |
| GPU1 | embedding + reranker + bert-ner |
| GPU2 | qwen35-27b (TP=1) |
| GPU3 | replica of the busiest model (or a new model) |
Option A — keep TP=2 for lowest single-stream latency. Put qwen35-27b on the working NVLink pair (GPU2+GPU3), move qwen36 to GPU0, small models to GPU1. Gains ~2× decode bandwidth over NVLink, but uses two GPUs for one model. Use this only if per-token latency is a hard SLO.
What the cluster reported (read-only), in case ops wants to see it:
nvidia-smi nvlink --status → GPU2, GPU3: 12 links @ 26.6 GB/s each
nvidia-smi nvlink --status -i 0 → (empty) # GPU0: no NVLink
nvidia-smi nvlink --status -i 1 → (empty) # GPU1: no NVLink
nvidia-smi topo -m → GPU0-GPU1: NODE (PCIe) ; GPU2-GPU3: NV12
compute-apps: GPU0 Worker_TP0 93GB | GPU1 Worker_TP1 93GB ← qwen35 TP on broken pair
GPU2 qwen36 76GB + embed/rerank/bert ~11GB ← crowded
GPU3 0 GB ← idle
nvidia-smi nvlink --status # both pairs should list 12 links nvidia-smi topo -m # GPU0-GPU1 should read NV12, not NODE bash learning/tools/cluster-probe.sh # prefill/decode + KV, live