Tensor parallelism — and why the wire between the GPUs decides if it's worth it.
Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict what tensor parallelism splits, why it forces an all-reduce every layer, and when to use it versus just running replicas.
The setup
A 27B model in FP8 (~27 GB) fits comfortably on one 94 GB H100.
Step 1 — what does TP split?
Tensor parallelism (TP) runs one model on several GPUs. Each layer's weight matrices — including the per-head projections — are sharded across them.
Step 2 — the hidden cost
Each GPU computes a slice of the layer, so the slices must be stitched back together before the next layer.
Recall — cover the screen: how often does a TP model all-reduce, and why does it hurt decode? Twice per layer, for every token — on the critical path (the next layer waits). Decode is already memory-bound, so adding cross-GPU comm every step directly raises token latency.(tap/hover to check)
Step 3 — so the wire matters
Step 4 — TP or replicas?
Recall — say it: when is TP=N worth it? When the model doesn't fit on one GPU, or you need the lowest single-stream latency AND have NVLink. If it fits and you want throughput, run replicas (TP=1 × N) — no all-reduce.(tap/hover)
On YOUR cluster real fault
qwen35-27b runs --tensor-parallel-size 2 on GPU0 + GPU1 — the pair whose NVLink is down. So its all-reduces cross PCIe, the worst case here. And 27B FP8 fits on one GPU, so TP=2 isn't even required. Fix: TP=1, or move it onto the working NVLink pair (GPU2+GPU3). Full diagnosis → Cluster Findings.
Explain to a colleague: "Splitting our model across two GPUs only helps if…" …they're connected by NVLink — TP all-reduces every layer, every token, on the critical path, so over slow PCIe it adds latency instead of removing it. And if the model fits on one GPU, replicas usually beat TP for throughput anyway.(tap/hover)
I'm your teacher — ask me anything. Want pipeline vs tensor parallelism, or how heads get split across GPUs?
Megatron-LM — Shoeybi et al., 2019 (1909.08053); NVLink vs PCIe (spheron).
Tensor Parallelism & NVLink
Splitting one model across GPUs — and why the wire between them decides if it's worth it.
Today's win: you'll explain what tensor parallelism (TP) actually splits,
why it forces an all-reduce every layer, every token, why that makes NVLink-vs-PCIe
decisive, and when to reach for TP=N versus just running replicas — using your own cluster's
NVLink fault as the worked example.
Pantry: TP is two chefs splitting one dish — each does half, but they must
combine their halves after every step. That needs a fast pass-through window between stations
(NVLink). Down a slow hallway (PCIe), every combine drags. (full analogy →)
1 · What TP actually splits
A model too big (or too slow) for one GPU can be sharded: each layer's big
weight matrices are split across N GPUs, every GPU computes a slice of the layer, and
the slices are stitched back together with an all-reduce before the next step.
That's tensor parallelism (Megatron-LM).1
Pantry: one dish, two chefs — chef A preps the left half, chef B
the right; they merge into one plate before the next step. (Contrast: replicas = each
chef independently cooks whole dishes for different customers — no merging.)
Each GPU holds half the layer and computes a partial; the all-reduce sums them into
the full result. No combine → no correct output. The combine is mandatory.
2 · The hidden cost: all-reduce every layer, every token
Megatron does two all-reduces per transformer layer (after attention, after
the MLP).1 In decode
that whole stack runs per token — for a 64-layer model that's ~128 all-reduces to emit
one token, all on the critical path: the next layer can't start until the combine lands.
So the GPU-to-GPU wire goes straight into your token latency.
Pantry: the two chefs hand off on every step of
every dish. A fast window between stations = seamless. A slow hallway = they're walking
back and forth all night.
~128 all-reduces per token for a 64-layer model, each blocking the next layer. On
NVLink that's a blink; on PCIe it's ~7× the cost — and decode is already memory-bound.
3 · TP=N or replicas? The decision
TP buys you two things: it lets a model that doesn't fit on one GPU run at all, and
it adds memory bandwidth (N GPUs' HBM) for lower single-stream latency. But it costs the
all-reduce tax. If the model fits on one GPU, replicas (TP=1 × N) usually win on
throughput — no communication at all.2
Fits on one GPU + latency not critical → replicas. Doesn't fit, or you need minimum
token latency → TP=N, but only worth it on NVLink.
In Kubernetes terms infra bridge
Tensor parallelism is one workload sharded across nodes — a StatefulSet whose shards must sync every step (an all-reduce barrier). So it lives or dies on the pod-to-pod network: NVLink is the same-node high-speed fabric, PCIe / Ethernet is the slower cross-node hop — which is exactly why your down-NVLink pair forces the all-reduce onto the slow path.
On YOUR cluster — this is exactly the trap real fault
qwen35-27b runs --tensor-parallel-size 2 on
GPU0 + GPU1 — the pair whose NVLink is down. So ~128 all-reduces per token cross
PCIe (~128 GB/s) instead of NVLink (~640 GB/s): the worst case in the decision above.
27B in FP8 (~27 GB) fits on one 94 GB GPU — qwen36 proves it (TP=1). So TP=2 isn't
even required here; it's a latency choice that's currently backfiring.
Fix per the decision tree: run it TP=1 (replica), or move TP=2 onto the
working NVLink pair (GPU2+GPU3) — and reseat the 0–1 bridge.
Full diagnosis + layouts → Cluster Findings.
Re-check the wire: nvidia-smi nvlink --status · nvidia-smi topo -m.
Picture the two chefs and the combine step, then answer from memory.
I'm your teacher — ask me anything. Want pipeline parallelism vs tensor
parallelism, the column/row-parallel math, or how TP interacts with the KV cache (Lesson 10)
across GPUs? Just ask.