Splitting a Model Across GPUs

Tensor parallelism — and why the wire between the GPUs decides if it's worth it.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.

Today's win: you'll predict what tensor parallelism splits, why it forces an all-reduce every layer, and when to use it versus just running replicas.

The setup

A 27B model in FP8 (~27 GB) fits comfortably on one 94 GB H100.

Step 1 — what does TP split?

Tensor parallelism (TP) runs one model on several GPUs. Each layer's weight matrices — including the per-head projections — are sharded across them.

Step 2 — the hidden cost

Each GPU computes a slice of the layer, so the slices must be stitched back together before the next layer.

Recall — cover the screen: how often does a TP model all-reduce, and why does it hurt decode?
Twice per layer, for every token — on the critical path (the next layer waits). Decode is already memory-bound, so adding cross-GPU comm every step directly raises token latency. (tap/hover to check)

Step 3 — so the wire matters

Step 4 — TP or replicas?

Recall — say it: when is TP=N worth it?
When the model doesn't fit on one GPU, or you need the lowest single-stream latency AND have NVLink. If it fits and you want throughput, run replicas (TP=1 × N) — no all-reduce. (tap/hover)

On YOUR cluster real fault

qwen35-27b runs --tensor-parallel-size 2 on GPU0 + GPU1 — the pair whose NVLink is down. So its all-reduces cross PCIe, the worst case here. And 27B FP8 fits on one GPU, so TP=2 isn't even required. Fix: TP=1, or move it onto the working NVLink pair (GPU2+GPU3). Full diagnosis → Cluster Findings.

Read this next — primary source Megatron-LM — Shoeybi et al. (tensor parallelism) · NVLink vs PCIe.

Final check — teach it back

Explain to a colleague: "Splitting our model across two GPUs only helps if…"
…they're connected by NVLink — TP all-reduces every layer, every token, on the critical path, so over slow PCIe it adds latency instead of removing it. And if the model fits on one GPU, replicas usually beat TP for throughput anyway. (tap/hover)

I'm your teacher — ask me anything. Want pipeline vs tensor parallelism, or how heads get split across GPUs?

← Lesson 18 · Model Formats & CompilationLesson 20 · Disaggregated Serving →

References

Megatron-LM — Shoeybi et al., 2019 (1909.08053); NVLink vs PCIe (spheron).

Tensor Parallelism & NVLink

Splitting one model across GPUs — and why the wire between them decides if it's worth it.

Today's win: you'll explain what tensor parallelism (TP) actually splits, why it forces an all-reduce every layer, every token, why that makes NVLink-vs-PCIe decisive, and when to reach for TP=N versus just running replicas — using your own cluster's NVLink fault as the worked example.

Pantry: TP is two chefs splitting one dish — each does half, but they must combine their halves after every step. That needs a fast pass-through window between stations (NVLink). Down a slow hallway (PCIe), every combine drags. (full analogy →)

1 · What TP actually splits

A model too big (or too slow) for one GPU can be sharded: each layer's big weight matrices are split across N GPUs, every GPU computes a slice of the layer, and the slices are stitched back together with an all-reduce before the next step. That's tensor parallelism (Megatron-LM).1

Pantry: one dish, two chefs — chef A preps the left half, chef B the right; they merge into one plate before the next step. (Contrast: replicas = each chef independently cooks whole dishes for different customers — no merging.)

Each GPU holds half the layer and computes a partial; the all-reduce sums them into the full result. No combine → no correct output. The combine is mandatory.

2 · The hidden cost: all-reduce every layer, every token

Megatron does two all-reduces per transformer layer (after attention, after the MLP).1 In decode that whole stack runs per token — for a 64-layer model that's ~128 all-reduces to emit one token, all on the critical path: the next layer can't start until the combine lands. So the GPU-to-GPU wire goes straight into your token latency.

Pantry: the two chefs hand off on every step of every dish. A fast window between stations = seamless. A slow hallway = they're walking back and forth all night.

~128 all-reduces per token for a 64-layer model, each blocking the next layer. On NVLink that's a blink; on PCIe it's ~7× the cost — and decode is already memory-bound.

3 · TP=N or replicas? The decision

TP buys you two things: it lets a model that doesn't fit on one GPU run at all, and it adds memory bandwidth (N GPUs' HBM) for lower single-stream latency. But it costs the all-reduce tax. If the model fits on one GPU, replicas (TP=1 × N) usually win on throughput — no communication at all.2

Fits on one GPU + latency not critical → replicas. Doesn't fit, or you need minimum token latency → TP=N, but only worth it on NVLink.

In Kubernetes terms infra bridge

Tensor parallelism is one workload sharded across nodes — a StatefulSet whose shards must sync every step (an all-reduce barrier). So it lives or dies on the pod-to-pod network: NVLink is the same-node high-speed fabric, PCIe / Ethernet is the slower cross-node hop — which is exactly why your down-NVLink pair forces the all-reduce onto the slow path.

On YOUR cluster — this is exactly the trap real fault

qwen35-27b runs --tensor-parallel-size 2 on GPU0 + GPU1 — the pair whose NVLink is down. So ~128 all-reduces per token cross PCIe (~128 GB/s) instead of NVLink (~640 GB/s): the worst case in the decision above.

27B in FP8 (~27 GB) fits on one 94 GB GPU — qwen36 proves it (TP=1). So TP=2 isn't even required here; it's a latency choice that's currently backfiring.
Fix per the decision tree: run it TP=1 (replica), or move TP=2 onto the working NVLink pair (GPU2+GPU3) — and reseat the 0–1 bridge.

Full diagnosis + layouts → Cluster Findings. Re-check the wire: nvidia-smi nvlink --status · nvidia-smi topo -m.

Read this next — primary source Megatron-LM: Training Multi-Billion Parameter Models Using Model Parallelism — Shoeybi et al., 2019. The origin of tensor parallelism (column/row-parallel layers + all-reduce). See §3 for the partitioning.

Check yourself (recall, don't peek)

Picture the two chefs and the combine step, then answer from memory.

I'm your teacher — ask me anything. Want pipeline parallelism vs tensor parallelism, the column/row-parallel math, or how TP interacts with the KV cache (Lesson 10) across GPUs? Just ask.

← Lesson 18 · Model Formats & CompilationLesson 20 · Disaggregated Serving →

References

Megatron-LM — Shoeybi et al., 2019 (arXiv 1909.08053). arxiv.org
NVLink vs PCIe bandwidth & why TP all-reduce needs the fast link (H100 NVLink ~900 GB/s vs PCIe ~128 GB/s). spheron.network · nvidia.com