Prefill vs decode — but this time, you work it out before I tell you.
Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: by the end you'll explain why decode is ~30× slower than prefill on
your own H100s — and you'll have predicted almost the whole answer yourself, one step at a time.
The mystery
Here are two real numbers from your cluster right now — Qwen3.5-27B, same GPU, same model,
one request:
prefill (reading the prompt): ~3,288 tokens/sec
decode (writing the answer): ~108 tokens/sec — about 30× slower
Step 1 — what does decode actually do?
Decode writes the answer one token at a time. Think about what the GPU needs for each new token.
Step 2 — so what's the bottleneck?
You've established that every token re-reads the whole model. Now: what limits how fast that can go?
Recall — cover the screen: in one line, why is decode memory-bound? Each token re-reads all the weights (huge bytes from HBM) but does only one token's worth of math — so bandwidth, not FLOPs, is the ceiling.(tap/hover to check)
Step 3 — the test that proves it
If decode is really memory-bound, here's a prediction that pins it down. Imagine swapping in a GPU
with double the compute (FLOPs) but the same memory bandwidth.
That's the whole asymmetry: prefill is compute-bound (big matrix multiplies over
the entire prompt, tensor cores saturated), decode is memory-bandwidth-bound
(re-read everything, do almost no math). Same GPU, opposite ceilings — and that's your 30×.1
Step 4 — the payoff: batching
One more prediction — this is where it connects to your day job. Instead of one request, serve
8 at once. The expensive part of decode was re-reading the weights…
Recall — say it from memory: prefill is ___-bound; decode is ___-bound; batching mainly helps ___. Prefill = compute-bound · decode = memory-bound · batching mainly helps decode throughput (one weight-read serves the whole batch).(tap/hover to check)
So continuous batching, routing, autoscaling — your ops notebooks — are all really about
keeping the memory-bound decode phase full, because a full batch multiplies
throughput for almost no extra cost. (Up to memory and compute limits — that's Lessons 10 and 11.)
The real test of the predict-first format: can you reconstruct it cold?
Explain to a colleague (out loud): "Our 27B is 30× slower at writing than reading because…" …reading the prompt (prefill) is one big parallel matmul that saturates the GPU's compute, while writing (decode) re-streams the entire model from memory for every single token — so it's capped by memory bandwidth, not compute. Batching many requests shares that one weight-read, which is why throughput scales with batch.(tap/hover to check yourself)
I'm your teacher — ask me anything. Did the predict-first format help, or get in the
way? Want the supporting diagrams from the main version added back as reveals? Tell me and
I'll tune it before redoing Lessons 2+.
Prefill is Compute-Bound, Decode is Memory-Bound — Towards Data Science.
towardsdatascience.com ·
Mastering LLM Techniques — NVIDIA
Prefill vs. Decode
The one asymmetry that explains your whole serving stack.
Today's win: by the end you'll be able to say, from memory, why
prefill is compute-bound and decode is memory-bandwidth-bound — and use
that single fact to predict how a serving system behaves under load.
First, a picture for your head: the faraway pantry
You're a lightning-fast line cook. Your
ingredients sit in a pantry across town; one road and one van connect you.
Watch the van — it hauls the entire pantry for every order. Hold
this image; each technical term below maps onto it.
Every request runs in two phases. Prefill reads your whole prompt
in one pass and emits the first token. Decode then emits
the rest one token at a time, each step feeding back the token it just
made.1
Pantry: prefill = one big catering order cooked in a
single go. Decode = à la carte, one bite at a time.
Watch the tokens light up one after another on the right —
that's decode, strictly sequential. Prefill (left) is a single wide pass.
2 · Same GPU, opposite bottleneck
Here's the crux. Prefill does huge matrix–matrix multiplies and
saturates the tensor cores — limited by compute. Decode, to
make a single token, must re-read all the model weights plus the whole
KV cache from memory while doing
almost no math — limited by memory bandwidth.23
Pantry: the catering cook is limited by how fast he
can cook (compute). The à-la-carte cook is limited by the van
(bandwidth) — the cooking is trivial.
The shine sweeps the full bar — prefill maxes compute, decode
maxes memory. Speeding up the idle resource buys nothing.
3 · Why one decode token is so expensive
It feels backwards that generating one token is the slow part. The reason:
per token you move gigabytes of weights + cache out of memory, but do only one
token's worth of arithmetic. That ratio — FLOPs per byte — is the
arithmetic intensity,
and for decode it's tiny.3
Pantry: the van hauls the whole pantry
across town just so you can make one knife-cut. Watch it haul below.
The fat pipe (bytes) is the bottleneck; the math is trivial. That's what
"memory-bandwidth-bound" looks like.
4 · The payoff — batching
This is where it ties back to your repo. Because that one expensive weight-read can
serve many sequences at once, batching multiplies decode throughput almost
for free. Continuous batching, routing, autoscaling — your ops notebooks are all
strategies to keep the memory-bound decode phase fed.4
Pantry: the van is crossing town anyway with
the whole pantry — so plate bites for 32 customers from that one trip.
One weight-read, many sequences served in turn. Bigger batch → more
tokens per byte moved → higher throughput. This is why continuous batching exists.
One nuance to file away: decode is memory-bound at the small/medium
batch sizes most serving runs at. Past roughly batch 32 the decode matrix
multiplies themselves turn compute-bound, while attention stays bandwidth-bound.
We'll derive that crossover with the roofline model in a later lesson.3
In Kubernetes terms infra bridge
Prefill is a request's heavy init / cold-start work — one big compute burst — while decode is its long-running steady-state loop. Same split as an initContainer (expensive, runs once) versus the main container (runs forever): two different resource profiles, which is exactly why some stacks even schedule them on separate node pools (Lesson 20).
On YOUR cluster measured · live
This isn't just theory — it's running on your 4×H100
lab right now. Qwen3.5-27B (tensor-parallel across 2 H100s), single request:
prefill (prompt) throughput up to ~3,288 tok/s — compute-bound, devouring the prompt
decode (generation) throughput ~108 tok/s — memory-bound, one token at a time
That ≈30× gap is the asymmetry of this lesson, on your own
GPUs. And your traffic is heavily prefill-bound (prompt:generation ≈ 19:1 to
47:1 — classic RAG), which is exactly why these servers run
--enable-chunked-prefill. Watch it live:
bash learning/tools/cluster-probe.sh · Your Lab →
Picture the van and the diagrams above, then answer from memory.
I'm your teacher — ask me anything. Want a diagram redrawn, the van
to move differently, a worked example with a real model + GPU, or the KV-cache
memory math next? Say the word and I'll go deeper or build the next lesson around it.