Two fixes for wasted decode — predicted, then revealed.
Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict why naïve batching idles the GPU and why
contiguous KV wastes memory — and how continuous batching + PagedAttention fix both.
The setup
You batch 4 requests together. They finish at very different lengths — one wants 20 tokens, another 800.
Step 1 — why does static batching idle?
Step 2 — the fix
If the problem is waiting for the slowest, what should you change about when you schedule?
Recall — cover the screen: in one line, what is continuous batching? Schedule one decode step at a time: after every step, evict finished requests and admit waiting ones, so the running batch never drains. (Iteration-level scheduling, from Orca.)(tap/hover to check)
Step 3 — but can you fit the batch?
Continuous batching only helps if the sequences fit in memory. Early servers reserved one
contiguous KV region per request, sized to its maximum possible length.
Step 4 — the memory fix
Recall — say it: what does PagedAttention borrow from operating systems, and what does it buy? Virtual memory / paging: store KV in small fixed-size blocks allocated on demand and non-contiguously. ~0 fragmentation → far more sequences fit, so continuous batching can actually keep the batch full.(tap/hover to check)
On YOUR cluster real config
vLLM is this engine. --max-num-seqs 8/64 = the continuous-batch width;
--enable-prefix-caching shares KV blocks across requests with the same prompt prefix (big for your
RAG); --enable-chunked-prefill keeps long prefills from blocking decodes. Live gauge:
num_requests_running vs num_requests_waiting. Your Lab →
Explain to a colleague: "Naïve LLM serving wastes the GPU two ways…" …static batching idles it waiting for the slowest request in the batch (fixed by continuous batching — swap requests every step), and contiguous max-length KV reservation wastes 60–80% of memory (fixed by PagedAttention — small on-demand blocks).(tap/hover)
I'm your teacher — ask me anything. Want a diagram of prefix-caching block sharing, or how chunked prefill works?
PagedAttention — Kwon et al., SOSP'23 (2309.06180) · Orca — Yu et al., OSDI'22 (usenix).
PagedAttention & Continuous Batching
Two fixes that keep the memory-bound decode phase fed.
Today's win: you'll explain why naïve static batching idles
the GPU and why contiguous KV allocation wastes most of your memory — and how
continuous batching + PagedAttention fix both, turning the batch cap you computed in
Lesson 10 into real throughput.
Pantry recap: decode is the van crossing town. From Lesson 9
you know the van is cheapest when it's full. This lesson is about two ways a
naïve kitchen keeps the van half-empty — and how to fix them. (full analogy →)
1 · Problem A — static batching idles the GPU
Group N requests, run them together, wait for all to finish, then start the
next batch. The catch: requests finish at wildly different lengths, so the whole batch
stalls on the slowest one while finished slots sit idle. On real traffic that leaves
roughly 60% of the GPU idle.3
Pantry: one shared van that refuses to leave until
every customer's order is done — the quick orders just wait on the slow one.
A, B, C finish early but their slots idle until D ends — only then can new
work start. The hatched area is wasted GPU.
2 · Fix A — continuous batching (iteration-level scheduling)
Instead of scheduling whole requests, schedule one decode step at a time.
After every step, evict finished requests and admit waiting ones — the batch never
drains. This is iteration-level scheduling, introduced by Orca, and
it's worth 4–8× on variable-length workloads.14
Pantry: the van never waits — it drops finished orders
and picks up waiting ones on every trip, so it's always full.
req A finishes and leaves; req E immediately takes the freed slot. No
waiting for the whole batch — the running set stays full step after step.
3 · Problem B — contiguous KV wastes most of your memory
Continuous batching only helps if you can fit the sequences. Early servers
reserved one contiguous KV region per request, sized to its
maximum possible length. Most requests never reach it, so 60–80% of
the reserved memory sits empty — and you fit far fewer sequences than
Lesson 10's math promised.2
Pantry: reserving a whole shelf per customer in case
they order the maximum — shelves fill up while mostly empty, and you turn customers away.
The blue is the KV you really use; the hatch is reserved-for-max waste.
Fragmentation, not the formula, is why so few requests fit.
4 · Fix B — PagedAttention
Borrow the operating system's trick: virtual memory. Split the KV cache into small
fixed-size blocks and hand them out on demand, non-contiguous,
as each sequence grows. Waste drops to nearly zero, so you finally fit the batch your
Lesson-2 math allowed. This is PagedAttention; together with continuous
batching it gave vLLM up to 24× the throughput of naïve HuggingFace.2
Pantry: ditch per-customer shelves. Use small bins from
a shared rack, assigned only as each order actually grows — almost no empty space.
Like OS paging: many requests share one pool of small blocks. Almost no
empty space, so continuous batching can actually keep the batch full.
5 · Putting it together — two requests, A and B
Now the synthesis. When person A and person B hit the same
model at once, what actually produces each full reply? Everything above shows up here on one
timeline — prefill (L1), per-sequence KV
(L2), continuous batching (§2) and paging (§4).
Pantry: two customers, one van. The van still hauls the whole
pantry per trip — but on that single trip the cook plates a bite for A and a bite for B.
Each customer's own order ticket, though, is read separately.
A prefills and starts decoding; B is admitted the moment it's prefilled and
joins the same forward passes; when A hits EOS its slot frees and a queued request E
reuses it — the batch never drains.
Reading one column at a time — what the GPU's single forward pass holds each iteration:
iter
what the one forward pass contains
tokens out
1
prefill A (whole prompt)
A: +tok 1
2
prefill B + decode A — mixed (chunked prefill)
B: +tok 1 · A: +tok 2
3–38
decode A + decode B — fused, one weight-read
A & B: +1 each
39
decode A + decode B (A hits EOS)
A: ✓ done · B: +tok
40
decode B (+ prefill E from the queue)
B: +tok · E: +tok 1
Inside one fused step: weights shared, KV separate
The reason B is almost free to add: the model's weight matrices are read once
from memory and drive both sequences in the same pass (the
Lesson 9 batching win). But the
attention step reads each sequence's own KV cache
(Lesson 10) — which is exactly why attention
gets no cross-batch reuse and stays memory-bound
(Lesson 11).
One weight-read, two sequences — but two separate KV caches. Shared weights are the
throughput win; the separate KV is why each request keeps its own context (and its own memory cost).
So B never waits for A. It joins the instant it's prefilled (continuous
batching, §2), rides every shared step after that, and when A finishes, A's KV blocks free
immediately (PagedAttention, §4) for the next request in the queue. Because your servers run
--enable-chunked-prefill
(Lesson 24), B's prompt also slips in
over a few steps without freezing A's token stream.
In Kubernetes terms infra bridge
Continuous batching is the scheduler bin-packing pods: admit new requests onto the GPU (node) the instant others finish, keeping utilization high instead of draining the whole batch first. And PagedAttention is memory overcommit with pages — hand out small KV blocks on demand instead of reserving each request's max up front, the same way you overcommit node memory rather than reserving every pod's limit.
On YOUR cluster — this engine is vLLM real config
Both fixes ship in the servers you're running. The
flags are the dials:
--max-num-seqs 8 (qwen36) / 64 (qwen35) — the
continuous-batch width: how many sequences decode together.
--enable-prefix-caching — PagedAttention blocks shared across
requests with the same prompt prefix (huge for your RAG system prompts).
--enable-chunked-prefill — splits long prefills so they don't block
decodes — vital since your traffic is 19–47:1 prefill-heavy.
The live gauge is num_requests_running vs
num_requests_waiting (the "Running: N reqs" line). Watch it:
bash learning/tools/cluster-probe.sh. Want the real throughput-vs-concurrency
curve measured on your H100s? That needs a small opt-in load test — say the
word and I'll run a bounded one. · Your Lab →
You use this every day: prompt caching bridge to the API
Why is a Claude Code (or any LLM-API) session cheap on repeated
turns but pricey to resume tomorrow? Same mechanism: prompt caching is prefix caching made
cross-request, time-limited, and billed — it stores the prefill KV for a stable prefix so
the next request skips re-prefilling
it. It's --enable-prefix-caching with a clock and a price tag.
Write the cache once (1.25×), then ride cheap reads (0.1×) while the session stays warm.
Go idle past the TTL and the cache evicts — the next turn re-prefills the whole history from scratch.
cache read (hit)
≈ 0.1× input price (~90% off)
cache write (cold / after expiry)
1.25× (5-min TTL) · 2× (1-hour TTL)
TTL
5 min default (refreshed on each use) · 1 hour option
Resume tomorrow? Idle past the TTL → the cache is gone → your first turn
is a miss: re-prefill the entire history at full price plus the 1.25× write premium, then
cheap reads resume. So continuing now is cheap; coming back tomorrow re-pays once to rebuild the cache
— not a penalty beyond uncached, just the savings reset for that one turn.
(Multipliers/TTL are Anthropic's; the mechanism — prefix-KV reuse
+ eviction — is universal.)
Pantry: the recipe binder stays open on
the counter while you keep cooking (warm reads); leave the kitchen idle and the staff put it away —
tomorrow they re-read the whole binder from scratch before the first dish.
Picture the idle bars and the packed blocks, then answer from memory.
I'm your teacher — ask me anything. Want the opt-in live batching
benchmark on your H100s, a diagram of prefix-caching block sharing, or to connect
max-num-seqs back to an SLO in your routing notebooks? Just ask.