Inference Engineering · Lesson 7 · The Autoregressive Loop & CostHome · Glossary · Your Lab
The Autoregressive Loop & the Cost Staircase
Why generation slows down — one guess at a time.
Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict the autoregressive loop, why each token costs
more than the last, and the fixes that flatten that staircase — the bridge into the inference runtime.
The setup
You have every piece now: tokenize → embed → attention → forward pass → decode. A 40-token answer needs
40 forward passes in a row.
Step 1 — the shape of generation
Step 2 — what grows each step?
Recall — cover the screen: why does each decode step get slower? Because every step re-reads the entire sequence so far to predict the next token. Step 1 processes a short sequence; step 40 processes 40+ tokens. Work per step grows with length — the "cost staircase".(tap/hover to check)
Step 3 — the staircase (real GPT-2 numbers)
On plain GPT-2, the first token took ~64 ms.
Step 4 — the root cause
Step 5 — the fix
Recall — say it: three ways the runtime flattens the staircase. KV cache (store past Keys/Values so each step does ~constant work — L10); batching (one weight-read serves many requests — L12); speculative decoding (guess several tokens, verify in one pass — L15).(tap/hover)
On YOUR cluster live
Your Qwen on vLLM already applies all three fixes — which is why it holds ~108 gen
tok/s instead of degrading token-by-token like raw GPT-2. Next, Lesson 9 splits this loop into
prefill vs decode — the asymmetry that drives the whole runtime. Welcome to Part II. Your Lab →
Read this next — primary source
runnable companion: day01 notebook — it measures this exact staircase on GPT-2.
Final check — teach it back
Explain to a colleague: "Naive generation gets slower per token because…" …it's a loop where every step re-reads the entire sequence so far, so work per token grows with length (a rising staircase). The KV cache fixes the recomputation, batching amortizes the weight-read, and speculative decoding skips passes — together flattening it.(tap/hover)
I'm your teacher — that's Part I done. Want to run the staircase measurement, or jump into prefill vs decode?
Why generation is a loop, why it gets slower, and why that's the whole rest of the course.
Today's win: you'll explain the autoregressive loop, measure why each token
costs more than the last (the "staircase"), and name the three fixes that flatten it — the bridge from
these foundations into the inference runtime.
The picture: re-read the whole ticket before every bite
You now have all the pieces: tokenize, embed, attention, forward pass,
decode. Chain them and generation is a loop — predict a token, append it, do it again. The catch:
to predict each new token, the cook re-reads the entire order from the top. Short orders are
quick; long ones drag — every bite a little slower than the last.
plate a bite, then re-read the whole ticket
one decode step (re-reads the whole sequence)
the ticket getting longer each bite
sequence length grows by 1 per token
each bite taking a little longer
the cost staircase
1 · Generation is a loop
Put Lessons 1–6 together. Prefill the
prompt once, then loop: forward pass → decode a
token → append it → feed the longer sequence back. This is autoregressive
generation.1
Generation = prefill once, then loop forward-pass→decode→append. Each turn of the loop emits
exactly one token (Lesson 12 revisits this
with two requests).
2 · The catch: every step re-reads the whole sequence
Here's the hidden cost. To produce token 41, the model must process all 40 tokens so far. Step 1
processes a short sequence; step 40 processes a much longer one. The work per step grows with the
sequence.1
The sequence the model re-reads gets longer by one token every step — so each step does a bit
more work than the last.
3 · The cost staircase (measured)
Plot the time per token and it's a staircase, each bar taller than the last. On
plain GPT-2 (no optimizations), the first token took ~64 ms and the 40th took
~147 ms — more than 2× slower, purely because the sequence grew.1
Pantry: bite 1 is instant; by bite 40 the cook re-reads a novel-length
ticket first. Same kitchen, slower service — entirely from the growing re-read.
The naive baseline, measured on GPT-2. This rising staircase is the single problem the entire
inference-runtime half of this course exists to flatten.
4 · The fixes — and the bridge to Part II
Almost everything ahead is an attack on this staircase:2
KV cache — the past tokens' Keys/Values don't change, so store them instead of
recomputing → each step does ~constant work (Lesson 10).
Batching — one weight-read serves many requests at once, amortizing the expensive
part (Lesson 12).
Speculative decoding — guess several tokens with a cheap model and verify in one
pass (Lesson 15).
Done. You've finished the foundations — the next lessons turn this staircase into a flat,
cheap, scalable line.
In Kubernetes terms infra bridge
The autoregressive loop is a reconcile loop: observe current state (the sequence so far) → compute the next action (predict a token) → apply it (append) → repeat until converged (a stop token). The same observe-decide-act cycle your controllers run — and the KV cache is just caching the observed state so each tick doesn't re-list everything from scratch.
On YOUR cluster — the staircase is already flattened live
Your Qwen on vLLM already uses every fix above. That's why it sustains ~108
generation tok/s with the KV cache instead of degrading token-by-token like raw GPT-2. The very next
lesson (Lesson 9) splits that loop into its two
phases — prefill vs decode — and the asymmetry between them drives the entire runtime. Welcome to
Part II. · Your Lab →
Read this next — primary source
Runnable companion: day01 notebook — it
measures this exact staircase on GPT-2. Then continue to the inference-runtime half of the course.
Check yourself (recall, don't peek)
I'm your teacher — that's Part I done. Want to run the staircase measurement yourself,
or jump straight into prefill vs decode? Just say the word.