Inference Engineering · Lesson 7 · The Autoregressive Loop & Cost Home · Glossary · Your Lab

The Autoregressive Loop & the Cost Staircase

Why generation slows down — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict the autoregressive loop, why each token costs more than the last, and the fixes that flatten that staircase — the bridge into the inference runtime.

The setup

You have every piece now: tokenize → embed → attention → forward pass → decode. A 40-token answer needs 40 forward passes in a row.

Step 1 — the shape of generation

Step 2 — what grows each step?

Recall — cover the screen: why does each decode step get slower?
Because every step re-reads the entire sequence so far to predict the next token. Step 1 processes a short sequence; step 40 processes 40+ tokens. Work per step grows with length — the "cost staircase". (tap/hover to check)

Step 3 — the staircase (real GPT-2 numbers)

On plain GPT-2, the first token took ~64 ms.

Step 4 — the root cause

Step 5 — the fix

Recall — say it: three ways the runtime flattens the staircase.
KV cache (store past Keys/Values so each step does ~constant work — L10); batching (one weight-read serves many requests — L12); speculative decoding (guess several tokens, verify in one pass — L15). (tap/hover)

On YOUR cluster live

Your Qwen on vLLM already applies all three fixes — which is why it holds ~108 gen tok/s instead of degrading token-by-token like raw GPT-2. Next, Lesson 9 splits this loop into prefill vs decode — the asymmetry that drives the whole runtime. Welcome to Part II. Your Lab →

Read this next — primary source runnable companion: day01 notebook — it measures this exact staircase on GPT-2.

Final check — teach it back

Explain to a colleague: "Naive generation gets slower per token because…"
…it's a loop where every step re-reads the entire sequence so far, so work per token grows with length (a rising staircase). The KV cache fixes the recomputation, batching amortizes the weight-read, and speculative decoding skips passes — together flattening it. (tap/hover)
I'm your teacher — that's Part I done. Want to run the staircase measurement, or jump into prefill vs decode?
← Lesson 6Next: Lesson 8 →
References
  1. day01 — autoregressive loop & per-token timing (notebook); fixes in Lessons 10, 12, 15.

The Autoregressive Loop & the Cost Staircase

Why generation is a loop, why it gets slower, and why that's the whole rest of the course.

Today's win: you'll explain the autoregressive loop, measure why each token costs more than the last (the "staircase"), and name the three fixes that flatten it — the bridge from these foundations into the inference runtime.

The picture: re-read the whole ticket before every bite

You now have all the pieces: tokenize, embed, attention, forward pass, decode. Chain them and generation is a loop — predict a token, append it, do it again. The catch: to predict each new token, the cook re-reads the entire order from the top. Short orders are quick; long ones drag — every bite a little slower than the last.

plate a bite, then re-read the whole ticketone decode step (re-reads the whole sequence)
the ticket getting longer each bitesequence length grows by 1 per token
each bite taking a little longerthe cost staircase

1 · Generation is a loop

Put Lessons 1–6 together. Prefill the prompt once, then loop: forward pass → decode a token → append it → feed the longer sequence back. This is autoregressive generation.1

prefill prompt forward pass decode + append +1 token feed the longer sequence back — repeat until a stop token
Generation = prefill once, then loop forward-pass→decode→append. Each turn of the loop emits exactly one token (Lesson 12 revisits this with two requests).

2 · The catch: every step re-reads the whole sequence

Here's the hidden cost. To produce token 41, the model must process all 40 tokens so far. Step 1 processes a short sequence; step 40 processes a much longer one. The work per step grows with the sequence.1

step 1step 3step N tokens read ← more tokens read each step tokens read (grows every step)
The sequence the model re-reads gets longer by one token every step — so each step does a bit more work than the last.

3 · The cost staircase (measured)

Plot the time per token and it's a staircase, each bar taller than the last. On plain GPT-2 (no optimizations), the first token took ~64 ms and the 40th took ~147 ms — more than 2× slower, purely because the sequence grew.1

Pantry: bite 1 is instant; by bite 40 the cook re-reads a novel-length ticket first. Same kitchen, slower service — entirely from the growing re-read.
generation step → ms / token → ~64ms ~147ms each token slower than the last ↗
The naive baseline, measured on GPT-2. This rising staircase is the single problem the entire inference-runtime half of this course exists to flatten.

4 · The fixes — and the bridge to Part II

Almost everything ahead is an attack on this staircase:2

generation step → naive (re-read everything) with the KV cache: ~flat per token Part II opens each box and flattens the stair
Done. You've finished the foundations — the next lessons turn this staircase into a flat, cheap, scalable line.

In Kubernetes terms infra bridge

The autoregressive loop is a reconcile loop: observe current state (the sequence so far) → compute the next action (predict a token) → apply it (append) → repeat until converged (a stop token). The same observe-decide-act cycle your controllers run — and the KV cache is just caching the observed state so each tick doesn't re-list everything from scratch.

On YOUR cluster — the staircase is already flattened live

Your Qwen on vLLM already uses every fix above. That's why it sustains ~108 generation tok/s with the KV cache instead of degrading token-by-token like raw GPT-2. The very next lesson (Lesson 9) splits that loop into its two phases — prefill vs decode — and the asymmetry between them drives the entire runtime. Welcome to Part II. · Your Lab →

Read this next — primary source Runnable companion: day01 notebook — it measures this exact staircase on GPT-2. Then continue to the inference-runtime half of the course.

Check yourself (recall, don't peek)

I'm your teacher — that's Part I done. Want to run the staircase measurement yourself, or jump straight into prefill vs decode? Just say the word.
← Lesson 6 — forward pass & sampling Next: Lesson 8 — what is an inference engine? →
References
  1. Autoregressive generation & the measured per-token staircase — day01 (llm-inference-mechanics.ipynb).
  2. KV caching, batching, speculative decoding (the fixes) — Lessons 10, 12, 15.