The ground floor — worked out one guess at a time.
Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict what your model actually does on every call
(predict one token, append, repeat), and why training and inference are different jobs — the base every
later lesson builds on.
The setup
You send a prompt to your LLM and get a paragraph back. But what does the model produce first,
in a single step? Let's build the answer.
Step 1 — how is the answer produced?
Step 2 — what's the raw output of one step?
Recall — cover the screen: what is an LLM, in one line? A next-token predictor: given the text so far, it outputs a probability over its whole vocabulary for the next token, picks one, appends it, and repeats. The model only ever predicts one token at a time.(tap/hover to check)
Step 3 — how confident is it? (real cluster data)
Your Qwen3.6, given "Kubernetes pods are scheduled by the", makes its top guess " kube" at 38.9%.
Step 4 — training vs inference
Step 5 — the hidden cost
Recall — say it: training vs inference — which is which? Training learns the weights from oceans of text — once, hugely expensive (like compiling a binary). Inference runs those frozen weights on your prompt — every request (like running the binary). This course is about inference.(tap/hover)
On YOUR cluster live Qwen3.6
Qwen3.6-27B-FP8 on your 4×H100 is exactly this loop. Real next-token distribution for
"Kubernetes pods are scheduled by the": " kube" 38.9% · " scheduler" 18.4% · " Kubernetes" 12.6%
— it learned pods are placed by the kube-scheduler, as a probability not a fact. See it raw:
curl localhost:8000/v1/completions … "max_tokens":1,"logprobs":8. Your Lab →
Explain to a colleague: "When we call our LLM, under the hood it…" …runs our prompt through the frozen weights to get a probability distribution over the next token, picks one, appends it, and feeds the longer sequence back in — looping one token at a time until it emits a stop token. It never plans the whole answer; it just predicts the next token, repeatedly.(tap/hover)
I'm your teacher — ask me anything. Want the live next-token call on your own prompts, or how training actually sets the weights?
Before any optimization: what your model actually does on every single call.
Today's win: you'll explain, from memory, what an LLM does when you call it —
predict one token, append it, repeat — and why "training" and "inference" are two
completely different jobs. This is the ground floor the whole course is built on.
The picture: a cook with a memorized cookbook
Extend the Faraway Pantry:
the model is a cook who has internalized one enormous cookbook — billions of numbers
(the weights) learned by tasting a huge slice of the internet. To answer you, the cook doesn't
plan the whole dish; it predicts the next step, one at a time, the way a recipe unfolds.
writing the cookbook (once, costly)
training — learn the weights
cooking orders from the finished cookbook
inference — what every API call does
predicting the next step of the recipe
next-token prediction
1 · An LLM predicts the next token — nothing more
This is the whole secret. An LLM does not compose a reply and write it out. Given the text
so far, it produces a probability for every token in its vocabulary — a guess at what
comes next — and one token is chosen. That's a single step.1
Pantry: the cook glances at the order ticket so far and predicts the
single most likely next ingredient — not the finished plate.
Real output from your Qwen3.6 for this prompt. It's not "certain" the answer is
" kube" — it's a distribution (39%), heading toward "kube-scheduler". Picking the top one is
"greedy" decoding (more in Lesson 6).
2 · Training vs inference — two different jobs
Training teaches the model to predict next tokens by adjusting the weights over
mountains of text — done once, on huge clusters, at enormous cost. Inference
takes those frozen weights and runs your prompt through them to generate — that's what happens on
every API call. This whole course is about inference.1
Pantry: training is writing the cookbook by tasting millions of dishes
(once). Inference is a line cook using that finished cookbook to plate orders all day.
Training builds the weights once. Inference runs them on your prompt, over and over —
that's the part you operate, pay for, and optimize.
For your SRE brain: compile vs run bridge
If it helps: training is compiling the binary — slow, done once, produces an
artifact. The weights are that compiled artifact sitting on disk. Inference is running the
binary to serve traffic, and "loading the model" is just loading that artifact into GPU memory.
Everything in this course is about making the run step fast and cheap.
3 · Every request is a three-stage pipeline (your course map)
Inference always follows the same path. Hold this diagram — each box is a future lesson.1
The map of the whole course. Today is the loop itself; later lessons open each box and
then make the loop fast.
4 · It's a loop — and that's where the cost hides
Chain those stages and you get autoregressive generation: predict a token, append it,
feed the longer sequence back in, predict the next. Our example continues
… the → kube → -scheduler → …. The catch — and the reason the rest of this course exists —
is that every step re-reads the entire sequence so far, so each token costs a little more than
the last.2 You'll measure that "staircase" in
Lesson 7.
Pantry: after every bite the cook re-reads the entire order
from the top before deciding the next bite. Short orders are quick; long ones drag.
Generation is just this loop. The whole optimization story — KV cache, batching, the
engine — exists to flatten that rising staircase.
On YOUR cluster — this is literally what's running live
Qwen3.6-27B-FP8 on your 4×H100 is an LLM doing exactly this loop. The
bar chart above is its real next-token distribution for "Kubernetes pods are scheduled by
the" — top guess " kube" (38.9%), then " scheduler" (18.4%): it has learned that pods are
placed by the kube-scheduler, and expresses it as a probability, not a fact.
See the raw distribution yourself — one token, top-8 alternatives:
curl localhost:8000/v1/completions -d '{"model":"…","prompt":"…","max_tokens":1,"logprobs":8}'.
· Your Lab →
Read/watch this next — primary sourceBut what is a GPT? — 3Blue1Brown (the best
visual intro to next-token prediction). Runnable companion:
day01 notebook — builds this loop on GPT-2 from scratch.
Check yourself (recall, don't peek)
Picture the cook, the distribution bars, and the loop, then answer from memory.
I'm your teacher — ask me anything. Want the live next-token call run on your own
prompts, or to see how "training" actually adjusts the weights? Just ask.