Inference Engineering · Lesson 1 · What Is an LLM? Home · Glossary · Your Lab

What Is an LLM?

The ground floor — worked out one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict what your model actually does on every call (predict one token, append, repeat), and why training and inference are different jobs — the base every later lesson builds on.

The setup

You send a prompt to your LLM and get a paragraph back. But what does the model produce first, in a single step? Let's build the answer.

Step 1 — how is the answer produced?

Step 2 — what's the raw output of one step?

Recall — cover the screen: what is an LLM, in one line?
A next-token predictor: given the text so far, it outputs a probability over its whole vocabulary for the next token, picks one, appends it, and repeats. The model only ever predicts one token at a time. (tap/hover to check)

Step 3 — how confident is it? (real cluster data)

Your Qwen3.6, given "Kubernetes pods are scheduled by the", makes its top guess " kube" at 38.9%.

Step 4 — training vs inference

Step 5 — the hidden cost

Recall — say it: training vs inference — which is which?
Training learns the weights from oceans of text — once, hugely expensive (like compiling a binary). Inference runs those frozen weights on your prompt — every request (like running the binary). This course is about inference. (tap/hover)

On YOUR cluster live Qwen3.6

Qwen3.6-27B-FP8 on your 4×H100 is exactly this loop. Real next-token distribution for "Kubernetes pods are scheduled by the": " kube" 38.9% · " scheduler" 18.4% · " Kubernetes" 12.6% — it learned pods are placed by the kube-scheduler, as a probability not a fact. See it raw: curl localhost:8000/v1/completions … "max_tokens":1,"logprobs":8. Your Lab →

Watch this next — primary source But what is a GPT? — 3Blue1Brown · runnable companion: day01 notebook.

Final check — teach it back

Explain to a colleague: "When we call our LLM, under the hood it…"
…runs our prompt through the frozen weights to get a probability distribution over the next token, picks one, appends it, and feeds the longer sequence back in — looping one token at a time until it emits a stop token. It never plans the whole answer; it just predicts the next token, repeatedly. (tap/hover)
I'm your teacher — ask me anything. Want the live next-token call on your own prompts, or how training actually sets the weights?
← HomeNext: Lesson 2 →
References
  1. day01 — What Is LLM Inference? (Inference Engineering Ch 2.2, Philip Kiely); 3Blue1Brown — But what is a GPT?; runnable: day01 notebook.

What Is an LLM?

Before any optimization: what your model actually does on every single call.

Today's win: you'll explain, from memory, what an LLM does when you call it — predict one token, append it, repeat — and why "training" and "inference" are two completely different jobs. This is the ground floor the whole course is built on.

The picture: a cook with a memorized cookbook

Extend the Faraway Pantry: the model is a cook who has internalized one enormous cookbook — billions of numbers (the weights) learned by tasting a huge slice of the internet. To answer you, the cook doesn't plan the whole dish; it predicts the next step, one at a time, the way a recipe unfolds.

writing the cookbook (once, costly)training — learn the weights
cooking orders from the finished cookbookinference — what every API call does
predicting the next step of the recipenext-token prediction

1 · An LLM predicts the next token — nothing more

This is the whole secret. An LLM does not compose a reply and write it out. Given the text so far, it produces a probability for every token in its vocabulary — a guess at what comes next — and one token is chosen. That's a single step.1

Pantry: the cook glances at the order ticket so far and predicts the single most likely next ingredient — not the finished plate.
Kubernetes pods are scheduled by the the model → a probability for every token in the vocabulary (top 6 shown): " kube" 38.9% ← picked (greedy) " scheduler" 18.4% " Kubernetes" 12.6% " control" 3.2% " cluster" 2.3% " controller" 1.9%
Real output from your Qwen3.6 for this prompt. It's not "certain" the answer is " kube" — it's a distribution (39%), heading toward "kube-scheduler". Picking the top one is "greedy" decoding (more in Lesson 6).

2 · Training vs inference — two different jobs

Training teaches the model to predict next tokens by adjusting the weights over mountains of text — done once, on huge clusters, at enormous cost. Inference takes those frozen weights and runs your prompt through them to generate — that's what happens on every API call. This whole course is about inference.1

Pantry: training is writing the cookbook by tasting millions of dishes (once). Inference is a line cook using that finished cookbook to plate orders all day.
TRAINING — once, very expensive oceans of text learn the weights INFERENCE — every request the weights + your prompt run next token repeat
Training builds the weights once. Inference runs them on your prompt, over and over — that's the part you operate, pay for, and optimize.

For your SRE brain: compile vs run bridge

If it helps: training is compiling the binary — slow, done once, produces an artifact. The weights are that compiled artifact sitting on disk. Inference is running the binary to serve traffic, and "loading the model" is just loading that artifact into GPU memory. Everything in this course is about making the run step fast and cheap.

3 · Every request is a three-stage pipeline (your course map)

Inference always follows the same path. Hold this diagram — each box is a future lesson.1

text in tokenize Lesson 3 forward pass Lessons 4–6 decode / pick Lesson 6 +1 token append & repeat — the loop · Lesson 7 making this loop fast = the inference engine (Lesson 8) and everything after
The map of the whole course. Today is the loop itself; later lessons open each box and then make the loop fast.

4 · It's a loop — and that's where the cost hides

Chain those stages and you get autoregressive generation: predict a token, append it, feed the longer sequence back in, predict the next. Our example continues … the → kube → -scheduler → …. The catch — and the reason the rest of this course exists — is that every step re-reads the entire sequence so far, so each token costs a little more than the last.2 You'll measure that "staircase" in Lesson 7.

Pantry: after every bite the cook re-reads the entire order from the top before deciding the next bite. Short orders are quick; long ones drag.
each step appends one token, then re-reads everything: step 1:Kubernetes pods … the→ kube step 2:… the kube→ -scheduler step 3:… the kube-scheduler→ , cost / step ↗ longer sequence → slower step (Lesson 7)
Generation is just this loop. The whole optimization story — KV cache, batching, the engine — exists to flatten that rising staircase.

On YOUR cluster — this is literally what's running live

Qwen3.6-27B-FP8 on your 4×H100 is an LLM doing exactly this loop. The bar chart above is its real next-token distribution for "Kubernetes pods are scheduled by the" — top guess " kube" (38.9%), then " scheduler" (18.4%): it has learned that pods are placed by the kube-scheduler, and expresses it as a probability, not a fact.

See the raw distribution yourself — one token, top-8 alternatives: curl localhost:8000/v1/completions -d '{"model":"…","prompt":"…","max_tokens":1,"logprobs":8}'. · Your Lab →

Read/watch this next — primary source But what is a GPT? — 3Blue1Brown (the best visual intro to next-token prediction). Runnable companion: day01 notebook — builds this loop on GPT-2 from scratch.

Check yourself (recall, don't peek)

Picture the cook, the distribution bars, and the loop, then answer from memory.

I'm your teacher — ask me anything. Want the live next-token call run on your own prompts, or to see how "training" actually adjusts the weights? Just ask.
← Home Next: Lesson 2 — what's inside a model →
References
  1. What Is LLM Inference? — day01 (Inference Engineering Ch 2.2, Philip Kiely); 3Blue1Brown — But what is a GPT?
  2. Autoregressive generation & the per-step cost — day01 notebook (llm-inference-mechanics.ipynb).