Inference Engineering · Lesson 2 · What's Inside a ModelHome · Glossary · Your Lab
What's Inside a Model
Open the box — one guess at a time.
Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict what's actually in a model file (named arrays
of numbers), how it's organized (one block, repeated), and where the parameters live.
The setup
A "27B model" is a file on disk. Before you peek: what do you think is in it?
Step 1 — what are the parameters?
Step 2 — why does SafeTensors load so fast?
Recall — cover the screen: what is a model file, concretely? A header (an index: each tensor's name, shape, byte offset) plus a big blob of raw floats. GPT-2 = 148 named tensors / 124M numbers. "Loading the model" = mapping that blob into GPU memory.(tap/hover to check)
Step 3 — where do the parameters live?
Step 4 — scaling up to your 27B
Step 5 — how big in FP8?
Recall — say it: the structure of a transformer, top to bottom. Embedding table → the same transformer block stacked N times (each = attention Q/K/V/O + a big FFN + layer norms) → final norm + output head. Most parameters sit in the FFN and attention of those repeated blocks.(tap/hover)
On YOUR cluster from config.json
Qwen3.6-27B real shape: 64 layers · hidden 5,120 · FFN 17,408 ·
vocab 248,320 · 24 query / 4 KV heads (head_dim 256). FP8 → ~27 GiB, fits one H100.
(It's an advanced hybrid + multimodal variant — same building blocks, taught clean here.) Your Lab →
Explain to a colleague: "A model file is basically…" …an index plus a blob of numbers: named tensors (arrays of floats) memory-mapped into the GPU, organized as an embedding table + one transformer block repeated N times + an output head. No code, no magic — just measurements.(tap/hover)
I'm your teacher — ask me anything. Want to dump a real tensor, or see why the FFN holds the most parameters?
Open the box: a "model" is just named arrays of numbers in a file.
Today's win: you'll see that a model is nothing magical — it's a file of named
tensors (arrays of floats), organized into one repeating transformer block — and you'll know
where the billions of parameters actually live.
The picture: the cookbook is just pages of numbers
From Lesson 1, the
weights are the cook's cookbook. Open it and there's no prose — just measurements: page
after page of numbers. A model file is that binder: a short index (which numbers are
where) plus the pages of raw floats.
a page of measurements
a tensor — one named array of floats
the binder's index
the file header — name → shape → byte offset
one recipe's section, repeated
a transformer block — repeated N times
1 · A model is a file of named tensors
Load GPT-2 (124M params — small enough to run on a laptop) and inspect the file: it's
148 tensors.1 Each is a named array with a
shape and raw float values — e.g. the token-embedding table wte.weight with shape
[50257, 768]. That's it. "Loading a model" means reading these arrays into GPU memory.
Pantry: the binder lists each page — "page wte: a 50257×768 grid
of numbers" — then the pages themselves follow. No magic, just measurements.
A model file = a header (names, shapes, offsets) + a blob of floats. "Loading the model" is
just mapping that blob into GPU memory.
For your SRE brain: SafeTensors vs pickle bridge
The modern format is SafeTensors: a JSON header + a contiguous byte blob, so
it memory-maps straight to GPU (load a 7B in ~2–3s vs 10s+ for a Python .pt
pickle). And it's safe — a .pt pickle can execute arbitrary code on load (a real
supply-chain risk); SafeTensors is data-only. Think "flat mmap-able artifact" vs "deserialize an
untrusted object graph."
2 · The parameters live in one repeating block
The tensors aren't a random pile — they're organized. A thin embedding table at the
input, then the same transformer block stacked N times (GPT-2: 12; your Qwen: 64), then
a final norm + output head. Each block holds the attention
projections (Q/K/V/O) and a big feed-forward network (FFN), with layer norms.1
Pantry: the cookbook isn't one long ramble — it's the same recipe
template repeated, each copy refining the dish a little more.
Embeddings in, N identical blocks, head out. Most parameters sit in the FFN and attention of
those repeated blocks — which is exactly what quantization and parallelism act on.
3 · Scale it up: GPT-2 → your 27B
Your production model is the same building blocks, just wider and deeper. Nothing new to
learn — more layers, bigger tensors.2
Same architecture, bigger numbers. Once you can read GPT-2's file, you can read any of them.
On YOUR cluster — Qwen3.6's real shape from its config
Straight from the model's config.json: 64 layers,
hidden size 5,120, FFN 17,408, vocab 248,320, 24 query / 4 KV attention heads
(head_dim 256). At FP8 (1 byte/param) the 27B weights are ~27 GiB — which is why it fits
on one 94 GiB H100 (no tensor-parallel split; Lesson 19).
One honest nuance: Qwen3.6 is an advanced variant — a hybrid that mixes
linear-attention layers with full attention every 4th layer, and it's multimodal (vision tokens). The
building blocks here are still the foundation; we teach the clean version first. · Your Lab →