Inference Engineering · Lesson 2 · What's Inside a Model Home · Glossary · Your Lab

What's Inside a Model

Open the box — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict what's actually in a model file (named arrays of numbers), how it's organized (one block, repeated), and where the parameters live.

The setup

A "27B model" is a file on disk. Before you peek: what do you think is in it?

Step 1 — what are the parameters?

Step 2 — why does SafeTensors load so fast?

Recall — cover the screen: what is a model file, concretely?
A header (an index: each tensor's name, shape, byte offset) plus a big blob of raw floats. GPT-2 = 148 named tensors / 124M numbers. "Loading the model" = mapping that blob into GPU memory. (tap/hover to check)

Step 3 — where do the parameters live?

Step 4 — scaling up to your 27B

Step 5 — how big in FP8?

Recall — say it: the structure of a transformer, top to bottom.
Embedding table → the same transformer block stacked N times (each = attention Q/K/V/O + a big FFN + layer norms) → final norm + output head. Most parameters sit in the FFN and attention of those repeated blocks. (tap/hover)

On YOUR cluster from config.json

Qwen3.6-27B real shape: 64 layers · hidden 5,120 · FFN 17,408 · vocab 248,320 · 24 query / 4 KV heads (head_dim 256). FP8 → ~27 GiB, fits one H100. (It's an advanced hybrid + multimodal variant — same building blocks, taught clean here.) Your Lab →

Read this next — primary source SafeTensors docs · runnable companion: day02 notebook.

Final check — teach it back

Explain to a colleague: "A model file is basically…"
…an index plus a blob of numbers: named tensors (arrays of floats) memory-mapped into the GPU, organized as an embedding table + one transformer block repeated N times + an output head. No code, no magic — just measurements. (tap/hover)
I'm your teacher — ask me anything. Want to dump a real tensor, or see why the FFN holds the most parameters?
← Lesson 1Next: Lesson 3 →
References
  1. day02 — What's Inside a Model (notebook); SafeTensors docs; Qwen3.6 config.json.

What's Inside a Model

Open the box: a "model" is just named arrays of numbers in a file.

Today's win: you'll see that a model is nothing magical — it's a file of named tensors (arrays of floats), organized into one repeating transformer block — and you'll know where the billions of parameters actually live.

The picture: the cookbook is just pages of numbers

From Lesson 1, the weights are the cook's cookbook. Open it and there's no prose — just measurements: page after page of numbers. A model file is that binder: a short index (which numbers are where) plus the pages of raw floats.

a page of measurementsa tensor — one named array of floats
the binder's indexthe file header — name → shape → byte offset
one recipe's section, repeateda transformer block — repeated N times

1 · A model is a file of named tensors

Load GPT-2 (124M params — small enough to run on a laptop) and inspect the file: it's 148 tensors.1 Each is a named array with a shape and raw float values — e.g. the token-embedding table wte.weight with shape [50257, 768]. That's it. "Loading a model" means reading these arrays into GPU memory.

Pantry: the binder lists each page — "page wte: a 50257×768 grid of numbers" — then the pages themselves follow. No magic, just measurements.
header — the index (JSON) wte.weight f16 [50257,768] @0 wpe.weight f16 [1024,768] @… h.0.attn.c_attn f16 [768,2304] @… h.0.mlp.c_fc f16 [768,3072] @… … (148 tensors total) raw float blob — the pages 0.0123 -0.045 0.881 … (124M numbers) mmap GPU memory weights, ready to run
A model file = a header (names, shapes, offsets) + a blob of floats. "Loading the model" is just mapping that blob into GPU memory.

For your SRE brain: SafeTensors vs pickle bridge

The modern format is SafeTensors: a JSON header + a contiguous byte blob, so it memory-maps straight to GPU (load a 7B in ~2–3s vs 10s+ for a Python .pt pickle). And it's safe — a .pt pickle can execute arbitrary code on load (a real supply-chain risk); SafeTensors is data-only. Think "flat mmap-able artifact" vs "deserialize an untrusted object graph."

2 · The parameters live in one repeating block

The tensors aren't a random pile — they're organized. A thin embedding table at the input, then the same transformer block stacked N times (GPT-2: 12; your Qwen: 64), then a final norm + output head. Each block holds the attention projections (Q/K/V/O) and a big feed-forward network (FFN), with layer norms.1

Pantry: the cookbook isn't one long ramble — it's the same recipe template repeated, each copy refining the dish a little more.
embedding table (input) transformer block × N attention (Q/K/V/O) feed-forward (FFN: up → down) + layer norms final norm + output head stacked, bottom → top where GPT-2's 124M parameters live: FFN 41% attn 30% embed 28% The FFN is the biggest chunk — it's where most "knowledge" is stored. (Embeddings shrink as a % in bigger models.) Quantization (L16) shrinks every one of these numbers; pruning deletes the near-zero ones.
Embeddings in, N identical blocks, head out. Most parameters sit in the FFN and attention of those repeated blocks — which is exactly what quantization and parallelism act on.

3 · Scale it up: GPT-2 → your 27B

Your production model is the same building blocks, just wider and deeper. Nothing new to learn — more layers, bigger tensors.2

GPT-2 (teaching model) your Qwen3.6-27B layers: 12hidden size: 768 vocab: 50,257params: 124M layers: 64hidden size: 5,120 vocab: 248,320params: 27,000M (27B) ~220× the parameters — same shapes
Same architecture, bigger numbers. Once you can read GPT-2's file, you can read any of them.

On YOUR cluster — Qwen3.6's real shape from its config

Straight from the model's config.json: 64 layers, hidden size 5,120, FFN 17,408, vocab 248,320, 24 query / 4 KV attention heads (head_dim 256). At FP8 (1 byte/param) the 27B weights are ~27 GiB — which is why it fits on one 94 GiB H100 (no tensor-parallel split; Lesson 19).

One honest nuance: Qwen3.6 is an advanced variant — a hybrid that mixes linear-attention layers with full attention every 4th layer, and it's multimodal (vision tokens). The building blocks here are still the foundation; we teach the clean version first. · Your Lab →

Read this next — primary source SafeTensors — Hugging Face docs. Runnable companion: day02 notebook — dissects GPT-2's actual weight file, tensor by tensor.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to dump a real tensor from a model file, or see why the FFN holds the most parameters? Just ask.
← Lesson 1 — what is an LLM Next: Lesson 3 — tokenization →
References
  1. What's Inside a Model — day02 (02-a-whats-inside-a-model.ipynb); SafeTensors docs.
  2. Qwen3.6-27B config.json (served on the cluster).