Quantization — the other lever — predicted step by step.
Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict why fewer bits makes decode faster and the KV cache smaller, what the precision tradeoff really is, and which format to reach for.
The setup
A model is billions of numbers. Quantization stores each in fewer bits — FP16 (2 bytes) → FP8 (1 byte) → INT4 (½ byte) — rounding each onto a coarser grid.
Step 1 — why is it faster?
Step 2 — and the KV cache?
Recall — cover the screen: the two wins from quantization, tied to earlier lessons. Weights in fewer bits → fewer bytes hauled per token → faster decode (Lesson 11's memory wall). KV in fewer bits → smaller cache → more concurrency (Lesson 10).(tap/hover to check)
Step 3 — what's the catch?
Step 4 — does it hurt quality?
Recall — say it: when do you reach for weight-only INT4 vs W8A8 (FP8/INT8)? INT4 weight-only when you're memory/bandwidth-starved at modest batch (it shrinks weights most). W8A8 (FP8/INT8) when you also want faster compute under heavy batching on big GPUs.(tap/hover)
On YOUR cluster real config
Qwen3.6-27B-FP8 quantizes twice: FP8 weights (~27 GiB vs ~54 in FP16 — which is why it fits on one H100, no TP needed) and --kv-cache-dtype fp8 (128 KiB/token vs 256). One nuance: at your model's head_dim 256, fp8 KV speeds decode but prefill can run slightly behind BF16 — worth measuring. Your Lab →
Explain to a colleague: "Quantization is basically free for us because…" …FP8 is ~lossless yet halves both the weights hauled per token (faster, memory-bound decode) and the KV cache (more concurrency) — and modern methods keep the few outlier values precise, which is where quality would otherwise break.(tap/hover)
I'm your teacher — ask me anything. Want the E4M3 vs E5M2 bit layout, or how AWQ picks its 1% salient weights?
The other lever — fewer bits per number — and what it costs.
Today's win: you'll explain what quantization shrinks, why fewer bits makes
decode faster (the bytes
lever) and the KV cache smaller (more concurrency), what the precision tradeoff really
is (outliers), and which format to reach for — all running on your FP8 cluster.
Pantry recap: the weights and the KV cache
are measurements written in the recipe binders and on the tasters' index cards.
Quantization is writing each measurement with fewer significant figures — thinner pages,
lighter to haul, more fit on the shelf. (full analogy →)
1 · What quantization actually is
A model is billions of numbers. Store each in fewer bits — FP16 (2 bytes) →
FP8 (1 byte) → INT4 (½ byte) — and you round each value onto a coarser grid. The model gets
smaller and cheaper to move, at the cost of a little rounding error.1
Pantry: FP16 says "237.4 g"; FP8 says "≈ 240 g". Fewer digits per
measurement — the binder gets thinner and the van hauls less per trip.
Halve the bits, halve the footprint — and snap each number to a coarser grid. The whole
game is keeping that rounding error harmless.
2 · Why it's a win — the two levers you already know
Quantization pays off in exactly the two places earlier lessons flagged:2
Weights → faster decode. Decode is memory-bound
— its cost is the bytes of weights hauled per token (Lesson 11).
FP8 weights are half the bytes → roughly 2× decode throughput, and the model fits
on fewer GPUs.
KV cache → more concurrency. An fp8 KV cache halves the bytes
factor in Lesson 10's formula → ~half the cache →
more sequences (or longer context) on the same GPU.
Pantry: thinner recipe binders (weights) → the van hauls less per
trip. Thinner index cards (KV) → more fit on the shelf. Same dish, less to carry and store.
Both levers attack bytes moved/stored — which Lesson 11 said is the real decode
bottleneck (not FLOPs). That's why quantization is the highest-leverage knob for decode.
3 · The catch — precision, and the outlier problem
Fewer bits means coarser rounding, so quality can slip. But the difficulty isn't the millions of
typical values — those round fine. It's a handful of outliers: a few
unusually large values whose rounding error blows up the result. Round those carelessly and the
model breaks.3
Pantry: most measurements survive a coarse cup. But a pinch of
saffron needs to be exact — round it to the nearest tablespoon and the dish is ruined. Smart
quantization keeps the saffron precise.
The reason naïve low-bit quantization fails — and why every good method is really a
strategy for the outliers.
4 · The format map — what to reach for
Format / method
Bits
Quantizes
Quality
When
FP8 (E4M3), W8A8
8
weights + activations + KV
≈ lossless
high-end GPUs (H100+) — your cluster
INT8 + SmoothQuant
8
weights + activations
~1–3%
broad hardware, integer tensor cores
INT4 (GPTQ / AWQ), W4A16
4 (weights)
weights only
small loss (protect salient)
memory-bound / smaller GPUs
FP4 / NVFP4
4
weights (+)
frontier, more care
newest Blackwell-class HW
Rule of thumb: weight-only (INT4) when you're memory- or bandwidth-starved and
running modest batches; W8A8 (FP8/INT8) when you also want faster compute under
heavy continuous batching on big GPUs.1
In Kubernetes terms infra bridge
Quantization is resource-request compression: fewer bytes per weight (and per KV entry) means more requests (pods) fit on a GPU (node) and less bandwidth per token — like tightening pod memory requests so the scheduler packs more onto each node. FP8 is the sweet spot: half the footprint at ~no quality cost.
On YOUR cluster — quantization, twice over real config
Qwen3.6-27B-FP8 runs both kinds at once:
FP8 weights (E4M3): ~27 GiB instead of ~54 GiB in FP16 — which is exactly why a 27B
model fits on one 94 GiB H100 (no need for the tensor-parallel split from
Lesson 19).
--kv-cache-dtype fp8: 1 byte/element → 128 KiB/token instead of 256
(the Lesson 10 number) → ~2× the concurrency.
And the H100's FP8 tensor cores hit ~2× the BF16 FLOPs (3,341 vs 1,671
TFLOPS — Lesson 11), so FP8 helps prefill
too. One real subtlety: your model's head_dim is 256, where fp8 KV clearly speeds decode but
prefill can run slightly behind BF16 — worth measuring with
bash learning/tools/cluster-probe.sh. · Your Lab →
Picture the coarser grid and the saffron, then answer from memory.
I'm your teacher — ask me anything. Want the bit layout of E4M3 vs E5M2, how AWQ
picks its 1% salient weights, or whether to quantize your reranker/embedding models too? Just ask.