Inference Engineering · Lesson 16 · Quantization Home · Glossary · Your Lab

Fewer Bits

Quantization — the other lever — predicted step by step.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict why fewer bits makes decode faster and the KV cache smaller, what the precision tradeoff really is, and which format to reach for.

The setup

A model is billions of numbers. Quantization stores each in fewer bits — FP16 (2 bytes) → FP8 (1 byte) → INT4 (½ byte) — rounding each onto a coarser grid.

Step 1 — why is it faster?

Step 2 — and the KV cache?

Recall — cover the screen: the two wins from quantization, tied to earlier lessons.
Weights in fewer bits → fewer bytes hauled per token → faster decode (Lesson 11's memory wall). KV in fewer bits → smaller cache → more concurrency (Lesson 10). (tap/hover to check)

Step 3 — what's the catch?

Step 4 — does it hurt quality?

Recall — say it: when do you reach for weight-only INT4 vs W8A8 (FP8/INT8)?
INT4 weight-only when you're memory/bandwidth-starved at modest batch (it shrinks weights most). W8A8 (FP8/INT8) when you also want faster compute under heavy batching on big GPUs. (tap/hover)

On YOUR cluster real config

Qwen3.6-27B-FP8 quantizes twice: FP8 weights (~27 GiB vs ~54 in FP16 — which is why it fits on one H100, no TP needed) and --kv-cache-dtype fp8 (128 KiB/token vs 256). One nuance: at your model's head_dim 256, fp8 KV speeds decode but prefill can run slightly behind BF16 — worth measuring. Your Lab →

Read this next — primary source FP8 Formats — Micikevicius et al. · vLLM FP8 KV-cache.

Final check — teach it back

Explain to a colleague: "Quantization is basically free for us because…"
…FP8 is ~lossless yet halves both the weights hauled per token (faster, memory-bound decode) and the KV cache (more concurrency) — and modern methods keep the few outlier values precise, which is where quality would otherwise break. (tap/hover)
I'm your teacher — ask me anything. Want the E4M3 vs E5M2 bit layout, or how AWQ picks its 1% salient weights?
← Lesson 15 · Speculative DecodingLesson 17 · Quantization Algorithms →
References
  1. FP8 Formats — Micikevicius et al. (2209.05433); AWQ (2306.00978); vLLM FP8 KV (blog).

Quantization

The other lever — fewer bits per number — and what it costs.

Today's win: you'll explain what quantization shrinks, why fewer bits makes decode faster (the bytes lever) and the KV cache smaller (more concurrency), what the precision tradeoff really is (outliers), and which format to reach for — all running on your FP8 cluster.
Pantry recap: the weights and the KV cache are measurements written in the recipe binders and on the tasters' index cards. Quantization is writing each measurement with fewer significant figures — thinner pages, lighter to haul, more fit on the shelf. (full analogy →)

1 · What quantization actually is

A model is billions of numbers. Store each in fewer bits — FP16 (2 bytes) → FP8 (1 byte) → INT4 (½ byte) — and you round each value onto a coarser grid. The model gets smaller and cheaper to move, at the cost of a little rounding error.1

Pantry: FP16 says "237.4 g"; FP8 says "≈ 240 g". Fewer digits per measurement — the binder gets thinner and the van hauls less per trip.
bits per number (footprint): FP16 · 2 bytes FP8 · 1 byte (half) INT4 · ½ byte (quarter) same value, rounded to each grid: true value snaps to nearest grid point finer grid = more bits = less rounding
Halve the bits, halve the footprint — and snap each number to a coarser grid. The whole game is keeping that rounding error harmless.

2 · Why it's a win — the two levers you already know

Quantization pays off in exactly the two places earlier lessons flagged:2

Pantry: thinner recipe binders (weights) → the van hauls less per trip. Thinner index cards (KV) → more fit on the shelf. Same dish, less to carry and store.
weights hauled per decode token FP16 FP8 — half the bytes → ~2× decode one token's KV index cards FP16 → fp8 → ½ the cache → ~2× sequences fit
Both levers attack bytes moved/stored — which Lesson 11 said is the real decode bottleneck (not FLOPs). That's why quantization is the highest-leverage knob for decode.

3 · The catch — precision, and the outlier problem

Fewer bits means coarser rounding, so quality can slip. But the difficulty isn't the millions of typical values — those round fine. It's a handful of outliers: a few unusually large values whose rounding error blows up the result. Round those carelessly and the model breaks.3

Pantry: most measurements survive a coarse cup. But a pinch of saffron needs to be exact — round it to the nearest tablespoon and the dish is ruined. Smart quantization keeps the saffron precise.
typical values → round cleanly outlier — far off the grid rounds badly → keep it precise fixes: per-channel scales · SmoothQuant (migrate outliers) · AWQ (protect ~1% salient) · isolate in 16-bit
The reason naïve low-bit quantization fails — and why every good method is really a strategy for the outliers.

4 · The format map — what to reach for

Format / methodBitsQuantizesQualityWhen
FP8 (E4M3), W8A88weights + activations + KV≈ losslesshigh-end GPUs (H100+) — your cluster
INT8 + SmoothQuant8weights + activations~1–3%broad hardware, integer tensor cores
INT4 (GPTQ / AWQ), W4A164 (weights)weights onlysmall loss (protect salient)memory-bound / smaller GPUs
FP4 / NVFP44weights (+)frontier, more carenewest Blackwell-class HW

Rule of thumb: weight-only (INT4) when you're memory- or bandwidth-starved and running modest batches; W8A8 (FP8/INT8) when you also want faster compute under heavy continuous batching on big GPUs.1

In Kubernetes terms infra bridge

Quantization is resource-request compression: fewer bytes per weight (and per KV entry) means more requests (pods) fit on a GPU (node) and less bandwidth per token — like tightening pod memory requests so the scheduler packs more onto each node. FP8 is the sweet spot: half the footprint at ~no quality cost.

On YOUR cluster — quantization, twice over real config

Qwen3.6-27B-FP8 runs both kinds at once:

And the H100's FP8 tensor cores hit ~2× the BF16 FLOPs (3,341 vs 1,671 TFLOPS — Lesson 11), so FP8 helps prefill too. One real subtlety: your model's head_dim is 256, where fp8 KV clearly speeds decode but prefill can run slightly behind BF16 — worth measuring with bash learning/tools/cluster-probe.sh. · Your Lab →

Read this next — primary source The State of FP8 KV-Cache & Attention Quantization — vLLM blog (the exact flag you run), and the format origin: FP8 Formats for Deep Learning — Micikevicius et al.

Check yourself (recall, don't peek)

Picture the coarser grid and the saffron, then answer from memory.

I'm your teacher — ask me anything. Want the bit layout of E4M3 vs E5M2, how AWQ picks its 1% salient weights, or whether to quantize your reranker/embedding models too? Just ask.
← Lesson 15 · Speculative DecodingLesson 17 · Quantization Algorithms →
References
  1. LLM quantization survey — weight-only vs W8A8, AWQ/GPTQ/SmoothQuant, format quality. aws.amazon.com · AWQ (Lin et al.)
  2. The State of FP8 KV-Cache & Attention Quantization — vLLM blog (2026). vllm-project.github.io
  3. SmoothQuant: migrating activation outliers to weights — Xiao et al. arxiv.org · FP8 Formats — 2209.05433