Dropping bits without dropping quality — one guess at a time.
Lesson 16 said quantization writes each weight with fewer bits. The naive way is just to round every weight to the nearest low-bit value (RTN).
RTN = a flat resource cap on every pod (throttles the critical ones). AWQ/SmoothQuant = profile-guided right-sizing: sample real activations (a calibration set ≈ representative traffic) to find the hot path and protect it, capping the rest. Measure before you squeeze.
Your Qwen is FP8 (W8A8) — wide dynamic range, so it tolerates outliers without these
tricks (FP8 "just works"). They matter when you go lower: AWQ/GPTQ-INT4 to fit a bigger model on a
smaller GPU, or SmoothQuant-INT8 on non-FP8 hardware. Now ...-AWQ in a checkpoint name means
something to you. Your Lab →
GPTQ, AWQ, SmoothQuant — how to drop bits without dropping quality.
Lesson 16 said quantization = writing each number with fewer significant figures. But you learned the catch: most measurements survive a coarse cup, while a pinch of saffron (an outlier) ruins the dish if rounded. These algorithms are the smart strategies for which values to protect and how to compensate for the rounding you do.
| round everything to the nearest cup | RTN (round-to-nearest) — the naive baseline |
| adjust later steps to cancel the error | GPTQ |
| keep the saffron exact | AWQ (protect salient weights) |
| move the hard-to-measure part elsewhere | SmoothQuant (migrate outliers) |
The naive method, RTN, just rounds every weight to the nearest value on the low-bit grid. It's instant and free — but accuracy slips, because a few outlier weights round badly and their error propagates. At INT4 especially, naive RTN can wreck a model.1
GPTQ quantizes one layer at a time and, after rounding each weight, nudges the remaining weights to cancel the error it just introduced — using second-order (Hessian) information about which adjustments matter. It needs a small calibration dataset to estimate that. Result: solid INT4 weights with little quality loss.1
AWQ (Activation-aware Weight Quantization) notices that the weights multiplying the largest activations matter most. It identifies that ~1% of salient weights and scales them so they survive quantization intact, while the rest go low-bit.2
Quantizing activations (not just weights) is hard because activations have wild outliers. SmoothQuant rescales per channel to migrate that difficulty from the activations into the weights, so both become easy to quantize — enabling 8-bit weights and activations (W8A8).3
RTN is a flat resource cap on every pod — simple, but it throttles the latency- critical ones. AWQ/SmoothQuant are profile-guided right-sizing: they sample real activations (a calibration set ≈ a representative traffic sample) to learn which weights are the hot path and protect those, capping the rest hard. You measure before you squeeze, instead of capping blind.
Your Qwen runs FP8 (W8A8), whose wide dynamic range often tolerates outliers
without these tricks — so FP8 is largely "just works." These algorithms earn their keep when you go
lower: an AWQ- or GPTQ-INT4 checkpoint to fit a bigger model on a smaller GPU, or
SmoothQuant for INT8 on hardware without FP8. Knowing them lets you read a checkpoint's name
(e.g. ...-AWQ) and know exactly what you're getting. · Your Lab →