Store vs compile — one guess at a time.
Lesson 2 showed a model as a file of weights (SafeTensors). But there's a spectrum from "just store the numbers" to "compile a tuned engine for one GPU".
SafeTensors/ONNX = a portable container image (run on any node). A TensorRT engine = a binary compiled for one node type — max perf, rebuild per hardware (per-arch image builds vs one universal image). Interpreted/portable vs ahead-of-time-compiled.
You run vLLM + SafeTensors (FP8) — flexible: swap Qwen versions, retune flags, move models, no recompile. A stable very-high-QPS model could be compiled with TensorRT-LLM (or via NVIDIA NIM's prebuilt engines, L27) for more latency headroom — trading away that flexibility. Your Lab →
Store the weights, or compile a tuned engine — portability vs peak speed.
Lesson 2 showed a model as a file of weights. But there's a spectrum. A serialized format just writes the numbers down portably — any kitchen can read it. A compiled engine is a custom assembly line built for this exact kitchen's equipment: blazing fast, but it only runs here and must be rebuilt for a different kitchen.
| recipe written down, portable | SafeTensors / ONNX (serialization) |
| recipe + the kitchen's workflow | ONNX (weights + computation graph) |
| a custom assembly line for one kitchen | TensorRT-LLM (compiled engine) |
The simplest formats save weights to disk. SafeTensors (from Lesson 2) is the modern default: a header + raw blob, memory-mappable, safe. The old .pt pickle can run arbitrary code on load (a supply-chain risk) and loads slower. Either way, you still need a framework/engine to run the math.1
ONNX goes a step further: it stores the weights and the model's computation graph in a standard operator set. That makes the model portable across runtimes — ONNX Runtime (ORT) can execute it on many backends without the original framework.1
TensorRT / TensorRT-LLM don't just store — they compile. Given the graph, they fuse layers (Lesson 13), pick the fastest kernels, select precision (FP16/FP8), and auto-tune for the exact GPU — producing a binary engine. Peak NVIDIA performance, at a cost: it's specialized per model and per GPU, so any change means a recompile, and it's less flexible.2
SafeTensors/ONNX are a portable container image — build once, run on any node. A TensorRT engine is a binary compiled for one CPU arch / one node type: maximum performance, but you must rebuild it per hardware (like maintaining per-architecture image builds instead of one universal image). Same tradeoff as interpreted/portable vs ahead-of-time-compiled.
So it's flexibility vs peak speed. vLLM (what you run) loads SafeTensors and stays flexible — swap models, change flags, no compile step. TensorRT-LLM squeezes out more throughput/latency but demands a per-model, per-GPU compile and a more rigid pipeline. Reach for compilation when the model is stable and you're chasing the last 20–30% at high, steady QPS.2
You run vLLM with SafeTensors (FP8) — the flexible path: you can swap Qwen
versions, retune --max-num-seqs, or move models between GPUs with no recompile. If one model
became a stable, very-high-QPS workload, compiling it with TensorRT-LLM (or NVIDIA NIM, which ships
prebuilt TRT-LLM engines — Lesson 27) could win
latency — at the cost of the flexibility you currently enjoy. · Your Lab →