Inference Engineering · Lesson 18 · Model Formats & Compilation Home · Glossary · Your Lab

Model Formats & Compilation

Store vs compile — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.
Today's win: you'll predict the difference between serializing a model and compiling one, and when each is worth it.

The setup

Lesson 2 showed a model as a file of weights (SafeTensors). But there's a spectrum from "just store the numbers" to "compile a tuned engine for one GPU".

Step 1 — what SafeTensors gives you

Step 2 — what ONNX adds

Recall — cover the screen: serialize vs compile, in one line each.
Serialize = store the weights (SafeTensors) ± the graph (ONNX) portably, run anywhere on a framework/engine. Compile = turn the graph into a hardware-specific binary engine (TensorRT-LLM): fuse layers, pick kernels, auto-tune for one GPU — fastest, but rebuild per model/GPU. (tap/hover to check)

Step 3 — what compilation does

Step 4 — the cost of compiling

Step 5 — what you run, and why

In Kubernetes terms infra bridge

SafeTensors/ONNX = a portable container image (run on any node). A TensorRT engine = a binary compiled for one node type — max perf, rebuild per hardware (per-arch image builds vs one universal image). Interpreted/portable vs ahead-of-time-compiled.

On YOUR cluster context

You run vLLM + SafeTensors (FP8) — flexible: swap Qwen versions, retune flags, move models, no recompile. A stable very-high-QPS model could be compiled with TensorRT-LLM (or via NVIDIA NIM's prebuilt engines, L27) for more latency headroom — trading away that flexibility. Your Lab →

Read this next — primary source ONNX · TensorRT-LLM docs · runnable: day08 notebook.

Final check — teach it back

Explain to a colleague: "We use vLLM + SafeTensors instead of compiling because…"
…serialized formats are portable and flexible — we can swap models, retune, and move across GPUs with no recompile. Compiling with TensorRT-LLM would fuse and auto-tune for our exact GPU for peak speed, but it's welded to one model + one GPU and must be rebuilt on any change. Worth it only for a frozen, very-high-QPS workload. (tap/hover)
I'm your teacher — ask me anything. Want a vLLM-vs-TensorRT-LLM throughput estimate, or what NIM ships?
← Lesson 17Next: Lesson 19 →
References
  1. day08 — model formats (notebook); ONNX; TensorRT-LLM.

Model Formats & Compilation

Store the weights, or compile a tuned engine — portability vs peak speed.

Today's win: you'll explain the difference between serializing a model (SafeTensors, ONNX) and compiling one (TensorRT-LLM) — portability vs hardware-tuned performance — and when each is the right call.

The picture: a written recipe vs a custom assembly line

Lesson 2 showed a model as a file of weights. But there's a spectrum. A serialized format just writes the numbers down portably — any kitchen can read it. A compiled engine is a custom assembly line built for this exact kitchen's equipment: blazing fast, but it only runs here and must be rebuilt for a different kitchen.

recipe written down, portableSafeTensors / ONNX (serialization)
recipe + the kitchen's workflowONNX (weights + computation graph)
a custom assembly line for one kitchenTensorRT-LLM (compiled engine)

1 · Serialization: just store the numbers

The simplest formats save weights to disk. SafeTensors (from Lesson 2) is the modern default: a header + raw blob, memory-mappable, safe. The old .pt pickle can run arbitrary code on load (a supply-chain risk) and loads slower. Either way, you still need a framework/engine to run the math.1

2 · ONNX: weights + the computation graph

ONNX goes a step further: it stores the weights and the model's computation graph in a standard operator set. That makes the model portable across runtimes — ONNX Runtime (ORT) can execute it on many backends without the original framework.1

Pantry: SafeTensors is the ingredient measurements; ONNX adds the steps too, written in a standard notation any kitchen can follow.

3 · Compilation: build a hardware-tuned engine

TensorRT / TensorRT-LLM don't just store — they compile. Given the graph, they fuse layers (Lesson 13), pick the fastest kernels, select precision (FP16/FP8), and auto-tune for the exact GPU — producing a binary engine. Peak NVIDIA performance, at a cost: it's specialized per model and per GPU, so any change means a recompile, and it's less flexible.2

.pt pickle SafeTensors ONNX TensorRT enginecompiled portable / flexible hardware-specific / fastest → store the weights (read anywhere) compile to one GPU
Left to right: more portable → more optimized. Serialization runs anywhere; a compiled engine wins on speed but is welded to one model + one GPU.

In Kubernetes terms infra bridge

SafeTensors/ONNX are a portable container image — build once, run on any node. A TensorRT engine is a binary compiled for one CPU arch / one node type: maximum performance, but you must rebuild it per hardware (like maintaining per-architecture image builds instead of one universal image). Same tradeoff as interpreted/portable vs ahead-of-time-compiled.

4 · The tradeoff — and what you run

So it's flexibility vs peak speed. vLLM (what you run) loads SafeTensors and stays flexible — swap models, change flags, no compile step. TensorRT-LLM squeezes out more throughput/latency but demands a per-model, per-GPU compile and a more rigid pipeline. Reach for compilation when the model is stable and you're chasing the last 20–30% at high, steady QPS.2

On YOUR cluster — flexible by choice context

You run vLLM with SafeTensors (FP8) — the flexible path: you can swap Qwen versions, retune --max-num-seqs, or move models between GPUs with no recompile. If one model became a stable, very-high-QPS workload, compiling it with TensorRT-LLM (or NVIDIA NIM, which ships prebuilt TRT-LLM engines — Lesson 27) could win latency — at the cost of the flexibility you currently enjoy. · Your Lab →

Read this next — primary source ONNX intro & TensorRT-LLM docs. Runnable companion: day08 notebook — .pt vs SafeTensors vs ONNX vs TensorRT.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want a rough vLLM-vs-TensorRT-LLM throughput comparison for your model, or what NIM actually ships? Just ask.
← Lesson 17 — quant algorithms Next: Lesson 19 — model parallelism →
References
  1. Model formats (.pt / SafeTensors / ONNX) — day08 (pytorch-model-formats.ipynb); ONNX.
  2. TensorRT-LLM compilation — NVIDIA docs (day11).