Below the engine — one guess at a time.
The forward pass isn't one monolithic op — it's a chain of small GPU operations (matmul, add, softmax, activation, …). Each is a "kernel".
A kernel launch ≈ a syscall / a separate step; fusion ≈ collapsing a chain of tiny init/sidecar containers (each with its own hop) into one in-memory process. SRAM-vs-HBM ≈ node-local cache vs remote storage: keep hot data close, stop paying the round-trip.
No flag needed — vLLM ships fused kernels + FlashAttention, plus FP8 tensor-core kernels on your Hopper H100s. It's a big reason decode holds ~108 tok/s. The lever you control is upstream (fewer bytes: quantization, GQA). Your Lab →
One level below the engine: why fewer, fatter GPU operations win.
A kernel is one operation the GPU runs across thousands of threads — one trip to do one thing. The forward pass is a sequence of kernels. The catch: each kernel reads its inputs from the faraway pantry (HBM) and writes results back. Fusion means doing several steps in one trip — keep the intermediate on the cutting board (SRAM, on-chip) instead of hauling it across town each time.
| one trip to do one task | a CUDA kernel |
| the pantry across town | HBM (GPU memory, ~3.4 TB/s) |
| the cutting board at hand | SRAM / registers (on-chip, ~19 TB/s) |
| doing several steps in one trip | kernel fusion |
Every operation — a matrix multiply, an add, a softmax, an activation — is a kernel: a small program the GPU runs in parallel across its thousands of threads. A single transformer layer fires many of them, and there's a fixed launch overhead (~10–50 µs) per kernel.1
Here's why it matters for decode. Each kernel reads its inputs from HBM and writes its output back to HBM. String many small kernels together and you pay that slow round-trip over and over — and decode is already memory-bound, so those trips are the bottleneck.1
Kernel fusion merges several ops into one kernel, so intermediate results live in on-chip SRAM/registers (~19 TB/s, vs HBM's ~3.4 TB/s) and never hit HBM. A classic example is a fused SwiGLU FFN: gate, activation, and elementwise multiply in a single kernel instead of three.1 Fewer launches, fewer round-trips.
A kernel launch is like a syscall or spinning up a separate step — cheap each, ruinous in bulk. Fusion is collapsing a chain of tiny init/sidecar containers (each with its own network/IPC hop) into one process that passes data in-memory. And SRAM-vs-HBM is node-local cache vs reaching across to remote storage: keep the hot data close and you stop paying the round-trip.
Attention (Lesson 5) naively builds the full n×n score matrix in HBM — huge for long contexts. FlashAttention fuses the whole attention into one kernel that streams over tiles in SRAM, never writing that n×n matrix to HBM. Same result, a fraction of the memory traffic.2
You don't set a flag for this: vLLM ships fused kernels + FlashAttention and, on your Hopper H100s, FP8 tensor-core kernels. It's a big part of why your Qwen sustains ~108 decode tok/s rather than dying on launch overhead and HBM traffic. The lever you do control is upstream (fewer bytes: quantization L16, GQA) — kernels are the engine's job, but knowing they exist explains where the speed comes from. · Your Lab →