CUDA Kernels & Fusion

Below the engine — one guess at a time.

Each step below: commit a guess, then hit Reveal. Predicting first — even a wrong guess — is what makes it stick.

Today's win: you'll predict what a kernel is, why many small ones are slow for decode, and how fusion + FlashAttention fix it.

The setup

The forward pass isn't one monolithic op — it's a chain of small GPU operations (matmul, add, softmax, activation, …). Each is a "kernel".

Step 1 — the hidden cost of each kernel

Step 2 — why this hurts decode specifically

Recall — cover the screen: what is kernel fusion and why does it help?
Merging several ops into one kernel so intermediate results stay in on-chip SRAM/registers instead of round-tripping through HBM. Fewer HBM trips + fewer launch overheads → faster, especially for memory-bound decode. (tap/hover to check)

Step 3 — the fix

Step 4 — SRAM vs HBM

Step 5 — the famous one

In Kubernetes terms infra bridge

A kernel launch ≈ a syscall / a separate step; fusion ≈ collapsing a chain of tiny init/sidecar containers (each with its own hop) into one in-memory process. SRAM-vs-HBM ≈ node-local cache vs remote storage: keep hot data close, stop paying the round-trip.

On YOUR cluster under the hood

No flag needed — vLLM ships fused kernels + FlashAttention, plus FP8 tensor-core kernels on your Hopper H100s. It's a big reason decode holds ~108 tok/s. The lever you control is upstream (fewer bytes: quantization, GQA). Your Lab →

Read this next — primary source FlashAttention — Dao et al. · runnable: day07 notebook.

Final check — teach it back

Explain to a colleague: "Fusing kernels speeds up decode because…"
…it cuts HBM round-trips and per-kernel launch overhead: several ops run as one kernel with intermediates kept in fast on-chip SRAM. Since decode is memory-bound, fewer trips to slow HBM is a direct win. FlashAttention applies the same idea to attention — tiling through SRAM instead of writing the n×n matrix. (tap/hover)

I'm your teacher — ask me anything. Want a fused-vs-unfused timing, or how FlashAttention keeps softmax correct while tiling?

← Lesson 12Next: Lesson 14 →

References

day07 — CUDA kernels & fusion (notebook); FlashAttention.

CUDA Kernels & Fusion

One level below the engine: why fewer, fatter GPU operations win.

Today's win: you'll explain what a CUDA kernel is, why launching many small ones — each round-tripping through HBM — is slow for memory-bound decode, and how kernel fusion and FlashAttention cut those round-trips.

The picture: stop walking to the pantry for every step

A kernel is one operation the GPU runs across thousands of threads — one trip to do one thing. The forward pass is a sequence of kernels. The catch: each kernel reads its inputs from the faraway pantry (HBM) and writes results back. Fusion means doing several steps in one trip — keep the intermediate on the cutting board (SRAM, on-chip) instead of hauling it across town each time.

one trip to do one task	a CUDA kernel
the pantry across town	HBM (GPU memory, ~3.4 TB/s)
the cutting board at hand	SRAM / registers (on-chip, ~19 TB/s)
doing several steps in one trip	kernel fusion

1 · A forward pass is a sequence of kernels

Every operation — a matrix multiply, an add, a softmax, an activation — is a kernel: a small program the GPU runs in parallel across its thousands of threads. A single transformer layer fires many of them, and there's a fixed launch overhead (~10–50 µs) per kernel.1

2 · The cost: HBM round-trips on every kernel

Here's why it matters for decode. Each kernel reads its inputs from HBM and writes its output back to HBM. String many small kernels together and you pay that slow round-trip over and over — and decode is already memory-bound, so those trips are the bottleneck.1

Same math, far fewer trips. Fusion keeps the intermediates on-chip (SRAM) instead of writing each one back to slow HBM — a direct win for memory-bound decode.

3 · Fusion: do more per trip

Kernel fusion merges several ops into one kernel, so intermediate results live in on-chip SRAM/registers (~19 TB/s, vs HBM's ~3.4 TB/s) and never hit HBM. A classic example is a fused SwiGLU FFN: gate, activation, and elementwise multiply in a single kernel instead of three.1 Fewer launches, fewer round-trips.

Pantry: instead of three trips (fetch, season, return; fetch, cook, return; …), the cook does all three on the board in one trip. The board (SRAM) is tiny but instant.

In Kubernetes terms infra bridge

A kernel launch is like a syscall or spinning up a separate step — cheap each, ruinous in bulk. Fusion is collapsing a chain of tiny init/sidecar containers (each with its own network/IPC hop) into one process that passes data in-memory. And SRAM-vs-HBM is node-local cache vs reaching across to remote storage: keep the hot data close and you stop paying the round-trip.

4 · FlashAttention: the famous fusion

Attention (Lesson 5) naively builds the full n×n score matrix in HBM — huge for long contexts. FlashAttention fuses the whole attention into one kernel that streams over tiles in SRAM, never writing that n×n matrix to HBM. Same result, a fraction of the memory traffic.2

FlashAttention is fusion applied to attention: tile through fast SRAM, skip the giant HBM write. It's why long-context attention is even feasible.

On YOUR cluster — you get this for free under the hood

You don't set a flag for this: vLLM ships fused kernels + FlashAttention and, on your Hopper H100s, FP8 tensor-core kernels. It's a big part of why your Qwen sustains ~108 decode tok/s rather than dying on launch overhead and HBM traffic. The lever you do control is upstream (fewer bytes: quantization L16, GQA) — kernels are the engine's job, but knowing they exist explains where the speed comes from. · Your Lab →

Read this next — primary source FlashAttention — Dao et al.. Runnable companion: day07 notebook — kernels, fusion, and SRAM vs DRAM.

Check yourself (recall, don't peek)

I'm your teacher — ask me anything. Want to see a fused vs unfused timing, or how FlashAttention's tiling keeps the softmax correct? Just ask.

← Lesson 12 — PagedAttention Next: Lesson 14 — prefix caching →

References

CUDA kernels & fusion — day07 (cuda-kernels.ipynb).
FlashAttention: Fast and Memory-Efficient Exact Attention — Dao et al. (2205.14135).