The hardware floor — one guess at a time.
We've leaned on "the pantry" (HBM) and "the cook" (compute) all course. Time to open the box: what's actually inside the GPU?
A GPU is a node: SMs = cores, HBM = RAM, SRAM/L2 = CPU caches. Knowing it is like knowing your instance type's cores/bandwidth/cache — it's how you reason about which resource bounds a workload instead of treating the node as a black box.
Each GPU is an H100 NVL (Hopper): ~130 SMs with FP8 Tensor Cores + Transformer Engine, ~50 MB L2, ~94 GB HBM3 @ ~3.9 TB/s. Those exact numbers set your roofline ridge (~214), KV capacity, and why FP8 + FlashAttention win here. Your Lab →
The hardware floor — where every optimization in this course physically lives.
We've talked about the pantry and the van for the whole course — now look at the building. A GPU is a kitchen with ~130 cook stations (SMs), each holding general tools (CUDA cores) and one specialized appliance (a Tensor Core for matrix multiplies). The pantry is HBM; each station's cutting board is tiny, instant SRAM.
| a cook station | SM (Streaming Multiprocessor) — runs your kernels |
| the matrix-multiply appliance | Tensor Core — the LLM workhorse |
| the pantry across town | HBM (~94 GB, ~3.9 TB/s on your H100 NVL) |
| the cutting board at the station | SRAM / registers (tiny, ~19 TB/s) |
A GPU is an array of Streaming Multiprocessors (an H100 has ~130). Each SM runs threads in groups of 32 called warps, in lockstep. Your kernels are scheduled across these SMs — and the GPU hides memory latency by swapping in another warp whenever one is waiting on data.1
The Tensor Core is a dedicated unit that does small matrix multiplies in one shot. LLMs are almost entirely matrix multiplies, so Tensor Cores are where the FLOPs come from — and on Hopper they run FP8 (via the Transformer Engine), which is why FP8 is so fast on your hardware.1
Memory comes in tiers: tiny+instant registers and SRAM (~19 TB/s) on each SM, a shared L2 (~50 MB), and big+slower HBM (~94 GB, ~3.9 TB/s). The trade is always capacity vs bandwidth.2 This single picture explains the whole course:
A GPU is a node: SMs are its cores, HBM is its RAM, and the SRAM/L2 tiers are the node's CPU caches. Knowing this is like knowing your instance type's core count, memory bandwidth, and cache — it's what lets you reason about why a workload is bound by one resource and not another, instead of treating the node as a black box.
Each of your 4 GPUs is an H100 NVL (Hopper): ~130 SMs with FP8 Tensor Cores + Transformer Engine, ~50 MB L2, and ~94 GB HBM3 at ~3.9 TB/s. Those exact numbers are what set your roofline ridge (~214 FLOP/byte), your KV-cache capacity, and why FP8 + FlashAttention are such wins here. The hardware is the constraint every lesson has been dancing around. · Your Lab →