- Published on
Making Inference Fast — Quantization, Sparsity, and Dataflow from a Hardware Lens
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- The Cost Structure of Inference — Why Memory Is the Bottleneck
- Quantization — Making Data Smaller
- Sparsity — Making Data Fewer
- Dataflow Architecture — Moving Data Less
- Operator Fusion — Not Writing Intermediates to Memory
- Batching and KV Cache — The Tug-of-War of Throughput and Latency
- Compiler Optimization — Searching Instead of Hand-Tuning
- The Accuracy-Efficiency Trade-off — No Free Lunch
- Hardware-Software Co-design — The Big Picture
- Putting It Into Practice — Where to Start
- How Inference and Training Workloads Differ
- A Small Case Study — Following the Numbers
- Conclusion
- References
Introduction
A trained model is made once, but inference runs billions of times a day for as long as the model lives. That is why, in 2026, the industry reported inference capex overtaking training capex for the first time. Making a model larger and smarter, and serving that model cheaply and quickly, have become genuinely separate engineering problems.
This essay decomposes the goal of "making inference fast" from a hardware perspective. The central claim is simple: modern inference is rarely slow because it runs out of compute. It is slow because it spends its time moving data. From that single fact, three strategies interlock — quantization to make data smaller, sparsity to make data fewer, and dataflow design to move data less. Add operator fusion, batching, KV caching, and compiler optimization, and the same chip can deliver several times the throughput.
The tone here is calm but grounded — every idea anchored in numbers and code.
The Cost Structure of Inference — Why Memory Is the Bottleneck
Start with the cost structure. Two numbers govern a GPU's performance: compute throughput (FLOPS) and memory bandwidth (bytes/s). Their ratio is called arithmetic intensity, measured as "how many operations do you perform for every byte read from memory?"
The roofline model captures this at a glance.
Performance
(FLOPS)
^
| ___________________ compute-bound ceiling
| /
| /
| / <- this slope is memory bandwidth
| /
| / memory-bound region
| /
+-------------+--------------------------> arithmetic intensity (FLOP/byte)
ridge point
When intensity is low (left), bandwidth determines performance; when high (right), compute does. The problem is that the decoding phase of LLM inference is extremely memory-bound.
LLM inference splits into two phases.
- Prefill: processes the entire input prompt at once. Heavy on matrix-matrix products, high arithmetic intensity, closer to compute-bound.
- Decode: generates tokens one at a time. Every step reads the entire model weights to do a matrix-vector product. With a small batch, arithmetic intensity drops near 1 and the workload becomes thoroughly memory-bound.
Feel it in numbers. A 70-billion-parameter model stored in FP16 is roughly 140GB. Decoding one token requires reading those weights once. At about 3.35TB/s of memory bandwidth, just sweeping the weights once takes about 42 milliseconds. So with batch 1, roughly 24 tokens per second is the theoretical ceiling. The compute units sit almost idle.
This is where the term memory wall comes in. Over the past decade-plus, compute performance has grown far faster than memory bandwidth. As a result, in most inference workloads the chip's compute capacity is plentiful and its ability to feed data is scarce. It also matters that the energy to move data is more than an order of magnitude larger than the energy to compute. Relative to an on-chip register access, fetching data from HBM costs roughly hundreds of times more.
Every other strategy in this essay follows from that one fact. Make data smaller (quantization), fewer (sparsity), and moved less (dataflow).
Quantization — Making Data Smaller
Quantization represents weights and activations with fewer bits. Going from FP16 (16 bits) to INT8 (8 bits) halves memory footprint and bandwidth demand. On a memory-bound workload, that almost directly translates into a 2x speedup.
The Landscape of Precision Formats
The major formats you encounter in practice in 2026 are these.
| Format | Bits | Primary use | Hardware support |
|---|---|---|---|
| FP16 / BF16 | 16 | Training, high-precision inference | Broad |
| FP8 (E4M3 / E5M2) | 8 | Inference, some training | Hopper onward, TPU |
| INT8 | 8 | Inference (PTQ/QAT) | Very broad |
| FP4 (MXFP4, etc.) | 4 | Inference | Blackwell 2nd-gen TE |
| INT4 | 4 | Weight-only quantization | Broad (kernel dependent) |
NVIDIA Blackwell's second-generation Transformer Engine (announced at GTC 2026) supports FP4 natively. FP4 represents a float in 4 bits, which on its own has too narrow a range and collapses accuracy. So microscaling is used: each small block (say 16 or 32 values) carries its own shared scale factor, matching dynamic range block by block.
Symmetric/Asymmetric, PTQ/QAT
The basic quantization formulas are these.
quantize: q = round(x / scale) + zero_point
dequantize: x_hat = (q - zero_point) * scale
Symmetric quantization fixes zero_point at 0. Asymmetric helps with skewed activations (for example, always-positive values after ReLU).
By timing, there are two approaches.
- PTQ (Post-Training Quantization): quantize a trained model as is, fitting only the scales with a small calibration set. Fast and simple, but accuracy can suffer below 4 bits.
- QAT (Quantization-Aware Training): simulate quantization error during training so the model adapts. Costly, but accuracy holds well at low bit widths.
A Minimal PTQ Calibration Example
import torch
def collect_quant_scale(weight: torch.Tensor, n_bits: int = 8):
"""Compute a symmetric quantization scale per channel."""
qmax = 2 ** (n_bits - 1) - 1 # 127 for int8
# per-output-channel (row) max
per_channel_max = weight.abs().amax(dim=1, keepdim=True)
scale = per_channel_max / qmax
scale = scale.clamp(min=1e-8)
return scale
def fake_quantize(weight: torch.Tensor, scale: torch.Tensor, n_bits: int = 8):
qmax = 2 ** (n_bits - 1) - 1
qmin = -qmax - 1
q = torch.clamp(torch.round(weight / scale), qmin, qmax)
return q * scale # dequantize to inspect the error
w = torch.randn(4096, 4096)
scale = collect_quant_scale(w, n_bits=8)
w_hat = fake_quantize(w, scale, n_bits=8)
err = (w - w_hat).abs().mean().item()
print(f"mean absolute error: {err:.6f}")
Per-channel scales produce smaller error than a single scale for the whole tensor, because one channel's outlier no longer poisons the others. In practice, techniques like SmoothQuant (shifting activation outliers into the weights to balance quantization difficulty) and AWQ (preserving precision only on important channels) are widely used.
What to Quantize
- Weight-only (INT4/INT8 weights, FP16 compute): effective for memory-bound decoding, since it shrinks the bandwidth needed to read weights.
- Weight+activation (e.g., W8A8): quantize both so the math itself runs on integer/low-precision tensor cores. Adds gains in compute-bound regions such as prefill.
Sparsity — Making Data Fewer
If quantization makes each value smaller, sparsity skips values that are zero, reducing the count of values outright.
The catch is that unstructured sparsity is not hardware friendly. Even if 90% of weights are zero, randomly placed zeros make it hard to exploit the GPU's coalesced memory access and the dense structure of tensor cores. The overhead of chasing indices eats the savings.
That is why structured sparsity exists. The canonical example is 2:4 sparsity, supported by tensor cores since NVIDIA Ampere. The rule is simple: among every four consecutive values, exactly two are zero.
dense: [ 0.8, 0.1, -0.3, 0.5 ]
2:4: [ 0.8, 0.0, 0.0, 0.5 ] <- only 2 of every 4 are non-zero
storage: 2 non-zero values + 2-bit index (which positions)
Hardware loves this pattern. The number of non-zeros is fixed (two of four), so memory access stays regular, and the tensor core has built-in circuitry to pick the pairs by their indices and perform half the multiplications. Ideally, matmul throughput roughly doubles.
# conceptual code to build a 2:4 sparse mask
import torch
def make_2to4_mask(weight: torch.Tensor):
"""Within each group of 4 along the row, zero out the 2 smallest by magnitude."""
out, in_dim = weight.shape
assert in_dim % 4 == 0
w = weight.view(out, in_dim // 4, 4)
# keep only the top-2 positions by magnitude in each group
idx = w.abs().argsort(dim=-1, descending=True)
mask = torch.zeros_like(w)
keep = idx[..., :2]
mask.scatter_(-1, keep, 1.0)
return (w * mask).view(out, in_dim)
w = torch.randn(1024, 1024)
w_sparse = make_2to4_mask(w)
print("zero ratio:", (w_sparse == 0).float().mean().item()) # about 0.5
In practice, simply zeroing small values degrades accuracy, so it is followed by fine-tuning to recover. Combined with quantization (e.g., 2:4 + INT8), you can shrink both memory and compute, but the two error sources stack, so accuracy validation must be more careful.
Activation sparsity exists too. The many zeros after ReLU-family functions, and the fact that MoE (Mixture of Experts) activates only some experts per token, are both variants of sparsity. MoE is effectively dynamic structured sparsity: the total parameter count is large, but only a fraction of the compute happens per token.
Dataflow Architecture — Moving Data Less
The third axis is how data flows inside the chip. The key insight is to reuse data once fetched as much as possible. Because reading data once from HBM costs far more than the on-chip computation, designs that pin data on chip and use it many times are decisive.
Systolic Arrays and Data Reuse
The systolic array, famously adopted by Google's TPU, is the crystallization of this philosophy. Multiply-accumulators (MACs) are laid out in a grid; data pushed in from one side flows cell to cell, and computation proceeds like a pipeline. Weights loaded once stay in the grid and are reused across many inputs, drastically cutting memory access.
systolic array (weight stationary)
input -> [w][w][w][w]
input -> [w][w][w][w] each [w] is a multiply-accumulate cell
input -> [w][w][w][w] weights stay in the cells
input -> [w][w][w][w] only inputs/partial sums flow
v
output (partial-sum accumulation)
Dataflow designs are classified by what they pin on chip.
- Weight stationary: keep weights in cells, stream inputs. High weight reuse.
- Output stationary: accumulate partial sums in cells. High output reuse.
- Row stationary: reuse input rows. An energy-efficient compromise (research such as Eyeriss).
Tiling — Being Mindful of the Memory Hierarchy
On GPUs the same principle appears as tiling. A large matmul is split into small blocks; one block is loaded into fast on-chip memory (shared memory/registers), and as much computation as possible is finished there. The goal is to reduce slow HBM round trips.
split C = A x B into tiles
A B C
[A00 A01] [B00 B01] [C00 C01]
[A10 A11] [B10 B11] [C10 C11]
C00 = A00*B00 + A01*B10 <- accumulate per tile
keep a tile in shared memory and reuse it
Choosing tile size is a trade-off. Large tiles increase reuse but may not fit on-chip; small tiles fit but increase HBM traffic. Automatically searching this sweet spot is the job of the compiler discussed later.
Operator Fusion — Not Writing Intermediates to Memory
Deep learning graphs chain many small operations: matmul, then bias add, then activation, then normalization. Run each separately and every step writes an intermediate tensor to HBM and reads it back in the next step. From a memory-bound viewpoint, this is waste piled on waste.
Operator fusion merges these consecutive operations into a single kernel, keeping intermediates in on-chip registers/shared memory and carrying them forward. The HBM round trips vanish.
before: MatMul -> [HBM] -> Bias -> [HBM] -> GELU -> [HBM]
after: MatMul + Bias + GELU (one kernel, intermediates stay on chip)
The most famous case is FlashAttention. Instead of materializing attention's huge intermediate (the score matrix) in HBM, it streams it in blocks and accumulates the softmax online. It reduces memory usage while breaking the bandwidth bottleneck, making long-context inference practical. Fusion is sometimes found automatically by the compiler, and for hot paths people sometimes write the kernels by hand.
Batching and KV Cache — The Tug-of-War of Throughput and Latency
The core reason decoding is memory-bound is that with batch 1 you read the weights once to make a single token. So if you batch many requests and make many tokens from one weight read, arithmetic intensity rises and the chip does real work.
The problem is that in real serving, requests arrive at different times and have different lengths. The answer is continuous batching. Rather than waiting for a fixed batch, at each generation step it drops finished requests and slots in new ones so the GPU never idles.
KV Cache
To avoid recomputing the key/value of previous tokens, the transformer stores them in the KV cache. This cache grows linearly with sequence length and batch size, quickly consuming memory.
KV cache size (bytes)
= 2(K,V) x layers x heads x head_dim
x seq_len x batch x bytes_per_elem
For a 13B-class model with long context, the KV cache can reach tens of GB. Hence these techniques.
- PagedAttention: manage the KV cache in small blocks like OS paging to reduce fragmentation (the heart of vLLM).
- KV cache quantization: store the cache in INT8/FP8 to halve capacity and bandwidth.
- GQA/MQA: have multiple attention heads share key/value to shrink the cache.
Batching and KV cache management are ultimately a tug-of-war between throughput and latency. Larger batches raise throughput but can slow individual responses. Workloads where time-to-first-token (TTFT) matters, such as chatbots, balance at a different point than bulk-batch workloads where total throughput matters.
Compiler Optimization — Searching Instead of Hand-Tuning
The tile-size choices, fusion decisions, and memory layouts seen above are endless to tune by hand. So deep learning compilers automate this search.
There are roughly these layers.
- Graph-level IR: takes the op graph and applies fusion, constant folding, layout transforms (e.g., the graph stages of TVM, XLA, TorchInductor).
- Tile/schedule search: tiles ops to the hardware and orders loops. Triton is a prime example — write a GPU kernel in near-Python syntax and the compiler handles memory layout and parallelization.
- Auto-tuning: actually run multiple tile-size/schedule candidates and pick the fastest.
# Triton-style pseudo kernel (conceptual)
# assume an @triton.jit decorator compiles this into a GPU kernel
def fused_add_relu(x_ptr, y_ptr, out_ptr, n, BLOCK: int):
pid = program_id(0) # block index
offs = pid * BLOCK + arange(0, BLOCK)
mask = offs < n
x = load(x_ptr + offs, mask=mask)
y = load(y_ptr + offs, mask=mask)
z = x + y
z = where(z > 0, z, 0.0) # ReLU fused in the same kernel
store(out_ptr + offs, z, mask=mask)
The key is separating "what to compute" (the math) from "how to compute it" (the schedule). For the same matmul, the optimal schedule differs by chip and by shape. The compiler sweeps that enormous search space faster than a human.
The Accuracy-Efficiency Trade-off — No Free Lunch
Every technique so far buys efficiency at the price of accuracy. What matters is measuring the exchange rate.
| Technique | What you gain | What you may lose | Mitigation |
|---|---|---|---|
| INT8 quantization | Half bandwidth/memory | Small accuracy drop | Per-channel scale, calibration |
| FP4/INT4 | Quarter bandwidth | Larger drop at lower bits | QAT, microscaling, mixed precision |
| 2:4 sparsity | About 2x compute | Loss of expressivity | Fine-tune after sparsification |
| KV cache quantization | Half the cache | Accumulated error in long context | Use FP8, preserve sensitive layers |
A few practical principles.
- Apply one technique at a time and measure accuracy. Turning several on at once hides where it broke.
- Look at tail cases, not just averages. Quantization often leaves average accuracy intact but collapses on specific inputs.
- Do not treat every layer at the same precision. Keep sensitive parts — attention outputs, first/last layers — at higher precision (mixed precision).
Hardware-Software Co-design — The Big Picture
By now one thing is clear: quantization formats, sparsity patterns, dataflow, and compilers are not independent — they are designed together.
- Because Blackwell's 2nd-gen Transformer Engine supports FP4 natively, it becomes worthwhile for software to adopt microscaling quantization.
- Because tensor cores understand the 2:4 pattern in circuitry, frameworks sparsify to match it.
- The limits of HBM bandwidth and the NVLink/UALink interconnect determine tile sizes and batching strategy.
The 2026 trajectory extends this co-design. NVIDIA set a goal for next-generation Vera Rubin of roughly 10x perf/watt alongside HBM4, while Google aimed squarely at inference efficiency with TPU v6 Trillium (about 4.7x peak versus the prior generation) and the inference-specialized 7th-generation Ironwood. Cloud providers' in-house inference ASICs are rising fast (from about 15% share in 2024 toward an expected 40% in 2026), accelerating the trend of tailoring chips to inference workloads. NVIDIA still holds roughly 75-80% of the accelerator market, but competition such as AMD MI350X and the rise of in-house ASICs are layering the inference landscape.
The takeaway is to view inference optimization not as a single trick but as a stack that moves together with hardware generations.
Putting It Into Practice — Where to Start
Suppose you must serve a new model. Here is a suggested order.
- Measure first. Profile whether the workload is prefill- or decode-heavy, memory- or compute-bound. Optimizing without knowing the bottleneck is wasted effort.
- Use a proven serving engine first. Engines like vLLM and TensorRT-LLM already have continuous batching, PagedAttention, and fused kernels. You rarely need to build from scratch.
- Start with weight quantization. If decoding is memory-bound, INT8/INT4 weight quantization gives the biggest immediate win. Measure accuracy to confirm it passes the bar.
- Manage the KV cache. For long context, free up memory with PagedAttention and KV cache quantization.
- Then sparsity / low-bit. If that is not enough, move to more aggressive techniques like 2:4 sparsity and FP4 — always with fine-tuning and validation.
- Exploit the chip generation. To use the latest FP4/FP8 tensor cores, you need matching compiler/runtime versions. Align hardware and software versions together.
Common pitfalls, too.
- Shipping quantization to production without accuracy validation, only to have quality collapse on certain inputs.
- Banking on unstructured sparsity and finding no real speedup (without structured sparsity, the hardware gain is near zero).
- Cranking the batch up until throughput improves but time-to-first-token ruins the user experience.
How Inference and Training Workloads Differ
There is a reason every optimization so far focused on "inference." Inference and training handle the same model, but the nature of the load the hardware faces is entirely different.
| Aspect | Training | Inference |
|---|---|---|
| Direction | Forward + backward | Forward only |
| Precision | BF16/FP8 centric, numerical stability matters | Aggressive down to INT8/FP4 |
| Memory | Explodes with activations/optimizer state | Weights + KV cache centric |
| Batch | Large to maximize throughput | Often split small under latency limits |
| Metrics | Convergence speed, total throughput | Latency (TTFT/per-token), throughput, cost |
Training is a one-time job done well, so throughput is nearly everything. Inference, by contrast, repeats endlessly, and the latency a user waits becomes quality directly. So inference optimization is the harder problem of satisfying both "average throughput" and "tail latency" at once.
This difference is reflected in chip design. Training chips emphasize high precision, huge memory, and fast chip-to-chip interconnect, while inference-specialized chips (e.g., TPU Ironwood, various inference ASICs) prioritize low-precision compute density, perf/watt, and low latency. That is the backdrop for the flood of inference-only chips in 2026, the year inference capex overtook training.
A Small Case Study — Following the Numbers
Let us bundle the earlier concepts into one hypothetical scenario. Assume we serve a 70B model as a chatbot on a single GPU.
starting point (FP16, batch 1)
weights about 140GB -> exceeds single-GPU memory, some offloading needed
one weight read per token -> about 24 tokens/sec (bandwidth limited)
GPU compute-unit utilization: very low
step 1: INT8 weight quantization
weights about 70GB -> memory headroom, half the bandwidth
about 2x speed per token -> about 45-48 tokens/sec
accuracy: near-lossless with per-channel scale
step 2: continuous batching (16 concurrent requests)
16 tokens generated per single weight read
arithmetic intensity rises -> GPU utilization greatly improved
total throughput several times higher, individual latency slightly up
step 3: KV cache quantization + PagedAttention
fragmentation suppressed even at long context
more concurrent requests accepted -> cost/request falls
The lesson is clear. No single technique is magic, but stacking them in order along the bottleneck makes throughput jump several-fold on the same chip and drops cost per request sharply. And at every step you must measure accuracy and latency to filter out changes that "look free but actually have a cost."
The core is the loop of measure -> remove the biggest bottleneck -> measure again. Knowing where your workload's point sits on the roofline comes before memorizing a flashy list of techniques.
Conclusion
Making inference fast compresses into one sentence: make data smaller (quantization), fewer (sparsity), and moved less (dataflow and fusion), then connect it through batching and compilers so the chip actually uses it. Every one of these strategies grows from the same root — the memory wall.
The chips of 2026 are increasingly designed with inference as priority one. As FP4 moves into the circuitry, inference-only ASICs proliferate, and compilers grow smarter, the engineer's role shifts from "which magic trick do I enable" to "can I diagnose my workload's bottleneck precisely and choose the right trade-off." Measure, apply one thing at a time, and guard accuracy — that discipline is the real secret to fast inference.
References
- NVIDIA Blackwell architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
- NVIDIA Transformer Engine docs: https://docs.nvidia.com/deeplearning/transformer-engine/
- Google Cloud TPU: https://cloud.google.com/tpu
- vLLM (PagedAttention): https://docs.vllm.ai/
- NVIDIA TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM
- OpenAI Triton: https://triton-lang.org/
- FlashAttention paper (arXiv): https://arxiv.org/abs/2205.14135
- General quantization/sparsity search (arXiv): https://arxiv.org/list/cs.LG/recent
- SemiAnalysis (industry analysis): https://www.semianalysis.com/