Skip to content
Published on

Making Inference Fast — Quantization, Sparsity, and Dataflow from a Hardware Lens

Authors

Introduction

A trained model is made once, but inference runs billions of times a day for as long as the model lives. That is why, in 2026, the industry reported inference capex overtaking training capex for the first time. Making a model larger and smarter, and serving that model cheaply and quickly, have become genuinely separate engineering problems.

This essay decomposes the goal of "making inference fast" from a hardware perspective. The central claim is simple: modern inference is rarely slow because it runs out of compute. It is slow because it spends its time moving data. From that single fact, three strategies interlock — quantization to make data smaller, sparsity to make data fewer, and dataflow design to move data less. Add operator fusion, batching, KV caching, and compiler optimization, and the same chip can deliver several times the throughput.

The tone here is calm but grounded — every idea anchored in numbers and code.


The Cost Structure of Inference — Why Memory Is the Bottleneck

Start with the cost structure. Two numbers govern a GPU's performance: compute throughput (FLOPS) and memory bandwidth (bytes/s). Their ratio is called arithmetic intensity, measured as "how many operations do you perform for every byte read from memory?"

The roofline model captures this at a glance.

Performance
(FLOPS)
  ^
  |                    ___________________  compute-bound ceiling
  |                   /
  |                  /
  |                 /  <- this slope is memory bandwidth
  |                /
  |               /     memory-bound region
  |              /
  +-------------+--------------------------> arithmetic intensity (FLOP/byte)
              ridge point

When intensity is low (left), bandwidth determines performance; when high (right), compute does. The problem is that the decoding phase of LLM inference is extremely memory-bound.

LLM inference splits into two phases.

  • Prefill: processes the entire input prompt at once. Heavy on matrix-matrix products, high arithmetic intensity, closer to compute-bound.
  • Decode: generates tokens one at a time. Every step reads the entire model weights to do a matrix-vector product. With a small batch, arithmetic intensity drops near 1 and the workload becomes thoroughly memory-bound.

Feel it in numbers. A 70-billion-parameter model stored in FP16 is roughly 140GB. Decoding one token requires reading those weights once. At about 3.35TB/s of memory bandwidth, just sweeping the weights once takes about 42 milliseconds. So with batch 1, roughly 24 tokens per second is the theoretical ceiling. The compute units sit almost idle.

This is where the term memory wall comes in. Over the past decade-plus, compute performance has grown far faster than memory bandwidth. As a result, in most inference workloads the chip's compute capacity is plentiful and its ability to feed data is scarce. It also matters that the energy to move data is more than an order of magnitude larger than the energy to compute. Relative to an on-chip register access, fetching data from HBM costs roughly hundreds of times more.

Every other strategy in this essay follows from that one fact. Make data smaller (quantization), fewer (sparsity), and moved less (dataflow).


Quantization — Making Data Smaller

Quantization represents weights and activations with fewer bits. Going from FP16 (16 bits) to INT8 (8 bits) halves memory footprint and bandwidth demand. On a memory-bound workload, that almost directly translates into a 2x speedup.

The Landscape of Precision Formats

The major formats you encounter in practice in 2026 are these.

FormatBitsPrimary useHardware support
FP16 / BF1616Training, high-precision inferenceBroad
FP8 (E4M3 / E5M2)8Inference, some trainingHopper onward, TPU
INT88Inference (PTQ/QAT)Very broad
FP4 (MXFP4, etc.)4InferenceBlackwell 2nd-gen TE
INT44Weight-only quantizationBroad (kernel dependent)

NVIDIA Blackwell's second-generation Transformer Engine (announced at GTC 2026) supports FP4 natively. FP4 represents a float in 4 bits, which on its own has too narrow a range and collapses accuracy. So microscaling is used: each small block (say 16 or 32 values) carries its own shared scale factor, matching dynamic range block by block.

Symmetric/Asymmetric, PTQ/QAT

The basic quantization formulas are these.

quantize:   q = round(x / scale) + zero_point
dequantize: x_hat = (q - zero_point) * scale

Symmetric quantization fixes zero_point at 0. Asymmetric helps with skewed activations (for example, always-positive values after ReLU).

By timing, there are two approaches.

  • PTQ (Post-Training Quantization): quantize a trained model as is, fitting only the scales with a small calibration set. Fast and simple, but accuracy can suffer below 4 bits.
  • QAT (Quantization-Aware Training): simulate quantization error during training so the model adapts. Costly, but accuracy holds well at low bit widths.

A Minimal PTQ Calibration Example

import torch

def collect_quant_scale(weight: torch.Tensor, n_bits: int = 8):
    """Compute a symmetric quantization scale per channel."""
    qmax = 2 ** (n_bits - 1) - 1  # 127 for int8
    # per-output-channel (row) max
    per_channel_max = weight.abs().amax(dim=1, keepdim=True)
    scale = per_channel_max / qmax
    scale = scale.clamp(min=1e-8)
    return scale

def fake_quantize(weight: torch.Tensor, scale: torch.Tensor, n_bits: int = 8):
    qmax = 2 ** (n_bits - 1) - 1
    qmin = -qmax - 1
    q = torch.clamp(torch.round(weight / scale), qmin, qmax)
    return q * scale  # dequantize to inspect the error

w = torch.randn(4096, 4096)
scale = collect_quant_scale(w, n_bits=8)
w_hat = fake_quantize(w, scale, n_bits=8)
err = (w - w_hat).abs().mean().item()
print(f"mean absolute error: {err:.6f}")

Per-channel scales produce smaller error than a single scale for the whole tensor, because one channel's outlier no longer poisons the others. In practice, techniques like SmoothQuant (shifting activation outliers into the weights to balance quantization difficulty) and AWQ (preserving precision only on important channels) are widely used.

What to Quantize

  • Weight-only (INT4/INT8 weights, FP16 compute): effective for memory-bound decoding, since it shrinks the bandwidth needed to read weights.
  • Weight+activation (e.g., W8A8): quantize both so the math itself runs on integer/low-precision tensor cores. Adds gains in compute-bound regions such as prefill.

Sparsity — Making Data Fewer

If quantization makes each value smaller, sparsity skips values that are zero, reducing the count of values outright.

The catch is that unstructured sparsity is not hardware friendly. Even if 90% of weights are zero, randomly placed zeros make it hard to exploit the GPU's coalesced memory access and the dense structure of tensor cores. The overhead of chasing indices eats the savings.

That is why structured sparsity exists. The canonical example is 2:4 sparsity, supported by tensor cores since NVIDIA Ampere. The rule is simple: among every four consecutive values, exactly two are zero.

dense: [ 0.8,  0.1, -0.3,  0.5 ]
2:4:   [ 0.8,  0.0,  0.0,  0.5 ]   <- only 2 of every 4 are non-zero

storage: 2 non-zero values + 2-bit index (which positions)

Hardware loves this pattern. The number of non-zeros is fixed (two of four), so memory access stays regular, and the tensor core has built-in circuitry to pick the pairs by their indices and perform half the multiplications. Ideally, matmul throughput roughly doubles.

# conceptual code to build a 2:4 sparse mask
import torch

def make_2to4_mask(weight: torch.Tensor):
    """Within each group of 4 along the row, zero out the 2 smallest by magnitude."""
    out, in_dim = weight.shape
    assert in_dim % 4 == 0
    w = weight.view(out, in_dim // 4, 4)
    # keep only the top-2 positions by magnitude in each group
    idx = w.abs().argsort(dim=-1, descending=True)
    mask = torch.zeros_like(w)
    keep = idx[..., :2]
    mask.scatter_(-1, keep, 1.0)
    return (w * mask).view(out, in_dim)

w = torch.randn(1024, 1024)
w_sparse = make_2to4_mask(w)
print("zero ratio:", (w_sparse == 0).float().mean().item())  # about 0.5

In practice, simply zeroing small values degrades accuracy, so it is followed by fine-tuning to recover. Combined with quantization (e.g., 2:4 + INT8), you can shrink both memory and compute, but the two error sources stack, so accuracy validation must be more careful.

Activation sparsity exists too. The many zeros after ReLU-family functions, and the fact that MoE (Mixture of Experts) activates only some experts per token, are both variants of sparsity. MoE is effectively dynamic structured sparsity: the total parameter count is large, but only a fraction of the compute happens per token.


Dataflow Architecture — Moving Data Less

The third axis is how data flows inside the chip. The key insight is to reuse data once fetched as much as possible. Because reading data once from HBM costs far more than the on-chip computation, designs that pin data on chip and use it many times are decisive.

Systolic Arrays and Data Reuse

The systolic array, famously adopted by Google's TPU, is the crystallization of this philosophy. Multiply-accumulators (MACs) are laid out in a grid; data pushed in from one side flows cell to cell, and computation proceeds like a pipeline. Weights loaded once stay in the grid and are reused across many inputs, drastically cutting memory access.

systolic array (weight stationary)

  input ->  [w][w][w][w]
  input ->  [w][w][w][w]   each [w] is a multiply-accumulate cell
  input ->  [w][w][w][w]   weights stay in the cells
  input ->  [w][w][w][w]   only inputs/partial sums flow
                v
              output (partial-sum accumulation)

Dataflow designs are classified by what they pin on chip.

  • Weight stationary: keep weights in cells, stream inputs. High weight reuse.
  • Output stationary: accumulate partial sums in cells. High output reuse.
  • Row stationary: reuse input rows. An energy-efficient compromise (research such as Eyeriss).

Tiling — Being Mindful of the Memory Hierarchy

On GPUs the same principle appears as tiling. A large matmul is split into small blocks; one block is loaded into fast on-chip memory (shared memory/registers), and as much computation as possible is finished there. The goal is to reduce slow HBM round trips.

split C = A x B into tiles

A          B          C
[A00 A01]  [B00 B01]  [C00 C01]
[A10 A11]  [B10 B11]  [C10 C11]

C00 = A00*B00 + A01*B10   <- accumulate per tile
keep a tile in shared memory and reuse it

Choosing tile size is a trade-off. Large tiles increase reuse but may not fit on-chip; small tiles fit but increase HBM traffic. Automatically searching this sweet spot is the job of the compiler discussed later.


Operator Fusion — Not Writing Intermediates to Memory

Deep learning graphs chain many small operations: matmul, then bias add, then activation, then normalization. Run each separately and every step writes an intermediate tensor to HBM and reads it back in the next step. From a memory-bound viewpoint, this is waste piled on waste.

Operator fusion merges these consecutive operations into a single kernel, keeping intermediates in on-chip registers/shared memory and carrying them forward. The HBM round trips vanish.

before: MatMul -> [HBM] -> Bias -> [HBM] -> GELU -> [HBM]
after:  MatMul + Bias + GELU  (one kernel, intermediates stay on chip)

The most famous case is FlashAttention. Instead of materializing attention's huge intermediate (the score matrix) in HBM, it streams it in blocks and accumulates the softmax online. It reduces memory usage while breaking the bandwidth bottleneck, making long-context inference practical. Fusion is sometimes found automatically by the compiler, and for hot paths people sometimes write the kernels by hand.


Batching and KV Cache — The Tug-of-War of Throughput and Latency

The core reason decoding is memory-bound is that with batch 1 you read the weights once to make a single token. So if you batch many requests and make many tokens from one weight read, arithmetic intensity rises and the chip does real work.

The problem is that in real serving, requests arrive at different times and have different lengths. The answer is continuous batching. Rather than waiting for a fixed batch, at each generation step it drops finished requests and slots in new ones so the GPU never idles.

KV Cache

To avoid recomputing the key/value of previous tokens, the transformer stores them in the KV cache. This cache grows linearly with sequence length and batch size, quickly consuming memory.

KV cache size (bytes)
 = 2(K,V) x layers x heads x head_dim
   x seq_len x batch x bytes_per_elem

For a 13B-class model with long context, the KV cache can reach tens of GB. Hence these techniques.

  • PagedAttention: manage the KV cache in small blocks like OS paging to reduce fragmentation (the heart of vLLM).
  • KV cache quantization: store the cache in INT8/FP8 to halve capacity and bandwidth.
  • GQA/MQA: have multiple attention heads share key/value to shrink the cache.

Batching and KV cache management are ultimately a tug-of-war between throughput and latency. Larger batches raise throughput but can slow individual responses. Workloads where time-to-first-token (TTFT) matters, such as chatbots, balance at a different point than bulk-batch workloads where total throughput matters.


Compiler Optimization — Searching Instead of Hand-Tuning

The tile-size choices, fusion decisions, and memory layouts seen above are endless to tune by hand. So deep learning compilers automate this search.

There are roughly these layers.

  • Graph-level IR: takes the op graph and applies fusion, constant folding, layout transforms (e.g., the graph stages of TVM, XLA, TorchInductor).
  • Tile/schedule search: tiles ops to the hardware and orders loops. Triton is a prime example — write a GPU kernel in near-Python syntax and the compiler handles memory layout and parallelization.
  • Auto-tuning: actually run multiple tile-size/schedule candidates and pick the fastest.
# Triton-style pseudo kernel (conceptual)
# assume an @triton.jit decorator compiles this into a GPU kernel
def fused_add_relu(x_ptr, y_ptr, out_ptr, n, BLOCK: int):
    pid = program_id(0)              # block index
    offs = pid * BLOCK + arange(0, BLOCK)
    mask = offs < n
    x = load(x_ptr + offs, mask=mask)
    y = load(y_ptr + offs, mask=mask)
    z = x + y
    z = where(z > 0, z, 0.0)         # ReLU fused in the same kernel
    store(out_ptr + offs, z, mask=mask)

The key is separating "what to compute" (the math) from "how to compute it" (the schedule). For the same matmul, the optimal schedule differs by chip and by shape. The compiler sweeps that enormous search space faster than a human.


The Accuracy-Efficiency Trade-off — No Free Lunch

Every technique so far buys efficiency at the price of accuracy. What matters is measuring the exchange rate.

TechniqueWhat you gainWhat you may loseMitigation
INT8 quantizationHalf bandwidth/memorySmall accuracy dropPer-channel scale, calibration
FP4/INT4Quarter bandwidthLarger drop at lower bitsQAT, microscaling, mixed precision
2:4 sparsityAbout 2x computeLoss of expressivityFine-tune after sparsification
KV cache quantizationHalf the cacheAccumulated error in long contextUse FP8, preserve sensitive layers

A few practical principles.

  • Apply one technique at a time and measure accuracy. Turning several on at once hides where it broke.
  • Look at tail cases, not just averages. Quantization often leaves average accuracy intact but collapses on specific inputs.
  • Do not treat every layer at the same precision. Keep sensitive parts — attention outputs, first/last layers — at higher precision (mixed precision).

Hardware-Software Co-design — The Big Picture

By now one thing is clear: quantization formats, sparsity patterns, dataflow, and compilers are not independent — they are designed together.

  • Because Blackwell's 2nd-gen Transformer Engine supports FP4 natively, it becomes worthwhile for software to adopt microscaling quantization.
  • Because tensor cores understand the 2:4 pattern in circuitry, frameworks sparsify to match it.
  • The limits of HBM bandwidth and the NVLink/UALink interconnect determine tile sizes and batching strategy.

The 2026 trajectory extends this co-design. NVIDIA set a goal for next-generation Vera Rubin of roughly 10x perf/watt alongside HBM4, while Google aimed squarely at inference efficiency with TPU v6 Trillium (about 4.7x peak versus the prior generation) and the inference-specialized 7th-generation Ironwood. Cloud providers' in-house inference ASICs are rising fast (from about 15% share in 2024 toward an expected 40% in 2026), accelerating the trend of tailoring chips to inference workloads. NVIDIA still holds roughly 75-80% of the accelerator market, but competition such as AMD MI350X and the rise of in-house ASICs are layering the inference landscape.

The takeaway is to view inference optimization not as a single trick but as a stack that moves together with hardware generations.


Putting It Into Practice — Where to Start

Suppose you must serve a new model. Here is a suggested order.

  1. Measure first. Profile whether the workload is prefill- or decode-heavy, memory- or compute-bound. Optimizing without knowing the bottleneck is wasted effort.
  2. Use a proven serving engine first. Engines like vLLM and TensorRT-LLM already have continuous batching, PagedAttention, and fused kernels. You rarely need to build from scratch.
  3. Start with weight quantization. If decoding is memory-bound, INT8/INT4 weight quantization gives the biggest immediate win. Measure accuracy to confirm it passes the bar.
  4. Manage the KV cache. For long context, free up memory with PagedAttention and KV cache quantization.
  5. Then sparsity / low-bit. If that is not enough, move to more aggressive techniques like 2:4 sparsity and FP4 — always with fine-tuning and validation.
  6. Exploit the chip generation. To use the latest FP4/FP8 tensor cores, you need matching compiler/runtime versions. Align hardware and software versions together.

Common pitfalls, too.

  • Shipping quantization to production without accuracy validation, only to have quality collapse on certain inputs.
  • Banking on unstructured sparsity and finding no real speedup (without structured sparsity, the hardware gain is near zero).
  • Cranking the batch up until throughput improves but time-to-first-token ruins the user experience.

How Inference and Training Workloads Differ

There is a reason every optimization so far focused on "inference." Inference and training handle the same model, but the nature of the load the hardware faces is entirely different.

AspectTrainingInference
DirectionForward + backwardForward only
PrecisionBF16/FP8 centric, numerical stability mattersAggressive down to INT8/FP4
MemoryExplodes with activations/optimizer stateWeights + KV cache centric
BatchLarge to maximize throughputOften split small under latency limits
MetricsConvergence speed, total throughputLatency (TTFT/per-token), throughput, cost

Training is a one-time job done well, so throughput is nearly everything. Inference, by contrast, repeats endlessly, and the latency a user waits becomes quality directly. So inference optimization is the harder problem of satisfying both "average throughput" and "tail latency" at once.

This difference is reflected in chip design. Training chips emphasize high precision, huge memory, and fast chip-to-chip interconnect, while inference-specialized chips (e.g., TPU Ironwood, various inference ASICs) prioritize low-precision compute density, perf/watt, and low latency. That is the backdrop for the flood of inference-only chips in 2026, the year inference capex overtook training.


A Small Case Study — Following the Numbers

Let us bundle the earlier concepts into one hypothetical scenario. Assume we serve a 70B model as a chatbot on a single GPU.

starting point (FP16, batch 1)
  weights about 140GB -> exceeds single-GPU memory, some offloading needed
  one weight read per token -> about 24 tokens/sec (bandwidth limited)
  GPU compute-unit utilization: very low

step 1: INT8 weight quantization
  weights about 70GB -> memory headroom, half the bandwidth
  about 2x speed per token -> about 45-48 tokens/sec
  accuracy: near-lossless with per-channel scale

step 2: continuous batching (16 concurrent requests)
  16 tokens generated per single weight read
  arithmetic intensity rises -> GPU utilization greatly improved
  total throughput several times higher, individual latency slightly up

step 3: KV cache quantization + PagedAttention
  fragmentation suppressed even at long context
  more concurrent requests accepted -> cost/request falls

The lesson is clear. No single technique is magic, but stacking them in order along the bottleneck makes throughput jump several-fold on the same chip and drops cost per request sharply. And at every step you must measure accuracy and latency to filter out changes that "look free but actually have a cost."

The core is the loop of measure -> remove the biggest bottleneck -> measure again. Knowing where your workload's point sits on the roofline comes before memorizing a flashy list of techniques.


Conclusion

Making inference fast compresses into one sentence: make data smaller (quantization), fewer (sparsity), and moved less (dataflow and fusion), then connect it through batching and compilers so the chip actually uses it. Every one of these strategies grows from the same root — the memory wall.

The chips of 2026 are increasingly designed with inference as priority one. As FP4 moves into the circuitry, inference-only ASICs proliferate, and compilers grow smarter, the engineer's role shifts from "which magic trick do I enable" to "can I diagnose my workload's bottleneck precisely and choose the right trade-off." Measure, apply one thing at a time, and guard accuracy — that discipline is the real secret to fast inference.


References