- Published on
The Memory Wall and HBM — The Real Bottleneck That Divides AI Performance
- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- 1. Compute Is Cheap, Data Movement Is Expensive
- 2. What Is the Memory Wall
- 3. HBM — The Weapon Against the Memory Wall
- 4. Arithmetic Intensity and the Roofline Model
- 5. The KV Cache — The Hidden Protagonist of Inference Memory
- 6. Quantization Saves Bandwidth
- 7. The Memory Hierarchy and On-Chip SRAM
- 8. Trends in Next-Generation Memory
- 9. Implications a Developer Should Grasp
- 10. Frequently Asked Questions
- Closing
- References
Introduction
When we talk about AI hardware, we usually look at TFLOPS — floating-point operations per second. "This chip does so many petaflops." But anyone who has actually run a model in the field has felt the same frustration. The compute capability on the spec sheet is enormous, yet running the model in practice uses only a fraction of it. It is common for GPU utilization to never climb above 30 percent. Why?
The answer is almost always the same: the compute units are waiting for data. There are plenty of multipliers ready to calculate, but the numbers to feed them (weights, activations) cannot be fetched from memory fast enough. This is the memory wall, and in 2026 it is the real bottleneck that effectively divides AI performance. This article lays out what the memory wall is and how a developer can understand and handle it.
The goal of this article is simple: to make you instinctively ask one question. "Is the real bottleneck of this workload compute, or memory?" That single question can cut your inference cost in half and stop you from buying the wrong chip. The eye that sees the true bottleneck hidden behind a flashy TFLOPS number — that is the core skill this article aims to convey.
Here is the overall flow. First we point out the fundamental asymmetry that "compute is cheap and data movement is expensive," then look at what the memory wall is and at HBM, the weapon we use against it. Next we cover the tool for judging whether you are compute-bound or memory-bound (the roofline model), along with the KV cache, quantization, and the memory hierarchy, and we close with next-generation memory trends and a practical wrap-up for developers.
1. Compute Is Cheap, Data Movement Is Expensive
One uncomfortable truth of chip design is that for decades, compute capability has advanced far faster than memory capability. Adding more transistors to grow the count of multipliers was relatively easy, but moving data quickly into the chip did not improve nearly as fast.
From an energy standpoint it is even more dramatic. The energy to fetch a number from distant memory far exceeds the energy to multiply two numbers. The farther the data, the larger the cost.
Data-movement energy (relative, concept)
On-chip compute (multiply) . cheapest
On-chip SRAM access ...
HBM (in-package) access ..........
Off-chip (another node) ...................... most expensive
-> The farther you fetch data, the more energy and time it costs.
This asymmetry explains everything. A large part of good AI hardware and good AI code is not about "reducing compute" but about "reducing data movement."
Why this asymmetry exists
This asymmetry is not an accident but a consequence of physics and process technology. Making transistors smaller and cramming more of them onto a chip advanced fairly steadily for decades, so growing the count of compute units was relatively easy. Moving data quickly into and out of the chip, by contrast, is bound by the number of pins, the physical limits of wiring, and constraints like power and heat — and it did not improve nearly as fast.
As a result, the gap between compute capability and memory-supply capability widened from generation to generation. Chips could compute more and more, but their ability to feed those compute units with data could not keep up. This accumulated gap is exactly the memory wall. And the larger a model becomes (the more weights it has), the more data must be read, so this problem only gets worse over time.
2. What Is the Memory Wall
The memory wall refers to the widening gap between compute speed and the speed at which memory can supply data. A chip's compute capability grew quickly, but the data-supply capability that must feed those compute units (memory bandwidth) did not grow as fast. As a result, powerful compute units sit idle for longer, waiting for data.
By analogy: a kitchen has 100 cooks (compute units), but only one corridor to carry ingredients (memory bandwidth), so most cooks stand waiting for materials. Growing to 200 cooks does not get food out faster if the corridor stays the same. What you must widen is the corridor.
This is why 2026 chip design focuses on memory bandwidth, interconnect, and packaging rather than simply adding more compute units.
The secret of 30% utilization
Let us re-explain the "30 percent GPU utilization" phenomenon mentioned at the start through the lens of the memory wall. Low utilization means the compute units are idle much of the time. But the reason they are idle is not that there is no work — it is that the data the work needs has not yet arrived.
Why utilization is low (concept)
time ->
compute: [calc][ waiting ][calc][ waiting ][calc]
memory: [ moving data ][ moving data ]
The compute units often stall, waiting for data.
-> What you need is not "faster compute" but "faster data supply."
So the answer to "how do I raise GPU utilization?" is usually not "keep the compute units busier" but rather "supply data faster, read less data, or reuse data you already read." This shift in perspective is the first step to understanding the memory wall.
3. HBM — The Weapon Against the Memory Wall
The most important weapon to widen this corridor is HBM (High Bandwidth Memory). Instead of placing ordinary DRAM flat beside the chip, HBM stacks memory chips vertically (3D stack) and attaches them right next to the compute chip in one package (for example, with CoWoS packaging). As a result, the distance data travels shrinks, and a very wide connection becomes possible, exploding the bandwidth.
HBM generations
HBM generation flow (concept)
HBM2 -> HBM2e -> HBM3 -> HBM3e -> HBM4
As generations rise:
- more stacked layers -> more capacity
- higher per-pin speed -> more bandwidth
- more refined packaging -> more efficiency
The key transition in 2026 is from HBM3e to HBM4. HBM4 widens the interface and stacks higher, raising bandwidth and capacity at once. The reason the next-generation platforms covered in an earlier article (such as Vera Rubin) adopt HBM4 is precisely to ease the memory wall.
Why HBM is scarce and expensive
HBM is powerful, but it is not free. The process of stacking memory vertically and attaching it precisely beside the compute chip is far harder and more expensive than ordinary memory. Yield management is also difficult: if even one die fails during stacking, the entire stack must be discarded. On top of that, the production capacity for the advanced packaging that binds HBM and the compute die into one (for example, CoWoS) is itself limited.
Why HBM is scarce (concept)
- 3D stacking: precise vertical stacking; one defect loses the whole stack
- advanced packaging: capacity to bind it with the compute die is rare
- surging demand: every AI accelerator wants HBM
-> HBM and packaging capacity are the real bottleneck
in the supply of AI accelerators.
So much of the talk in 2026 about AI accelerator shortages is, in truth, a shortage of "HBM and packaging" rather than of "the chip itself." Even if you can print compute dies, you cannot make finished products if you lack the HBM and the packaging capacity to attach to them. The memory wall is not only a performance problem but also a supply-chain problem.
Bandwidth vs. capacity
HBM has two distinct resources, and you must not confuse them.
- Capacity (GB): the space for model weights, activations, and the KV cache. If short, the model will not fit on the chip or must be split across several chips.
- Bandwidth (TB/s): how fast you can read and write data from that space. It often directly determines inference speed.
A common misconception in inference is "as long as capacity is large, you are fine." Even if the model fits, if bandwidth is insufficient, token generation is slow. You must view both resources together.
Capacity vs. bandwidth (analogy)
capacity = the size of the warehouse (how much you can store)
bandwidth = the width of the door (how fast you can take things out)
A big warehouse with a narrow door: you can store a lot,
but you cannot get it out quickly.
Inference speed is often decided not by "warehouse size"
but by "door width."
The lesson of this analogy is: when you look at a chip's spec, do not be dazzled by the "so many GB" capacity number alone. The "so many TB/s" of bandwidth next to it often has a more direct effect on inference speed. Whether the model fits is a question of capacity; whether it is fast enough is a question of bandwidth. They are different questions, and you must satisfy both to get a good inference experience.
4. Arithmetic Intensity and the Roofline Model
The tool for deciding whether you are compute-bound or memory-bound is the roofline model, and its key concept is arithmetic intensity. Arithmetic intensity is "how many operations you perform per byte fetched from memory."
Arithmetic intensity definition
I = (operations performed) / (bytes moved)
= FLOPs / Bytes
Large I -> fetched data is heavily reused -> compute-bound
Small I -> data is used once and discarded -> memory-bound
The roofline model draws the upper bound of achievable performance as a function of this arithmetic intensity.
Roofline model (concept)
Achievable performance
^
| ______________ compute roof (peak FLOPS)
| /
| / <- slope of this ramp = memory bandwidth
| /
| /
+---------+------------------> arithmetic intensity (FLOPs/Byte)
ridge point
If arithmetic intensity is below the ridge (ramp region),
memory bandwidth limits performance.
If above (flat region), compute capability limits it.
LLM inference, especially the decoding stage that generates tokens one at a time, has very low arithmetic intensity. It reads enormous weights once, multiplies them by a small input, and discards them. So LLM inference almost always sits on the left ramp of the roofline — the memory-bound region. This single fact explains nearly all of inference optimization.
The direction of optimization seen through the roofline
The real value of the roofline model is that it tells you "where to fix." If your workload sits in the ramp region (memory-bound), adding more compute units does no good. You must raise the slope of the ramp — that is, memory bandwidth — or push arithmetic intensity up to move rightward.
What helps and what does not when memory-bound
No help: adding compute units, a pricier chip that boasts peak FLOPS
(the compute units are already idle)
Helps: - a chip with higher memory bandwidth
- quantization to read fewer bytes
- raising data reuse to lift arithmetic intensity (use SRAM)
- removing unnecessary memory accesses
This simple insight becomes the compass of inference optimization. Before buying an expensive chip, the right order is to first check where your workload sits on the roofline. Since most LLM inference is on the left ramp, the answer is almost always "do not add compute — reduce data movement."
5. The KV Cache — The Hidden Protagonist of Inference Memory
When talking about memory in LLM inference, you cannot leave out the KV cache. As it generates tokens, a transformer stores and reuses the keys and values for prior tokens, so that it does not recompute everything from scratch.
The problem is that this cache grows in proportion to context length.
KV cache size (concept)
cache size approx. proportional to (batch size) x (context length)
x (number of layers) x (head dimension) x (2: key and value)
x (precision bytes)
long context + many concurrent users -> the KV cache grows
as large as the weights, sometimes larger.
In 2026 services handling long context and many concurrent users, the KV cache becomes the culprit that eats both memory capacity and bandwidth. So techniques to compress the cache, lower its precision (such as an INT8 cache), or manage it efficiently in pages have emerged as core inference optimizations.
How the KV cache eats bandwidth
The KV cache eats not only capacity but also bandwidth. Every time it generates a single token, it must re-read all of the keys and values accumulated so far. The longer the context, the larger the cache that must be read at each token, and this translates directly into a bandwidth burden.
The KV cache and bandwidth (concept)
short context: read a small cache when generating a token -> light
long context: read a huge cache when generating a token -> heavy, every token
-> Long context consumes not only "capacity" but also
"bandwidth at every token."
This is why per-token cost keeps rising as generation gets longer.
This is why, in long-context services, token generation slows down as you get deeper into the generation — the cache keeps growing. Here is why KV-cache optimization is not merely about saving memory but about speed and cost. A lower-precision or compressed cache reduces both capacity and bandwidth at once, flattening the cost curve of long-context inference.
6. Quantization Saves Bandwidth
In a memory-bound world, quantization is not merely a memory-saving technique but a speed-up technique. This point matters precisely because it runs against intuition.
Dropping weights from FP16 to INT8 halves memory usage. But if inference is memory-bound, reading half the bytes means reading twice as fast. In other words, quantization directly raises throughput in the memory-bound region.
The dual effect of quantization (memory-bound inference)
FP16 weight: [##] read 8 bytes -> 1 unit of time
INT8 weight: [#] read 4 bytes -> 0.5 unit (2x faster)
FP4 weight: [|] read 2 bytes -> 0.25 unit (4x faster)
When memory is the bottleneck, lowering precision saves
capacity AND speeds things up.
Of course, lowering precision risks accuracy loss. So in 2026 inference stacks combine sophisticated quantization techniques that minimize accuracy loss (per-weight, per-channel scaling, and so on) with low-precision formats supported directly by the hardware (FP8/FP4). Blackwell's second-generation Transformer Engine from the earlier article is precisely the hardware answer to this trend.
Estimating token generation speed from bandwidth
To get a feel for the power of being memory-bound, let us do a thought experiment that roughly estimates token generation speed from bandwidth alone. The key intuition is this: to generate one token in the decode stage, you must read every weight of the model once.
Estimating token rate from bandwidth (concept)
bytes to read per token approx. total model weight size
tokens per second approx. (memory bandwidth) / (model weight size)
e.g.) quantize the weights to half the size
-> half the bytes to read per token
-> roughly twice the tokens at the same bandwidth
This simple estimate makes clear why quantization is, in effect, speed. Halve the bytes you must read per token, and even with the same bandwidth you generate roughly twice as many tokens per second. Of course, in practice other factors intrude — the KV cache, activations, overhead — but the big picture does not change: "token generation speed is proportional to bandwidth and inversely proportional to weight size." If you want faster token generation, the answer is often wider bandwidth or smaller weights, not a pricier set of compute units.
7. The Memory Hierarchy and On-Chip SRAM
Memory is not one thing but a hierarchy. It is stacked in layers, from fast-and-small to slow-and-large.
Memory hierarchy (fast/small -> slow/large)
Registers fastest, tiny capacity
On-chip SRAM very fast, small capacity (a few MB to tens of MB)
HBM fast, large capacity (tens to hundreds of GB)
Host memory slow, very large capacity
Other nodes slowest, practically unlimited
The secret of a good inference kernel is to keep data in a faster layer for as long as possible. Once you fetch data from HBM into on-chip SRAM, reuse it as much as you can without going back to HBM. This is the core idea behind techniques that reorganize attention to be memory-efficient (splitting data into small blocks processed inside SRAM). In effect, you artificially raise arithmetic intensity to move rightward on the roofline.
The reason this idea is so powerful is that it produces exactly the same mathematical result while raising speed by changing only how memory is accessed. The amount of computation stays the same; merely changing, cleverly, how data is moved and reused makes a big difference. This is why good kernel design in the memory-wall era is closer to "the choreography of data movement" than to "mathematics."
In an extreme case, some wafer-scale chips (such as Cerebras WSE-3) take the approach of putting a giant on-chip SRAM (about 44GB) and tremendous on-chip bandwidth (about 21 PB/s) to eliminate the cost of going to HBM in the first place. It is another kind of answer to the memory wall.
The idea behind this approach is fundamental. An ordinary chip must constantly shuttle data between a small SRAM and a large HBM, but a wafer-scale chip is so enormous (an entire wafer in one piece) that it creates room to place the whole model in on-chip SRAM. Then the most expensive data movement — the round trip to off-chip memory — disappears.
Ordinary chip vs. wafer-scale (memory-access view)
ordinary chip: compute <-> small SRAM <-> HBM (off-chip) <- costly HBM round trip
wafer-scale: compute <-> giant on-chip SRAM <- off-chip round trip minimized
-> Keep data inside the chip to bypass the most expensive
part of the memory wall.
Of course, this approach has its own price. A giant single chip is hard and expensive to manufacture, and it does not fit every workload and budget. But the idea — that "the most direct way to bypass the memory wall is to never send data off the chip at all" — is a fine example of how far memory-centric thinking can go.
8. Trends in Next-Generation Memory
Research to get past the memory wall also heads in more fundamental directions beyond HBM generation updates.
- In-memory computing: instead of bringing data to the compute unit, compute directly inside the memory. It is the most radical way to reduce data movement itself.
- Photonic interconnect: carry data with light instead of electricity to raise the bandwidth and energy efficiency of chip-to-chip communication (Lightmatter and related DARPA projects, for example).
- Optical/photonic tensor cores: a more distant research direction that performs matrix operations themselves with light.
- Chiplet and advanced packaging: pack several small chips (chiplets) densely into one package to shrink chip-to-chip distance and raise bandwidth.
Most of this research is still early in commercialization or at the research stage, but the direction is consistent. In the end it is about the essence of the memory wall: reducing the cost of data movement.
The big picture of three strategies
If we tie next-generation memory research into one big picture, the strategy against the memory wall comes down to three branches.
Three strategies against the memory wall (concept)
1. Move faster: HBM generation updates, photonic interconnect
(move data faster and more efficiently)
2. Move less: quantization, sparsity, data reuse, on-chip SRAM
(reduce the data that must move in the first place)
3. Do not move: in-memory computing, wafer-scale
(process data in place instead of moving it)
The interesting point is that these three strategies are not mutually exclusive. Real systems combine all three. They use fast HBM (strategy 1), reduce the bytes to read with quantization (strategy 2), and keep data inside on-chip SRAM for reuse (close to strategy 3). The one a developer can directly control is mostly strategy 2. Even when the hardware provides strategies 1 and 3, practicing "move less" through software choices — quantization, cache management, data locality — is on us. That is why the memory wall is not only a chip designer's problem but every AI developer's problem.
9. Implications a Developer Should Grasp
Even for a developer who does not design chips, understanding the memory wall is a practical weapon.
- Inference is usually memory-bound. So before buying a faster chip, saving bandwidth first via quantization, KV-cache management, and batching is more cost-effective.
- Quantization reduces memory and raises speed at the same time. Seriously evaluate FP8/INT8 serving.
- Context length is not free. Long context directly consumes memory and bandwidth through the KV cache. Question whether you truly need that length.
- Distinguish capacity from bandwidth. "Does the model fit" and "is it fast enough" are different questions.
- Be conscious of data locality. Structuring code to reuse the same data reduces expensive memory round trips.
Memory optimization checklist
When inference cost is a concern, checking the following in order — before swapping the chip — is highly effective.
- Is this workload memory-bound or compute-bound? (confirm by profiling)
- Have you applied low-precision (FP8/INT8) serving? Have you validated accuracy?
- Is there room to compress the KV cache or lower its precision?
- Are you using only the context length you truly need? Any excess context?
- Are you batching so multiple requests share the weight reads?
- Is there inefficiency from repeatedly reading the same data?
Most items on this checklist can be applied with software alone, without more expensive hardware. To understand the memory wall is to instinctively run through these items.
10. Frequently Asked Questions
Q. Will just buying a chip with large memory capacity solve the memory problem? A. No. Capacity and bandwidth are different resources. Even if capacity is large enough for the model to fit, token generation is slow if bandwidth is insufficient. In memory-bound inference, bandwidth often directly determines speed.
Q. Doesn't quantization hurt accuracy? A. Lowering precision does carry a risk of loss, but the sophisticated quantization techniques of 2026 keep that loss very small. In many cases it is hard for users to even perceive, and the speed and cost gains in exchange are far larger. That said, validating quality on your own tasks is essential.
Q. How do I know whether I am memory-bound or compute-bound? A. Look at memory-bandwidth utilization and compute-unit utilization with a profiling tool. If bandwidth is nearly saturated while the compute units are idle, you are memory-bound. Generally, an LLM's token generation stage is memory-bound, while the prefill stage of a long prompt is closer to compute-bound.
Q. Why is using long context expensive? A. Long context grows the KV cache, and that cache consumes capacity and bandwidth at once. On top of that, the grown cache must be re-read at every token generated, so per-token cost rises as generation gets longer. Questioning whether you truly need that context length is the starting point of cost reduction.
Q. Are wafer-scale chips the right answer to the memory wall? A. They are one interesting approach but not the only answer. Eliminating HBM round trips with a giant on-chip SRAM is powerful, but it does not fit every workload and budget. HBM generation updates, quantization, and in-memory and photonic research are several branches advancing toward the same problem at once.
Closing
The real bottleneck that divides AI performance is not the compute capability we usually imagine, but memory. In an era where compute is cheap and data movement is expensive, both good hardware and good code evolve toward reducing data movement. New memory like HBM4, software techniques like quantization, kernel designs like exploiting on-chip SRAM, and future research like in-memory and photonic computing all aim at this single problem.
What we as developers should do is not be dazzled by flashy TFLOPS numbers, but ask "is the real bottleneck of this workload compute or memory?" That one question can cut your inference cost in half. Understanding the memory wall is fundamental for every developer working with AI systems in 2026.
Let me close with one sentence to remember: compute is cheap, data movement is expensive. Engrave this single line, and whether a new chip or a new model appears, you will see through to the principle operating beneath it. The names of chips change every year, but this principle does not. And the developer who knows it can build a faster, cheaper AI system on the same budget.
References
- The roofline model (introductory paper, ACM): https://dl.acm.org/doi/10.1145/1498765.1498785
- NVIDIA Blackwell architecture: https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/
- Cerebras (wafer-scale engine): https://www.cerebras.ai/
- Lightmatter (photonic computing): https://lightmatter.co/
- Google Cloud TPU: https://cloud.google.com/tpu
- arXiv (AI systems and memory research): https://arxiv.org/list/cs.AR/recent
- SemiAnalysis (memory and AI infrastructure analysis): https://www.semianalysis.com/