- Authors

- Name
- Youngju Kim
- @fjvbn20031
- Introduction
- 1. The Character of Inference Workloads
- 2. GPU — King of Generality and Ecosystem
- 3. Google TPU — A Tensor-Specialized Systolic Array
- 4. Cloud In-House Inference ASICs — Quiet, Rapid Growth
- 5. Trade-off Comparison Table
- 6. Compilers and Software Stacks — CUDA vs XLA
- 7. The Economics of ASICs — NRE and Break-even
- 8. Selection Criteria — What Should You Choose
- 9. Frequently Asked Questions
- 10. The Future
- 11. A Practical Summary for Developers
- Closing
- References
Introduction
Inference is fundamentally different work from training. Training pushes huge batches at once and burns the GPU at close to 100 percent — it is throughput-oriented work. Inference receives user requests intermittently, where response latency is the user experience itself, and it calls the same model hundreds of millions of times. Because of this difference, in inference the more important question is not "which chip is fastest" but "which chip is cheapest per token and most efficient per watt."
In 2026, three kinds of chips compete over this inference market: the GPU, armed with generality and ecosystem; Google's TPU, specialized for tensor operations; and the ASIC, extremely tailored to a specific workload. This article lays out the strengths and weaknesses of all three, and the criteria for choosing, from a developer's point of view.
Here is the broad arc of this article. First we make clear how the inference workload differs from training; then we examine the three kinds of chips in turn. Next we summarize the trade-offs in a single table, and compare the compiler and software stack (CUDA and XLA) that matters as much as the hardware. Finally we close with practical selection criteria and a look at the future. To throw out the core message in advance: "the answer depends on how fixed your workload is."
1. The Character of Inference Workloads
First, let us make clear why inference differs from training.
Training vs. inference (key differences)
Item Training Inference
----------- ------------------- --------------------
Batch size large (thousands) small (1 to tens), variable
Latency low sensitivity high (user waiting)
Compute high arithmetic int. low (memory-bound)
Precision BF16/FP8 FP8/FP4/INT8
Repetitions few (train once) many (called endlessly)
Goal throughput cost per token + latency
The key point is that inference is often memory-bound. Especially in the autoregressive decoding stage that generates tokens one at a time, the time to read enormous weights from memory exceeds the time to multiply them. So an inference chip's contest is often decided not by "how fast does it multiply" but by "how fast does it read memory, and how little power does it spend doing so."
The two stages of inference — prefill and decode
To understand inference more precisely, you need to know that LLM inference splits into two stages with different characters.
The two stages of LLM inference (concept)
1. prefill (prompt processing)
Processes the entire input in parallel, all at once. Lots of compute
-> closer to compute-bound.
2. decode (token generation)
Generates tokens one at a time, sequentially. Re-reads the enormous
weights for every token -> memory-bound.
Even within one inference, the two stages have different bottlenecks.
This distinction matters because the two stages demand different things from the chip. Prefill demands compute capacity; decode demands memory bandwidth. A good inference system sometimes separates and optimizes the two (for example, placing the two stages on different resources). When picking a chip, too, "is my workload prefill-heavy or decode-heavy" turns out to be a surprisingly important variable. A long prompt with a short answer is prefill-heavy; a short prompt with long generation is decode-heavy.
2. GPU — King of Generality and Ecosystem
The GPU's strength is not its chip spec but its ecosystem.
- Anything runs: new model structures, custom operations, experimental quantization schemes — almost everything runs immediately on a GPU.
- CUDA as a moat: a decade-plus of accumulated libraries (cuBLAS, cuDNN, CUTLASS), kernels, profilers, and a vast community. Inference serving engines (various inference runtimes) support the GPU first.
- Flexible batching: advanced techniques such as continuous batching, which groups variable-length requests, are mature in the GPU ecosystem.
- A deep talent pool: there are many engineers in the market who know how to work with CUDA, making it easy to staff a team and solve problems. This is an often-underrated, very real advantage.
The GPU's weakness is precisely the price of that generality. A chip designed to do anything is inevitably less efficient per watt than an ASIC that does just one thing. In a scenario of running a fixed model at scale, that inefficiency accumulates into a cost gap.
To summarize this trade-off in one sentence: the GPU is an all-purpose tool that "does anything reasonably well but does no single thing to the extreme." An all-purpose tool is the best choice when you don't yet know what you'll do, and it yields its place to a dedicated tool once the job is decided. As the inference market matures and "decided-job" workloads multiply, the GPU's generality becomes a double-edged sword.
And yet the reason the GPU still dominates inference is that real-world workloads do not get fixed as quickly as you'd think. Models keep improving, new techniques appear, precision formats change. During this period of change, the value of a GPU that "runs anything immediately" more than offsets its cost inefficiency. You could even see the GPU's true strength as not the chip but the insurance it provides against change.
Batching — the hidden secret of GPU inference efficiency
One of the most powerful weapons for pushing up throughput in GPU inference is batching. If you process user requests one at a time, you read the enormous weights once, use them for a single request, and throw them away. Tremendous waste. Instead, if you gather several requests and process them all at once with the same weights, multiple requests share the weights you read once.
The effect of batching (concept)
No batching: read weights -> process 1 request -> discard (repeat)
low weight reuse, large memory waste
With batching: read weights -> process N requests at once -> discard
weights read once shared by N -> efficiency soars
The problem is that inference requests differ in length and arrive at different times. Techniques like continuous batching, which handle this efficiently, are mature in the GPU inference ecosystem, and this is one of the GPU's real strengths. That said, batching raises throughput but can slightly increase the latency of individual requests, so it takes operational judgment to balance throughput against latency.
3. Google TPU — A Tensor-Specialized Systolic Array
The Google TPU is an interesting being that sits between the GPU and the ASIC. It is not as flexible as a general-purpose GPU, but not as rigid as a single-purpose ASIC either. At its core is the systolic array — a grid-shaped compute structure designed for matrix multiplication.
To explain the systolic array intuitively: it is a structure where data flows regularly, like a heartbeat, through compute units arranged in a grid. Once data enters the grid, it does not go back out to external memory; it is passed sideways from compute unit to compute unit, where multiplication and accumulation happen.
systolic array (concept)
input -> [PE]-[PE]-[PE]
| | | PE = compute unit (multiply-accumulate)
[PE]-[PE]-[PE] data flows sideways inside the grid
| | | -> reduces round-trips to external memory
[PE]-[PE]-[PE]
| result
Extremely efficient for matrix multiply. High data reuse, low memory load.
The advantage of this structure is its high data-reuse rate. Once data is loaded into the grid, it is used by many operations, so there is less need to read the same data repeatedly from memory. That is why the systolic array fits deep-learning workloads dominated by matrix multiplication. The flip side: for irregular operations that do not fit the grid structure well, efficiency drops. This is also why the TPU is "strong at matrix multiplication but not as flexible as a GPU for arbitrary operations."
TPU v6 Trillium
One core of the 2026 TPU line is the v6 generation, Trillium. Google stated that Trillium raised peak compute performance per chip by about 4.7x over the prior generation. Memory bandwidth and interconnect were strengthened alongside, so it is used for both large-scale training and inference.
Ironwood — the inference-specialized 7th generation
More interesting is the 7th-generation Ironwood, specialized for inference. As the name suggests, Ironwood is designed squarely for the inference era rather than training. It focuses on serving giant models at low latency and high power efficiency. It is a product that aligns exactly with the 2026 trend of inference capex overtaking training.
The reason Ironwood is symbolic is that it is the first TPU line to dedicate an entire generation to inference. Until now accelerators were mostly designed with training as the top priority, with inference following along as an afterthought. But once inference cost overtook training, the very priority of chip design flipped. Ironwood marks the inflection point of crossing over from an era of squeezing inference into chips designed for training, to an era of chips designed for inference from the ground up. This is the same flow as NVIDIA's Blackwell taking aim squarely at inference, and it tells us the whole industry is heading in the same direction.
The TPU's trade-offs
- Strengths: extremely efficient at matrix multiplication, excellent scalability in large clusters, tight integration with the Google stack.
- Weaknesses: harder to run arbitrary operations as freely as a GPU, and its ecosystem is Google-cloud-centric, so portability is limited.
To revisit why we called the TPU "between the GPU and the ASIC": the TPU cannot run just anything the way a GPU can, but it is not pinned to a single model the way a single-purpose ASIC is. It is specialized for the broad category of matrix operations, so it handles a variety of models within that category efficiently. This "moderate specialization" is the TPU's identity. From a balance point that is neither too general nor too rigid, it digests both large-scale training and inference at reasonable efficiency. The price, however, is being tied to the Google Cloud fence, which is a clear constraint for organizations with a multi-cloud strategy.
4. Cloud In-House Inference ASICs — Quiet, Rapid Growth
The fastest-growing category in the 2026 inference market is the cloud providers' in-house inference ASICs. The share that ASICs hold of inference workloads is projected to climb steeply from about 15 percent in 2024 to about 40 percent in 2026.
Why do the clouds build their own chips?
- Economics: the workloads they run most in their own data centers (specific recommendation, translation, LLM inference) are fixed. Building a fixed workload as an ASIC lowers cost per watt far below a GPU.
- Supply-chain control: they reduce sole dependence on NVIDIA and own their roadmap.
- Vertical integration: designing model, compiler, and chip together maximizes optimization headroom. When you know exactly what the model looks like and then build the chip, extreme optimizations become possible that a general-purpose chip cannot manage.
The power of this vertical integration is easy to underestimate. A general-purpose GPU vendor has to run every model in the world well, so it cannot optimize extremely for any single one. By contrast, a company that builds its own chip for its own model can design precisely for that specific model's compute pattern, precision, and memory access. This integration — one team refining model, compiler, and chip together — is the fundamental reason an in-house ASIC can outdo a general-purpose chip on the same workload.
The ASIC's weakness is clear. It has almost no flexibility. Stray from the model structure or precision assumed at chip-design time, and efficiency drops sharply or it simply will not run. It is unsuitable for the research stage where model structures change quickly, and it shines in sufficiently standardized, fixed bulk inference.
It is worth chewing on what this growth from 15 percent to 40 percent means. For the ASIC share of inference workloads to more than double in just two years is a signal that the market is moving fast toward "give up a little flexibility to cut cost a lot." It also means inference workloads are becoming that much more standardized and fixed. The industry is maturing past the experimental period when models changed often, into a stage of stably serving validated models in bulk.
Inference ASIC share trend (projection, concept)
2024 ### ~15%
2025 ###### (rising)
2026 ######## ~40%
-> The more standardized the workload, the faster the share of
specialized chips grows.
What an inference ASIC does well and poorly
To understand the ASIC more concretely, let us split what it does well from what it does poorly.
Strengths / weaknesses of an inference ASIC (concept)
Does well: - bulk processing of a fixed model at a fixed precision
- minimizing cost per watt and per token
- predictable, stable workloads
Does poorly: - reacting instantly to new model structures
- experimental ops and custom kernels
- workloads that change often
This split is the heart of the ASIC adoption decision. If your workload fits cleanly in the "does well" column, an ASIC delivers overwhelming economics. Conversely, if much of it spills into the "does poorly" column, then however cheap the cost per token may look, frequent redesigns and workarounds can make the total cost larger instead. So an ASIC decision should start not from the chip spec but from a self-diagnosis: how fixed is my workload?
5. Trade-off Comparison Table
A side-by-side look at the three categories.
| Criterion | GPU | TPU | Inference ASIC |
|---|---|---|---|
| Flexibility | very high | medium | low |
| Ecosystem maturity | highest (CUDA) | medium (XLA, Google) | low (vendor lock-in) |
| Perf per watt | medium | high | very high (when fixed) |
| Cost per token | medium | low | lowest (fixed workload) |
| Latency optimization | good | good | very good (when fixed) |
| New-model support | immediate | relatively fast | slow (needs redesign) |
| Portability | high | low | very low |
| Best scenario | research, varied loads | large training/inference | standardized bulk infer. |
The one-line reading of this table: as you move right, efficiency and cost improve, but flexibility and portability worsen. How fixed your workload is becomes the key decision variable.
A common mistake when reading this table is deciding from a single cell. For example, look only at "cost per token" and the ASIC is overwhelming, but if you don't also look at "new-model support" and "portability" beside it, you fall into a trap. No matter how cheap the cost per token, if you must redesign the chip every time you change the model, the total cost can actually grow. Every decision must be made by balancing several axes, not just one.
Another caution: the values in this table are not absolute. As compilers mature, the "ecosystem maturity" of TPU and ASIC rises; as a new GPU generation arrives, "perf per watt" changes. Remember that the table is only a snapshot at the 2026 moment — a shifting terrain that gets updated every year.
6. Compilers and Software Stacks — CUDA vs XLA
As important as the hardware is the software stack. The compiler that turns model code into instructions the chip executes governs both performance and productivity.
The CUDA camp
The GPU revolves around CUDA. Developers write models in a high-level framework, and beneath it, libraries like cuDNN and CUTLASS plus custom kernels optimize the operations for the GPU. The core strength is maturity and control. When needed, you can write your own kernel and squeeze out the last drop of performance.
GPU execution flow (concept)
Model code
|
v
Framework graph
|
v
CUDA kernels / cuDNN / CUTLASS <- you can hand-write kernels
|
v
GPU execution
The XLA camp
The TPU and many ASICs revolve around a compiler such as XLA. Developers usually do not write kernels by hand; the compiler sees the whole graph and automatically performs operation fusion, layout optimization, and memory scheduling.
TPU/ASIC execution flow (concept)
Model code
|
v
Graph (the whole thing seen at once)
|
v
XLA compiler <- auto fusion, layout, scheduling
|
v
TPU / ASIC execution
The difference between the two philosophies is clear. CUDA offers "powerful low-level control and a giant ecosystem"; XLA offers "automatic optimization you delegate to the compiler and a clean abstraction." ASIC vendors typically provide their own compiler closer to the latter, and the maturity of that compiler determines the real-world usability of the chip.
Why the compiler is the real battleground
Here is a core point I want to emphasize. The real battleground of the inference-hardware contest is not the chip's transistors but the compiler. No matter how outstanding a chip's theoretical performance, if the compiler is too immature to lift the model to that performance, it is useless.
Theoretical perf vs. effective perf (concept)
Chip A: theoretical 100, immature compiler -> effective 40
Chip B: theoretical 80, mature compiler -> effective 70
-> More than the spec-sheet theoretical performance, how much the
compiler draws out determines the actual experience.
This is also why NVIDIA's moat is so solid. CUDA is not a chip but a software stack refined over more than a decade. Even if a new ASIC arrives with better transistors, it takes a long time of grinding on the compiler before it can offer the experience of "bring your model and it just works." That is why, for a chip vendor, the compiler team has become as important as the chip-design team. From a developer's standpoint, too, when evaluating a new accelerator, it is far more important to directly benchmark "how well the compiler draws out performance when I put my model on it" than to read the "theoretical TFLOPS."
7. The Economics of ASICs — NRE and Break-even
The decision to build your own ASIC rests on interesting economics. Designing the chip and preparing the production line costs an enormous one-time expense (NRE, Non-Recurring Engineering). To justify this cost, the cumulative scale of the workload you process with that chip must be large enough that the per-chip cost savings exceed the NRE.
ASIC break-even (concept)
cost
^
| GPU rental: cost rises steadily in proportion to usage
| /
| /
| / ____________ ASIC: large upfront NRE, but low unit cost after
| / /
| / /
| //
| X <- break-even point
+--------------------------> cumulative workload scale
ASIC pays off only when the workload is large and fixed enough
to cross the break-even point.
What this graph explains is clear. When the workload is small or changes often, GPU rental is cheaper; when the workload is enormous and fixed, the ASIC becomes cheaper. That is why in-house ASICs are nearly the exclusive domain of hyperscale clouds and AI companies. Only they have workloads large and stable enough to justify the NRE. For most companies the rational choice is to rent that infrastructure. In other words, ASIC economics is ultimately a function of "scale."
8. Selection Criteria — What Should You Choose
A checklist for real decisions.
- Does the model structure change often? Yes -> GPU. In research and experimentation, flexibility matters above all.
- Are you running one fixed model at enormous scale? Yes -> ASIC or TPU. The more fixed, the greater a specialized chip's economics.
- Must you port across multiple clouds or on-prem? Yes -> GPU. Its portability and compatibility are overwhelming.
- Are power and cost your biggest constraints? Yes -> seriously evaluate specialized chips (TPU/ASIC).
- Does the team have low-level optimization skill? Yes -> custom kernels on a GPU can deliver big gains. No -> a stack where the compiler handles it is easier.
For most ordinary application teams, starting with a GPU (or a GPU-based managed inference service) is reasonable. Once the workload grows large and fixed enough, a switch to a TPU or ASIC becomes justified by cost savings.
Three real scenarios
Let us plug the abstract criteria into concrete situations.
Scenario A — a startup's new AI feature. It changes the model often, and traffic is hard to predict. The answer is GPU-based managed inference. This is a stage where flexibility and fast experimentation matter more than cost optimization. Bind yourself prematurely to a specialized chip and you get tripped up every time you change the model. Obsessing over cost optimization at this stage is a common mistake. If you harden your infrastructure onto a specialized chip before the product is even validated, that investment becomes a shackle exactly when you need to change direction.
Scenario B — the core inference of a mature service. The model structure has stabilized, and tens of billions of calls arrive daily. Now cost is the business itself. This is the time to move to a TPU, or if possible a specialized chip and low-precision serving matched to that workload. Even a small efficiency gain is enormous in absolute terms. The crux at this stage is the judgment of "is it stable enough?" Only when you can be confident the model and traffic patterns have settled is the shift to specialization safe.
Scenario C — a multi-cloud, on-prem product. You must deploy into different environments per customer. Portability is the top priority, so the GPU is practically the only realistic choice. A particular cloud's TPU or an in-house ASIC cannot be used outside that cloud.
Summarizing these three scenarios in a single table makes the decision even clearer.
| Scenario | Top value | Realistic choice |
|---|---|---|
| New, experimental | flexibility | GPU managed inference |
| Mature, bulk | cost | TPU or specialized/low-prec |
| Multi-cloud product | portability | GPU |
The common lesson of these three scenarios is that "the answer depends not on a chip's absolute performance but on our stage and constraints."
9. Frequently Asked Questions
Q. Is the TPU unconditionally more efficient than the GPU? A. When matrix multiplication dominates and the workload fits the TPU well, it can be more efficient. But when there are many irregular operations or the model structure is unusual, the GPU can be better. There is no "unconditionally."
Q. Can our company build an inference ASIC too? A. Technically yes, but economically it is usually irrational. Without a workload large and fixed enough to justify the enormous NRE, renting the cloud's ASIC is far cheaper.
Q. Is being tied to CUDA really a problem? A. It is convenient for now, but over the long run it narrows your bargaining power and options. If you put an abstraction layer on your core inference path so you can swap backends, you leave room to switch when a cheaper option appears in the future.
Q. If the compiler optimizes automatically, why write kernels by hand? A. In most cases the compiler is enough. But on the extreme inference path where performance makes or breaks the business, hand-writing a kernel to squeeze out the last few percent can make a big difference. This is the benefit that the depth of the GPU ecosystem provides.
10. The Future
The directions beyond 2026.
- A continuing rise in ASIC share: the more standardized inference workloads become, the larger the share of in-house ASICs.
- Intensifying compiler competition: compiler and software-stack maturity, more than the chip itself, becomes the battleground. To beat the GPU, an ASIC must deliver an "it just works" experience.
- Advancing abstraction layers: a middle layer that lets you swap backends without binding to specific hardware grows important. As it matures, the barrier to ASIC adoption lowers.
- Mixed operation: it becomes common to mix GPUs and specialized chips within one service, by workload character.
To unpack this mixed operation a bit more: a future inference system most likely will not rely on a single chip. For instance, experimental traffic that changes models often goes to the GPU, the bulk traffic of a stabilized core model goes to a specialized chip, and some paths where latency is extremely critical go to yet another optimized resource.
The future mixed inference infrastructure (concept)
request --+-- experimental/new-model traffic --> GPU (flexibility)
+-- stable core traffic --> specialized chip (cost)
+-- ultra-low-latency path --> optimized resource (latency)
-> Evolves into a structure that routes by workload character,
not a single chip.
In such a structure, the skill a developer needs is less about knowing a particular chip deeply and more about designing the abstraction that classifies workloads by character and routes them to the right resource. In other words, the weight of future inference engineering shifts from the question of "which chip is best" to "which traffic should go where."
11. A Practical Summary for Developers
Even a developer who does not pick chips directly has practical takeaways from this comparison.
- Understand the prefill/decode split of your workload. Because the two have different bottlenecks, your optimization direction and chip choice change.
- Apply batching and low-precision serving first. Before swapping chips, there is large efficiency to squeeze out in software on the same chip.
- Benchmark with your own model, not theoretical performance. How much the compiler draws out that performance is the real metric.
- Leave room to swap backends with an abstraction layer. If you don't bind your code deeply to a particular chip, your future options widen.
- Decide according to your stage. Flexibility early (GPU), cost at maturity (TPU/ASIC). Optimization that skips a stage usually backfires.
Just being conscious of these five can make a big difference in accelerator choice and inference cost. And importantly, most of these items can be practiced without deep knowledge of a particular chip — simply by understanding your workload. In other words, the starting point of good inference engineering is not memorizing the chip catalog but knowing your own workload precisely.
Closing
The GPU vs TPU vs ASIC inference war is not a matter of "who wins" but of "what fits which workload." The GPU holds its place with flexibility and ecosystem, the TPU with balanced efficiency, and the ASIC with the extreme economics of fixed workloads.
What we as developers should remember is simple: the more fixed the workload, the greater the value of a specialized chip; the more change, the more a GPU's flexibility shines. And whichever chip you pick, understanding that inference is essentially memory-bound can cut your cost dramatically. We dig into the identity of that memory bottleneck in the next article.
Finally, let me add one balanced perspective. Flat assertions like "ASIC replaces GPU" or "the GPU is finished" are mostly exaggerations. Reality is far more gradual and coexistent. The GPU holds its place in the realm of change and experimentation, the TPU in balanced large-scale workloads, and the ASIC in standardized bulk inference — each keeping its ground and growing together. More realistic than a future where one chip takes everything is a pluralistic future where you pick and use chips to match the workload. The ability to navigate that pluralistic world well will be a core competency of the post-2026 inference engineer.
References
- Google Cloud TPU: https://cloud.google.com/tpu
- Google Cloud TPU documentation: https://cloud.google.com/tpu/docs
- NVIDIA data center GPUs: https://www.nvidia.com/en-us/data-center/
- OpenXLA project: https://openxla.org/
- AWS in-house inference chip (Inferentia): https://aws.amazon.com/machine-learning/inferentia/
- SemiAnalysis (AI infrastructure analysis): https://www.semianalysis.com/
- arXiv (computer architecture): https://arxiv.org/list/cs.AR/recent