Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

For a long time we talked about chip progress as "how much smaller did the transistor get." Moore's Law was the rule of thumb that transistor density doubles every 18 to 24 months, and for decades that promise held. Yet in 2026, the question that decides AI accelerator performance is no longer only "what process node is it." An increasingly important question is this: "how do you attach and connect multiple dies inside a single package."

NVIDIA's Blackwell, as of GTC 2026, is a design that combines two enormous dies using TSMC's CoWoS-L packaging. The two dies are bound by a die-to-die link reaching roughly 10 TB/s, so they behave as if they were one chip. AMD's MI300 and MI350 families go a step further, placing several GPU and CPU dies on an interposer and stacking HBM on top — a chiplet assembly. The competitiveness of an accelerator is now decided not only by the silicon design of a single die, but by how those dies are packaged.

In this post we will look, step by step, at why the monolithic die ran into a wall, what problem the chiplet idea solves, and how CoWoS, 3D stacking, the UCIe standard, and HBM integration mesh together. At the end we will touch on the packaging supply bottleneck centered on TSMC, the physical challenges of heat and power, and the optical-integration future.

The Limits of the Monolithic Die — The Reticle and Yield Walls

Traditionally a single chip was made from one enormous silicon die. We call this a monolithic design. Whether CPU or GPU, every circuit was etched onto a single sheet of silicon, and if you wanted a more powerful chip, you just made the die larger. But this approach has two physical limits.

The Reticle Limit

The first is the reticle limit. Lithography tools have a fixed maximum area they can expose in a single shot. Today's EUV scanners cap a single exposure at roughly 26mm by 33mm, that is around 858 square millimeters (in the 800-square-millimeter range). A die larger than that cannot be made in a single exposure. In other words, the area of a single die has a hard physical ceiling.

NVIDIA's high-end GPUs were already pressed right against this reticle limit. The Hopper-generation H100 was about 814 square millimeters, essentially near the reticle ceiling. There was almost no room left to grow the die. That is precisely why Blackwell was split into two dies. They wanted a bigger chip but could not build it as one die, so they joined two reticle-class dies and used them as one.

The Yield Wall

The second limit is yield. Defects are scattered randomly across a silicon wafer. The larger the die, the higher the probability that at least one defect falls inside it. In a simplified model, as die area grows the good-die fraction drops exponentially.

Good-die yield ≈ exp(-defect density × die area)

Assuming defect density = 0.1 per square cm:

Die area Approximate yield

-----------------------------------

100 square mm about 90%

400 square mm about 67%

800 square mm about 45%

When a die reaches 800 square millimeters, nearly half are thrown away as defective. The number of good dies you can get from the same wafer drops sharply, and the cost per chip explodes. A die that is both large and low-yield reaches a point that is economically unsustainable.

These two walls — the reticle limit and the yield wall — are the root cause of the monolithic era's decline.

Chiplets — Slicing the Big Die Into Pieces

The chiplet idea is simple. Instead of making one enormous die, you make several smaller dies divided by function, then connect them at high speed inside a single package. Each small die is called a chiplet.

Monolithic design

+-----------------------------+

| |

| one enormous die |

| (reticle limit, low yield) |

| |

+-----------------------------+

Chiplet design

+--------+ +--------+ +--------+

+--------+ +--------+ +--------+

\\ | /

high-speed die-to-die interconnect

This approach has several advantages. First, small dies yield better. As the table showed, a 100-square-millimeter die yields about 90%, so you harvest far more good dies from the same wafer. Second, you can pick only verified good dies, known-good-die, for packaging. You test each die individually before attaching it to the package, screening out defective dies in advance. Third, you can mix dies built on different processes. Compute cores can use the most advanced node while I/O or memory controllers use a more mature and cheaper node, optimizing cost.

AMD is the company that has pushed this chiplet strategy most aggressively. The MI300 and MI350 families place GPU compute dies, CPU dies, and I/O dies on an interposer and stack HBM memory on top. An accelerator of a scale impossible as a single die is realized as a combination of verified small dies.

2.5D and 3D — Two Ways to Attach Dies

There are two broad ways to connect chiplets inside a single package: 2.5D stacking and 3D stacking.

2.5D Stacking (CoWoS)

In the 2.5D approach you place several dies side by side on a thin silicon substrate called an interposer. Fine wiring is etched into the interposer, connecting the dies with very short, dense traces. Because the dies sit side by side, it is called "2.5 dimensional."

TSMC's CoWoS (Chip-on-Wafer-on-Substrate) is the representative 2.5D technology. As the name says, you put the chip (die) on a wafer (interposer), and then put that on a substrate.

2.5D (CoWoS) cross section

[die A] [HBM stack] [die B]

=================================== <- interposer (silicon)

+---------------------------------+

| substrate |

+---------------------------------+

| | | | <- package pins (BGA, etc.)

3D Stacking

The 3D approach stacks dies upward rather than side by side. You place another die directly on top of a die and connect the upper and lower dies with vertical wiring called TSV (Through-Silicon Via). Stacking vertically shortens the wiring distance and maximizes density per unit area.

3D stacking cross section

[top die]

==================== <- TSV (vertical through wiring)

[bottom die]

+---------------------+

| substrate |

+---------------------+

TSMC's SoIC (System on Integrated Chips) and Intel's Foveros are examples of 3D stacking. Some AMD products stack a cache die on top of the compute die in 3D to dramatically increase cache capacity.

| Aspect | 2.5D (CoWoS, etc.) | 3D stacking (SoIC, Foveros, etc.) |

| --- | --- | --- |

| Die placement | side by side on interposer | die on die, vertically |

| Connection | interposer wiring | TSV vertical through vias |

| Wiring distance | short | very short |

| Thermal management | relatively favorable | difficult (top die hard to cool) |

| Area efficiency | moderate | very high |

| Typical use | GPU + HBM integration | cache stacking, logic stacking |

Interposers and Silicon Bridges — EMIB, InFO

The heart of 2.5D packaging is how you connect the dies. There are several different approaches here.

Full Silicon Interposer

The most intuitive approach is a large silicon interposer that covers every die. CoWoS-S is close to this. A large interposer offers high wiring density and stability, but the interposer itself is affected by the reticle limit. As the package grows, you must stitch several interposers together or use a more sophisticated process. The CoWoS-L that Blackwell used is an evolved approach that combines local silicon bridges with redistribution layers for exactly these large packages.

Silicon Bridge (EMIB)

Intel's EMIB (Embedded Multi-die Interconnect Bridge) embeds a small piece of silicon, a bridge, into the substrate only at the boundary where two dies meet, instead of a large interposer covering everything. Because you put high-density wiring only where it is needed, it can be more favorable than a large interposer in terms of cost and area.

Full interposer vs silicon bridge

Full interposer:

[die A]========[die B]

======entire interposer======

Silicon bridge (EMIB):

[die A]==[bridge]==[die B]

+--only a small bridge embedded in substrate--+

InFO (Integrated Fan-Out)

TSMC's InFO (Integrated Fan-Out) is a fan-out approach that connects dies using a redistribution layer (RDL) without an interposer. It can make a relatively thin and light package, used in mobile and some accelerator products.

In this way interposers, silicon bridges, and fan-out each offer a different balance of cost, wiring density, package size, and thermal behavior. Which one you choose becomes a design decision that directly determines the product's performance and cost.

HBM Integration — Breaking the Memory Wall With Packaging

One of the biggest bottlenecks in AI accelerators is memory bandwidth. You must continuously haul the weights of a huge model into the compute cores, and moving the data itself often consumes more energy and time than the computation. This is commonly called the memory wall.

HBM (High Bandwidth Memory) is the packaging-level answer to this memory wall. HBM is a memory stack of several DRAM dies stacked vertically and connected by TSVs. Placing this stack right next to the accelerator die, on the same interposer, dramatically shortens the distance between memory and compute cores.

HBM integration (2.5D)

[HBM stack] [GPU die] [HBM stack]

(DRAM 4-12 layers) (DRAM 4-12 layers)

===================================== <- interposer

very short, very wide wiring

short distance = high bandwidth + low transfer energy

The shorter the distance you move data, the wider the bus you can use, and the lower the energy per transferred bit. So HBM integration is not simply about attaching a lot of memory — it is a packaging technique that physically places memory close to the compute cores.

As of 2026, HBM4 is starting to appear. NVIDIA's next-generation Vera Rubin (expected late 2026) is reported to adopt HBM4. As HBM generations advance, you can stack more DRAM layers at higher bandwidth, which means processing larger models faster.

UCIe — A Common Language for Chiplets

The chiplet era brings a new problem. How do you connect dies made by different companies? To attach company A's compute die and company B's I/O die into one package, the die-to-die interconnect specification between them must be standardized.

UCIe (Universal Chiplet Interconnect Express) is an open standard for exactly this problem. If PCIe was the standard interface between chips on a board, UCIe aims to be the standard interface between dies inside a package. With the 2.0 specification following UCIe 1.0, a common language for chiplet-to-chiplet communication is being established, from the physical layer to the protocol layer.

UCIe layer structure (conceptual)

+-------------------------------+

| protocol layer (PCIe/CXL) |

+-------------------------------+

| adapter layer (reliability) |

+-------------------------------+

| physical layer (die-to-die) |

+-------------------------------+

| in-package wiring |

[die A] <-----------> [die B]

The significance of UCIe is not merely a technical specification. Once a standardized die-to-die interface takes hold, a chiplet ecosystem opens up where chip design houses can pick and combine chiplets from different suppliers as if assembling components. Compute from one company, the memory controller from another, I/O from yet another — this kind of heterogeneous combination becomes reality.

Yield and Cost — Small Dies Win

Let us restate why chiplets are not just a neat engineering trick but an economic inevitability. The core is yield and cost.

As we saw, the larger the die area, the more the good-die yield falls exponentially. Build one enormous die and a single defect forces you to scrap the whole die. Split the same function into four small dies, however, and you can scrap only the defective die and keep the rest.

Same total area, different partitioning strategy

Strategy A: one 800-square-mm die

yield about 45% -> over half scrapped

Strategy B: four 200-square-mm dies

each die yields about 82%

pick and combine good dies -> far more efficient

Known-good-die testing adds to this. Small dies can be tested individually before packaging, so only dies confirmed as good are committed to the expensive packaging process. This reduces the waste of discovering a defect at the expensive packaging stage and scrapping the whole thing.

Process mixing is also a big benefit. Not every circuit needs the most advanced process. Compute cores need the density of the latest node, but I/O and memory controllers are fine on a more mature and cheaper node. Chiplet design lets you assign the most cost-effective process to each function.

| Item | Monolithic | Chiplet |

| --- | --- | --- |

| Die size | large (reticle limit) | small |

| Die yield | low | high |

| Scrap on defect | whole die | only that chiplet |

| Pre-test | limited | known-good-die possible |

| Process mixing | not possible | possible |

| Packaging complexity | low | high |

| die-to-die overhead | none | present |

Of course chiplets have a cost too. The die-to-die interconnect adds power and latency, and the packaging itself becomes far more complex and expensive. But in the regime where dies grow large enough, the yield advantage of chiplets overwhelms this overhead.

Packaging Decides Performance

Now we reach the core claim. In the AI accelerators of 2026, packaging is performance.

Think about it: an accelerator's performance is not decided merely by how many compute cores it has. How fast you feed those cores with data, how fast the cores talk to each other, and how wide a bandwidth ties chips together — these determine real workload performance. All of this is the domain of packaging.

Consider the roughly 10 TB/s die-to-die link that joins Blackwell's two dies. If this bandwidth is not enough, the two dies cannot act as one, and software perceives them as two chips. Thanks to this wide link that packaging creates, the two dies appear as a single logical GPU.

The same goes for HBM integration. With the same compute cores, doubling memory bandwidth nearly doubles throughput for memory-bound workloads. Large language model inference is substantially memory-bound, so HBM bandwidth directly determines inference throughput.

One interesting industry signal to add: 2026 is observed to be the year inference capital expenditure first overtakes training capital expenditure. It means more money is now going into serving models than into building them. Because for inference bandwidth and efficiency are cost, the importance of HBM and packaging will only grow.

Supply Chain — CoWoS Capacity as the Bottleneck

Here we must address one industrial reality. Advanced packaging cannot be done by just anyone; in practice a small number of foundries hold the capacity. In particular, TSMC's CoWoS capacity is cited as the key bottleneck of AI accelerator supply as of 2026.

The situation is this. Demand for AI accelerators is explosive, yet building those accelerators requires advanced 2.5D packaging like CoWoS. But laying down a new CoWoS line and ramping its yield takes time and enormous investment. As a result, you can make the compute dies but lack the capacity to package them, so shipments are constrained.

The bottleneck in AI accelerator supply

[compute die fab] --> [HBM supply] --> [CoWoS packaging] --> [ship]

^^^^^^^^^^^^^^^

the bottleneck (2026)

While NVIDIA holds roughly 75 to 80 percent accelerator market share, competing products like AMD's MI350X compete for the same packaging capacity. In other words, advanced packaging capacity has become not just a technical matter but a strategic resource that decides who can put how many accelerators into the market. HBM supply is in a similar tension.

This supply-chain perspective has very real implications for anyone designing systems or procuring infrastructure. An accelerator adoption plan can have its timeline driven not just by performance specs but by the availability of packaging and HBM capacity.

Heat, Power, Warpage — The Bill From Physics

Advanced packaging is not free. Cram several dies and HBM stacks into one package, and physics hands you a bill.

Thermal

The biggest challenge is heat. Because enormous power is concentrated in a small area, pulling out the generated heat becomes ever harder. Especially in 3D stacking, the heat of the top die must pass through the bottom die to escape, making heat removal even trickier. So high-performance accelerators increasingly demand sophisticated cooling, and beyond that, liquid cooling.

Power Delivery

The second is power delivery. To supply current stably to a huge bundle of dies, the package and substrate must withstand enormous current. A long current path or high resistance causes voltage drop, which leads directly to lower performance or instability. This is why techniques like backside power delivery, supplying power from the back of the chip, have recently drawn attention.

Warpage

The third, somewhat unexpected, is warpage. Silicon, interposer, and substrate have different coefficients of thermal expansion. As the package repeatedly heats and cools, they expand and contract at different rates, and the whole package warps slightly. The larger the package, the worse this warpage, and in severe cases the fine joints between dies break or reliability problems arise. In large CoWoS packages, warpage management is a very real engineering challenge.

The physical bill of advanced packaging

heat concentration ---> cooling/liquid cooling needed

power concentration --> backside power, thick power grid

CTE mismatch ---------> package warpage, joint reliability

The larger the package, the bigger all three bills.

The Future — Optical Integration and Beyond

There is a limit to connecting dies with electrical wiring. The longer the distance, the more energy an electrical signal uses and the more loss it suffers. Short distances inside a package are fine with electricity, but the longer distances between package and package, board and board, become an increasing burden.

This is why optical integration, optical I/O, is drawing attention. Transmitting signals with light suffers far less loss with distance, and can carry very high bandwidth at lower energy. Research into co-packaged optics, bringing silicon photonics into the package and placing an optical engine right next to the compute die, is active.

Electrical I/O vs optical I/O

Electrical: [die] ===copper wiring=== [die]

loss/energy rises with distance

Optical: [die]--[optical engine]~~~light~~~[optical engine]--[die]

distance-insensitive, high bandwidth, low energy

In the big picture, the evolution of packaging points in one direction: bringing compute, memory, and communication physically ever closer together. Memory came next to compute via HBM, dies gathered into one package via chiplets, and the next step is integrating communication (I/O) into the package via light. If Moore's Law pushed performance by shrinking transistors, in the world beyond it integration pushes performance.

A Developer and System Perspective

What does all this hardware talk mean for someone who writes software? More directly than you might think.

First, code that is conscious of data movement matters more and more. As long as the memory wall is real, reducing data movement often yields more performance than reducing computation. Patterns that keep data near the cores and reuse it — tiling, fusion, cache-friendly access — interlock with the hardware's packaging structure to determine performance.

Second, being conscious of die and chip boundaries matters. In an accelerator where two dies are bound by a die-to-die link, like Blackwell, communication crossing the link is more expensive than communication inside a die. In multi-GPU and multi-die environments, how you split and place a workload decides performance. Understanding the topology of chip-to-chip interconnects like NVLink or UALink becomes the starting point for optimizing distributed training and inference.

Third, the perspective of infrastructure procurement and capacity planning. The CoWoS and HBM capacity bottleneck we saw directly affects accelerator availability and price. If you are planning a large-scale inference service, you must consider not only performance specs but supply availability and lead time.

Three things for developers to remember

1. data movement is more expensive than compute (memory wall)

2. die-to-die / chip-to-chip is more expensive than intra-die

3. packaging/HBM capacity decides availability and cost

More concretely, here is a checklist you can run through when reasoning about performance.

Packaging-aware optimization checklist

[ ] Did you measure whether the workload is memory-bound or compute-bound first

[ ] For inference, do you know how much HBM bandwidth the KV cache consumes

[ ] Did data reuse (tiling, fusion) cut down HBM round trips

[ ] On a multi-die accelerator, did you minimize traffic crossing the die-to-die link

[ ] Does the tensor/pipeline parallel split match the chip-to-chip interconnect topology

[ ] Did you overlap communication with compute to hide link latency

[ ] At procurement time, did the schedule account for packaging/HBM capacity lead time

[ ] On an accelerator generation change, did you review how HBM capacity/bandwidth shifts affect batch size

The earlier items in this checklist live at the code level; the later ones at the system and procurement level. The interesting part is that the two are increasingly inseparable. Which accelerator you can secure decides which parallelization strategy is possible, and that parallelization strategy in turn decides the communication pattern on top of the packaging structure. The deeper the physical integration of the hardware, the more deeply software optimization must be conscious of that physical structure.

Working the Yield Math — One 800-Square-mm Die vs Four 200-Square-mm Chiplets

Earlier we said "small dies win," but the claim does not really land until you push the numbers through yourself. Let us take two strategies of the same total area and compute them to the end under identical assumptions.

The assumptions are simple. Defect density is 0.1 per square centimeter, and the yield model is the same negative-exponential model we used before. 800 square millimeters is 8 square centimeters; 200 square millimeters is 2 square centimeters.

Shared assumptions

defect density D = 0.1 / square cm

yield model Y = exp(-D × A), A is die area in square cm

Strategy A — one monolithic 800-square-mm die

A = 8.0 square cm

Y = exp(-0.1 × 8.0) = exp(-0.8) ≈ 0.449

-> good-die probability about 44.9%

Strategy B — four 200-square-mm chiplets (800 square mm total)

each chiplet A = 2.0 square cm

per-chiplet yield = exp(-0.1 × 2.0) = exp(-0.2) ≈ 0.819

-> per-chiplet good probability about 81.9%

There is one trap to watch here. If you simply multiply "all four chiplets are good" in strategy B, you get 0.819 to the fourth power, about 45%, which looks no better than monolithic. But that framing is the trap. The real advantage of chiplets is known-good-die selection: you discard defective chiplets before packaging and assemble only good ones, so you never need "all four good at once."

Good silicon harvested from the same wafer (intuitive comparison)

If usable silicon area on one wafer is 100:

Strategy A (800-square-mm die):

fewer dies fit, and only about 45% of them are good

-> every scrapped die throws away a full 800 square mm

Strategy B (200-square-mm chiplet):

four times as many dies fit in the same area

about 82% are good, and a defect scraps only 200 square mm

-> the scrap unit shrinks to a quarter, so effective good area rises sharply

In numbers: the expected scrapped silicon area is about 0.55 × 800 = 440 square millimeters per die under strategy A, but only about 0.18 × 200 = 36 square millimeters per chiplet under strategy B. Building the same total area needs four chiplets, so 36 × 4 = 144 square millimeters, still far below strategy A's 440. On an equal-area basis, the chiplet route throws away roughly a third as much silicon.

Then packaging cost enters. Chiplets require an interposer, extra bonding, and inspection steps, so packaging cost per die is higher than monolithic. The conclusion is therefore area-dependent. When dies are small, monolithic is cheaper; as the die approaches the reticle limit, the yield advantage of chiplets overwhelms the added packaging cost. Giant accelerators like Blackwell or MI350 go chiplet precisely because they have crossed this break-even point.

A Deeper Look at UCIe — Standard Package vs Advanced Package

Earlier we introduced UCIe as "a common language for chiplets." Now let us go one level deeper. UCIe's physical layer defines two broad package classes: the standard package and the advanced package.

The standard package connects dies on an ordinary organic substrate at a relatively generous bump pitch. Wiring density is lower, but manufacturing is cheap and easy. The advanced package connects on a high-density medium such as a silicon interposer or bridge, at a much finer bump pitch. Manufacturing is expensive, but it can route far more wires per unit of edge length, the shoreline.

The key concept here is shoreline bandwidth. The bandwidth of a die-to-die connection is constrained not by die area but by the length of the edge where two dies meet. So the industry measures bandwidth as "gigabytes per millimeter of edge."

Shoreline bandwidth (die-to-die bandwidth per mm of edge)

+----------+

| die A |

+----------+

^^^^^^^^^^^^ <- only this much edge length can carry wiring

the touching edge = shoreline

UCIe standard package: roughly tens of GB/s per mm of edge

UCIe advanced package: roughly hundreds of GB/s per mm of edge and up

For the same edge, the advanced package is an order of magnitude denser

When signals must travel far, a retimer becomes necessary. Electrical signals on a standard package attenuate with distance, so a retimer chip reshapes and re-amplifies the signal partway, extending reach. A retimer adds latency and power, though, so where possible it is better to keep distances short with an advanced package.

Placing UCIe alongside competing interconnects clarifies where it sits. NVLink-C2C and Infinity Fabric are proprietary specifications from NVIDIA and AMD respectively, highly optimized inside their own ecosystems. UCIe aims to compete on performance while, as an open standard, opening a path to mixing chiplets from different suppliers.

| --- | --- | --- | --- |

The table makes it look as if UCIe will replace the proprietary specs outright, but reality is more nuanced. A proprietary spec lets one company optimize everything from the physical layer to the software stack, so it tends to hold the performance lead for a while. UCIe's real weapon is not raw performance but standardization itself — turning chiplets into components that can be bought and sold on the market.

Packaging Techniques at a Glance — 2.5D, 3D, InFO, EMIB, Foveros

We have now introduced many packaging approaches, so let us gather them into one table. The key axes are stacking direction, bump pitch (finer means higher wiring density), relative cost, and typical use.

| --- | --- | --- | --- | --- | --- |

Read the table across and one trend appears: as cost rises, bump pitch tightens and wiring density climbs. More expensive packaging means shorter, wider die-to-die connections, which translate directly into bandwidth and efficiency. Hybrid bonding sits at the end of this trend — bonding copper pads directly without bumps, creating extremely dense sub-micron-class connections.

The designer is, in effect, choosing a point on this table. Save cost but give up bandwidth, or pour in cost for the densest possible connection. The nature of the product — mobile or datacenter accelerator — drives that choice.

CoWoS Capacity and the Supply Ramp — Why Blackwell and MI350 Supply Is Coupled

We said earlier that CoWoS capacity is the bottleneck, but placing it on a time axis makes the severity clearer. The core point is simple: the rate at which advanced packaging capacity grows cannot keep up with the rate at which AI accelerator demand explodes.

CoWoS capacity ramp vs demand (conceptual trend, 2024-2026)

demand ---------------------------/

capacity ------------/------/

/ /

2024 2025 2026

------------------------------------

gap (demand - capacity) = shipment constraint = allocation

To grow capacity you must lay down new interposer fabrication lines, bonding equipment, and inspection tools, then ramp the yield — all of which takes months to years of lead time and enormous capital. So even though foundries have aggressively expanded CoWoS capacity from 2024 through 2026, the added capacity gets absorbed almost as fast as it appears.

This bottleneck spills straight into product supply. NVIDIA's Blackwell and AMD's MI350 family both depend on the same kind of advanced 2.5D packaging and HBM. In other words, both companies' flagship products compete for the same constrained resource. No matter how fast you fabricate the compute dies, without securing packaging slots and HBM volume you cannot ship the finished product.

How supply gets coupled

compute die (relatively comfortable)

HBM volume (tight) ------+

| |

CoWoS slot (bottleneck) -+--> both must be secured to ship

finished accelerator (rationed by allocation)

The resulting reality is allocation. Wanting to buy an accelerator does not mean you can buy it immediately; volume is distributed by the priority the supplier sets. Large cloud providers secure supply first, and the rest of demand queues behind them. For anyone procuring infrastructure, this allocation structure and lead time become variables as important as the performance spec.

Closing

People say Moore's Law is over, but more accurately the lead actor on stage has changed. The race to make transistors smaller still goes on, but the larger performance gains increasingly come from how you split, attach, and connect dies — that is, from advanced packaging.

The monolithic die was blocked by the reticle and yield walls, and the chiplet that emerged as the answer opened a new economics of verifying and combining small dies. CoWoS and 3D stacking bound these dies into something like one chip, HBM broke the memory wall with packaging, and UCIe is paving the way for chiplets to talk in a common language. Behind all of it stand the supply bottleneck of CoWoS capacity and the bill from physics: heat, power, and warpage.

The 2026 AI accelerator competition is less a fight over who designs the faster core and more a fight over who can integrate more cleverly and package more reliably. The world beyond Moore's Law is, in the end, the world of packaging.

References

- [TSMC — 3DFabric / Advanced Packaging](https://www.tsmc.com/)

- [NVIDIA — Data Center GPU Platforms](https://www.nvidia.com/)

- [AMD — Instinct Accelerators](https://www.amd.com/)

- [UCIe — Universal Chiplet Interconnect Express](https://www.uciexpress.org/)

- [Intel — Foveros / EMIB Advanced Packaging](https://www.intel.com/)

- [SemiAnalysis — Semiconductor Supply Chain and Packaging Analysis](https://www.semianalysis.com/)

- [IEEE Spectrum — Chiplets and Advanced Packaging](https://spectrum.ieee.org/)

- [Chips and Cheese — Microarchitecture Deep Dives](https://chipsandcheese.com/)

- [Synopsys — Multi-Die Systems / UCIe IP](https://www.synopsys.com/)