Skip to content

필사 모드: AI Interconnect — NVLink, NVSwitch, UALink, and the Art of Scaling Up

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

When we talk about the compute performance of a single GPU, we tend to fixate on numbers like TFLOPS. Yet as we enter an era of harnessing thousands or even tens of thousands of GPUs to train a single colossal model, what actually decides the performance of the whole system is no longer the raw compute of any individual GPU. It is the **interconnect** — the communication path that stitches GPU to GPU.

To put it bluntly, the large-scale AI infrastructure race of 2026 is not a contest over who builds the fastest chip. It is a contest over who can lash the largest number of chips together, losslessly, so they behave as one. That is why NVIDIA put a rack-scale system called GB200 NVL72 front and center in the Blackwell generation, why AMD, Broadcom, Google and others formed the UALink Consortium, and why Google brought optical circuit switches (OCS) into its TPUs. All of them point at the same problem: how do we shrink the communication bottleneck?

This article starts from why communication becomes the bottleneck in distributed training and inference, then walks through the distinction between scale-up and scale-out, the principles of NVLink and NVSwitch, NVLink domains like GB200 NVL72, open alternatives such as UALink and Ultra Ethernet, network topologies and collective communication, and finally the overlap of compute and communication. I write this from the vantage point not of a hardware engineer but of a developer who trains and serves models, with the goal of answering one nagging question: "Why doesn't my training get faster when I add more GPUs?"

The communication bottleneck in distributed training and inference

Let us build intuition first. When a single model is split across multiple GPUs for training, each GPU finishes its share of the computation and must then exchange results with the others. In data parallelism, each GPU computes gradients on a different batch of data, and those gradients must be summed and averaged across all GPUs. That operation is the famous all-reduce. In tensor parallelism, a single matrix multiply is carved across several GPUs, so intermediate results must be exchanged at every layer.

The trouble is that compute keeps getting faster while communication does not keep pace. The floating-point throughput of GPUs has climbed steeply generation over generation, but the bandwidth for shipping data off the chip cannot follow at the same rate, owing to physical constraints. As a result the compute-to-communication ratio worsens, and the time a GPU spends idle, waiting on communication, grows.

We can capture this with a simple model.

time per step ≈ max(compute time, comm time) (when perfectly overlapped)

time per step ≈ compute time + comm time (when not overlapped at all)

comm time ≈ bytes to transfer / effective bandwidth + latency

The key phrase here is "effective bandwidth." There is always a gap between the peak bandwidth printed in a spec sheet and the bandwidth you actually realize when you run a collective. Topology, message size, the algorithm the communication library chooses, and contention with other traffic all conspire to create that gap.

In large-scale training, communication is by no means a small share. At the scale of thousands of GPUs, when you add up data-parallel all-reduce and tensor-parallel communication, an untuned setup can see communication devour anywhere from 30 percent to more than half of total step time. The same holds for inference. When you spread a single giant model across many GPUs to serve it, the latency to generate one token is heavily governed by tensor-parallel communication latency. And in 2026, with inference capex overtaking training capex for the first time, the communication efficiency of the inference path has become a direct cost issue.

Scale-up vs scale-out

The first axis for understanding interconnect is the distinction between scale-up and scale-out.

**Scale-up** means increasing the number of GPUs within a single tightly coupled domain. Several GPUs inside one server node, or dozens of GPUs inside a single rack, are bound together with very fast dedicated links so that they act like one enormous GPU. NVIDIA's NVLink and NVSwitch are the canonical examples, and UALink is the open alternative in this space. Inside a scale-up domain, bandwidth is very high and latency very low, so you can lean heavily on communication-intensive parallelism schemes like tensor parallelism.

**Scale-out** means connecting many nodes and many racks over a datacenter network to reach hundreds, thousands, or tens of thousands of GPUs. InfiniBand or Ethernet are used here, and on the open-standard side the Ultra Ethernet Consortium (UEC) targets this domain. Scale-out links have lower bandwidth and higher latency than scale-up links. So it is customary to place less communication-intensive schemes — data parallelism or pipeline parallelism — at this tier.

The two axes compare as follows.

| Aspect | Scale-up | Scale-out |

| --- | --- | --- |

| Scope | Within a node or rack | Across nodes, racks, datacenter |

| Key tech | NVLink, NVSwitch, UALink | InfiniBand, Ethernet, Ultra Ethernet |

| Bandwidth | Very high (TB/s class per GPU) | Lower (hundreds of Gb/s per port) |

| Latency | Very low | Relatively high |

| Suited parallelism | Tensor parallel, expert parallel | Data parallel, pipeline parallel |

| Coupling | Tight (close to shared memory) | Loose (message passing) |

In practice the two are combined hierarchically. You place tensor parallelism inside one NVLink domain, and stitch domains together with data and pipeline parallelism. This layout decides how much you communicate and where, and in the end it governs training efficiency.

The principles of NVLink and NVSwitch

Now let us examine NVLink and NVSwitch, the flagship of scale-up interconnect.

**NVLink** is NVIDIA's high-speed link that connects GPUs directly. Rather than routing through the CPU over PCIe, GPUs exchange data point to point. Each GPU has several bundles of NVLink "lanes" that connect to other GPUs or to a switch. Across generations, both the per-link signaling speed and the number of links have grown, so the total bandwidth a single GPU can exchange with the outside has climbed steadily.

The approximate bidirectional NVLink bandwidth per GPU, by generation, is below. Exact figures vary by product and configuration, so treat these as a sense of the trend.

| NVLink generation | Representative GPU generation | Total bandwidth per GPU (bidirectional, approx.) |

| --- | --- | --- |

| 1st gen | Pascal | about 160 GB/s |

| 2nd gen | Volta | about 300 GB/s |

| 3rd gen | Ampere | about 600 GB/s |

| 4th gen | Hopper | about 900 GB/s |

| 5th gen | Blackwell | about 1.8 TB/s |

It is worth noting that bandwidth has effectively nearly doubled each generation. The 5th-generation NVLink in Blackwell reaches roughly 1.8 TB/s per GPU, one or two orders of magnitude larger than a typical scale-out network port. This makes vivid why being inside the same domain versus outside it makes such a difference.

But connecting GPUs only two at a time has its limits. With even 8 GPUs, directly wiring every pair is inefficient, and with more it becomes nearly impossible. Enter **NVSwitch**. NVSwitch is a dedicated switch chip that exchanges NVLink traffic, letting every GPU in the domain communicate at uniform bandwidth as if directly connected to all the others. It implements a so-called "all-to-all" non-blocking topology inside the NVLink domain.

A simplified view of a node with NVSwitch looks like this.

+--------+ +--------+ +--------+ +--------+

| GPU 0 | | GPU 1 | | GPU 2 | | GPU 3 |

+---+----+ +---+----+ +---+----+ +---+----+

| | | |

====+============+====NVLink==+============+====

| | | |

+---+------------+------------+------------+---+

| NVSwitch fabric |

+---+------------+------------+------------+---+

| | | |

====+============+====NVLink==+============+====

| | | |

+---+----+ +---+----+ +---+----+ +---+----+

| GPU 4 | | GPU 5 | | GPU 6 | | GPU 7 |

+--------+ +--------+ +--------+ +--------+

Thanks to this structure, any pair of GPUs in the node can communicate at the same full bandwidth, and collectives like all-reduce are not hamstrung by topology. This is why placing tensor parallelism inside the NVLink domain became the textbook approach.

NVLink domains and rack scale — GB200 NVL72

The true power of NVSwitch shows when it extends beyond a single node to an entire rack. NVIDIA's GB200 NVL72 sits at the apex of that.

GB200 NVL72 packs 72 Blackwell GPUs into a single rack and binds them into one NVLink domain on an NVSwitch fabric. That is, all 72 GPUs in the rack can communicate at NVLink bandwidth with one another. Traditionally an NVLink domain was confined to a single node (typically 8 GPUs), and everything beyond it was a slower scale-out network. NVL72 pushes that boundary out to the entire rack. In effect, 72 GPUs behave like one large GPU.

Why does this matter? When training or serving a giant model, tensor parallelism and expert parallelism (the expert parallelism of MoE) communicate so frequently that their efficiency collapses over a scale-out network. But when the NVLink domain grows to 72 GPUs, all these heavy communication patterns can be confined to the fast domain. In particular, when the all-to-all communication that routes tokens to multiple experts in an MoE model happens inside the NVLink domain, inference latency and throughput improve dramatically.

A simplified picture of the rack-scale system looks like this.

GB200 NVL72 (single NVLink domain, 72 GPUs)

+-----------------------------------------------+

| compute trays x N (each tray: Grace CPU + |

| Blackwell GPU) |

| | | | | |

| +----+--------+--------+--------+----+ |

| | NVSwitch tray (backplane) | |

| +----+--------+--------+--------+----+ |

| | | | | |

| ...every GPU connected at full NVLink BW... |

+-----------------------------------------------+

| (scale-out: Ethernet / InfiniBand)

v

other racks (cluster at thousands-of-GPU scale)

At the rack level, the NVLink domain is the boundary of scale-up, and stitching multiple racks beyond it is the realm of scale-out. The next generation, Vera Rubin, is announced for late 2026, and together with HBM4 memory it aims to lift performance-per-watt substantially. The interconnect, too, is evolving in step toward larger domains and higher bandwidth.

UALink and Ultra Ethernet — open alternatives

NVLink and NVSwitch are powerful, but they are NVIDIA proprietary technology. Camps that worry about being locked to a single vendor have been pushing open standards.

**UALink (Ultra Accelerator Link)** is the open alternative for scale-up interconnect. Several companies including AMD, Broadcom, and Google formed the UALink Consortium to build a common specification that can bind accelerators from multiple vendors into one scale-up domain. The goal is clear: to do what NVLink does inside NVIDIA's ecosystem, but in an open, multi-vendor way. As UALink matures, an ecosystem opens up in which accelerator makers and switch makers can compete as separate players.

**Ultra Ethernet (UEC, Ultra Ethernet Consortium)** is an open standard aimed at the scale-out domain. It is an attempt to improve conventional Ethernet for AI and HPC workloads, bringing the high-performance scale-out territory long held by InfiniBand into the standard Ethernet ecosystem. Better congestion control, more efficient multipath transport, and optimizations for collectives are central to it.

The three camps line up as follows.

| Tier | NVIDIA proprietary | Open standard |

| --- | --- | --- |

| Scale-up (within node/rack) | NVLink, NVSwitch | UALink |

| Scale-out (across nodes) | NVIDIA Quantum InfiniBand, Spectrum-X | Ultra Ethernet (UEC) |

| Orientation | Vertical integration, optimized single stack | Multi-vendor, open competition |

One more intriguing approach is Google's TPU. Google connects TPUs with ICI (Inter-Chip Interconnect) and combines it with an optical circuit switch (OCS) to dynamically reconfigure the topology. By swapping the optical path itself instead of relying on electrical packet switches, it gains the flexibility to route around a failed node or compose a topology suited to the job. TPU v6 Trillium lifted peak performance roughly 4.7x over the prior generation, and the 7th generation, Ironwood, has positioned itself as an inference-focused generation.

Behind this trend is the spread of custom ASICs from cloud providers. The share of inference ASICs is projected to climb fast, from roughly 15 percent in 2024 to about 40 percent in 2026. NVIDIA holds roughly 75 to 80 percent of the accelerator market while AMD MI350X and others challenge it, but the competition around interconnect standards is far more layered than that single statistic.

Network topology — fat-tree, dragonfly, rail-optimized

Once you step into the scale-out domain, topology design becomes important. Topology is the blueprint for how nodes and switches are wired, and with the same count of GPUs and cables, collective performance can vary greatly depending on it.

**Fat-tree (Clos)** is the most widely used structure. It stacks several stages of switches in a hierarchy, thickening the links as you climb so that any two nodes are guaranteed sufficient bandwidth. Ideally it offers non-blocking bisection bandwidth, so all nodes can communicate at once without a bottleneck. The drawback is that as stages grow, switch and cable costs balloon.

**Dragonfly** builds groups, wiring densely within a group and connecting groups with a small number of long links. It helps cut cable and switch costs at scale, but because inter-group links are few, certain traffic patterns can cause congestion, so adaptive routing matters.

**Rail-optimized** topology is a layout specialized for AI training. It gathers the i-th NIC (rail) of each node onto a switch dedicated to that rail. Then a pattern like data-parallel all-reduce, where "GPUs of the same ordinal" communicate, completes in a single hop, cutting switch stages and maximizing bandwidth.

The three topologies compare as follows.

| Topology | Strength | Weakness | Best fit |

| --- | --- | --- | --- |

| Fat-tree (Clos) | Uniform full bandwidth, simple routing | Cost balloons at scale | General-purpose cluster |

| Dragonfly | Cable/switch cost efficiency | Inter-group congestion risk | Ultra-large HPC |

| Rail-optimized | Optimal for AI collectives, few hops | Specialized to certain patterns | Dedicated large-scale training |

Topology choice is not an abstract matter. With the same GPU count, real training throughput can differ by tens of percent depending on topology.

Collective communication — all-reduce, ring vs tree

The protagonist of distributed-training communication is collective communication. Among collectives, all-reduce matters most. All-reduce sums (or averages) the tensors held by every GPU element-wise, then distributes that result back to all GPUs. It is exactly the operation that combines gradients in data-parallel training.

There are several ways to implement all-reduce, the canonical ones being the ring and the tree approaches.

**Ring all-reduce** arranges the GPUs in a logical ring and circulates data, chunked, around the ring. It splits into two phases. First, in the reduce-scatter phase, each GPU gathers the sum of the chunk it is responsible for; then, in the all-gather phase, that summed result is spread to everyone. The advantage of the ring approach is that the volume of communication is nearly independent of the GPU count.

The flow of ring all-reduce in pseudocode is below.

ring all-reduce concept (N GPUs, each holds an array of N chunks)

phase 1: reduce-scatter

for step in range(N - 1):

send_chunk = (rank - step) % N

recv_chunk = (rank - step - 1) % N

send(buffer[send_chunk], to=(rank + 1) % N)

incoming = recv(from=(rank - 1) % N)

buffer[recv_chunk] += incoming # accumulate partial sums

at this point each GPU holds the 'complete sum' of one distinct chunk

phase 2: all-gather

for step in range(N - 1):

send_chunk = (rank - step + 1) % N

recv_chunk = (rank - step) % N

send(buffer[send_chunk], to=(rank + 1) % N)

buffer[recv_chunk] = recv(from=(rank - 1) % N)

after this, all GPUs hold the same 'total sum'

The total volume each GPU sends and receives in ring all-reduce is roughly as follows.

volume per GPU ≈ 2 x (N - 1) / N x (tensor size)

≈ 2 x (tensor size) (when N is large)

That is, no matter how large the GPU count N grows, the per-GPU volume converges to about twice the tensor size. It is highly efficient from a bandwidth standpoint, so it is widely used for large-scale data parallelism. The drawback is that the number of steps grows in proportion to N, so for small messages latency accumulates.

**Tree all-reduce** performs the summation and distribution over a tree structure. Its step count grows in proportion to the logarithm of N, which is advantageous on latency when messages are small or the GPU count is high. On the other hand, bandwidth efficiency can be lower than the ring.

| Approach | Step count | Bandwidth efficiency | Best fit |

| --- | --- | --- | --- |

| Ring | Proportional to N | Very high | Large messages, bandwidth-bound |

| Tree | Proportional to log N | Moderate | Small messages, latency-bound |

In practice the library looks at message size and topology and picks between the two automatically. NVIDIA's NCCL, and other collective libraries, do this for you. Rather than memorizing it, what matters more is "configuring the environment so the library recognizes the topology accurately."

The pattern of using reduce-scatter and all-gather separately is also growing in importance. Memory-saving techniques of the ZeRO and FSDP family decompose all-reduce into reduce-scatter and all-gather, performing communication while sharding parameters and gradients. Once you understand the communication pattern, it becomes natural to see why a given parallelism strategy is fast on a given topology.

Overlapping compute and communication

No matter how fast you make communication, it never drops to zero. So a second strategy appears: hiding communication behind computation — that is, the overlap of compute and communication.

The core idea is simple. While a GPU computes the gradients of the next layer, it sends the already-finished gradients of a previous layer in the background via all-reduce. By running communication asynchronously on a separate stream, you hide it behind the time computation is in progress. Overlap it well and the communication time feels essentially free.

A simplified view of overlap in backpropagation looks like this.

hiding gradient communication behind compute in backprop (concept)

for layer in reversed(layers):

grad = compute_gradient(layer) # compute stream

handle = async_all_reduce(grad) # start async on comm stream

pending.append((layer, handle))

the next layer's compute follows immediately, comm runs in background

for layer, handle in pending:

handle.wait() # wait for completion only when needed

apply_update(layer)

For overlap to work well, several conditions must hold. The communication stream and the compute stream must not contend heavily for the same resources; gradients must be grouped into reasonable sizes (bucketing) so that very small communications are not frequent; and there must be enough computation between the moment communication starts and the moment its result is awaited. Frameworks provide such bucketing and asynchronous communication by default, but tuning is needed depending on model structure and parallelism settings.

Drawing the effect of overlap on a time axis looks like this.

without overlap:

[comp L1][comm L1][comp L2][comm L2][comp L3][comm L3] <- comm adds up directly

with overlap:

[comp L1][comp L2][comp L3]

[comm L1][comm L2][comm L3] <- comm hides behind comp

the total time ends up shorter

Here the interconnect appears again. If the in-domain bandwidth is large enough, communication shortens and hides more easily within the compute time. Conversely, if bandwidth is scarce, communication runs longer than compute, and no matter how asynchronously you run it, a communication tail is exposed. In the end, hardware bandwidth and software overlap are a complementary pair.

The network divides training efficiency

If we sum up the story so far in one sentence, it is that **the network is what decides scaling efficiency**. Let us think of this through a metric called scaling efficiency.

scaling efficiency = (throughput on N GPUs) / (1-GPU throughput x N)

ideal = 1.0 (linear scaling)

reality = less than 1.0 due to comm overhead, load imbalance, overlap limits

When you double the GPUs but throughput does not double, it is almost always because of communication. The larger the share communication takes, the smaller the gain you get from adding the same GPUs. So a cluster with a weak interconnect, no matter how many GPUs you amass, sees efficiency crumble beyond a certain scale.

This makes the meaning of the strategy of growing the NVLink domain clear. By confining the most communication-frequent tensor and expert parallelism inside the fast NVLink domain, you leave only the less communication-frequent data and pipeline parallelism for the scale-out network. The reason GB200 NVL72 put 72 GPUs in one domain is precisely a design to raise scaling efficiency by confining heavy communication inside a larger, faster region.

A rough intuition, in table form, looks like this.

| Communication share | Scaling efficiency (approx.) | Meaning |

| --- | --- | --- |

| Very low | 0.9 and above | Nearly linear, ideal |

| Moderate | 0.7 to 0.85 | Common well-tuned training |

| High | around 0.5 | Severe comm bottleneck, needs work |

This table is not a precise measurement but a way to gain a feel. The point is that every effort to reduce the communication share — a larger domain, a better topology, smarter collectives, tighter overlap — translates directly into scaling efficiency.

The future — larger domains, optics, openness

The future of interconnect converges on a few directions.

First, **domain expansion**. The NVLink domain, which grew from 8 GPUs in one node to 72 GPUs in a rack, is likely to grow further. The larger the domain, the more heavy communication can be confined inside the fast region. The Vera Rubin generation foreshadows the next chapter of this trend.

Second, **optical interconnect**. Dragging cables long with electrical signals runs into limits of power and reliability. Optics connect over longer distances with less power, and even grant the flexibility to dynamically reconfigure topology, as with Google's OCS. As co-packaged optics move close to switches and accelerators, a new chapter of bandwidth and power efficiency opens.

Third, **the maturing of open standards**. As UALink and Ultra Ethernet take hold, an ecosystem forms in which accelerators and switches are separated and compete. For cloud providers who want to reduce single-vendor lock-in, this is strategically important. Coupled with the expanding share of inference ASICs, the competition over interconnect standards will only intensify.

Fourth, **the shift of weight toward inference**. As inference capex overtakes training capex in 2026, the priorities of interconnect design are changing too. Training cares about throughput, inference about latency, so low-latency communication of small messages and the all-to-all routing of MoE become ever more important design goals.

Implications for developers

Even if you are not a hardware engineer, the practical implications a developer who trains and serves models should take from interconnect are clear.

First, **match your parallelism strategy to the topology.** Place the most communication-frequent tensor and expert parallelism inside the NVLink domain, and lay out data and pipeline parallelism across domains. Reverse this layout and you idle the fast domain while overworking the slow network.

Second, **let the communication library recognize the topology properly.** Libraries like NCCL pick algorithms by looking at topology. If environment variables and topology information are wrong, they take a suboptimal path. You need the habit of measuring real collective bandwidth through profiling and suspecting the gap against catalog figures.

Third, **measure the overlap.** Just because the framework turned overlap on does not mean it actually overlaps well. Confirm with a timeline profiler that communication truly hides behind computation, and tune bucket size and async settings.

Fourth, **make scaling efficiency your metric.** Measure whether throughput rises proportionally each time you add GPUs, and find the point where efficiency breaks down. That point is exactly where the interconnect becomes the bottleneck, a signal that you need to re-lay parallelism or move to a larger domain.

Fifth, **consider algorithms that reduce the communication volume itself.** Gradient compression, larger microbatches, and training techniques that reduce communication frequency all relieve the communication bottleneck. There is room here to lift efficiency without changing hardware.

Working through an all-reduce communication time

Putting actual numbers to the communication time we have treated abstractly makes it far more concrete. Let us estimate the time the gradient all-reduce of a single step takes in data-parallel training.

As we saw, the total volume each GPU sends and receives in ring all-reduce is about twice the tensor size. If we call the number of model parameters P and the gradient bytes per parameter B, then the tensor to all-reduce is P times B bytes. So the per-GPU volume is close to twice that.

Plugging in effective bandwidth and latency lets us estimate the communication time of one step. The process of computing this with concrete numbers, written as code, is below.

estimating all-reduce communication time (conceptual calculation)

params = 70e9 # number of parameters (e.g., a 70B model)

bytes_per_param = 2 # a bf16 gradient is 2 bytes per parameter

tensor_bytes = params * bytes_per_param # size of the all-reduce target

in ring all-reduce, per-GPU volume ~ 2x

per_gpu_bytes = 2 * tensor_bytes

link_bw = 1.8e12 # effective bandwidth 1.8 TB/s (assume Blackwell NVLink)

scaleout_bw = 50e9 # assume scale-out effective bandwidth of 50 GB/s

t_nvlink = per_gpu_bytes / link_bw # inside the NVLink domain

t_scaleout = per_gpu_bytes / scaleout_bw # over the scale-out network

print("all-reduce target:", tensor_bytes / 1e9, "GB")

print("per-GPU volume:", per_gpu_bytes / 1e9, "GB")

print("NVLink comm time:", t_nvlink * 1000, "ms")

print("scale-out comm time:", t_scaleout * 1000, "ms")

A 70B model's bf16 gradient is about 140 GB, and the per-GPU volume is about 280 GB. Inside an NVLink domain with 1.8 TB/s of effective bandwidth, this takes about 0.16 seconds; over a scale-out network with 50 GB/s of effective bandwidth, it takes about 5.6 seconds. The same communication differs by tens of times depending on whether it is inside the domain or outside it. This one simple calculation lays bare why everyone wants to grow the NVLink domain.

Of course, in practice gradients are split into buckets and overlapped with compute, so the entire communication time above is not added to step time directly. But if the communication volume itself does not shrink, there is a clear limit to what overlap can hide.

NVLink generations and NVSwitch domain size

If we extend the per-generation bandwidth table seen earlier to also include the domain size that NVSwitch binds, the trend becomes even clearer. The figures below vary by product and configuration, so treat them as a sense of the trend.

| NVLink generation | Representative era | Bandwidth per GPU (bidirectional) | NVSwitch domain GPU count (approx.) |

| --- | --- | --- | --- |

| 1st gen (Pascal) | around 2016 | about 160 GB/s | no switch, point-to-point |

| 2nd gen (Volta) | around 2017 | about 300 GB/s | 8 GPUs per node |

| 3rd gen (Ampere) | around 2020 | about 600 GB/s | 8 GPUs per node |

| 4th gen (Hopper) | around 2022 | about 900 GB/s | 8 per node, up to 256 with extensions |

| 5th gen (Blackwell) | around 2024 | about 1.8 TB/s | 72 GPUs per rack (NVL72) |

What is worth noting is that not only did bandwidth grow, but the number of GPUs in one domain grew alongside it. The NVLink domain, once confined to an 8-GPU node, expanding to an entire rack of 72 GPUs in the Blackwell generation is a more fundamental change than a mere bandwidth increase. The size of the fast region that can confine heavy communication grew wholesale.

Google TPU's optical circuit switch (OCS)

While NVIDIA grows the domain with electrical NVSwitch, Google took an entirely different road: the optical circuit switch (OCS).

A typical datacenter switch is an electrical packet switch. It receives an incoming packet as an electrical signal, reads its destination, and sends it out the appropriate port. Because an electrical-to-optical conversion and a routing decision happen per packet, the power and latency burden grows as bandwidth grows. An OCS, by contrast, does not look inside packets. It merely connects an input fiber to an output fiber physically with tiny mirrors (MEMS). Once a path is set, light passes straight through with no conversion, so power stays nearly constant regardless of bandwidth and latency is very low.

In exchange, an OCS cannot switch paths per packet. To change a path it must realign the mirrors, which takes on the order of milliseconds. So an OCS is used not to switch traffic moment to moment but to reconfigure the topology itself to suit a job. When a job arrives in a TPU pod, the OCS connects the needed TPUs into the desired shape (for instance, a 3D torus), and when the job finishes it unties them and reassigns them to another job.

Drawing the difference between electrical switching and optical circuit switching looks like this.

electrical packet switch:

optical -> [electrical conv] -> [packet routing] -> [electrical conv] -> optical

(conversion and decision per packet, power rises with bandwidth)

optical circuit switch (OCS):

optical ============ [path linked by MEMS mirrors] ============ optical

(light passes straight, no conversion; reconfigures topology)

job A: connect TPUs into a 3D torus

job B: connect a different set of TPUs into a different shape

failed node: realign mirrors to route around it

Thanks to this flexibility, Google can remove a failed TPU from the topology and swap in a spare, or carve a pod to fit a job's scale, almost as if in software. Compared to the fixed topology of electrical-switch-centric designs, this is a clear strength in operational flexibility and power efficiency.

Compute-comm overlap, deeper — FSDP prefetch

We saw overlap in backpropagation earlier, but in modern memory-saving training, overlap matters in the forward pass too. FSDP (Fully Sharded Data Parallel) shards parameters across GPUs, then all-gathers a layer's full parameters just before computing that layer. Once the computation finishes, it discards them again to save memory.

The key here is to "prefetch" the next layer's parameters. While computing the current layer, if you start the all-gather of the parameters needed for the next layer in the background ahead of time, the communication is already done when the next layer's computation begins. Written as code, the concept is below.

parameter prefetch overlap in FSDP forward pass (concept)

def forward(layers, x):

handle = async_all_gather(layers[0].shard) # gather first layer params

for i, layer in enumerate(layers):

handle.wait() # current layer params ready

params = layer.full_params

if i + 1 < len(layers):

prefetch next layer params in the background

handle = async_all_gather(layers[i + 1].shard)

x = layer.compute(x, params) # compute current layer (overlaps comm)

layer.free_full_params() # reclaim memory

return x

This way the communication (all-gather) hides behind the computation (layer.compute), and communication latency is absorbed within compute time. That said, prefetching too far ahead uses that much more memory, so typically only one or two layers ahead are fetched. And here too, the same rule holds: the domain bandwidth must be large enough for the all-gather to shorten and hide well within compute.

A practical checklist — NCCL tuning and topology awareness

So that the developer implications above translate straight into action, here is a checklist of items to verify when setting up distributed training.

- **Confirm topology awareness**: verify that the communication library accurately recognizes the in-node NVLink/NVSwitch structure and the inter-node network. If the topology file or auto-detection is off, you take a suboptimal path.

- **Align NIC and GPU affinity**: set affinity so each GPU uses a physically nearby NIC. If misaligned, communication needlessly crosses CPU sockets.

- **Measure collective bandwidth for real**: before your training code, measure all-reduce bandwidth with a microbenchmark and check how much you get relative to catalog figures. If effective bandwidth is lower than expected, it is likely a configuration problem.

- **Check algorithm selection**: confirm that an appropriate algorithm between ring and tree is chosen for the message size, and steer it with environment variables if needed.

- **Tune bucket size**: if gradient buckets are too small, communication overhead rises; too large, and overlap opportunity shrinks. Find the right size for your model.

- **Verify the overlap timeline**: use a profiler to confirm with your own eyes that communication truly hides behind computation. If a communication tail is exposed, revisit bucket and async settings.

- **Check parallelism dimension placement**: verify that tensor/expert parallelism falls inside the NVLink domain boundary and that data/pipeline parallelism sits on scale-out.

- **Track scaling efficiency**: record whether throughput rises proportionally each time you add GPUs, and treat the point where efficiency bends as a bottleneck signal.

This checklist is a general principle not tied to a specific library. The core is an attitude of always suspecting and measuring whether the software is drawing the hardware's bandwidth out to the very end.

Closing

As the competition in AI infrastructure shifts from the chip to the system, the interconnect is no longer background infrastructure but a core variable that decides performance. The scale-up domains built by NVLink and NVSwitch, rack-scale systems like GB200 NVL72, open alternatives such as UALink and Ultra Ethernet, and even Google TPU's optical circuit switch — all of them give different answers to the same question: "How do we bind more accelerators together, losslessly, as one?"

The lesson all of this offers a developer converges on one thing: do not look only at compute, look at communication. Once you understand communication, you begin to see why some training does not speed up even as you add GPUs, why some inference grows long in latency, and why the same model yields entirely different efficiency on a different cluster. And that understanding is the starting point for building faster, cheaper AI systems.

References

- [NVIDIA official site](https://www.nvidia.com/)

- [NVIDIA developer documentation](https://docs.nvidia.com/)

- [NVIDIA NCCL — collective communication library](https://developer.nvidia.com/nccl)

- [UALink Consortium](https://ualinkconsortium.org/)

- [Ultra Ethernet Consortium](https://ultraethernet.org/)

- [Google Cloud TPU](https://cloud.google.com/tpu)

- [AMD official site](https://www.amd.com/)

- [SemiAnalysis — semiconductor/infrastructure analysis](https://www.semianalysis.com/)

- [IEEE Spectrum — semiconductors and computing](https://spectrum.ieee.org/)

현재 단락 (1/237)

When we talk about the compute performance of a single GPU, we tend to fixate on numbers like TFLOPS...

작성 글자: 0원문 글자: 30,246작성 단락: 0/237