Systolic Arrays and Dataflow Architecture — The Heart of the TPU

Introduction
Why Matrix Multiplication Is the Core Operation in AI
What Is a Systolic Array?
Data Flows and Gets Multiplied — Tracing the Operation
The TPU's Systolic Design
Weight-Stationary vs Output-Stationary
Data Reuse and Energy
Dataflow Architecture in General
Comparison with GPU Tensor Cores
Utilization and Tiling
Compiler Mapping
Pros and Cons
Takeaways for Developers
Conclusion
References

Introduction

Training and serving today's enormous language models ultimately reduces to a staggering amount of matrix multiplication. The attention blocks of a Transformer, its MLP layers, and convolutions are all, underneath, huge piles of multiply-accumulate (MAC) operations.

Yet the MAC operation itself is not particularly expensive. What is expensive is moving data from memory to the compute units. The energy to read a number out of DRAM often dwarfs the energy of the multiplication that consumes it, sometimes by hundreds of times. This is the so-called memory wall.

The systolic array is a structure designed to attack this problem head-on. Once a piece of data is read in, it flows across a grid of compute units inside the chip and is reused many times along the way. Google's TPU (Tensor Processing Unit) is the most famous implementation of this idea, and NVIDIA's tensor cores share much of the same spirit.

In this article we will build up, step by step, what a systolic array is and how it works, what dataflow strategies like weight-stationary and output-stationary decide, and how a compiler maps a large matrix onto a small grid.

Why Matrix Multiplication Is the Core Operation in AI

Let us set a baseline. Consider multiplying an M×K matrix A by a K×N matrix B to produce an M×N matrix C.

C[m][n] = sum over k of  A[m][k] * B[k][n]

The number of multiply-accumulate operations required for this single line is M×N×K. Multiplying two 1024×1024 matrices, for instance, triggers close to a billion (10^9) MACs. In large models, matrix multiplications like this repeat thousands of times per second.

The key insight is this: the same data is used many times.

A single row A[m][:] is reused to compute an entire row of C (N elements).
A single column B[:][n] is reused to compute an entire column of C (M elements).

If you naively re-read from memory every time, the data movement explodes. So you must hold data on-chip once you have read it and reuse it as much as possible. The systolic array bakes this reuse directly into the hardware structure.

What Is a Systolic Array?

The name comes from a concept proposed in 1978 by H. T. Kung and Charles Leiserson. "Systolic" evokes the heart pumping blood in rhythmic beats (systole). Data marches one cell per clock, regularly, like a heartbeat, across the grid.

The structure is simple. Tiny processing elements (PEs) are packed into a two-dimensional grid. Each PE does only this:

Receive a value coming in from above (or the left).
Multiply it by the value it holds.
Add to its accumulator.
Pass the received value on, unchanged, to the neighbor below (or to the right).

A PE never touches external memory directly. It talks only to its neighbors. Data enters at the edge of the grid and advances one cell at a time, passing through every internal PE. While one piece of data crosses the whole grid, it is reused in countless MACs.

        B weights flow top to bottom
        b00   b01   b02
         |     |     |
   a0 → [PE]→[PE]→[PE]→   (A inputs flow left to right)
         |     |     |
   a1 → [PE]→[PE]→[PE]→
         |     |     |
   a2 → [PE]→[PE]→[PE]→
         |     |     |
        partial sums accumulate downward and exit

The inside of each PE, in pseudocode, looks like this:

What a PE does every clock cycle:
  in_left    = activation received from the left neighbor
  in_top     = partial sum received from the top neighbor
  product    = in_left * weight_held_here
  out_bottom = in_top + product
  out_right  = in_left          (forward the activation to the right)

The decisive point is that weight_held_here stays inside the PE. Load a weight once, and the same weight is reused against every activation that flows in. This is the weight-stationary strategy we cover in the next section.

Data Flows and Gets Multiplied — Tracing the Operation

Let us trace the concept by hand. We will multiply two 2×2 matrices.

A = | a00  a01 |      B = | b00  b01 |
    | a10  a11 |          | b10  b11 |

Desired result:
C[0][0] = a00*b00 + a01*b10
C[0][1] = a00*b01 + a01*b11
C[1][0] = a10*b00 + a11*b10
C[1][1] = a10*b01 + a11*b11

We use a 2×2 grid of PEs and preload B's weights into each PE.

Weights loaded per PE position:
  [PE00=b00] [PE01=b01]
  [PE10=b10] [PE11=b11]

Now we stream A's rows in from the left edge. The important detail is that the data is fed skewed: each row is delayed by one clock so the accumulation timing lines up.

time →
Input stream (left edge):
  row 0 (into PE00, PE01):  a00, a01
  row 1 (into PE10, PE11):  a10, a11  (enters one cycle later)

Tracing PE00 cycle by cycle:

cycle 1: a00 arrives → accum = a00*b00
cycle 2: a01 arrives → accum = a00*b00 + a01*b10  ← C[0][0] complete

As the partial sums flowing above PE00 combine downward and exit one edge, each element of C we wanted is completed. The crux: values like a00 and a01 were read from memory exactly once, and weights like b00 and b10 were never re-read at all — they resided inside the PEs.

Scale this tiny example to a 256×256 grid and you perform 256×256 = 65,536 MACs simultaneously every single clock. That is precisely why a TPU achieves overwhelming throughput on matrix multiplication.

The TPU's Systolic Design

The heart of the first-generation TPU, which Google described in a 2016 paper, was a 256×256 systolic Matrix Multiply Unit (MXU). 65,536 eight-bit integer MAC units formed a grid that processed enormous matrix tiles with a single instruction.

The key features of the TPU design:

One giant MXU: Rather than many small cores, compute is concentrated in one large grid. Control overhead is tiny, so most of the chip area can go to actual arithmetic units.
On-chip data reuse: Weights and activations flow and are reused inside the grid, dramatically cutting the number of DRAM accesses.
Deterministic execution: With almost no nondeterministic factors like cache misses or branch prediction, performance can be predicted precisely and scheduled at compile time.

The TPU has evolved across generations. As of 2026, Google has pushed the 6th-generation Trillium (TPU v6) to roughly 4.7x the peak performance of the prior generation and runs it alongside the inference-specialized 7th-generation Ironwood. Yet the fundamental principle — flow data across a grid-shaped systolic compute unit and reuse it — has not changed since the first generation.

Weight-Stationary vs Output-Stationary

The most important decision when designing a systolic array is what to keep stationary inside the PE. Which of the three operands (weights, activations, partial sums) you pin down gives the dataflow strategy its name.

Weight-Stationary

Load the weights into the PE, stream the activations through, and push the partial sums out of the grid.

Stays in the PE:   weights (W)
Flows in:          activations (A)
Flows out:         partial sums (psum)

Strength: the same weights can be reused across many input batches, so weight reuse efficiency is highest when the batch is large.
Good for: workloads like inference where weights are fixed and only the inputs keep changing.

Output-Stationary

Each PE holds the accumulator for a single output element all the way through. Both weights and activations flow in.

Stays in the PE:   partial sum (psum) — the output element
Flows in:          weights (W) and activations (A)
Flows out:         the finished output (once, at the end)

Strength: partial sums are never shuffled around, so there is no accumulation precision loss and psum movement energy approaches zero.
Good for: when the accumulation depth (K) is very long and the cost of moving partial sums would be a burden.

Input-Stationary

Pin the activations into the PE and flow the weights. Useful for convolution patterns where the same input is reused across many output channels.

A side-by-side comparison:

Strategy	Stationary data	Flowing data	Strength	Weak case
Weight-stationary	Weights	Activations, partial sums	Maximizes weight reuse, inference-friendly	Partial-sum movement energy
Output-stationary	Partial sums	Weights, activations	Accumulation precision, minimal psum motion	Weights must be re-streamed each time
Input-stationary	Activations	Weights, partial sums	Good for convolutions with large input reuse	Burden when many output channels

Real accelerators choose, or mix, these strategies depending on the workload. No strategy is best in every case; the shape of the matrices (the ratio of M, N, K) and the batch size drive the optimal choice.

Data Reuse and Energy

Why does the dataflow strategy matter so much? The answer lies in energy. A consistent trend across many measurements is that one data movement is far more expensive than one computation.

Rough energy cost comparison (relative, directional):
  register / inside-PE access   : 1x
  on-chip SRAM access           : a few to tens of x
  off-chip DRAM access          : hundreds to thousands of x

Exact numbers vary by process and design, but the direction is consistent: a round trip to DRAM is overwhelmingly costly. So the goal of accelerator design is not simply to pack in many multipliers, but to perform as many MACs as possible per piece of data read.

The concept that quantifies this efficiency is arithmetic intensity.

arithmetic intensity = operations performed / bytes moved  (unit: FLOP/byte)

The higher the arithmetic intensity, the less you are bound by memory bandwidth and the higher your compute utilization. Maximizing this metric is exactly what a systolic array aims for. Read one weight, load it into the grid, then multiply it by hundreds of activations, and that weight's arithmetic intensity soars to hundreds of FLOP/byte.

Dataflow Architecture in General

The systolic array is one instance of a broader idea: dataflow architecture. The traditional von Neumann structure is control-driven: instructions pull in data. Dataflow, by contrast, is data-driven: a computation fires once its data is ready.

von Neumann:  PC points at an instruction → fetch data → execute → store result
dataflow   :  operands arrive → fire immediately → forward result to the next node

The appeal of dataflow:

Explicit program counters and complex control flow are reduced.
Data dependencies naturally expose parallelism.
Because data flows directly between compute units, you waste less storing intermediate results to memory and reading them back.

Modern AI accelerators share this dataflow philosophy to varying degrees. The core idea is to lay a computation graph out spatially on the chip and stream tensors over it.

Comparison with GPU Tensor Cores

NVIDIA GPUs have traditionally been thousands of small cores running in a SIMT fashion. But starting with the Volta generation in 2017, a dedicated matrix-multiply unit called the tensor core was added. A tensor core is hardware that multiply-accumulates a small matrix tile (for example 4×4 or 16×16) in a single instruction.

Comparing systolic arrays and tensor cores:

Aspect	TPU systolic array	GPU tensor core
Grid scale	One giant grid (e.g. 256×256)	Many small tile units spread across cores
Control style	Deterministic schedule at compile time	Runtime thread/warp scheduling
Flexibility	Highly specialized for matrix multiply	Coexists with general GPU compute, flexible
Data reuse	Hardware enforces it via grid flow	Relies on register/shared-memory usage

Roughly summarized: the TPU designs the whole chip as one block for the single job of matrix multiplication, while the GPU keeps its general-purpose nature and slots in matrix-multiply acceleration units. As of 2026, the NVIDIA Blackwell generation and the next-generation Vera Rubin carry a second-generation Transformer Engine and larger tensor cores, showing major strides in low-precision compute and memory bandwidth.

Rather than one being superior, the trade-offs differ by workload and operating environment. In data center settings for large-scale training and inference, both approaches survive and compete.

Utilization and Tiling

Just because a systolic array can theoretically do 65,536 MACs per clock does not mean it actually works that hard all the time. Real utilization is often lower, because the shape of the matrix does not divide evenly into the grid.

Stream a 9×9 matrix into a 256×256 grid, for instance, and only a sliver of the grid does work while the rest sits idle. There is also a pipeline fill/drain region as data flows in and out of the grid, during which some PEs are idle.

The technique that addresses this is tiling: split the large matrix into small tiles sized to the grid and stream them in turn.

A large matrix C (1024 x 1024)
split into 256 x 256 tiles matching the grid:

  +------+------+------+------+
  | T00  | T01  | T02  | T03  |
  +------+------+------+------+
  | T10  | T11  | T12  | T13  |
  +------+------+------+------+
  |  ... 4 x 4 = 16 tiles ... |
  +------+------+------+------+

Each tile fills the grid, so utilization is high.

Aligning tile sizes to the grid and streaming enough tiles back to back to keep the pipeline deeply filled are the keys to raising utilization. Utilization is best when the matrix dimensions are multiples of the grid size.

Compiler Mapping

Developers do not feed data into each PE by hand. The compiler does. For the TPU it is XLA; for the GPU it is tools like cuBLAS/cuDNN or Triton that map high-level tensor operations onto the hardware grid.

The compiler's core work:

1. Find matrix-multiply nodes in the computation graph.
2. Split the matrix dimensions (M, N, K) into tiles sized to the grid.
3. Choose a dataflow strategy (weight-stationary, etc.).
4. Schedule the order in which tiles stream in.
5. Build a pipeline so data loads overlap with compute.

The quality of this process determines final performance. On the same hardware, utilization can vary widely depending on how well the compiler picks tile sizes and flows. That is why accelerator companies invest as heavily in the compiler stack as in the hardware.

The practical takeaway for developers: keep matrix dimensions hardware-friendly (multiples of 8, multiples of 128) and the compiler finds better mappings more easily.

Pros and Cons

Let us summarize the trade-offs of the systolic array approach.

Pros:

Very high throughput and energy efficiency for matrix multiplication.
The hardware structure forcibly reduces data movement, easing the memory wall.
Simple control lets most of the chip area go to actual arithmetic units.
Deterministic, so performance is easy to predict and schedule.

Cons:

Poorly suited to irregular operations that are not matrix multiplication.
Utilization drops when the matrix shape does not match the grid.
Pipeline fill/drain overhead stands out on small matrices.
Fixed grid size means less flexibility than a GPU.

Because of these trade-offs, it is accurate to view the systolic array not as a universal solution but as a tool specialized for deep learning workloads where matrix multiplication is overwhelmingly dominant.

Takeaways for Developers

Even for developers who never design hardware, these principles are practically useful.

Increasing batch size raises weight reuse on a weight-stationary accelerator, improving efficiency. This is one reason batching matters in inference serving.
Aligning matrix dimensions (e.g. padding to multiples of 128) makes it easy for the compiler to build tiles that fill the grid, raising utilization.
Using low-precision compute (FP8, FP4, etc.) lets the same grid process more MACs and reduces the bytes moved, raising arithmetic intensity.
Being mindful of memory access patterns — keeping tensors in contiguous, aligned layouts — favors the data flow.

In short, keep "multiplication is cheap, data movement is expensive" in mind, and you develop intuition for shaping your model structure and serving setup to be hardware-friendly.

Conclusion

The systolic array is the result of pushing a simple idea all the way. The principle — once you read data, reuse it as much as possible — is etched directly into a structure that lays compute units out in a grid and streams data over them. Like a beating heart, data advances one cell per clock, multiplying and accumulating in an elegant flow that underpins the TPU's overwhelming matrix-multiply performance.

Understand the three axes — choice of dataflow strategy, tiling and utilization, and compiler mapping — and you gain solid intuition for why a given accelerator shines on a given workload, and how to shape your model to be hardware-friendly. The progress of AI hardware is ultimately an endless fight with the memory wall, and the systolic array is one of the longest-proven weapons in that fight.