Mechanical Sympathy: Writing Code the Hardware Likes

Introduction — Same Complexity, Different Speed
The Memory Hierarchy — Cliffs of Speed
Cache Lines — The Cache Does Not Move Bytes One at a Time
Data Locality — Array vs Linked List
Access Pattern Is Everything — Row-Major vs Column-Major
Branch Prediction — The CPU Guesses the Future
False Sharing — Invisible Contention
Data Layout — SoA vs AoS
Predictable Access Patterns Are the Core
When to Care and When to Ignore
Conclusion
References

Introduction — Same Complexity, Different Speed

Algorithms courses teach us to judge code by its time complexity. O(n) beats O(n squared); ignore constant factors. This view is mostly right and useful. But dig into real-world performance and you hit a strange wall. Two pieces of code are clearly both O(n), yet one runs 10 times, sometimes 50 times, faster than the other, even though both touch each element exactly once.

This difference comes not from the algorithm but from the hardware. A modern CPU is not the simple calculator we imagine. It is a complex machine full of multiple cache layers, a predictor that guesses branches ahead of time, and a pipeline that overlaps the execution of many instructions. The attitude of understanding how this machine works and writing code that fits it is called mechanical sympathy. The phrase originally came from F1 driver Jackie Stewart, meaning "a driver who understands the machine drives it faster," and it applies just as well to software.

This post is about why hardware-friendly code is fast and how to write it. The core topics are caches, data locality, branch prediction, false sharing, and data layout.

The Memory Hierarchy — Cliffs of Speed

The starting point for understanding performance is the fact that memory is not one thing. Between the CPU and main memory (RAM) sit several layers of cache, and each layer differs dramatically in speed and size. Higher up, faster and smaller; lower down, slower and larger.

  Layer          approx latency   relative speed   size
  ------------------------------------------------------------
  Register       ~0.3 ns          1x (baseline)    hundreds of bytes
  L1 cache       ~1 ns            about 3x slower  32-64 KB
  L2 cache       ~4 ns            about 10x slower 256KB-1MB
  L3 cache       ~12 ns           about 40x slower a few to tens of MB
  Main memory    ~100 ns          about 300x slow  a few to hundreds of GB
  SSD            ~100,000 ns      about 300,000x   hundreds of GB to TBs

The most important thing in this table is that the gaps between layers are as steep as cliffs. Reading data from the L1 cache versus from main memory is roughly a 100x difference. To the CPU, 100 ns is an eternity. In that time it could execute hundreds of instructions.

Scaling this latency to human time makes it vivid.

  If one CPU cycle were 1 second:
  L1 cache access   = about 3 seconds
  L2 cache access   = about 10 seconds
  L3 cache access   = about 40 seconds
  Main memory access = about 5 minutes
  SSD access        = about several days

So a CPU waiting on main memory is, in human terms, like stepping out for a cup of coffee. That is why much of modern performance optimization is not about "reducing computation" but about "reducing memory waits," that is, keeping data in cache as much as possible. Computation is already fast enough. What is slow is fetching the data.

Cache Lines — The Cache Does Not Move Bytes One at a Time

The key concept for understanding caches is the cache line. When the CPU brings data from memory into cache, it does not fetch only the single byte you asked for. Instead it fetches the whole fixed-size chunk that contains that byte. This chunk is the cache line, and on most modern CPUs it is 64 bytes.

  Even if you request a single int (4 bytes):

  Memory:  [.. the whole 64-byte line containing the requested int ..]
                        |
                        v
  Cache:   [the entire 64-byte line is loaded]

  So the 60 nearby bytes come into cache "for free" too

This fact has enormous consequences. Access some data, and the data next to it rides into cache for free. So accessing data that sits close together in memory in succession is very fast, because most of it is already in cache. Conversely, accessing data scattered all over memory is slow, because each access must fetch a new cache line.

This is the physical basis for data locality, our next topic. The advice to "use nearby data in succession" ultimately means "get multiple uses out of a single cache line."

Data Locality — Array vs Linked List

The starkest demonstration of the power of data locality is the comparison between arrays and linked lists. In textbook terms, the two have a trade-off: arrays are fast for random access and slow for insertion, linked lists are fast for insertion and slow for random access. But measure real traversal performance and the linked list is often far slower than the array, and even in insertion/deletion-heavy scenarios the array frequently wins. The reason is the cache.

  Array: elements sit contiguously in memory
  [e0][e1][e2][e3][e4][e5][e6][e7] ...
   |-- one cache line brings in several elements --|
   traversal is mostly cache hits -> fast

  Linked list: nodes are scattered all over memory
  [node2] ......... [node0] ..... [node3] ... [node1]
      each node requires following a pointer to jump
      mostly cache misses -> slow (a memory round trip every time)

An array lays its elements out contiguously in memory. So sweeping an array from start to end, a single cache line brings in several elements, and most accesses are cache hits. On top of that, the CPU detects this sequential pattern and prefetches data it will need next.

A linked list is different. Each node is allocated scattered across the heap, and nodes connect only by pointers. So traversing the list means following pointers and jumping all over memory. Each node requires a new cache line, and worse, you cannot know where the next node is until you read the current one, which defeats prefetching. The result is a string of cache misses.

The key lesson is this. Big-O notation counts only the number of accesses, but real speed depends heavily on whether those accesses are cache hits or misses. For work that traverses sequentially, an array (or vector, or dynamic array) that uses contiguous memory wins most of the time.

Access Pattern Is Everything — Row-Major vs Column-Major

Even for the same data, the order in which you access it can change speed dramatically. Consider the classic example of traversing a 2D array. Most languages (C, NumPy's default, and so on) store 2D arrays in row-major order, meaning the elements of a row sit contiguously in memory.

import numpy as np
A = np.zeros((10000, 10000))

# way 1: traverse in row-major order (matches memory order)
total = 0.0
for i in range(10000):
    for j in range(10000):
        total += A[i][j]     # inner loop sweeps contiguous memory -> fast

# way 2: traverse in column-major order (against memory order)
total = 0.0
for j in range(10000):
    for i in range(10000):
        total += A[i][j]     # each access jumps a full row -> cache misses explode

The two pieces of code add exactly the same elements exactly the same number of times. They are algorithmically identical. Yet way 1 runs several times faster than way 2. Depending on the language and size, the difference can be 5x to more than 10x.

The reason is cache lines. In way 1, the inner loop sweeps a row laid out contiguously in memory, so a single cache line brings in several elements and most accesses hit. In way 2, the inner loop walks down a column, and under row-major storage the elements of the same column are a full row apart in memory (here, 10,000 elements apart). So each access must fetch a new cache line, and the remaining 63 bytes of that line are thrown away unused.

The lesson is clear: access data in the order it is stored. For a multidimensional array, know the memory layout (row-major or column-major) and arrange your loop order so that the innermost loop sweeps contiguous memory. This single simple principle often yields a bigger performance gain than any algorithmic improvement.

Branch Prediction — The CPU Guesses the Future

Another secret behind a modern CPU's speed is pipelining. Rather than finishing one instruction before starting the next, the CPU overlaps many instructions like an assembly line. But there is a problem. When it meets a branch like an if, the CPU must decide which side's instructions to feed into the pipeline before the condition's result is known.

So the CPU uses a branch predictor. Looking at past patterns, it guesses "this branch is probably taken (or not taken)," fills the pipeline with that side ahead of time, and begins executing. If the guess is right, it proceeds fast with no penalty. But if it is wrong, it must throw away all the speculatively executed instructions (a pipeline flush) and restart from the correct side. That penalty runs to tens of cycles.

This fact leads to a famous observation: processing sorted data can be faster than processing unsorted data.

# sum only the values above a threshold in a large array
threshold = 128
total = 0
for x in data:
    if x >= threshold:      # the predictability of this branch drives the speed
        total += x

If data is sorted, this branch stays false for a while and then becomes true from some point on. The predictor easily learns this regular pattern and gets it right almost every time. But if data is random, the branch result jumps unpredictably and the predictor is wrong about half the time. Each misprediction flushes the pipeline, and the same code can run several times slower.

There are a few practical insights here. First, predictable branches are cheap and unpredictable branches are expensive. Second, sometimes it pays to eliminate the branch entirely (branchless code). For example, expressing "add the value or not, depending on a condition" with arithmetic instead of a branch avoids the misprediction penalty.

# arithmetic instead of a branch (no misprediction)
# (x >= threshold) is treated as True/False = 1/0
total += x * (x >= threshold)

Of course, branchless code is not always faster and can hurt readability, so it is not something to apply blindly. The key is the awareness that an unpredictable branch inside a hot loop can be a hidden performance thief.

An especially subtle performance trap in multithreaded code is false sharing. Several threads each modify different variables, so logically there is no conflict at all, yet performance collapses. The culprit is again the cache line.

As we saw, the cache moves in units of 64-byte lines. But each CPU core has its own cache, and when several cores cache the same data they must maintain coherence. When one core modifies a cache line, the copies of that same line held by other cores are invalidated.

The problem is that even different variables, if they happen to sit within the same cache line, are treated as if the threads were fighting over the same data.

  Suppose thread A modifies counter[0] and thread B modifies counter[1].
  If the two variables sit in the same 64-byte cache line:

  [ counter[0] | counter[1] | ... same 64-byte line ... ]
       ^A writes      ^B writes

  Every time A writes, B's cache line is invalidated,
  and every time B writes, A's cache line is invalidated.
  The two threads are logically independent, yet they ping-pong the
  cache line and slow each other down.

This is false sharing. It is not real data sharing (each has a different variable) but a fake contention that arises because they share a cache line. The symptom is nasty: the code works perfectly correctly, but adding threads does not improve performance or even makes it worse.

The fix is to push the offending variables onto different cache lines. Commonly, you add padding so that each thread's data occupies an independent cache line.

  Fix: add padding between each thread's data to separate the lines

  [ counter[0] + padding (fills the line) ][ counter[1] + padding ]
        line dedicated to thread A             line dedicated to thread B

  now they no longer touch each other's line, so the ping-pong disappears

False sharing is hard to see and tricky to diagnose. If you made something multithreaded but it does not scale, it is worth suspecting whether the data the threads touch happens to be crowded into the same cache line.

Data Layout — SoA vs AoS

The design-level topic of how to lay data out in memory is Array of Structures (AoS) versus Structure of Arrays (SoA). Even holding the same data, the two differ greatly in cache efficiency.

AoS is the way we naturally think of. Make each entity a struct, and put those structs in an array.

  AoS (Array of Structures):
  struct Particle { float x, y, z; float mass; ... };
  Particle particles[N];

  Memory: [x0 y0 z0 m0][x1 y1 z1 m1][x2 y2 z2 m2] ...
          all fields of one particle sit together

SoA is the opposite: make a separate array per field.

  SoA (Structure of Arrays):
  struct Particles { float x[N]; float y[N]; float z[N]; float mass[N]; };

  Memory: [x0 x1 x2 x3 ...][y0 y1 y2 ...][z0 z1 z2 ...] ...
          the same field sits contiguously together

Why does this difference matter? Consider work that sweeps only a specific field in bulk. Suppose a physics simulation loop updates only the x coordinate of every particle.

In AoS, the x coordinates are far apart in memory (with y, z, and mass wedged between them). So sweeping only x drags the unused y, z, and mass into the cache line along with it, wasting cache space. In SoA, the x coordinates sit contiguously in one array, so the cache line comes packed with nothing but x. The cache is used frugally, and because access is sequential, prefetching works well too. SoA is also friendlier to SIMD (processing multiple data with one instruction) vectorization.

  When processing only x coordinates in bulk:

  AoS: cache line holds one x + wasted y,z,mass  -> low cache efficiency
  SoA: cache line packed with only x             -> high cache efficiency, SIMD-friendly

That said, SoA is not always right. If you frequently handle all fields of one entity together (for example, reading one whole particle and processing it), AoS is better because it gathers that entity's fields into a single cache line. The key is the access pattern. Decide the layout by whether you sweep by field or by entity. The reason SoA is so often chosen in game engines, physics engines, and data-processing systems is that the hot loops of such systems usually sweep only a specific field in bulk.

Predictable Access Patterns Are the Core

There is a single principle running through everything so far: CPUs and caches love predictable sequential access and hate random, scattered access.

CPUs have a hardware prefetcher. It watches memory access patterns, and when it detects regular sequential access, it pulls the data you will need next into cache ahead of time. So for a predictable pattern like sweeping an array in order, the data has already arrived in cache before you need it. With the prefetcher's help, memory latency is almost hidden.

Conversely, if access is random, the prefetcher cannot read the pattern and cannot help. Data structures that chase pointers around (linked lists, trees, hash-table chaining) or accesses with jumbled indices defeat the prefetcher.

From here we can gather practical principles.

Access sequentially. Whenever possible, sweep data in the order it is stored.
Prefer contiguous memory. A contiguous array beats scattered nodes for both the cache and the prefetcher.
Lay out data to match the access pattern. Keep frequently co-used data close (in the same cache line), and consider SoA if you sweep by field.
Make hot-loop branches predictable. Sort or eliminate the branch if needed.
Avoid false sharing in multithreading. Separate per-thread data onto different cache lines.

These principles all converge on one attitude: being conscious, as you write code, of how data is laid out in memory and in what order it is accessed. That is the practice of mechanical sympathy.

When to Care and When to Ignore

Having read this far, you might think "so I should optimize all my code this way," but no. Mechanical sympathy is not a rule to apply always; it is a tool to reach for when needed.

Most code is not performance-bound. Readability and correctness matter far more, and premature low-level optimization only makes code complex. As Donald Knuth's famous line goes, "premature optimization is the root of all evil." So the first principle is always to measure. Use a profiler to find the real hot spots, and focus only there.

Where mechanical sympathy truly matters is clear: hot loops that process vast amounts of data repeatedly, real-time rendering in games, physics, and graphics, and systems where latency is cost, such as high-frequency trading or large-scale data processing. In such places, even after you have refined the algorithmic complexity, cache optimization can yield several times more gain.

The balanced attitude is this: normally write clear, simple code, but understand how the hardware works. That way, at the moment performance truly matters, you know where and how to make changes. Mechanical sympathy is not an obsession to apply to all code, but a discernment to exercise when the moment calls for it.

Conclusion

The reason two pieces of code with the same time complexity can differ by tens of times in reality is that a computer is not the abstract calculator of textbooks but a physical machine full of caches, pipelines, and predictors. The memory hierarchy has cliffs of speed, the cache moves in units of lines, data locality governs traversal speed, branch prediction decides the fate of hot loops, false sharing blocks multithreaded scaling, and data layout (AoS versus SoA) determines cache efficiency.

Mechanical sympathy is the attitude of writing code with regard for this machine's mind. Keep data contiguous, access it sequentially, make branches predictable, lay it out to match the access pattern. All of this ultimately reduces to one principle: handle data in the way the hardware likes.

Big-O notation is still a precious starting point. But beneath it lies a world of performance it cannot explain. The developer who understands that world writes far faster code with the same algorithm, just as the driver who understands the machine drives it faster.

References

Mechanical Sympathy (Martin Thompson's blog): https://mechanical-sympathy.blogspot.com/
What Every Programmer Should Know About Memory (Ulrich Drepper): https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
Latency Numbers Every Programmer Should Know: https://gist.github.com/jboner/2841832
Data-Oriented Design (Richard Fabian): https://www.dataorienteddesign.com/dodbook/
Agner Fog — Software optimization resources: https://www.agner.org/optimize/