Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

TL;DR

Memory hierarchy = the essence of performance: an L1 cache miss is 100x slower
Cache Line (64 bytes): the unit of memory access
False Sharing: the silent killer of multi-threaded performance
Cache-friendly structures: Array > LinkedList, SoA > AoS
Mechanical Sympathy: understand the hardware and code runs 100x faster

1. Memory Hierarchy

1.1 Why a hierarchy?

Ideal: all memory would be fast, big, and cheap.

Reality — tradeoffs:

Fast = small = expensive
Slow = large = cheap

A layered hierarchy balances them.

1.2 Hierarchy (2025)

Tier	Size	Latency	Notes
Register	~32 x 64-bit	1 cycle	Inside the CPU core
L1 cache	32-64 KB	4 cycles (~1ns)	Per core
L2 cache	256 KB-1 MB	10 cycles (~3ns)	Per core or shared
L3 cache (LLC)	4-128 MB	40 cycles (~12ns)	Shared across cores
RAM (DDR5)	16-512 GB	100 cycles (~30ns)	System-wide
NVMe SSD	1-100 TB	100 microseconds	Persistent
HDD	1-20 TB	10 milliseconds	Almost unused

1.3 Scale visualization

If L1 = 1 second:

L2 = 3 seconds
L3 = 12 seconds
RAM = 30 seconds
SSD = 27 hours
HDD = 115 days

Memory is ~100x slower than L1. Disk is ~1 million times slower.

1.4 Implication

Memory access = cost
Cache hit = cheap
Cache miss = expensive

Good code = code with many cache hits.

2. How CPU Caches Work

2.1 Cache Line

A cache line is the minimum unit of the cache, typically 64 bytes.

Read 1 byte from RAM?
-> The CPU fetches a full 64-byte cache line

Effect:

Neighboring memory arrives together
Natural exploitation of locality

2.2 Two kinds of locality

Temporal locality: something just accessed is likely accessed again (loop counters). Spatial locality: nearby memory is likely accessed soon (array traversal).

Good code exploits both.

2.3 Cache mapping

Direct Mapped: each address maps to exactly one line.
Set Associative: pick among N lines (8-way is common).
Fully Associative: anywhere (hard to implement).

Most modern CPUs are 8-way set associative.

2.4 Replacement

LRU: least recently used
Pseudo-LRU: cheap approximation
Random: surprisingly close to LRU in practice

3. Cache Miss Types

3.1 Compulsory (Cold) Miss

First access. Unavoidable.

Mitigate: prefetching.

// Compiler / CPU does this automatically
for (int i = 0; i < n; i++) {
    process(arr[i]);
    // CPU prefetches arr[i+1], arr[i+2]
}

3.2 Capacity Miss

Cache is too small for the working set.

Fix: fit the working set, use cache blocking (tiling).

// Without blocking (many misses)
for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            C[i][j] += A[i][k] * B[k][j];

// With blocking (fits in cache)
for (int ii = 0; ii < n; ii += BLOCK)
    for (int jj = 0; jj < n; jj += BLOCK)
        for (int kk = 0; kk < n; kk += BLOCK)
            for (int i = ii; i < ii + BLOCK; i++)
                for (int j = jj; j < jj + BLOCK; j++)
                    for (int k = kk; k < kk + BLOCK; k++)
                        C[i][j] += A[i][k] * B[k][j];

Can be 10x+ faster.

3.3 Conflict Miss

Multiple items map to the same cache set slot. Fix with padding or alignment.

4. Cache-Friendly Data Structures

4.1 Array vs LinkedList

// Array - cache friendly
int arr[1000];
for (int i = 0; i < 1000; i++)
    sum += arr[i];  // contiguous, prefetcher wins

// LinkedList - cache hostile
struct Node {
    int value;
    struct Node* next;
};

Node* curr = head;
while (curr) {
    sum += curr->value;
    curr = curr->next;  // unpredictable, cache miss
}

Benchmark: Array is 10-100x faster than LinkedList on large datasets.

4.2 AoS vs SoA

AoS (traditional):

struct Point {
    float x, y, z;
    int color;
};

Point points[1000];

for (int i = 0; i < 1000; i++)
    sum += points[i].x;
// Fetches 16 bytes each iter, uses only 4

SoA (cache friendly):

struct Points {
    float x[1000];
    float y[1000];
    float z[1000];
    int color[1000];
};

for (int i = 0; i < 1000; i++)
    sum += points.x[i];
// 4 bytes fetched, 100% utilized, SIMD-ready

Result: 4-10x speedup. Used in game engines (ECS), analytics, ML.

4.3 Struct padding and alignment

struct Bad {
    char a;     // 1 byte + 7 padding
    double b;   // 8 bytes
    char c;     // 1 byte + 7 padding
};
// sizeof(Bad) = 24 bytes (16 wasted!)

struct Good {
    double b;   // 8 bytes
    char a;     // 1 byte
    char c;     // 1 byte + 6 padding
};
// sizeof(Good) = 16 bytes

Rule: biggest fields first.

4.4 Hash map cache behavior

Open addressing (Robin Hood, Swiss Table): data in one array, cache friendly.
Separate chaining (traditional): per-bucket linked lists, cache hostile.

Google's Swiss Table searches 16 slots at a time with SIMD.

5.1 The problem

struct Counters {
    atomic<int> a;  // 4 bytes
    atomic<int> b;  // 4 bytes
};
// Both sit on the same 64-byte cache line

Thread 1 writes a, Thread 2 writes b. Independent writes, but same cache line, so each write invalidates the other's copy. Infinite ping-pong.

Slowdown: 10-100x.

5.2 Detection

perf c2c record ./my_program
perf c2c report

5.3 Fixes

Padding:

struct Counters {
    alignas(64) atomic<int> a;
    alignas(64) atomic<int> b;
};

C++17 hardware_destructive_interference_size:

#include <new>
struct Counters {
    alignas(std::hardware_destructive_interference_size) atomic<int> a;
    alignas(std::hardware_destructive_interference_size) atomic<int> b;
};

Thread-local:

thread_local int a;
thread_local int b;

5.4 LongAdder (Java)

// AtomicLong - false sharing possible
AtomicLong counter = new AtomicLong();
counter.incrementAndGet();

// LongAdder - avoids false sharing by sharding
LongAdder counter = new LongAdder();
counter.increment();

Internally padded. 100x+ faster at 16+ threads.

6. MESI Protocol

6.1 Cache coherence

Multi-core caches need coherence. MESI tracks four states per line:

M (Modified): dirty, only this core
E (Exclusive): clean, only this core
S (Shared): clean, multiple cores
I (Invalid): stale

6.2 Transitions

[I] -> read  -> [E] (alone) or [S] (shared)
[E] -> write -> [M]
[S] -> write -> [M] (invalidate others)
[M] -> other read -> [S] + writeback

Variants: MOESI, MESIF. Coherence traffic is what makes false sharing expensive.

7. NUMA — Non-Uniform Memory Access

7.1 What is NUMA?

Large servers (40+ cores) have multiple sockets, each with its own memory bank.

[Socket 1]      [Socket 2]
   |               |
[Memory 1]    [Memory 2]
   |               |
   +--- Bus -------+

Local memory: ~100ns
Remote memory: ~200ns

7.2 Impact

Wrong placement = 50% slowdown.

7.3 NUMA-aware programming

numactl --cpunodebind=0 --membind=0 ./my_program
numactl --hardware

#include <numa.h>
void* mem = numa_alloc_local(size);

7.4 Use cases

Databases (PostgreSQL, MySQL): NUMA-aware tuning is mandatory
JVM: -XX:+UseNUMA
Redis: single-threaded, huge NUMA sensitivity
Kubernetes: NUMA topology manager

8. Mechanical Sympathy

8.1 The concept

Martin Thompson (LMAX Disruptor) popularized this idea:

"If you understand how the machine works, you can write optimal code."

Like a race-car driver who understands engines, programmers should understand CPUs.

8.2 LMAX Disruptor

Problem: 6 million financial transactions per second.
Solution: mechanical sympathy to the extreme.
Tools: ring buffer (array-based), false-sharing avoidance (padding), single-writer principle (lock-free), pre-allocation (no GC).
Result: Java at C++ speeds.

8.3 Seven principles

Single Writer Principle — no locks
Write combining
Pad to avoid false sharing
Immutable objects
Avoid GC pauses (pre-allocate)
Cache-friendly data structures
Understand the hardware

8.4 Example

// Bad - false sharing
class BadCounter {
    long count;
}

// Good - padding
class GoodCounter {
    long p1, p2, p3, p4, p5, p6, p7;
    long count;
    long p9, p10, p11, p12, p13, p14, p15;
}

@Contended annotation (Java 8+):

@sun.misc.Contended
class Counter {
    long count;
}

9. Measurement and Profiling

9.1 perf

perf stat -e cache-misses,cache-references ./my_program

9.2 L1 miss rate

perf stat -e L1-dcache-load-misses,L1-dcache-loads ./my_program

L1 miss rate above 5% usually means there's room to optimize.

9.3 Intel VTune

Commercial (free tier available). Memory access patterns, false sharing, SIMD utilization.

9.4 cachegrind (Valgrind)

valgrind --tool=cachegrind ./my_program
cg_annotate cachegrind.out.PID

Per-function cache miss reports.

9.5 Microbenchmarks

Google Benchmark (C++):

static void BM_Array(benchmark::State& state) {
    int arr[1000];
    for (auto _ : state) {
        long sum = 0;
        for (int i = 0; i < 1000; i++)
            sum += arr[i];
        benchmark::DoNotOptimize(sum);
    }
}
BENCHMARK(BM_Array);

JMH (Java) does the same for JVM code.

10. Real-World Optimization

10.1 Matrix multiplication

Naive:

for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
        for (k = 0; k < N; k++)
            C[i][j] += A[i][k] * B[k][j];

B[k][j] is cache-hostile.

Loop swap:

for (i = 0; i < N; i++)
    for (k = 0; k < N; k++)
        for (j = 0; j < N; j++)
            C[i][j] += A[i][k] * B[k][j];

Tiling:

for (ii = 0; ii < N; ii += T)
    for (kk = 0; kk < N; kk += T)
        for (jj = 0; jj < N; jj += T)
            for (i = ii; i < ii+T; i++)
                for (k = kk; k < kk+T; k++)
                    for (j = jj; j < jj+T; j++)
                        C[i][j] += A[i][k] * B[k][j];

Add SIMD + threading (BLAS): 100x+ over naive.

10.2 Hash tables

std::unordered_map: separate chaining, cache hostile.
Swiss Table / dense_hash_map: open addressing + SIMD, 2-5x faster.
Robin Hood hashing: another cache-friendly variant.

10.3 Game engine ECS

Traditional OOP drags entire objects through cache. ECS stores components as parallel arrays, using only what the system needs. Unity DOTS, Unreal Mass: 10x+ speedups.

10.4 Columnar databases

Row stores (PostgreSQL, MySQL): read full rows even for one column.
Column stores (ClickHouse, Druid): read only the needed column.

For SELECT AVG(age), columnar is 100x+ faster.

Quiz

1. How much slower is RAM than L1 cache?

Answer: L1 is ~1ns (4 cycles), RAM is ~30ns (100 cycles). About 30x. If L1 = 1 second, RAM = 30 seconds, SSD = 27 hours. Every tier of the hierarchy is ~10x slower than the one above. This is why cache-friendly code is 100x faster — it isn't "optimization", it's the fundamental difference between tiers.

2. What is false sharing and how do you prevent it?

Answer: Two variables on the same 64-byte cache line ping-pong between cores whenever either is written, even if threads use different variables. 10-100x slowdown. Prevent with (1) padding via alignas(64), (2) C++17 hardware_destructive_interference_size, (3) thread-local variables, (4) Java @Contended. Detect with perf c2c.

3. AoS vs SoA?

Answer: Array of Structures stores each record together; Structure of Arrays stores each field as its own array. SoA lets you read only the needed field (100% cache utilization) and enables SIMD/vectorization. 4-10x faster for field-wise workloads. Used by ECS game engines (Unity DOTS, Unreal Mass), analytics, ML.

4. Why does NUMA matter on multi-socket servers?

Answer: Each socket has its own memory bank. Local accesses are ~100ns, remote ~200ns. Bad placement costs 50% performance. Fix with numactl binding, numa_alloc_local, JVM -XX:+UseNUMA. Essential for PostgreSQL, MySQL, Redis tuning.

5. What is Mechanical Sympathy and why does LMAX Disruptor use it?

Answer: A term from Martin Thompson — understand how the machine works to write optimal code. LMAX Disruptor hits 6M tx/sec in Java via: ring buffer (cache-friendly array), single-writer principle (lock-free), padding (no false sharing), pre-allocation (no GC), immutable objects. Result: Java at C++ speeds.

References

What Every Programmer Should Know About Memory — Ulrich Drepper
LMAX Disruptor — Martin Thompson
Mechanical Sympathy Blog
Cache-Oblivious Algorithms
Intel Optimization Manual
Agner Fog's Optimization Guides
perf
perf c2c
Google Benchmark
Cachegrind
Computer Systems: A Programmer's Perspective