- Published on
CPU Cache & Memory Hierarchy Complete Guide 2025: L1/L2/L3, Cache Line, False Sharing, Performance Tuning
- Authors

- Name
- Youngju Kim
- @fjvbn20031
TL;DR
- Memory hierarchy = the essence of performance: an L1 cache miss is 100x slower
- Cache Line (64 bytes): the unit of memory access
- False Sharing: the silent killer of multi-threaded performance
- Cache-friendly structures: Array > LinkedList, SoA > AoS
- Mechanical Sympathy: understand the hardware and code runs 100x faster
1. Memory Hierarchy
1.1 Why a hierarchy?
Ideal: all memory would be fast, big, and cheap.
Reality — tradeoffs:
- Fast = small = expensive
- Slow = large = cheap
A layered hierarchy balances them.
1.2 Hierarchy (2025)
| Tier | Size | Latency | Notes |
|---|---|---|---|
| Register | ~32 x 64-bit | 1 cycle | Inside the CPU core |
| L1 cache | 32-64 KB | 4 cycles (~1ns) | Per core |
| L2 cache | 256 KB-1 MB | 10 cycles (~3ns) | Per core or shared |
| L3 cache (LLC) | 4-128 MB | 40 cycles (~12ns) | Shared across cores |
| RAM (DDR5) | 16-512 GB | 100 cycles (~30ns) | System-wide |
| NVMe SSD | 1-100 TB | 100 microseconds | Persistent |
| HDD | 1-20 TB | 10 milliseconds | Almost unused |
1.3 Scale visualization
If L1 = 1 second:
- L2 = 3 seconds
- L3 = 12 seconds
- RAM = 30 seconds
- SSD = 27 hours
- HDD = 115 days
Memory is ~100x slower than L1. Disk is ~1 million times slower.
1.4 Implication
Memory access = cost
Cache hit = cheap
Cache miss = expensive
Good code = code with many cache hits.
2. How CPU Caches Work
2.1 Cache Line
A cache line is the minimum unit of the cache, typically 64 bytes.
Read 1 byte from RAM?
-> The CPU fetches a full 64-byte cache line
Effect:
- Neighboring memory arrives together
- Natural exploitation of locality
2.2 Two kinds of locality
Temporal locality: something just accessed is likely accessed again (loop counters). Spatial locality: nearby memory is likely accessed soon (array traversal).
Good code exploits both.
2.3 Cache mapping
- Direct Mapped: each address maps to exactly one line.
- Set Associative: pick among N lines (8-way is common).
- Fully Associative: anywhere (hard to implement).
Most modern CPUs are 8-way set associative.
2.4 Replacement
- LRU: least recently used
- Pseudo-LRU: cheap approximation
- Random: surprisingly close to LRU in practice
3. Cache Miss Types
3.1 Compulsory (Cold) Miss
First access. Unavoidable.
Mitigate: prefetching.
// Compiler / CPU does this automatically
for (int i = 0; i < n; i++) {
process(arr[i]);
// CPU prefetches arr[i+1], arr[i+2]
}
3.2 Capacity Miss
Cache is too small for the working set.
Fix: fit the working set, use cache blocking (tiling).
// Without blocking (many misses)
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
C[i][j] += A[i][k] * B[k][j];
// With blocking (fits in cache)
for (int ii = 0; ii < n; ii += BLOCK)
for (int jj = 0; jj < n; jj += BLOCK)
for (int kk = 0; kk < n; kk += BLOCK)
for (int i = ii; i < ii + BLOCK; i++)
for (int j = jj; j < jj + BLOCK; j++)
for (int k = kk; k < kk + BLOCK; k++)
C[i][j] += A[i][k] * B[k][j];
Can be 10x+ faster.
3.3 Conflict Miss
Multiple items map to the same cache set slot. Fix with padding or alignment.
4. Cache-Friendly Data Structures
4.1 Array vs LinkedList
// Array - cache friendly
int arr[1000];
for (int i = 0; i < 1000; i++)
sum += arr[i]; // contiguous, prefetcher wins
// LinkedList - cache hostile
struct Node {
int value;
struct Node* next;
};
Node* curr = head;
while (curr) {
sum += curr->value;
curr = curr->next; // unpredictable, cache miss
}
Benchmark: Array is 10-100x faster than LinkedList on large datasets.
4.2 AoS vs SoA
AoS (traditional):
struct Point {
float x, y, z;
int color;
};
Point points[1000];
for (int i = 0; i < 1000; i++)
sum += points[i].x;
// Fetches 16 bytes each iter, uses only 4
SoA (cache friendly):
struct Points {
float x[1000];
float y[1000];
float z[1000];
int color[1000];
};
for (int i = 0; i < 1000; i++)
sum += points.x[i];
// 4 bytes fetched, 100% utilized, SIMD-ready
Result: 4-10x speedup. Used in game engines (ECS), analytics, ML.
4.3 Struct padding and alignment
struct Bad {
char a; // 1 byte + 7 padding
double b; // 8 bytes
char c; // 1 byte + 7 padding
};
// sizeof(Bad) = 24 bytes (16 wasted!)
struct Good {
double b; // 8 bytes
char a; // 1 byte
char c; // 1 byte + 6 padding
};
// sizeof(Good) = 16 bytes
Rule: biggest fields first.
4.4 Hash map cache behavior
- Open addressing (Robin Hood, Swiss Table): data in one array, cache friendly.
- Separate chaining (traditional): per-bucket linked lists, cache hostile.
Google's Swiss Table searches 16 slots at a time with SIMD.
5. False Sharing
5.1 The problem
struct Counters {
atomic<int> a; // 4 bytes
atomic<int> b; // 4 bytes
};
// Both sit on the same 64-byte cache line
Thread 1 writes a, Thread 2 writes b. Independent writes, but same cache line, so each write invalidates the other's copy. Infinite ping-pong.
Slowdown: 10-100x.
5.2 Detection
perf c2c record ./my_program
perf c2c report
5.3 Fixes
Padding:
struct Counters {
alignas(64) atomic<int> a;
alignas(64) atomic<int> b;
};
C++17 hardware_destructive_interference_size:
#include <new>
struct Counters {
alignas(std::hardware_destructive_interference_size) atomic<int> a;
alignas(std::hardware_destructive_interference_size) atomic<int> b;
};
Thread-local:
thread_local int a;
thread_local int b;
5.4 LongAdder (Java)
// AtomicLong - false sharing possible
AtomicLong counter = new AtomicLong();
counter.incrementAndGet();
// LongAdder - avoids false sharing by sharding
LongAdder counter = new LongAdder();
counter.increment();
Internally padded. 100x+ faster at 16+ threads.
6. MESI Protocol
6.1 Cache coherence
Multi-core caches need coherence. MESI tracks four states per line:
- M (Modified): dirty, only this core
- E (Exclusive): clean, only this core
- S (Shared): clean, multiple cores
- I (Invalid): stale
6.2 Transitions
[I] -> read -> [E] (alone) or [S] (shared)
[E] -> write -> [M]
[S] -> write -> [M] (invalidate others)
[M] -> other read -> [S] + writeback
Variants: MOESI, MESIF. Coherence traffic is what makes false sharing expensive.
7. NUMA — Non-Uniform Memory Access
7.1 What is NUMA?
Large servers (40+ cores) have multiple sockets, each with its own memory bank.
[Socket 1] [Socket 2]
| |
[Memory 1] [Memory 2]
| |
+--- Bus -------+
- Local memory: ~100ns
- Remote memory: ~200ns
7.2 Impact
Wrong placement = 50% slowdown.
7.3 NUMA-aware programming
numactl --cpunodebind=0 --membind=0 ./my_program
numactl --hardware
#include <numa.h>
void* mem = numa_alloc_local(size);
7.4 Use cases
- Databases (PostgreSQL, MySQL): NUMA-aware tuning is mandatory
- JVM:
-XX:+UseNUMA - Redis: single-threaded, huge NUMA sensitivity
- Kubernetes: NUMA topology manager
8. Mechanical Sympathy
8.1 The concept
Martin Thompson (LMAX Disruptor) popularized this idea:
"If you understand how the machine works, you can write optimal code."
Like a race-car driver who understands engines, programmers should understand CPUs.
8.2 LMAX Disruptor
- Problem: 6 million financial transactions per second.
- Solution: mechanical sympathy to the extreme.
- Tools: ring buffer (array-based), false-sharing avoidance (padding), single-writer principle (lock-free), pre-allocation (no GC).
- Result: Java at C++ speeds.
8.3 Seven principles
- Single Writer Principle — no locks
- Write combining
- Pad to avoid false sharing
- Immutable objects
- Avoid GC pauses (pre-allocate)
- Cache-friendly data structures
- Understand the hardware
8.4 Example
// Bad - false sharing
class BadCounter {
long count;
}
// Good - padding
class GoodCounter {
long p1, p2, p3, p4, p5, p6, p7;
long count;
long p9, p10, p11, p12, p13, p14, p15;
}
@Contended annotation (Java 8+):
@sun.misc.Contended
class Counter {
long count;
}
9. Measurement and Profiling
9.1 perf
perf stat -e cache-misses,cache-references ./my_program
9.2 L1 miss rate
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./my_program
L1 miss rate above 5% usually means there's room to optimize.
9.3 Intel VTune
Commercial (free tier available). Memory access patterns, false sharing, SIMD utilization.
9.4 cachegrind (Valgrind)
valgrind --tool=cachegrind ./my_program
cg_annotate cachegrind.out.PID
Per-function cache miss reports.
9.5 Microbenchmarks
Google Benchmark (C++):
static void BM_Array(benchmark::State& state) {
int arr[1000];
for (auto _ : state) {
long sum = 0;
for (int i = 0; i < 1000; i++)
sum += arr[i];
benchmark::DoNotOptimize(sum);
}
}
BENCHMARK(BM_Array);
JMH (Java) does the same for JVM code.
10. Real-World Optimization
10.1 Matrix multiplication
Naive:
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
B[k][j] is cache-hostile.
Loop swap:
for (i = 0; i < N; i++)
for (k = 0; k < N; k++)
for (j = 0; j < N; j++)
C[i][j] += A[i][k] * B[k][j];
Tiling:
for (ii = 0; ii < N; ii += T)
for (kk = 0; kk < N; kk += T)
for (jj = 0; jj < N; jj += T)
for (i = ii; i < ii+T; i++)
for (k = kk; k < kk+T; k++)
for (j = jj; j < jj+T; j++)
C[i][j] += A[i][k] * B[k][j];
Add SIMD + threading (BLAS): 100x+ over naive.
10.2 Hash tables
std::unordered_map: separate chaining, cache hostile.- Swiss Table / dense_hash_map: open addressing + SIMD, 2-5x faster.
- Robin Hood hashing: another cache-friendly variant.
10.3 Game engine ECS
Traditional OOP drags entire objects through cache. ECS stores components as parallel arrays, using only what the system needs. Unity DOTS, Unreal Mass: 10x+ speedups.
10.4 Columnar databases
- Row stores (PostgreSQL, MySQL): read full rows even for one column.
- Column stores (ClickHouse, Druid): read only the needed column.
For SELECT AVG(age), columnar is 100x+ faster.
Quiz
1. How much slower is RAM than L1 cache?
Answer: L1 is ~1ns (4 cycles), RAM is ~30ns (100 cycles). About 30x. If L1 = 1 second, RAM = 30 seconds, SSD = 27 hours. Every tier of the hierarchy is ~10x slower than the one above. This is why cache-friendly code is 100x faster — it isn't "optimization", it's the fundamental difference between tiers.
2. What is false sharing and how do you prevent it?
Answer: Two variables on the same 64-byte cache line ping-pong between cores whenever either is written, even if threads use different variables. 10-100x slowdown. Prevent with (1) padding via alignas(64), (2) C++17 hardware_destructive_interference_size, (3) thread-local variables, (4) Java @Contended. Detect with perf c2c.
3. AoS vs SoA?
Answer: Array of Structures stores each record together; Structure of Arrays stores each field as its own array. SoA lets you read only the needed field (100% cache utilization) and enables SIMD/vectorization. 4-10x faster for field-wise workloads. Used by ECS game engines (Unity DOTS, Unreal Mass), analytics, ML.
4. Why does NUMA matter on multi-socket servers?
Answer: Each socket has its own memory bank. Local accesses are ~100ns, remote ~200ns. Bad placement costs 50% performance. Fix with numactl binding, numa_alloc_local, JVM -XX:+UseNUMA. Essential for PostgreSQL, MySQL, Redis tuning.
5. What is Mechanical Sympathy and why does LMAX Disruptor use it?
Answer: A term from Martin Thompson — understand how the machine works to write optimal code. LMAX Disruptor hits 6M tx/sec in Java via: ring buffer (cache-friendly array), single-writer principle (lock-free), padding (no false sharing), pre-allocation (no GC), immutable objects. Result: Java at C++ speeds.