Split View: CPU 캐시 & 메모리 계층 완전 가이드 2025: L1/L2/L3, Cache Line, False Sharing, 성능 최적화

✨ Learn with Quiz

CPU 캐시 & 메모리 계층 완전 가이드 2025: L1/L2/L3, Cache Line, False Sharing, 성능 최적화

TL;DR

메모리 계층 = 성능의 본질: L1 캐시 미스가 100배 느림
Cache Line (64 bytes): 메모리 접근의 단위
False Sharing: 멀티스레드 성능 살인자
Cache-friendly 자료구조: Array > LinkedList, SoA > AoS
Mechanical Sympathy: 하드웨어를 이해하면 100배 빠름

1. 메모리 계층

1.1 왜 계층이 필요한가?

이상: 모든 메모리가 빠르고 크고 싸면 좋겠지만...

현실의 트레이드오프:

빠름 = 작음 = 비쌈
느림 = 큼 = 쌈

→ 계층 구조로 균형.

1.2 메모리 계층 (2025)

단계	크기	지연시간	비고
레지스터	~32개 × 64bit	1 cycle	CPU 코어 안
L1 캐시	32-64 KB	4 cycles (~1ns)	코어당
L2 캐시	256 KB-1 MB	10 cycles (~3ns)	코어당 또는 공유
L3 캐시	4-128 MB	40 cycles (~12ns)	모든 코어 공유
RAM (DDR5)	16-512 GB	100 cycles (~30ns)	시스템 전체
NVMe SSD	1-100 TB	100 microseconds	영구 저장
HDD	1-20 TB	10 milliseconds	거의 안 씀

1.3 차이의 시각화

L1 캐시 = 1초라면:

L2 = 3초
L3 = 12초
RAM = 30초
SSD = 27시간
HDD = 115일

메모리 접근이 100배 느림. 디스크는 백만 배 느림.

1.4 의미

메모리 접근 = 비용
캐시 hit = 저렴
캐시 miss = 비쌈

좋은 코드 = 캐시 hit이 많은 코드.

2. CPU 캐시 작동 원리

2.1 캐시 라인

Cache Line = 캐시의 최소 단위. 보통 64 bytes.

RAM에서 1 byte 읽으면?
→ 64 bytes를 캐시 라인 단위로 가져옴

효과:

가까운 메모리가 함께 캐시에 옴
"지역성"의 자연스러운 활용

2.2 두 가지 지역성

시간적 지역성 (Temporal Locality):

방금 접근한 것을 곧 다시 접근
예: 루프 변수

공간적 지역성 (Spatial Locality):

가까운 메모리를 곧 접근
예: 배열 순회

좋은 코드는 둘 다 활용.

2.3 캐시 매핑

Direct Mapped: 각 메모리 주소가 한 캐시 라인에만. Set Associative: 여러 라인 중 선택 (8-way가 일반적). Fully Associative: 어디든 가능 (구현 어려움).

대부분의 현대 CPU는 8-way set associative.

2.4 캐시 교체

캐시가 가득 차면 무엇을 버릴까?

LRU (Least Recently Used): 가장 오래 안 쓴 것 Pseudo-LRU: LRU의 근사 (구현 쉬움) Random: 무작위 (놀랍게도 LRU와 비슷)

3. 캐시 미스의 종류

3.1 Compulsory Miss (Cold Miss)

처음 접근하는 데이터. 피할 수 없음.

최소화: Prefetching — 미리 가져오기.

// 컴파일러/CPU가 자동
for (int i = 0; i < n; i++) {
    process(arr[i]);
    // CPU가 arr[i+1], arr[i+2]를 미리 가져옴
}

3.2 Capacity Miss

캐시가 너무 작아서 발생.

해결:

작업 데이터 크기를 캐시에 맞춤
Cache blocking (Tiling)

// 블로킹 없이 (캐시 미스 많음)
for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            C[i][j] += A[i][k] * B[k][j];

// 블로킹 (캐시에 맞춤)
for (int ii = 0; ii < n; ii += BLOCK)
    for (int jj = 0; jj < n; jj += BLOCK)
        for (int kk = 0; kk < n; kk += BLOCK)
            for (int i = ii; i < ii + BLOCK; i++)
                for (int j = jj; j < jj + BLOCK; j++)
                    for (int k = kk; k < kk + BLOCK; k++)
                        C[i][j] += A[i][k] * B[k][j];

10배+ 빨라질 수 있음.

3.3 Conflict Miss

여러 데이터가 같은 캐시 라인 슬롯에 매핑됨.

해결:

패딩 추가
데이터 정렬

4. 캐시 친화적 자료구조

4.1 Array vs LinkedList

// Array - 캐시 친화적
int arr[1000];
for (int i = 0; i < 1000; i++)
    sum += arr[i];  // 연속 메모리, prefetch 효과

// LinkedList - 캐시 적대적
struct Node {
    int value;
    struct Node* next;
};

Node* curr = head;
while (curr) {
    sum += curr->value;
    curr = curr->next;  // 어디 있을지 모름, cache miss
}

벤치마크: Array가 LinkedList보다 10-100배 빠름 (큰 데이터에서).

4.2 Array of Structures (AoS) vs Structure of Arrays (SoA)

AoS (전통적):

struct Point {
    float x, y, z;
    int color;
};

Point points[1000];

// x만 처리하고 싶을 때
for (int i = 0; i < 1000; i++)
    sum += points[i].x;
// 매번 16 bytes (Point 크기) 가져옴, 그 중 4 bytes만 사용

SoA (캐시 친화적):

struct Points {
    float x[1000];
    float y[1000];
    float z[1000];
    int color[1000];
};

// x만 처리
for (int i = 0; i < 1000; i++)
    sum += points.x[i];
// 4 bytes만 가져옴, 100% 활용

효과: SIMD까지 가능 → 4-10배 빠름.

사용: 게임 엔진 (ECS), 데이터 분석, 머신러닝.

4.3 Struct Padding과 정렬

struct Bad {
    char a;     // 1 byte + 7 padding
    double b;   // 8 bytes
    char c;     // 1 byte + 7 padding
};
// sizeof(Bad) = 24 bytes (16 wasted!)

struct Good {
    double b;   // 8 bytes
    char a;     // 1 byte
    char c;     // 1 byte + 6 padding
};
// sizeof(Good) = 16 bytes

규칙: 큰 필드를 먼저, 작은 필드를 나중에.

4.4 Hash Map의 캐시 친화성

Open Addressing (Robin Hood, Swiss Table):

모든 데이터가 한 배열에
캐시 친화적

Separate Chaining (전통적):

충돌 시 LinkedList
캐시 적대적

Google의 Swiss Table: SIMD로 16개 슬롯을 한 번에 검색.

5.1 문제

struct Counters {
    atomic<int> a;  // 4 bytes
    atomic<int> b;  // 4 bytes
};
// 같은 cache line (64 bytes)에 위치

시나리오:

Thread 1: a 수정
Thread 2: b 수정
둘은 독립적이지만, 같은 cache line

결과:

Thread 1이 a를 쓰면 cache line 무효화
Thread 2의 캐시도 무효화
Thread 2가 b 읽을 때 다시 가져옴
무한 ping-pong

성능 저하: 10-100배.

5.2 탐지

perf c2c record ./my_program
perf c2c report

perf c2c가 false sharing을 감지해줍니다.

5.3 해결

1. 패딩:

struct Counters {
    alignas(64) atomic<int> a;
    alignas(64) atomic<int> b;
};

2. C++17 hardware_destructive_interference_size:

#include <new>
struct Counters {
    alignas(std::hardware_destructive_interference_size) atomic<int> a;
    alignas(std::hardware_destructive_interference_size) atomic<int> b;
};

3. Thread-local:

thread_local int a;
thread_local int b;

각 스레드가 자체 메모리 → false sharing 없음.

5.4 LongAdder (Java)

AtomicLong보다 LongAdder가 빠른 이유:

// AtomicLong - false sharing 가능
AtomicLong counter = new AtomicLong();
counter.incrementAndGet();  // 모든 스레드가 같은 메모리

// LongAdder - false sharing 회피
LongAdder counter = new LongAdder();
counter.increment();  // 스레드별로 분산된 셀에

내부적으로 padding 사용. 16+ 스레드에서 100배+ 빠름.

6. MESI 프로토콜

6.1 캐시 일관성

멀티 코어에서 같은 데이터를 캐시하면? 일관성 문제.

Core 1: cache line X = 5 (수정)
Core 2: cache line X = 3 (옛 값)

해결: MESI 프로토콜.

6.2 4가지 상태

M (Modified): 수정됨, 이 코어만 가짐
E (Exclusive): 깨끗함, 이 코어만 가짐
S (Shared): 깨끗함, 여러 코어가 가짐
I (Invalid): 무효 (다른 코어가 수정)

6.3 상태 전이

[I] → 읽기 → [E] (혼자) 또는 [S] (공유)
[E] → 쓰기 → [M]
[S] → 쓰기 → [M] (다른 코어에 invalidate 신호)
[M] → 다른 코어 읽기 → [S] + writeback

6.4 Coherency Traffic

MOESI, MESIF 등 변형도 있음.

핵심: 캐시 무효화 트래픽이 성능에 영향. False sharing의 원인.

7. NUMA — Non-Uniform Memory Access

7.1 NUMA란?

큰 서버 (40+ 코어)에서:

CPU가 여러 소켓
각 소켓이 자체 메모리

[Socket 1]      [Socket 2]
   ↓               ↓
[Memory 1]    [Memory 2]
   ↓               ↓
   └─── Bus ───────┘

비대칭 접근:

로컬 메모리: 빠름 (~100ns)
원격 메모리: 느림 (~200ns)

7.2 영향

Thread on Socket 1 → Memory on Socket 1: 빠름
Thread on Socket 1 → Memory on Socket 2: 느림 (2배)

잘못된 배치 = 성능 50% 저하.

7.3 NUMA 인식 프로그래밍

Linux numactl:

# 메모리를 특정 노드에 할당
numactl --cpunodebind=0 --membind=0 ./my_program

# 정보 확인
numactl --hardware

프로그래밍:

#include <numa.h>

void* mem = numa_alloc_local(size);  // 로컬 노드에 할당

7.4 NUMA 사용 사례

Databases (PostgreSQL, MySQL): NUMA-aware 설정 필수
JVM: -XX:+UseNUMA
Redis: 단일 스레드라 큰 영향
Kubernetes: NUMA topology manager

8. Mechanical Sympathy

8.1 개념

Martin Thompson (LMAX, Disruptor 창시자)이 강조:

"기계가 어떻게 작동하는지 이해하면, 어떻게 최적의 코드를 작성할지 이해할 수 있다."

자동차 레이서가 엔진을 이해하듯, 프로그래머도 CPU를 이해해야 합니다.

8.2 LMAX Disruptor

LMAX: 금융 거래 시스템. 문제: 초당 600만 거래 처리. 해결: Mechanical sympathy를 극단까지.

Disruptor의 비결:

Ring buffer (배열 기반, 캐시 친화적)
False sharing 회피 (padding)
Single-writer 원칙 (락 없음)
Pre-allocation (GC 회피)

결과: Java로 C++ 수준 성능.

8.3 7가지 원칙

The Single Writer Principle: 한 스레드만 쓰기 → 락 없음
Write Combining: 여러 작은 쓰기를 모아서
Pad to avoid false sharing: 패딩으로 격리
Use immutable objects: 캐시 친화적
Avoid GC pauses: pre-allocate
Cache friendly data structures: Array, ring buffer
Mechanical sympathy: 하드웨어 이해

8.4 적용 예

// 잘못 - false sharing
class BadCounter {
    long count;
}

// 좋음 - padding
class GoodCounter {
    long p1, p2, p3, p4, p5, p6, p7;  // padding
    long count;
    long p9, p10, p11, p12, p13, p14, p15;  // padding
}

@Contended annotation (Java 8+):

@sun.misc.Contended
class Counter {
    long count;
}

9. 측정과 프로파일링

9.1 perf

Linux의 표준 도구:

# 캐시 미스 측정
perf stat -e cache-misses,cache-references ./my_program

# 결과
   1,234,567   cache-misses    #   12.345 % of all cache refs
  10,000,000   cache-references

9.2 Cache Miss Rate

perf stat -e L1-dcache-load-misses,L1-dcache-loads ./my_program

L1 miss rate가 5%+ 이면 최적화 여지.

9.3 Intel VTune

상용 도구 (무료 옵션 있음). 더 풍부한 분석:

메모리 접근 패턴
false sharing 감지
SIMD 활용도

9.4 cachegrind (Valgrind)

valgrind --tool=cachegrind ./my_program
cg_annotate cachegrind.out.PID

함수별 캐시 미스 보고.

9.5 Microbenchmark

Google Benchmark (C++):

static void BM_Array(benchmark::State& state) {
    int arr[1000];
    for (auto _ : state) {
        long sum = 0;
        for (int i = 0; i < 1000; i++)
            sum += arr[i];
        benchmark::DoNotOptimize(sum);
    }
}
BENCHMARK(BM_Array);

JMH (Java): 동일한 기능.

10. 실전 최적화 사례

10.1 Matrix Multiplication

Naive:

for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
        for (k = 0; k < N; k++)
            C[i][j] += A[i][k] * B[k][j];

문제: B[k][j]가 cache-unfriendly.

개선 1 — Loop swap:

for (i = 0; i < N; i++)
    for (k = 0; k < N; k++)
        for (j = 0; j < N; j++)
            C[i][j] += A[i][k] * B[k][j];

이제 B도 순차 접근.

개선 2 — Tiling:

for (ii = 0; ii < N; ii += T)
    for (kk = 0; kk < N; kk += T)
        for (jj = 0; jj < N; jj += T)
            for (i = ii; i < ii+T; i++)
                for (k = kk; k < kk+T; k++)
                    for (j = jj; j < jj+T; j++)
                        C[i][j] += A[i][k] * B[k][j];

개선 3 — SIMD + Threading: BLAS 라이브러리 사용.

결과: Naive 대비 100배+ 빠름.

10.2 Hash Table 비교

std::unordered_map (C++):

Separate chaining
캐시 적대적
일반 사용

Google dense_hash_map / Swiss Table:

Open addressing
Cache-friendly
SIMD 사용
2-5배 빠름

Robin Hood hashing: 또 다른 cache-friendly 변형.

10.3 게임 엔진 ECS

전통적 OOP:

class GameObject {
    Vector3 position;
    Vector3 velocity;
    Mesh mesh;
    Texture texture;
};

vector<GameObject> objects;

// 모든 객체의 position 업데이트
for (auto& obj : objects) {
    obj.position += obj.velocity * dt;
    // 매번 GameObject 전체를 cache에 가져옴
    // mesh, texture는 사용 안 하지만 캐시 차지
}

ECS (Entity Component System):

vector<Vector3> positions;
vector<Vector3> velocities;
vector<Mesh> meshes;
vector<Texture> textures;

// position 업데이트
for (int i = 0; i < entities.size(); i++) {
    positions[i] += velocities[i] * dt;
    // position과 velocity만 캐시에
    // 100% cache 활용
}

Unity DOTS, Unreal Mass: ECS 채택. 10배+ 빠름.

10.4 데이터베이스 컬럼 저장

Row-oriented (PostgreSQL, MySQL):

[id=1, name="Alice", age=30, country="KR"]
[id=2, name="Bob", age=25, country="US"]

Column-oriented (ClickHouse, Druid):

id:      [1, 2, 3, ...]
name:    ["Alice", "Bob", "Charlie", ...]
age:     [30, 25, 28, ...]
country: ["KR", "US", "JP", ...]

SELECT AVG(age):

Row: 모든 컬럼 읽음 (낭비)
Column: age만 읽음 (캐시 친화적)

→ ClickHouse가 분석에서 100배+ 빠름.

퀴즈

1. L1 캐시와 RAM의 접근 시간 차이는?

답: L1 캐시: ~1ns (4 cycles), RAM: ~30ns (100 cycles). 약 30배 차이. L1을 1초로 비유하면 RAM은 30초. 더 큰 차이: SSD는 100,000ns (100 microseconds) → L1을 1초로 비유하면 SSD는 27시간. 메모리 계층의 각 단계가 10배+ 느림. 이게 캐시 친화적 코드가 100배 빠른 이유. 단순히 최적화가 아니라 본질적 차이.

2. False Sharing이 무엇이고 어떻게 방지하나요?

답: 두 변수가 같은 cache line (64 bytes)에 있을 때, 한 스레드가 한 변수를 수정하면 다른 스레드의 캐시도 무효화됩니다. 두 변수가 독립적이어도 무한 ping-pong 발생, 성능 10-100배 저하. 방지: (1) Padding — alignas(64) 또는 7개 long 추가, (2) C++17 hardware_destructive_interference_size, (3) Thread-local 변수, (4) Java @Contended annotation. perf c2c로 탐지 가능.

3. AoS와 SoA의 차이는?

답: Array of Structures (AoS): Point[N] — 각 Point가 x, y, z, color를 함께. 전통적 OOP 방식. Structure of Arrays (SoA): x[N], y[N], z[N], color[N] — 같은 종류 데이터를 같이. SoA의 장점: (1) 한 필드만 처리할 때 캐시 100% 활용, (2) SIMD 가능, (3) 벡터화 친화적. 결과: 4-10배 빠름. 게임 엔진 ECS, 데이터 분석, 머신러닝이 SoA를 사용. Unity DOTS, Unreal Mass.

4. NUMA가 멀티 소켓 서버에서 왜 중요한가요?

답: 큰 서버 (40+ 코어)는 여러 CPU 소켓을 가지며, 각 소켓이 자체 메모리. 같은 소켓의 메모리는 빠르고 (~100ns), 다른 소켓은 느림 (~200ns). 잘못된 배치 = 성능 50% 저하. 해결: (1) numactl로 프로세스를 특정 노드에 바인딩, (2) numa_alloc_local()로 로컬 메모리 할당, (3) JVM -XX:+UseNUMA. PostgreSQL, MySQL, Redis 같은 DB는 NUMA-aware 설정 필수.

5. Mechanical Sympathy가 무엇이고 LMAX Disruptor의 비결은?

답: Martin Thompson이 강조한 개념 — "기계의 작동 방식을 이해하면 최적의 코드를 작성할 수 있다." LMAX Disruptor는 Java로 초당 600만 거래 처리. 비결: (1) Ring buffer — 배열 기반, 캐시 친화적, (2) Single-writer 원칙 — 한 스레드만 쓰기, 락 없음, (3) False sharing 회피 — padding, (4) Pre-allocation — GC pauses 회피, (5) 불변 객체 — 캐시 친화적. 결과: Java로 C++ 수준 성능. Disruptor 패턴은 모든 고성능 시스템의 본보기.

참고 자료

What Every Programmer Should Know About Memory — Ulrich Drepper
LMAX Disruptor — Martin Thompson
Mechanical Sympathy Blog
Cache-Oblivious Algorithms
Intel Optimization Manual
Agner Fog's Optimization Guides
perf
perf c2c — false sharing 감지
Google Benchmark
Cachegrind
Computer Systems: A Programmer's Perspective

CPU Cache & Memory Hierarchy Complete Guide 2025: L1/L2/L3, Cache Line, False Sharing, Performance Tuning

TL;DR

Memory hierarchy = the essence of performance: an L1 cache miss is 100x slower
Cache Line (64 bytes): the unit of memory access
False Sharing: the silent killer of multi-threaded performance
Cache-friendly structures: Array > LinkedList, SoA > AoS
Mechanical Sympathy: understand the hardware and code runs 100x faster

1. Memory Hierarchy

1.1 Why a hierarchy?

Ideal: all memory would be fast, big, and cheap.

Reality — tradeoffs:

Fast = small = expensive
Slow = large = cheap

A layered hierarchy balances them.

1.2 Hierarchy (2025)

Tier	Size	Latency	Notes
Register	~32 x 64-bit	1 cycle	Inside the CPU core
L1 cache	32-64 KB	4 cycles (~1ns)	Per core
L2 cache	256 KB-1 MB	10 cycles (~3ns)	Per core or shared
L3 cache (LLC)	4-128 MB	40 cycles (~12ns)	Shared across cores
RAM (DDR5)	16-512 GB	100 cycles (~30ns)	System-wide
NVMe SSD	1-100 TB	100 microseconds	Persistent
HDD	1-20 TB	10 milliseconds	Almost unused

1.3 Scale visualization

If L1 = 1 second:

L2 = 3 seconds
L3 = 12 seconds
RAM = 30 seconds
SSD = 27 hours
HDD = 115 days

Memory is ~100x slower than L1. Disk is ~1 million times slower.

1.4 Implication

Memory access = cost
Cache hit = cheap
Cache miss = expensive

Good code = code with many cache hits.

2. How CPU Caches Work

2.1 Cache Line

A cache line is the minimum unit of the cache, typically 64 bytes.

Read 1 byte from RAM?
-> The CPU fetches a full 64-byte cache line

Effect:

Neighboring memory arrives together
Natural exploitation of locality

2.2 Two kinds of locality

Temporal locality: something just accessed is likely accessed again (loop counters). Spatial locality: nearby memory is likely accessed soon (array traversal).

Good code exploits both.

2.3 Cache mapping

Direct Mapped: each address maps to exactly one line.
Set Associative: pick among N lines (8-way is common).
Fully Associative: anywhere (hard to implement).

Most modern CPUs are 8-way set associative.

2.4 Replacement

LRU: least recently used
Pseudo-LRU: cheap approximation
Random: surprisingly close to LRU in practice

3. Cache Miss Types

3.1 Compulsory (Cold) Miss

First access. Unavoidable.

Mitigate: prefetching.

// Compiler / CPU does this automatically
for (int i = 0; i < n; i++) {
    process(arr[i]);
    // CPU prefetches arr[i+1], arr[i+2]
}

3.2 Capacity Miss

Cache is too small for the working set.

Fix: fit the working set, use cache blocking (tiling).

// Without blocking (many misses)
for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
        for (int k = 0; k < n; k++)
            C[i][j] += A[i][k] * B[k][j];

// With blocking (fits in cache)
for (int ii = 0; ii < n; ii += BLOCK)
    for (int jj = 0; jj < n; jj += BLOCK)
        for (int kk = 0; kk < n; kk += BLOCK)
            for (int i = ii; i < ii + BLOCK; i++)
                for (int j = jj; j < jj + BLOCK; j++)
                    for (int k = kk; k < kk + BLOCK; k++)
                        C[i][j] += A[i][k] * B[k][j];

Can be 10x+ faster.

3.3 Conflict Miss

Multiple items map to the same cache set slot. Fix with padding or alignment.

4. Cache-Friendly Data Structures

4.1 Array vs LinkedList

// Array - cache friendly
int arr[1000];
for (int i = 0; i < 1000; i++)
    sum += arr[i];  // contiguous, prefetcher wins

// LinkedList - cache hostile
struct Node {
    int value;
    struct Node* next;
};

Node* curr = head;
while (curr) {
    sum += curr->value;
    curr = curr->next;  // unpredictable, cache miss
}

Benchmark: Array is 10-100x faster than LinkedList on large datasets.

4.2 AoS vs SoA

AoS (traditional):

struct Point {
    float x, y, z;
    int color;
};

Point points[1000];

for (int i = 0; i < 1000; i++)
    sum += points[i].x;
// Fetches 16 bytes each iter, uses only 4

SoA (cache friendly):

struct Points {
    float x[1000];
    float y[1000];
    float z[1000];
    int color[1000];
};

for (int i = 0; i < 1000; i++)
    sum += points.x[i];
// 4 bytes fetched, 100% utilized, SIMD-ready

Result: 4-10x speedup. Used in game engines (ECS), analytics, ML.

4.3 Struct padding and alignment

struct Bad {
    char a;     // 1 byte + 7 padding
    double b;   // 8 bytes
    char c;     // 1 byte + 7 padding
};
// sizeof(Bad) = 24 bytes (16 wasted!)

struct Good {
    double b;   // 8 bytes
    char a;     // 1 byte
    char c;     // 1 byte + 6 padding
};
// sizeof(Good) = 16 bytes

Rule: biggest fields first.

4.4 Hash map cache behavior

Open addressing (Robin Hood, Swiss Table): data in one array, cache friendly.
Separate chaining (traditional): per-bucket linked lists, cache hostile.

Google's Swiss Table searches 16 slots at a time with SIMD.

5.1 The problem

struct Counters {
    atomic<int> a;  // 4 bytes
    atomic<int> b;  // 4 bytes
};
// Both sit on the same 64-byte cache line

Thread 1 writes a, Thread 2 writes b. Independent writes, but same cache line, so each write invalidates the other's copy. Infinite ping-pong.

Slowdown: 10-100x.

5.2 Detection

perf c2c record ./my_program
perf c2c report

5.3 Fixes

Padding:

struct Counters {
    alignas(64) atomic<int> a;
    alignas(64) atomic<int> b;
};

C++17 hardware_destructive_interference_size:

#include <new>
struct Counters {
    alignas(std::hardware_destructive_interference_size) atomic<int> a;
    alignas(std::hardware_destructive_interference_size) atomic<int> b;
};

Thread-local:

thread_local int a;
thread_local int b;

5.4 LongAdder (Java)

// AtomicLong - false sharing possible
AtomicLong counter = new AtomicLong();
counter.incrementAndGet();

// LongAdder - avoids false sharing by sharding
LongAdder counter = new LongAdder();
counter.increment();

Internally padded. 100x+ faster at 16+ threads.

6. MESI Protocol

6.1 Cache coherence

Multi-core caches need coherence. MESI tracks four states per line:

M (Modified): dirty, only this core
E (Exclusive): clean, only this core
S (Shared): clean, multiple cores
I (Invalid): stale

6.2 Transitions

[I] -> read  -> [E] (alone) or [S] (shared)
[E] -> write -> [M]
[S] -> write -> [M] (invalidate others)
[M] -> other read -> [S] + writeback

Variants: MOESI, MESIF. Coherence traffic is what makes false sharing expensive.

7. NUMA — Non-Uniform Memory Access

7.1 What is NUMA?

Large servers (40+ cores) have multiple sockets, each with its own memory bank.

[Socket 1]      [Socket 2]
   |               |
[Memory 1]    [Memory 2]
   |               |
   +--- Bus -------+

Local memory: ~100ns
Remote memory: ~200ns

7.2 Impact

Wrong placement = 50% slowdown.

7.3 NUMA-aware programming

numactl --cpunodebind=0 --membind=0 ./my_program
numactl --hardware

#include <numa.h>
void* mem = numa_alloc_local(size);

7.4 Use cases

Databases (PostgreSQL, MySQL): NUMA-aware tuning is mandatory
JVM: -XX:+UseNUMA
Redis: single-threaded, huge NUMA sensitivity
Kubernetes: NUMA topology manager

8. Mechanical Sympathy

8.1 The concept

Martin Thompson (LMAX Disruptor) popularized this idea:

"If you understand how the machine works, you can write optimal code."

Like a race-car driver who understands engines, programmers should understand CPUs.

8.2 LMAX Disruptor

Problem: 6 million financial transactions per second.
Solution: mechanical sympathy to the extreme.
Tools: ring buffer (array-based), false-sharing avoidance (padding), single-writer principle (lock-free), pre-allocation (no GC).
Result: Java at C++ speeds.

8.3 Seven principles

Single Writer Principle — no locks
Write combining
Pad to avoid false sharing
Immutable objects
Avoid GC pauses (pre-allocate)
Cache-friendly data structures
Understand the hardware

8.4 Example

// Bad - false sharing
class BadCounter {
    long count;
}

// Good - padding
class GoodCounter {
    long p1, p2, p3, p4, p5, p6, p7;
    long count;
    long p9, p10, p11, p12, p13, p14, p15;
}

@Contended annotation (Java 8+):

@sun.misc.Contended
class Counter {
    long count;
}

9. Measurement and Profiling

9.1 perf

perf stat -e cache-misses,cache-references ./my_program

9.2 L1 miss rate

perf stat -e L1-dcache-load-misses,L1-dcache-loads ./my_program

L1 miss rate above 5% usually means there's room to optimize.

9.3 Intel VTune

Commercial (free tier available). Memory access patterns, false sharing, SIMD utilization.

9.4 cachegrind (Valgrind)

valgrind --tool=cachegrind ./my_program
cg_annotate cachegrind.out.PID

Per-function cache miss reports.

9.5 Microbenchmarks

Google Benchmark (C++):

static void BM_Array(benchmark::State& state) {
    int arr[1000];
    for (auto _ : state) {
        long sum = 0;
        for (int i = 0; i < 1000; i++)
            sum += arr[i];
        benchmark::DoNotOptimize(sum);
    }
}
BENCHMARK(BM_Array);

JMH (Java) does the same for JVM code.

10. Real-World Optimization

10.1 Matrix multiplication

Naive:

for (i = 0; i < N; i++)
    for (j = 0; j < N; j++)
        for (k = 0; k < N; k++)
            C[i][j] += A[i][k] * B[k][j];

B[k][j] is cache-hostile.

Loop swap:

for (i = 0; i < N; i++)
    for (k = 0; k < N; k++)
        for (j = 0; j < N; j++)
            C[i][j] += A[i][k] * B[k][j];

Tiling:

for (ii = 0; ii < N; ii += T)
    for (kk = 0; kk < N; kk += T)
        for (jj = 0; jj < N; jj += T)
            for (i = ii; i < ii+T; i++)
                for (k = kk; k < kk+T; k++)
                    for (j = jj; j < jj+T; j++)
                        C[i][j] += A[i][k] * B[k][j];

Add SIMD + threading (BLAS): 100x+ over naive.

10.2 Hash tables

std::unordered_map: separate chaining, cache hostile.
Swiss Table / dense_hash_map: open addressing + SIMD, 2-5x faster.
Robin Hood hashing: another cache-friendly variant.

10.3 Game engine ECS

Traditional OOP drags entire objects through cache. ECS stores components as parallel arrays, using only what the system needs. Unity DOTS, Unreal Mass: 10x+ speedups.

10.4 Columnar databases

Row stores (PostgreSQL, MySQL): read full rows even for one column.
Column stores (ClickHouse, Druid): read only the needed column.

For SELECT AVG(age), columnar is 100x+ faster.

Quiz

1. How much slower is RAM than L1 cache?

Answer: L1 is ~1ns (4 cycles), RAM is ~30ns (100 cycles). About 30x. If L1 = 1 second, RAM = 30 seconds, SSD = 27 hours. Every tier of the hierarchy is ~10x slower than the one above. This is why cache-friendly code is 100x faster — it isn't "optimization", it's the fundamental difference between tiers.

2. What is false sharing and how do you prevent it?

Answer: Two variables on the same 64-byte cache line ping-pong between cores whenever either is written, even if threads use different variables. 10-100x slowdown. Prevent with (1) padding via alignas(64), (2) C++17 hardware_destructive_interference_size, (3) thread-local variables, (4) Java @Contended. Detect with perf c2c.

3. AoS vs SoA?

Answer: Array of Structures stores each record together; Structure of Arrays stores each field as its own array. SoA lets you read only the needed field (100% cache utilization) and enables SIMD/vectorization. 4-10x faster for field-wise workloads. Used by ECS game engines (Unity DOTS, Unreal Mass), analytics, ML.

4. Why does NUMA matter on multi-socket servers?

Answer: Each socket has its own memory bank. Local accesses are ~100ns, remote ~200ns. Bad placement costs 50% performance. Fix with numactl binding, numa_alloc_local, JVM -XX:+UseNUMA. Essential for PostgreSQL, MySQL, Redis tuning.

5. What is Mechanical Sympathy and why does LMAX Disruptor use it?

Answer: A term from Martin Thompson — understand how the machine works to write optimal code. LMAX Disruptor hits 6M tx/sec in Java via: ring buffer (cache-friendly array), single-writer principle (lock-free), padding (no false sharing), pre-allocation (no GC), immutable objects. Result: Java at C++ speeds.

References

What Every Programmer Should Know About Memory — Ulrich Drepper
LMAX Disruptor — Martin Thompson
Mechanical Sympathy Blog
Cache-Oblivious Algorithms
Intel Optimization Manual
Agner Fog's Optimization Guides
perf
perf c2c
Google Benchmark
Cachegrind
Computer Systems: A Programmer's Perspective

CPU 캐시 & 메모리 계층 완전 가이드 2025: L1/L2/L3, Cache Line, False Sharing, 성능 최적화

TL;DR

1. 메모리 계층

1.1 왜 계층이 필요한가?

1.2 메모리 계층 (2025)

1.3 차이의 시각화

1.4 의미

2. CPU 캐시 작동 원리

2.1 캐시 라인

2.2 두 가지 지역성

2.3 캐시 매핑

2.4 캐시 교체

3. 캐시 미스의 종류

3.1 Compulsory Miss (Cold Miss)

3.2 Capacity Miss

3.3 Conflict Miss

4. 캐시 친화적 자료구조

4.1 Array vs LinkedList

4.2 Array of Structures (AoS) vs Structure of Arrays (SoA)

4.3 Struct Padding과 정렬

4.4 Hash Map의 캐시 친화성

5. False Sharing

5.1 문제

5.2 탐지

5.3 해결

5.4 LongAdder (Java)

6. MESI 프로토콜

6.1 캐시 일관성

6.2 4가지 상태

6.3 상태 전이

6.4 Coherency Traffic

7. NUMA — Non-Uniform Memory Access

7.1 NUMA란?

7.2 영향

7.3 NUMA 인식 프로그래밍

7.4 NUMA 사용 사례

8. Mechanical Sympathy

8.1 개념

8.2 LMAX Disruptor

8.3 7가지 원칙

8.4 적용 예

9. 측정과 프로파일링

9.1 perf

9.2 Cache Miss Rate

9.3 Intel VTune

9.4 cachegrind (Valgrind)

9.5 Microbenchmark

10. 실전 최적화 사례

10.1 Matrix Multiplication

10.2 Hash Table 비교

10.3 게임 엔진 ECS

10.4 데이터베이스 컬럼 저장

퀴즈

참고 자료

CPU Cache & Memory Hierarchy Complete Guide 2025: L1/L2/L3, Cache Line, False Sharing, Performance Tuning

TL;DR

1. Memory Hierarchy

1.1 Why a hierarchy?

1.2 Hierarchy (2025)

1.3 Scale visualization

1.4 Implication

2. How CPU Caches Work

2.1 Cache Line

2.2 Two kinds of locality

2.3 Cache mapping

2.4 Replacement

3. Cache Miss Types

3.1 Compulsory (Cold) Miss

3.2 Capacity Miss

3.3 Conflict Miss

4. Cache-Friendly Data Structures

4.1 Array vs LinkedList

4.2 AoS vs SoA

4.3 Struct padding and alignment

4.4 Hash map cache behavior

5. False Sharing

5.1 The problem

5.2 Detection

5.3 Fixes

5.4 LongAdder (Java)