Split View: 데이터베이스 내부 구조 완전 가이드 2025: 스토리지 엔진, B-Tree, LSM-Tree, MVCC, WAL
데이터베이스 내부 구조 완전 가이드 2025: 스토리지 엔진, B-Tree, LSM-Tree, MVCC, WAL
목차
1. 들어가며: 왜 데이터베이스 내부 구조를 알아야 하는가
데이터베이스는 현대 소프트웨어 시스템의 심장이다. 단순히 SQL 쿼리를 작성하는 것을 넘어서, 내부에서 데이터가 어떻게 저장되고 검색되는지 이해하면 성능 문제를 근본적으로 해결할 수 있다.
이 가이드에서 다루는 핵심 주제:
- 스토리지 엔진의 두 가지 패러다임: 페이지 지향(B-Tree) vs 로그 구조(LSM-Tree)
- Buffer Pool과 캐시 전략
- WAL(Write-Ahead Log)과 장애 복구
- MVCC와 트랜잭션 격리 수준
- 락 메커니즘과 교착 상태(Deadlock)
- 쿼리 옵티마이저의 동작 원리
- InnoDB, RocksDB 내부 구조
데이터베이스 내부 구조 전체 아키텍처
============================================
Client Query
|
v
[SQL Parser] --> [Query Optimizer] --> [Execution Engine]
| |
v v
[Buffer Pool / Cache] [Transaction Manager]
| |
v v
[Storage Engine] [Lock Manager + MVCC]
|
+---> [B-Tree Index] (페이지 지향)
+---> [LSM-Tree] (로그 구조)
|
v
[WAL (Write-Ahead Log)]
|
v
[Disk (데이터 파일 + 로그 파일)]
2. 스토리지 엔진 개요: 페이지 지향 vs 로그 구조
스토리지 엔진은 데이터를 디스크에 저장하고 읽는 핵심 컴포넌트다. 크게 두 가지 패러다임이 존재한다.
2.1 페이지 지향 스토리지 (Page-Oriented)
전통적인 RDBMS가 채택하는 방식이다. 고정 크기 페이지(보통 4KB~16KB) 단위로 데이터를 관리한다.
페이지 지향 스토리지 구조
=========================
[Page 0: File Header]
[Page 1: B-Tree Root]
[Page 2: B-Tree Internal]
[Page 3: B-Tree Leaf (data)]
[Page 4: B-Tree Leaf (data)]
[Page 5: Free Space Map]
...
[Page N: B-Tree Leaf (data)]
각 페이지 내부 구조:
+----------------------------------+
| Page Header (LSN, checksum, ...) |
| Slot Array (오프셋 포인터) |
| ...free space... |
| Tuple 3 | Tuple 2 | Tuple 1 |
+----------------------------------+
특징:
- 읽기 최적화: 인덱스를 통한 빠른 포인트 쿼리
- In-place 업데이트: 데이터가 저장된 위치에서 직접 수정
- 대표 구현: MySQL InnoDB, PostgreSQL, SQL Server, Oracle
2.2 로그 구조 스토리지 (Log-Structured)
모든 쓰기를 순차적 로그로 기록하는 방식이다. 쓰기 성능에 최적화되어 있다.
로그 구조 스토리지 구조
=======================
Write Path:
[Client Write] --> [MemTable (in-memory, sorted)]
|
(flush when full)
|
v
[SSTable L0] (immutable, sorted)
[SSTable L0]
|
(compaction)
|
v
[SSTable L1] (merged, sorted)
|
(compaction)
|
v
[SSTable L2] (larger, sorted)
특징:
- 쓰기 최적화: 순차 쓰기만 수행하여 높은 쓰기 처리량
- 읽기 시 다수의 SSTable을 확인해야 할 수 있음
- 대표 구현: RocksDB, LevelDB, Cassandra, HBase
2.3 비교 요약
| 특성 | 페이지 지향 (B-Tree) | 로그 구조 (LSM-Tree) |
|---|---|---|
| 쓰기 성능 | 보통 (랜덤 I/O) | 우수 (순차 I/O) |
| 읽기 성능 | 우수 (인덱스 직접 접근) | 보통 (여러 레벨 탐색) |
| 공간 효율성 | 보통 (페이지 내 단편화) | 우수 (컴팩션으로 정리) |
| 쓰기 증폭 | 낮음 | 높음 (컴팩션) |
| 읽기 증폭 | 낮음 | 높음 (여러 SSTable) |
| 대표 DB | MySQL, PostgreSQL | RocksDB, Cassandra |
3. B-Tree 심층 분석
B-Tree는 1972년 Rudolf Bayer와 Edward McCreight가 발명한 자기 균형 트리 자료구조다. 거의 모든 RDBMS 인덱스의 기반이 된다.
3.1 B-Tree 기본 구조
B-Tree 구조 (order = 4)
========================
[ 30 | 60 ] <-- Root Node
/ | \
v v v
[10|20] [40|50] [70|80|90] <-- Internal Nodes
/ | \ / | \ / | | \
v v v v v v v v v v
[1] [15] [25] [35] [45] [55] [65] [75] [85] [95] <-- Leaf Nodes
5 17 28 38 48 58 68 78 88 98
8 19 39 59 79 99
B-Tree 속성 (order = m):
- 루트는 최소 2개의 자식을 가짐
- 내부 노드는 ceil(m/2) ~ m개의 자식을 가짐
- 모든 리프 노드는 같은 깊이에 위치
- 검색/삽입/삭제: O(log N)
3.2 노드 분할 (Node Splitting)
노드가 가득 차면 분할이 발생한다.
노드 분할 과정 (order = 4, 키 25 삽입)
========================================
Before:
Parent: [20 | 40]
Child: [21 | 23 | 24] (full!)
Step 1: 25를 삽입하면 overflow
Temp: [21 | 23 | 24 | 25]
Step 2: 중간값(23)을 부모로 올림
Parent: [20 | 23 | 40]
Left: [21]
Right: [24 | 25]
After:
[20 | 23 | 40]
/ | | \
... [21] [24|25] ...
3.3 B+Tree vs B-Tree
대부분의 데이터베이스는 순수 B-Tree가 아닌 B+Tree를 사용한다.
B+Tree 구조
============
Internal Nodes: 키만 저장 (데이터 없음)
[30 | 60]
/ | \
v v v
Leaf: [10|20|30] --> [40|50|60] --> [70|80|90]
data data data
핵심 차이:
- B-Tree: 모든 노드에 데이터 저장
- B+Tree: 리프 노드에만 데이터 저장
- B+Tree: 리프 노드가 연결 리스트로 연결 (범위 스캔 최적화)
B+Tree의 장점:
- 내부 노드에 더 많은 키를 저장 가능 (fanout 증가)
- 범위 쿼리 시 리프 노드를 순차적으로 순회
- 더 일관된 검색 성능 (항상 리프까지 내려감)
3.4 클러스터드 vs 논클러스터드 인덱스
클러스터드 인덱스 (InnoDB Primary Key)
======================================
B+Tree 리프 = 실제 테이블 데이터
[PK: 1] --> {id:1, name:'Alice', age:30}
[PK: 2] --> {id:2, name:'Bob', age:25}
[PK: 3] --> {id:3, name:'Carol', age:35}
논클러스터드 인덱스 (Secondary Index)
======================================
B+Tree 리프 = 프라이머리 키 참조
[name:'Alice'] --> PK: 1 (다시 클러스터드 인덱스 조회 필요!)
[name:'Bob'] --> PK: 2
[name:'Carol'] --> PK: 3
클러스터드 인덱스 특징:
- 테이블당 하나만 존재 (보통 PK)
- 데이터가 PK 순서대로 물리적으로 정렬
- 범위 쿼리에 매우 효율적
논클러스터드 인덱스의 이중 조회 문제:
- Secondary Index 조회 후 클러스터드 인덱스를 다시 조회해야 함
- 이를 "bookmark lookup" 또는 "table lookup"이라 부름
- Covering Index로 이 문제를 회피할 수 있음
4. LSM-Tree 심층 분석
LSM-Tree(Log-Structured Merge Tree)는 쓰기 집약적 워크로드에 최적화된 자료구조다.
4.1 LSM-Tree 아키텍처
LSM-Tree 전체 구조
==================
[Write Request]
|
v
[WAL (디스크)] <--- 장애 복구용 선행 기록
|
v
[MemTable (메모리)] <--- Red-Black Tree 또는 Skip List
|
(MemTable이 임계값 도달)
|
v
[Immutable MemTable]
|
(디스크로 flush)
|
v
[Level 0 SSTables] <--- 키 범위 겹칠 수 있음
|
(Compaction)
|
v
[Level 1 SSTables] <--- 키 범위 겹치지 않음
|
(Compaction)
|
v
[Level 2 SSTables] <--- 더 큰 파일들
|
...
4.2 MemTable
MemTable은 메모리에 상주하는 정렬된 자료구조다.
// MemTable 구현 예시 (Skip List 기반)
class MemTable {
private final ConcurrentSkipListMap<byte[], byte[]> data;
private final AtomicLong approximateSize;
private static final long FLUSH_THRESHOLD = 64 * 1024 * 1024; // 64MB
public void put(byte[] key, byte[] value) {
data.put(key, value);
approximateSize.addAndGet(key.length + value.length);
}
public byte[] get(byte[] key) {
return data.get(key);
}
public boolean shouldFlush() {
return approximateSize.get() >= FLUSH_THRESHOLD;
}
// MemTable -> SSTable 변환
public SSTable flush(String sstablePath) {
SSTableBuilder builder = new SSTableBuilder(sstablePath);
for (Map.Entry<byte[], byte[]> entry : data.entrySet()) {
builder.add(entry.getKey(), entry.getValue());
}
return builder.build(); // 정렬된 상태로 디스크에 기록
}
}
4.3 SSTable (Sorted String Table)
SSTable 내부 구조
=================
+-------------------------------------------+
| Data Block 0 |
| key1:value1 | key2:value2 | key3:value3 |
+-------------------------------------------+
| Data Block 1 |
| key4:value4 | key5:value5 | key6:value6 |
+-------------------------------------------+
| ... |
+-------------------------------------------+
| Meta Block (Bloom Filter) |
+-------------------------------------------+
| Index Block |
| block0_last_key -> offset0 |
| block1_last_key -> offset1 |
+-------------------------------------------+
| Footer |
| index_offset | meta_offset | magic_num |
+-------------------------------------------+
4.4 컴팩션 전략
Size-Tiered Compaction (STCS)
Size-Tiered Compaction
======================
크기가 비슷한 SSTable들을 합병
Level 0: [T1] [T2] [T3] [T4] (각 64MB)
|
(4개가 쌓이면 합병)
|
v
Level 1: [ T1+T2+T3+T4 ] (약 256MB)
장점: 쓰기 증폭이 낮음
단점: 공간 증폭이 높음 (일시적으로 2배 공간 필요)
읽기 시 여러 SSTable 확인 필요
Leveled Compaction (LCS)
Leveled Compaction
==================
각 레벨은 키 범위가 겹치지 않는 SSTable로 구성
레벨 N+1의 크기는 레벨 N의 10배
L0: [a-z] [a-z] [a-z] (키 범위 겹침 허용, 각 64MB)
|
L1: [a-e] [f-k] [l-p] [q-z] (키 범위 겹침 없음, 총 640MB)
|
L2: [a-b] [c-d] ... [y-z] (키 범위 겹침 없음, 총 6.4GB)
장점: 읽기 증폭 낮음, 공간 증폭 낮음
단점: 쓰기 증폭 높음 (최대 10x per level)
4.5 Bloom Filter
Bloom Filter는 SSTable에 특정 키가 존재하는지 빠르게 판단하는 확률적 자료구조다.
# Bloom Filter 동작 원리
class BloomFilter:
def __init__(self, size, num_hash_functions):
self.bit_array = [0] * size
self.size = size
self.num_hash = num_hash_functions
def add(self, key):
for i in range(self.num_hash):
index = hash(key, seed=i) % self.size
self.bit_array[index] = 1
def might_contain(self, key):
"""
True -> 키가 존재할 수 있음 (false positive 가능)
False -> 키가 확실히 없음 (false negative 없음)
"""
for i in range(self.num_hash):
index = hash(key, seed=i) % self.size
if self.bit_array[index] == 0:
return False
return True
# 사용 예시: SSTable 읽기 최적화
def get(key):
# 1. MemTable 확인
result = memtable.get(key)
if result:
return result
# 2. 각 SSTable의 Bloom Filter로 빠르게 필터링
for sstable in sorted_sstables:
if sstable.bloom_filter.might_contain(key):
result = sstable.search(key) # 실제 디스크 I/O
if result:
return result
return None
4.6 쓰기 증폭 (Write Amplification)
쓰기 증폭 분석
==============
원본 데이터: 1 byte
B-Tree 쓰기 증폭:
WAL 기록(1) + 페이지 쓰기(1) = 약 2x
(실제로는 페이지 크기만큼 쓸 수 있어 더 높을 수 있음)
LSM-Tree 쓰기 증폭 (Leveled Compaction):
WAL 기록(1) + MemTable flush(1) + L0->L1(10) + L1->L2(10) + ...
= 약 10 * level_count
예: 5개 레벨이면 약 50x 쓰기 증폭!
5. Buffer Pool 관리
Buffer Pool은 디스크 페이지를 메모리에 캐싱하는 핵심 컴포넌트다.
5.1 Buffer Pool 아키텍처
Buffer Pool 구조
================
[Hash Table: page_id -> frame_id]
|
v
+------+------+------+------+------+
|Frame0|Frame1|Frame2|Frame3|Frame4| <-- Buffer Pool (메모리)
|Page5 |Page12|Page3 |empty |Page8 |
|dirty |clean |dirty | |clean |
+------+------+------+------+------+
[Free List]: [Frame3]
[LRU List]: Frame4 <-> Frame1 <-> Frame0 <-> Frame2
(least) (most)
Page Table (매핑):
Page 5 -> Frame 0 (dirty, pin_count=2)
Page 12 -> Frame 1 (clean, pin_count=0)
Page 3 -> Frame 2 (dirty, pin_count=1)
Page 8 -> Frame 4 (clean, pin_count=0)
5.2 페이지 교체 알고리즘
LRU (Least Recently Used)
class LRUReplacer:
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict()
def access(self, frame_id):
if frame_id in self.cache:
self.cache.move_to_end(frame_id)
else:
if len(self.cache) >= self.capacity:
self.cache.popitem(last=False) # 가장 오래된 것 제거
self.cache[frame_id] = True
def evict(self):
if self.cache:
frame_id, _ = self.cache.popitem(last=False)
return frame_id
return None
Clock Algorithm (Second Chance)
Clock 알고리즘
==============
시계 바늘이 순환하며 교체할 프레임을 찾음
[Frame0: ref=1]
/ \
[Frame4: ref=0] [Frame1: ref=1] <-- clock hand 위치
\ /
[Frame3: ref=1]
|
[Frame2: ref=0]
교체 과정:
1. clock hand가 가리키는 Frame1의 ref bit 확인
2. ref=1이면 ref=0으로 설정하고 다음으로 이동
3. ref=0인 프레임을 찾으면 교체
4. Frame2(ref=0)를 교체 대상으로 선택
5.3 Dirty Page와 Checkpoint
Checkpoint 과정
===============
1. Dirty Page 목록 수집
Frame0(Page5): dirty -> 디스크에 기록 필요
Frame2(Page3): dirty -> 디스크에 기록 필요
2. Fuzzy Checkpoint (InnoDB 방식)
- 모든 dirty page를 한 번에 쓰지 않음
- 백그라운드에서 점진적으로 기록
- 시스템 부하 분산
3. Checkpoint LSN 기록
- WAL에서 이 LSN 이전의 로그는 복구 시 불필요
- WAL 파일 재활용 가능
Timeline:
WAL: [LSN:100][LSN:101]...[LSN:200][LSN:201]...
^
Checkpoint LSN=200
(LSN 200 이전은 안전하게 디스크에 기록됨)
6. WAL (Write-Ahead Log) 심층 분석
WAL은 데이터 변경 전에 로그를 먼저 기록하는 프로토콜이다. 장애 복구의 핵심이다.
6.1 WAL 동작 원리
WAL 기록 흐름
=============
1. 트랜잭션 시작
BEGIN;
2. UPDATE users SET name='Bob' WHERE id=1;
a) WAL에 로그 레코드 기록:
[LSN:101, TxID:5, Table:users, PK:1,
old:{name:'Alice'}, new:{name:'Bob'}]
b) Buffer Pool의 페이지 수정 (메모리에서만)
3. COMMIT;
a) WAL에 커밋 레코드 기록:
[LSN:102, TxID:5, COMMIT]
b) WAL을 디스크에 fsync (이것이 COMMIT의 내구성 보장!)
c) 클라이언트에 성공 응답
핵심: 데이터 페이지는 나중에 lazy하게 디스크에 기록됨
WAL만 디스크에 있으면 장애 후 복구 가능
6.2 Log Sequence Number (LSN)
LSN 구조와 역할
===============
WAL File:
[LSN:100|INSERT|TxID:3|...][LSN:101|UPDATE|TxID:5|...][LSN:102|COMMIT|TxID:5]
Buffer Pool Page Header:
+------------------+
| Page LSN: 101 | <-- 이 페이지에 마지막으로 적용된 WAL 레코드
| Checksum: 0xAB12 |
| ... |
+------------------+
Checkpoint LSN: 100
-> LSN 100 이전의 변경은 모두 디스크에 반영됨
-> 복구 시 LSN 100 이후부터 재실행(redo)
Flushed LSN: 102
-> WAL 파일에서 디스크에 기록된 마지막 LSN
6.3 Group Commit
Group Commit 최적화
===================
개별 fsync는 비용이 매우 높음 (SSD에서도 수백 us)
Without Group Commit:
Tx1: write WAL -> fsync -> respond (10ms)
Tx2: write WAL -> fsync -> respond (10ms)
Tx3: write WAL -> fsync -> respond (10ms)
Total: 30ms, 3 fsync calls
With Group Commit:
Tx1: write WAL --|
Tx2: write WAL --|--> single fsync -> respond to all
Tx3: write WAL --|
Total: 10ms, 1 fsync call
효과: fsync 횟수를 대폭 줄여 처리량 크게 향상
6.4 장애 복구 (Crash Recovery)
ARIES 복구 프로토콜 (Analysis-Redo-Undo)
==========================================
Phase 1: Analysis (분석)
WAL을 처음부터 스캔하여:
- 장애 시점의 활성 트랜잭션 목록 파악
- dirty page 목록 파악
Phase 2: Redo (재실행)
Checkpoint LSN 이후의 모든 WAL 레코드를 재실행
- 커밋된 트랜잭션의 변경 복원
- 미커밋 트랜잭션의 변경도 일단 복원
Phase 3: Undo (취소)
미커밋 트랜잭션의 변경을 역순으로 취소
예시:
WAL: [Tx1:INSERT][Tx2:UPDATE][Tx1:COMMIT][Tx2:DELETE][CRASH!]
^
Analysis: Tx1=committed, Tx2=active(uncommitted)
Redo: Tx1의 INSERT 재실행, Tx2의 UPDATE/DELETE도 재실행
Undo: Tx2의 DELETE 취소, Tx2의 UPDATE 취소
7. MVCC (다중 버전 동시성 제어)
MVCC는 데이터의 여러 버전을 유지하여 읽기와 쓰기가 서로 블로킹하지 않도록 하는 동시성 제어 기법이다.
7.1 PostgreSQL MVCC: xmin/xmax
PostgreSQL MVCC 구현
====================
각 튜플(행)에 메타데이터 포함:
Tuple Header:
+----------+----------+--------+
| xmin | xmax | data |
+----------+----------+--------+
xmin: 이 튜플을 생성(INSERT)한 트랜잭션 ID
xmax: 이 튜플을 삭제(DELETE/UPDATE)한 트랜잭션 ID (0이면 유효)
예시: UPDATE users SET age=31 WHERE id=1;
Before (xmin=100, xmax=0):
+----------+----------+-------------------+
| xmin=100 | xmax=0 | id=1, age=30 | <-- 현재 버전
+----------+----------+-------------------+
After (Tx 200이 UPDATE 실행):
+----------+----------+-------------------+
| xmin=100 | xmax=200 | id=1, age=30 | <-- 이전 버전 (soft delete)
+----------+----------+-------------------+
+----------+----------+-------------------+
| xmin=200 | xmax=0 | id=1, age=31 | <-- 새 버전
+----------+----------+-------------------+
가시성 판단 규칙:
PostgreSQL 가시성 규칙 (Snapshot Isolation)
==========================================
현재 트랜잭션의 snapshot: active_tx = {150, 180}
snapshot_xmin = 150 (가장 오래된 활성 트랜잭션)
튜플이 보이는 조건:
1. xmin이 커밋됨 AND xmin이 snapshot 이전
2. xmax가 0이거나, xmax가 커밋되지 않았거나,
xmax가 snapshot 이후
예시:
- (xmin=100, xmax=0): 보임 (100은 커밋됨, 삭제 안 됨)
- (xmin=100, xmax=200): Tx200이 커밋되면 안 보임
- (xmin=150, xmax=0): 안 보임 (150은 아직 활성)
- (xmin=120, xmax=180): 180이 활성이므로 보임
7.2 MySQL InnoDB MVCC: Read View
InnoDB Read View 구조
=====================
Read View 생성 시점의 스냅샷:
{
creator_trx_id: 200, // 현재 트랜잭션 ID
trx_ids: [150, 180], // 생성 시점 활성 트랜잭션 목록
up_limit_id: 150, // trx_ids 중 최솟값
low_limit_id: 201 // 다음 할당될 트랜잭션 ID
}
Undo Log를 통한 버전 체인:
Current Row: {id:1, name:'Carol', trx_id:200}
|
v (undo log pointer)
Undo v1: {id:1, name:'Bob', trx_id:180}
|
v (undo log pointer)
Undo v2: {id:1, name:'Alice', trx_id:100}
가시성 판단:
- trx_id=200 (creator): 자신이 수정 -> 보임
- trx_id가 up_limit_id(150) 미만: 커밋 확정 -> 보임
- trx_id가 low_limit_id(201) 이상: 미래 트랜잭션 -> 안 보임
- trx_id가 trx_ids에 포함: 활성 트랜잭션 -> 안 보임
- 그 외: 커밋됨 -> 보임
8. 트랜잭션 격리 수준
8.1 네 가지 격리 수준
격리 수준과 허용되는 이상 현상
===============================
격리 수준 | Dirty Read | Non-Repeatable | Phantom Read
================== | ========== | ============== | ============
READ UNCOMMITTED | 허용 | 허용 | 허용
READ COMMITTED | 방지 | 허용 | 허용
REPEATABLE READ | 방지 | 방지 | 허용*
SERIALIZABLE | 방지 | 방지 | 방지
* MySQL InnoDB의 REPEATABLE READ는 Gap Lock으로 Phantom Read도 방지
8.2 이상 현상 예시
-- Dirty Read (READ UNCOMMITTED)
-- Tx1:
BEGIN;
UPDATE accounts SET balance = 500 WHERE id = 1; -- 원래 1000
-- 아직 COMMIT 안 함!
-- Tx2 (READ UNCOMMITTED):
SELECT balance FROM accounts WHERE id = 1;
-- 결과: 500 (커밋되지 않은 데이터를 읽음!)
-- Tx1:
ROLLBACK; -- 500은 무효화됨
-- Tx2는 잘못된 데이터를 기반으로 결정을 내릴 수 있음
-- Non-Repeatable Read (READ COMMITTED)
-- Tx1:
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- 결과: 1000
-- Tx2:
BEGIN;
UPDATE accounts SET balance = 500 WHERE id = 1;
COMMIT;
-- Tx1:
SELECT balance FROM accounts WHERE id = 1;
-- 결과: 500 (같은 트랜잭션 내에서 다른 결과!)
-- Phantom Read (REPEATABLE READ, non-InnoDB)
-- Tx1:
BEGIN;
SELECT COUNT(*) FROM orders WHERE status = 'pending';
-- 결과: 5
-- Tx2:
BEGIN;
INSERT INTO orders (status) VALUES ('pending');
COMMIT;
-- Tx1:
SELECT COUNT(*) FROM orders WHERE status = 'pending';
-- 결과: 6 (새로운 행이 나타남 - phantom!)
9. 락(Lock) 메커니즘
9.1 락 종류
InnoDB 락 종류
==============
1. Record Lock (레코드 락)
특정 인덱스 레코드를 잠금
예: WHERE id = 5 -> id=5 레코드만 잠금
2. Gap Lock (갭 락)
인덱스 레코드 사이의 간격을 잠금
예: WHERE id BETWEEN 5 AND 10
-> (5, 10) 사이에 새로운 레코드 삽입 방지
3. Next-Key Lock (넥스트키 락)
Record Lock + Gap Lock
예: WHERE id = 5
-> id=5 레코드 + 이전 레코드와의 갭
InnoDB REPEATABLE READ의 기본 잠금 방식
4. Intention Lock (인텐션 락)
테이블 수준의 의도 표시
- IS (Intention Shared): 행 수준 S lock을 획득할 의도
- IX (Intention Exclusive): 행 수준 X lock을 획득할 의도
9.2 Intention Lock 호환성 매트릭스
Intention Lock 호환성
=====================
| IS | IX | S | X
---------|-------|-------|-------|-------
IS | 호환 | 호환 | 호환 | 충돌
IX | 호환 | 호환 | 충돌 | 충돌
S | 호환 | 충돌 | 호환 | 충돌
X | 충돌 | 충돌 | 충돌 | 충돌
예시:
Tx1: SELECT ... FOR SHARE -> IS(테이블) + S(행)
Tx2: SELECT ... FOR UPDATE -> IX(테이블) + X(행)
IS와 IX는 호환 -> 테이블 수준에서 블로킹 없음
같은 행에 S와 X -> 행 수준에서 Tx2가 대기
9.3 교착 상태 (Deadlock) 탐지
-- 교착 상태 시나리오
-- Tx1:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1; -- Lock A
-- Tx2:
BEGIN;
UPDATE accounts SET balance = balance - 200 WHERE id = 2; -- Lock B
-- Tx1:
UPDATE accounts SET balance = balance + 100 WHERE id = 2; -- Wait for Lock B
-- Tx2:
UPDATE accounts SET balance = balance + 200 WHERE id = 1; -- Wait for Lock A
-- DEADLOCK! Tx1 -> Lock B -> Tx2 -> Lock A -> Tx1
Deadlock 탐지: Wait-for Graph
==============================
Tx1 ---waits for---> Tx2
^ |
| |
+---waits for--------+
사이클 발견! -> 비용이 낮은 트랜잭션을 롤백
InnoDB 방식:
- Wait-for graph를 주기적으로 검사
- 사이클 발견 시 undo log가 적은 트랜잭션을 victim으로 선택
- ERROR 1213 (40001): Deadlock found when trying to get lock
10. 쿼리 옵티마이저
10.1 쿼리 처리 파이프라인
쿼리 처리 과정
==============
SQL Query
|
v
[Parser] --> AST (Abstract Syntax Tree)
|
v
[Binder/Analyzer] --> 테이블/컬럼 존재 확인, 타입 검사
|
v
[Query Rewriter] --> 뷰 확장, 서브쿼리 변환
|
v
[Optimizer] --> 최적 실행 계획 선택
|
+---> [Cost Estimator] --> 각 계획의 비용 추정
+---> [Plan Enumerator] --> 가능한 실행 계획 열거
|
v
[Execution Engine] --> 실행 계획에 따라 데이터 접근
10.2 비용 기반 최적화 (Cost-Based Optimization)
비용 추정 요소
=============
1. I/O 비용: 디스크에서 읽어야 할 페이지 수
- Sequential I/O: 1.0 (기본 단위)
- Random I/O: 4.0 (SSD) ~ 40.0 (HDD)
2. CPU 비용: 튜플 처리 비용
- 비교 연산: 0.01
- 해시 계산: 0.02
3. 메모리 비용: 정렬, 해시 테이블 등
비용 공식 예시 (간략화):
Total Cost = (pages * seq_page_cost) + (tuples * cpu_tuple_cost)
예: 풀 테이블 스캔
pages = 1000, seq_page_cost = 1.0
tuples = 100000, cpu_tuple_cost = 0.01
Total = 1000 * 1.0 + 100000 * 0.01 = 2000
10.3 조인 알고리즘과 비용
조인 알고리즘 비교
==================
1. Nested Loop Join
for each row r in R: -- |R| iterations
for each row s in S: -- |S| iterations
if r.key == s.key:
emit (r, s)
비용: O(|R| * |S|)
적합: 작은 테이블, 인덱스가 있을 때
2. Hash Join
Phase 1 (Build):
for each row r in R:
hash_table[hash(r.key)] = r
Phase 2 (Probe):
for each row s in S:
if hash_table[hash(s.key)] exists:
emit (match, s)
비용: O(|R| + |S|)
적합: 대용량 테이블, 등가 조인
3. Sort-Merge Join
Phase 1: Sort R by key
Phase 2: Sort S by key
Phase 3: Merge sorted R and S
비용: O(|R|log|R| + |S|log|S| + |R| + |S|)
적합: 이미 정렬된 데이터, 범위 조인
10.4 인덱스 선택과 통계 정보
-- PostgreSQL 통계 정보 확인
SELECT
tablename,
attname,
n_distinct, -- 고유값 수 (-1이면 모두 고유)
correlation, -- 물리적 정렬과 논리적 정렬의 상관관계
most_common_vals,
most_common_freqs,
histogram_bounds
FROM pg_stats
WHERE tablename = 'users';
-- 결과 예시:
-- attname: status
-- n_distinct: 5
-- most_common_vals: {active, inactive, pending, banned, deleted}
-- most_common_freqs: {0.65, 0.20, 0.10, 0.03, 0.02}
-- correlation: 0.12 (물리적 순서와 무관)
인덱스 선택 판단 기준
====================
쿼리: SELECT * FROM users WHERE status = 'active'
옵션 1: Full Table Scan
pages: 1000, selectivity: 0.65 (65%의 행이 'active')
비용: 1000 * 1.0 = 1000
옵션 2: Index Scan (status 인덱스)
index_pages: 10, table_pages: 650 (random I/O)
비용: 10 * 1.0 + 650 * 4.0 = 2610
결론: selectivity가 높아서(65%) Full Table Scan이 더 효율적!
일반적으로: selectivity가 5~15% 이하일 때 인덱스 사용이 유리
11. InnoDB 내부 구조
11.1 InnoDB 아키텍처 개요
InnoDB 아키텍처
===============
In-Memory:
+-------------------------------------------------------+
| Buffer Pool |
| [Data Pages] [Index Pages] [Undo Pages] [Lock Info] |
| [Adaptive Hash Index] |
| [Change Buffer] |
+-------------------------------------------------------+
| Log Buffer |
+-------------------------------------------------------+
On-Disk:
+----------------------------+----------------------------+
| System Tablespace | Redo Log Files |
| (ibdata1) | (ib_logfile0, ib_logfile1)|
| - Data Dictionary | - WAL records |
| - Doublewrite Buffer | |
| - Change Buffer | |
| - Undo Logs | |
+----------------------------+----------------------------+
| Per-Table Tablespace (.ibd files) |
| - Table Data + Index |
+--------------------------------------------------------+
11.2 Change Buffer
Change Buffer 동작
==================
상황: Secondary Index 페이지가 Buffer Pool에 없을 때
Without Change Buffer:
1. 디스크에서 페이지 읽기 (Random I/O!)
2. 페이지 수정
3. dirty로 표시
With Change Buffer:
1. 변경 사항을 Change Buffer에 기록 (메모리)
2. 나중에 페이지가 읽힐 때 merge
효과: 랜덤 I/O를 순차 I/O로 변환
적합: 쓰기가 많고 secondary index가 많은 워크로드
부적합: unique index (중복 확인 필요)
설정:
innodb_change_buffering = all -- inserts, deletes, purges 모두
innodb_change_buffer_max_size = 25 -- buffer pool의 25%
11.3 Adaptive Hash Index (AHI)
Adaptive Hash Index
===================
InnoDB가 자동으로 자주 접근하는 B+Tree 페이지에 해시 인덱스를 구축
일반 B+Tree 접근: O(log N) - 3~4번의 페이지 접근
AHI 접근: O(1) - 해시 테이블 직접 참조
동작 방식:
1. 특정 인덱스 패턴의 접근 빈도 모니터링
2. 임계값 초과 시 자동으로 해시 인덱스 생성
3. 패턴이 변경되면 해시 인덱스 자동 재구축/삭제
모니터링:
SHOW ENGINE INNODB STATUS;
-- Hash table size 34679, used cells 1234
-- 10.52 hash searches/s, 3.48 non-hash searches/s
주의: 워크로드에 따라 오히려 성능 저하 가능
innodb_adaptive_hash_index = ON|OFF
11.4 Doublewrite Buffer
Doublewrite Buffer 목적
=======================
문제: Partial Page Write (Torn Page)
- OS 페이지 크기(4KB)와 InnoDB 페이지 크기(16KB)가 다름
- 장애 시 16KB 페이지의 일부만 기록될 수 있음
- WAL로도 복구 불가 (WAL은 페이지의 변경 부분만 기록)
해결: Doublewrite Buffer
1. dirty page를 먼저 doublewrite buffer(연속 영역)에 기록
2. 그 다음 실제 위치에 기록
3. 장애 시: doublewrite buffer에서 완전한 페이지 복구
흐름:
[Dirty Page] --> [Doublewrite Buffer (디스크)] --> [실제 테이블 파일]
(순차 쓰기, 2MB 단위) (랜덤 쓰기)
12. RocksDB 내부 구조
12.1 RocksDB 아키텍처
RocksDB 아키텍처
================
Write Path:
[Put/Delete] --> [WAL] --> [MemTable (Active)]
|
(switch when full)
|
v
[MemTable (Immutable)]
|
(background flush)
|
v
[L0 SST Files]
|
(background compaction)
|
v
[L1 SST Files]
|
...
|
v
[Ln SST Files]
Read Path:
[Get] --> [MemTable] --> [L0 SSTs] --> [L1 SSTs] --> ... --> [Ln SSTs]
(bloom) (bloom) (bloom) (bloom)
12.2 Column Families
Column Families 개념
===================
하나의 DB 인스턴스에서 논리적으로 분리된 키-값 네임스페이스
CF "default": user:1 -> {"name":"Alice"}
CF "metadata": schema_version -> "3.2"
CF "timeseries": ts:1710000000 -> {"temp":23.5}
각 Column Family는 독립적인:
- MemTable
- SSTable 세트
- Compaction 설정
- Compression 설정
사용 예:
Options options;
ColumnFamilyOptions cf_opts;
cf_opts.compaction_style = kCompactionStyleLevel;
cf_opts.write_buffer_size = 128 << 20; // 128MB MemTable
// Timeseries CF는 FIFO compaction (오래된 데이터 자동 삭제)
ColumnFamilyOptions ts_opts;
ts_opts.compaction_style = kCompactionStyleFIFO;
ts_opts.compaction_options_fifo.max_table_files_size = 1 << 30; // 1GB
12.3 Rate Limiter
Rate Limiter 설정
=================
목적: 컴팩션과 플러시의 I/O를 제한하여
전면(foreground) 읽기/쓰기 성능 보호
설정 예시:
options.rate_limiter.reset(
NewGenericRateLimiter(
100 * 1024 * 1024, // rate_bytes_per_sec: 100MB/s
100 * 1000, // refill_period_us: 100ms
10, // fairness
RateLimiter::Mode::kWritesOnly
)
);
동작:
- 컴팩션/플러시가 100MB/s 이상 쓰지 못하도록 제한
- 전면 요청이 I/O 대역폭을 확보할 수 있음
- 피크 시간에는 낮게, 오프피크에는 높게 동적 조절 가능
13. 실전 성능 튜닝 팁
13.1 인덱스 최적화
-- 커버링 인덱스로 이중 조회 제거
-- Before: secondary index -> clustered index (bookmark lookup)
SELECT name, email FROM users WHERE status = 'active';
-- index(status)만 있으면 name, email을 위해 테이블 접근 필요
-- After: 커버링 인덱스
CREATE INDEX idx_status_covering ON users(status) INCLUDE (name, email);
-- 인덱스만으로 쿼리 완료 (Index Only Scan)
-- 복합 인덱스 순서의 중요성 (Leftmost Prefix Rule)
CREATE INDEX idx_abc ON orders(customer_id, order_date, status);
-- 이 인덱스를 사용할 수 있는 쿼리:
SELECT * FROM orders WHERE customer_id = 100; -- O
SELECT * FROM orders WHERE customer_id = 100 AND order_date > '2025-01-01'; -- O
SELECT * FROM orders WHERE customer_id = 100 AND status = 'shipped'; -- O (skip scan 가능)
-- 이 인덱스를 사용할 수 없는 쿼리:
SELECT * FROM orders WHERE order_date > '2025-01-01'; -- X
SELECT * FROM orders WHERE status = 'shipped'; -- X
13.2 Buffer Pool 튜닝
# MySQL InnoDB Buffer Pool 설정
[mysqld]
# 전체 메모리의 70-80%를 Buffer Pool에 할당
innodb_buffer_pool_size = 12G
# 여러 인스턴스로 분할하여 경합 감소
innodb_buffer_pool_instances = 8
# Buffer Pool 내용을 재시작 시 보존
innodb_buffer_pool_dump_at_shutdown = ON
innodb_buffer_pool_load_at_startup = ON
# 모니터링
SHOW ENGINE INNODB STATUS;
-- Buffer pool hit rate: 999/1000 (99.9% 이상이 이상적)
13.3 WAL/Redo Log 튜닝
# InnoDB Redo Log 설정
[mysqld]
# Redo log 크기 (너무 작으면 checkpoint 빈번, 너무 크면 복구 시간 증가)
innodb_redo_log_capacity = 4G
# fsync 전략
innodb_flush_log_at_trx_commit = 1 # 가장 안전 (기본값)
# = 0: 1초마다 fsync (최대 1초 데이터 손실 가능)
# = 1: 매 커밋마다 fsync (가장 안전, 가장 느림)
# = 2: 매 커밋마다 OS buffer에 쓰기, 1초마다 fsync
14. 퀴즈
데이터베이스 내부 구조에 대한 이해를 점검해 보자.
Q1. B+Tree에서 리프 노드가 연결 리스트로 연결되어 있는 이유는?
A: 범위 쿼리(Range Scan) 최적화를 위해서다. 예를 들어 WHERE age BETWEEN 20 AND 30 같은 쿼리에서, B+Tree는 age=20인 리프 노드를 찾은 뒤 연결 리스트를 따라 age=30까지 순차적으로 스캔할 수 있다. 리프 노드 간 연결이 없다면 매번 루트부터 다시 탐색해야 한다. 이것이 B+Tree가 순수 B-Tree보다 범위 쿼리에 훨씬 효율적인 핵심 이유다.
Q2. LSM-Tree의 Leveled Compaction에서 쓰기 증폭이 발생하는 이유는?
A: Leveled Compaction에서 Level N의 SSTable이 Level N+1로 합병될 때, Level N+1의 겹치는 키 범위에 해당하는 모든 SSTable을 읽고 병합하여 다시 써야 한다. Level N+1의 크기가 Level N의 약 10배이므로, 하나의 SSTable이 최대 10개의 SSTable과 합병될 수 있다. 이 과정이 각 레벨에서 반복되므로 전체 쓰기 증폭은 약 10 x 레벨 수가 된다. 대신 읽기 증폭과 공간 증폭은 낮아지는 트레이드오프가 있다.
Q3. PostgreSQL에서 VACUUM이 필요한 이유를 MVCC 관점에서 설명하시오.
A: PostgreSQL의 MVCC는 UPDATE나 DELETE 시 이전 버전의 튜플을 즉시 제거하지 않고 xmax를 설정하여 "soft delete" 처리한다. 이렇게 남겨진 이전 버전 튜플을 "dead tuple"이라 한다. 시간이 지나면 어떤 활성 트랜잭션도 참조하지 않는 dead tuple이 쌓여 디스크 공간을 낭비하고 인덱스 성능을 저하시킨다. VACUUM은 이런 dead tuple을 정리하여 공간을 재활용 가능하게 만든다. autovacuum이 이를 자동으로 수행하지만, 대량 갱신 후에는 수동 VACUUM이 필요할 수 있다.
Q4. InnoDB의 Next-Key Lock이 Phantom Read를 방지하는 원리를 설명하시오.
A: Next-Key Lock은 Record Lock과 Gap Lock의 조합이다. 예를 들어 SELECT * FROM orders WHERE amount BETWEEN 100 AND 200 FOR UPDATE 쿼리에서, InnoDB는 조건에 해당하는 레코드 자체(Record Lock)뿐 아니라 레코드 사이의 간격(Gap Lock)도 잠근다. 이렇게 하면 다른 트랜잭션이 해당 범위에 새로운 행을 삽입할 수 없으므로 Phantom Read가 방지된다. 이것이 MySQL InnoDB가 REPEATABLE READ 격리 수준에서도 Phantom Read를 방지할 수 있는 이유다.
Q5. Buffer Pool의 hit rate가 낮을 때(예: 90%) 어떤 조치를 취해야 하는가?
A: Buffer Pool hit rate가 낮다는 것은 디스크 I/O가 빈번하게 발생한다는 뜻이다. 조치 방법은 다음과 같다. (1) innodb_buffer_pool_size를 증가시킨다. 일반적으로 전용 DB 서버에서는 물리 메모리의 70-80%를 할당한다. (2) 쿼리를 최적화하여 불필요한 풀 테이블 스캔을 줄인다. 적절한 인덱스를 추가하면 접근하는 페이지 수가 줄어든다. (3) innodb_buffer_pool_instances를 늘려 mutex 경합을 줄인다. (4) 워킹 셋(자주 접근하는 데이터)의 크기가 Buffer Pool보다 크다면 데이터 파티셔닝이나 캐시 레이어(Redis)를 고려한다. 목표는 99% 이상의 hit rate이다.
15. 참고 자료
- Designing Data-Intensive Applications - Martin Kleppmann (O'Reilly, 2017)
- Database Internals - Alex Petrov (O'Reilly, 2019)
- MySQL Internals Manual - https://dev.mysql.com/doc/internals/en/
- PostgreSQL Documentation: MVCC - https://www.postgresql.org/docs/current/mvcc.html
- RocksDB Wiki - https://github.com/facebook/rocksdb/wiki
- InnoDB Buffer Pool - https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool.html
- B-Tree Visualizer - https://www.cs.usfca.edu/~galles/visualization/BTree.html
- The Log-Structured Merge-Tree (LSM-Tree) - Patrick O'Neil et al. (1996)
- ARIES: A Transaction Recovery Method - C. Mohan et al. (1992)
- Architecture of a Database System - Hellerstein, Stonebraker, Hamilton (2007)
- LevelDB Implementation Notes - https://github.com/google/leveldb/blob/main/doc/impl.md
- CMU Database Group Lectures - https://15445.courses.cs.cmu.edu/
- Use The Index, Luke - https://use-the-index-luke.com/
Database Internals Complete Guide 2025: Storage Engines, B-Tree, LSM-Tree, MVCC, WAL
Table of Contents
1. Introduction: Why Understand Database Internals
Databases are the heart of modern software systems. Going beyond simply writing SQL queries, understanding how data is stored and retrieved internally allows you to solve performance problems at their root.
Key topics covered in this guide:
- Two paradigms of storage engines: page-oriented (B-Tree) vs log-structured (LSM-Tree)
- Buffer Pool and caching strategies
- WAL (Write-Ahead Log) and crash recovery
- MVCC and transaction isolation levels
- Lock mechanisms and deadlock detection
- How the query optimizer works
- InnoDB and RocksDB internals
Database Internal Architecture Overview
============================================
Client Query
|
v
[SQL Parser] --> [Query Optimizer] --> [Execution Engine]
| |
v v
[Buffer Pool / Cache] [Transaction Manager]
| |
v v
[Storage Engine] [Lock Manager + MVCC]
|
+---> [B-Tree Index] (page-oriented)
+---> [LSM-Tree] (log-structured)
|
v
[WAL (Write-Ahead Log)]
|
v
[Disk (Data Files + Log Files)]
2. Storage Engine Overview: Page-Oriented vs Log-Structured
The storage engine is the core component responsible for storing and reading data from disk. There are two major paradigms.
2.1 Page-Oriented Storage
The approach adopted by traditional RDBMS. Data is managed in fixed-size pages (typically 4KB to 16KB).
Page-Oriented Storage Structure
================================
[Page 0: File Header]
[Page 1: B-Tree Root]
[Page 2: B-Tree Internal]
[Page 3: B-Tree Leaf (data)]
[Page 4: B-Tree Leaf (data)]
[Page 5: Free Space Map]
...
[Page N: B-Tree Leaf (data)]
Internal structure of each page:
+----------------------------------+
| Page Header (LSN, checksum, ...) |
| Slot Array (offset pointers) |
| ...free space... |
| Tuple 3 | Tuple 2 | Tuple 1 |
+----------------------------------+
Characteristics:
- Read-optimized: fast point queries through indexes
- In-place updates: data is modified directly at its storage location
- Notable implementations: MySQL InnoDB, PostgreSQL, SQL Server, Oracle
2.2 Log-Structured Storage
An approach that records all writes as sequential log entries. Optimized for write performance.
Log-Structured Storage Architecture
====================================
Write Path:
[Client Write] --> [MemTable (in-memory, sorted)]
|
(flush when full)
|
v
[SSTable L0] (immutable, sorted)
[SSTable L0]
|
(compaction)
|
v
[SSTable L1] (merged, sorted)
|
(compaction)
|
v
[SSTable L2] (larger, sorted)
Characteristics:
- Write-optimized: only sequential writes yield high write throughput
- Reads may need to check multiple SSTables
- Notable implementations: RocksDB, LevelDB, Cassandra, HBase
2.3 Comparison Summary
| Property | Page-Oriented (B-Tree) | Log-Structured (LSM-Tree) |
|---|---|---|
| Write Performance | Moderate (random I/O) | Excellent (sequential I/O) |
| Read Performance | Excellent (direct index access) | Moderate (multi-level search) |
| Space Efficiency | Moderate (page fragmentation) | Excellent (cleaned by compaction) |
| Write Amplification | Low | High (compaction) |
| Read Amplification | Low | High (multiple SSTables) |
| Notable DBs | MySQL, PostgreSQL | RocksDB, Cassandra |
3. B-Tree Deep Dive
B-Tree is a self-balancing tree data structure invented in 1972 by Rudolf Bayer and Edward McCreight. It forms the foundation of nearly all RDBMS indexes.
3.1 B-Tree Basic Structure
B-Tree Structure (order = 4)
=============================
[ 30 | 60 ] <-- Root Node
/ | \
v v v
[10|20] [40|50] [70|80|90] <-- Internal Nodes
/ | \ / | \ / | | \
v v v v v v v v v v
[1] [15] [25] [35] [45] [55] [65] [75] [85] [95] <-- Leaf Nodes
5 17 28 38 48 58 68 78 88 98
8 19 39 59 79 99
B-Tree properties (order = m):
- Root has at least 2 children
- Internal nodes have ceil(m/2) to m children
- All leaf nodes are at the same depth
- Search/Insert/Delete: O(log N)
3.2 Node Splitting
When a node becomes full, a split occurs.
Node Splitting Process (order = 4, inserting key 25)
=====================================================
Before:
Parent: [20 | 40]
Child: [21 | 23 | 24] (full!)
Step 1: Inserting 25 causes overflow
Temp: [21 | 23 | 24 | 25]
Step 2: Promote median (23) to parent
Parent: [20 | 23 | 40]
Left: [21]
Right: [24 | 25]
After:
[20 | 23 | 40]
/ | | \
... [21] [24|25] ...
3.3 B+Tree vs B-Tree
Most databases use B+Tree rather than pure B-Tree.
B+Tree Structure
=================
Internal Nodes: keys only (no data)
[30 | 60]
/ | \
v v v
Leaf: [10|20|30] --> [40|50|60] --> [70|80|90]
data data data
Key differences:
- B-Tree: data stored in all nodes
- B+Tree: data stored only in leaf nodes
- B+Tree: leaf nodes linked as linked list (range scan optimization)
B+Tree advantages:
- Internal nodes can store more keys (higher fanout)
- Range queries traverse leaf nodes sequentially
- More consistent search performance (always traverses to leaves)
3.4 Clustered vs Non-Clustered Index
Clustered Index (InnoDB Primary Key)
=====================================
B+Tree leaf = actual table data
[PK: 1] --> {id:1, name:'Alice', age:30}
[PK: 2] --> {id:2, name:'Bob', age:25}
[PK: 3] --> {id:3, name:'Carol', age:35}
Non-Clustered Index (Secondary Index)
======================================
B+Tree leaf = primary key reference
[name:'Alice'] --> PK: 1 (requires another clustered index lookup!)
[name:'Bob'] --> PK: 2
[name:'Carol'] --> PK: 3
Clustered index characteristics:
- Only one per table (typically the PK)
- Data is physically sorted in PK order
- Very efficient for range queries
Non-clustered index double lookup problem:
- After secondary index lookup, must look up the clustered index again
- This is called "bookmark lookup" or "table lookup"
- Can be avoided with a Covering Index
4. LSM-Tree Deep Dive
LSM-Tree (Log-Structured Merge Tree) is a data structure optimized for write-intensive workloads.
4.1 LSM-Tree Architecture
LSM-Tree Full Architecture
===========================
[Write Request]
|
v
[WAL (disk)] <--- Write-ahead for crash recovery
|
v
[MemTable (memory)] <--- Red-Black Tree or Skip List
|
(MemTable reaches threshold)
|
v
[Immutable MemTable]
|
(flush to disk)
|
v
[Level 0 SSTables] <--- Key ranges may overlap
|
(Compaction)
|
v
[Level 1 SSTables] <--- Key ranges do not overlap
|
(Compaction)
|
v
[Level 2 SSTables] <--- Larger files
|
...
4.2 MemTable
The MemTable is a sorted, in-memory data structure.
// MemTable implementation example (Skip List based)
class MemTable {
private final ConcurrentSkipListMap<byte[], byte[]> data;
private final AtomicLong approximateSize;
private static final long FLUSH_THRESHOLD = 64 * 1024 * 1024; // 64MB
public void put(byte[] key, byte[] value) {
data.put(key, value);
approximateSize.addAndGet(key.length + value.length);
}
public byte[] get(byte[] key) {
return data.get(key);
}
public boolean shouldFlush() {
return approximateSize.get() >= FLUSH_THRESHOLD;
}
// MemTable -> SSTable conversion
public SSTable flush(String sstablePath) {
SSTableBuilder builder = new SSTableBuilder(sstablePath);
for (Map.Entry<byte[], byte[]> entry : data.entrySet()) {
builder.add(entry.getKey(), entry.getValue());
}
return builder.build(); // Written to disk in sorted order
}
}
4.3 SSTable (Sorted String Table)
SSTable Internal Structure
===========================
+-------------------------------------------+
| Data Block 0 |
| key1:value1 | key2:value2 | key3:value3 |
+-------------------------------------------+
| Data Block 1 |
| key4:value4 | key5:value5 | key6:value6 |
+-------------------------------------------+
| ... |
+-------------------------------------------+
| Meta Block (Bloom Filter) |
+-------------------------------------------+
| Index Block |
| block0_last_key -> offset0 |
| block1_last_key -> offset1 |
+-------------------------------------------+
| Footer |
| index_offset | meta_offset | magic_num |
+-------------------------------------------+
4.4 Compaction Strategies
Size-Tiered Compaction (STCS)
Size-Tiered Compaction
======================
Merges SSTables of similar size
Level 0: [T1] [T2] [T3] [T4] (each 64MB)
|
(merge when 4 accumulate)
|
v
Level 1: [ T1+T2+T3+T4 ] (approx 256MB)
Pros: Low write amplification
Cons: High space amplification (temporarily needs 2x space)
Reads may need to check multiple SSTables
Leveled Compaction (LCS)
Leveled Compaction
==================
Each level consists of SSTables with non-overlapping key ranges
Level N+1 is 10x the size of Level N
L0: [a-z] [a-z] [a-z] (key range overlap allowed, each 64MB)
|
L1: [a-e] [f-k] [l-p] [q-z] (no key range overlap, total 640MB)
|
L2: [a-b] [c-d] ... [y-z] (no key range overlap, total 6.4GB)
Pros: Low read amplification, low space amplification
Cons: High write amplification (up to 10x per level)
4.5 Bloom Filter
A Bloom Filter is a probabilistic data structure that quickly determines whether a key exists in an SSTable.
# Bloom Filter operation principle
class BloomFilter:
def __init__(self, size, num_hash_functions):
self.bit_array = [0] * size
self.size = size
self.num_hash = num_hash_functions
def add(self, key):
for i in range(self.num_hash):
index = hash(key, seed=i) % self.size
self.bit_array[index] = 1
def might_contain(self, key):
"""
True -> key might exist (false positives possible)
False -> key definitely does not exist (no false negatives)
"""
for i in range(self.num_hash):
index = hash(key, seed=i) % self.size
if self.bit_array[index] == 0:
return False
return True
# Usage example: SSTable read optimization
def get(key):
# 1. Check MemTable
result = memtable.get(key)
if result:
return result
# 2. Quickly filter using each SSTable's Bloom Filter
for sstable in sorted_sstables:
if sstable.bloom_filter.might_contain(key):
result = sstable.search(key) # Actual disk I/O
if result:
return result
return None
4.6 Write Amplification
Write Amplification Analysis
==============================
Original data: 1 byte
B-Tree write amplification:
WAL write(1) + page write(1) = approx 2x
(can actually be higher since entire page may be written)
LSM-Tree write amplification (Leveled Compaction):
WAL write(1) + MemTable flush(1) + L0->L1(10) + L1->L2(10) + ...
= approx 10 * level_count
Example: 5 levels = approx 50x write amplification!
5. Buffer Pool Management
The Buffer Pool is the core component that caches disk pages in memory.
5.1 Buffer Pool Architecture
Buffer Pool Structure
=====================
[Hash Table: page_id -> frame_id]
|
v
+------+------+------+------+------+
|Frame0|Frame1|Frame2|Frame3|Frame4| <-- Buffer Pool (memory)
|Page5 |Page12|Page3 |empty |Page8 |
|dirty |clean |dirty | |clean |
+------+------+------+------+------+
[Free List]: [Frame3]
[LRU List]: Frame4 <-> Frame1 <-> Frame0 <-> Frame2
(least) (most)
Page Table (mapping):
Page 5 -> Frame 0 (dirty, pin_count=2)
Page 12 -> Frame 1 (clean, pin_count=0)
Page 3 -> Frame 2 (dirty, pin_count=1)
Page 8 -> Frame 4 (clean, pin_count=0)
5.2 Page Replacement Algorithms
LRU (Least Recently Used)
class LRUReplacer:
def __init__(self, capacity):
self.capacity = capacity
self.cache = OrderedDict()
def access(self, frame_id):
if frame_id in self.cache:
self.cache.move_to_end(frame_id)
else:
if len(self.cache) >= self.capacity:
self.cache.popitem(last=False) # Remove oldest
self.cache[frame_id] = True
def evict(self):
if self.cache:
frame_id, _ = self.cache.popitem(last=False)
return frame_id
return None
Clock Algorithm (Second Chance)
Clock Algorithm
================
A clock hand sweeps around finding frames to replace
[Frame0: ref=1]
/ \
[Frame4: ref=0] [Frame1: ref=1] <-- clock hand position
\ /
[Frame3: ref=1]
|
[Frame2: ref=0]
Replacement process:
1. Check ref bit of Frame1 where clock hand points
2. If ref=1, set ref=0 and move to next
3. When a frame with ref=0 is found, select it for replacement
4. Frame2 (ref=0) is selected for eviction
5.3 Dirty Pages and Checkpoints
Checkpoint Process
==================
1. Collect dirty page list
Frame0(Page5): dirty -> needs to be written to disk
Frame2(Page3): dirty -> needs to be written to disk
2. Fuzzy Checkpoint (InnoDB approach)
- Does not write all dirty pages at once
- Writes incrementally in the background
- Distributes system load
3. Record Checkpoint LSN
- WAL logs before this LSN are not needed for recovery
- WAL files can be recycled
Timeline:
WAL: [LSN:100][LSN:101]...[LSN:200][LSN:201]...
^
Checkpoint LSN=200
(all changes before LSN 200 safely on disk)
6. WAL (Write-Ahead Log) Deep Dive
WAL is a protocol that writes logs before modifying data. It is the cornerstone of crash recovery.
6.1 How WAL Works
WAL Write Flow
===============
1. Transaction begins
BEGIN;
2. UPDATE users SET name='Bob' WHERE id=1;
a) Write log record to WAL:
[LSN:101, TxID:5, Table:users, PK:1,
old:{name:'Alice'}, new:{name:'Bob'}]
b) Modify page in Buffer Pool (in memory only)
3. COMMIT;
a) Write commit record to WAL:
[LSN:102, TxID:5, COMMIT]
b) fsync WAL to disk (this guarantees COMMIT durability!)
c) Respond success to client
Key point: Data pages are lazily written to disk later
As long as WAL is on disk, recovery after crash is possible
6.2 Log Sequence Number (LSN)
LSN Structure and Role
=======================
WAL File:
[LSN:100|INSERT|TxID:3|...][LSN:101|UPDATE|TxID:5|...][LSN:102|COMMIT|TxID:5]
Buffer Pool Page Header:
+------------------+
| Page LSN: 101 | <-- Last WAL record applied to this page
| Checksum: 0xAB12 |
| ... |
+------------------+
Checkpoint LSN: 100
-> All changes before LSN 100 are reflected on disk
-> During recovery, redo from LSN 100 onwards
Flushed LSN: 102
-> Last LSN written to disk in WAL file
6.3 Group Commit
Group Commit Optimization
==========================
Individual fsync calls are very expensive (hundreds of us even on SSD)
Without Group Commit:
Tx1: write WAL -> fsync -> respond (10ms)
Tx2: write WAL -> fsync -> respond (10ms)
Tx3: write WAL -> fsync -> respond (10ms)
Total: 30ms, 3 fsync calls
With Group Commit:
Tx1: write WAL --|
Tx2: write WAL --|--> single fsync -> respond to all
Tx3: write WAL --|
Total: 10ms, 1 fsync call
Effect: Dramatically reduces fsync count, greatly improving throughput
6.4 Crash Recovery
ARIES Recovery Protocol (Analysis-Redo-Undo)
==============================================
Phase 1: Analysis
Scan WAL from the beginning to:
- Identify active transactions at crash time
- Identify dirty page list
Phase 2: Redo
Replay all WAL records after Checkpoint LSN
- Restore committed transaction changes
- Also restore uncommitted transaction changes (temporarily)
Phase 3: Undo
Reverse uncommitted transaction changes in reverse order
Example:
WAL: [Tx1:INSERT][Tx2:UPDATE][Tx1:COMMIT][Tx2:DELETE][CRASH!]
^
Analysis: Tx1=committed, Tx2=active(uncommitted)
Redo: Replay Tx1's INSERT, also replay Tx2's UPDATE/DELETE
Undo: Reverse Tx2's DELETE, reverse Tx2's UPDATE
7. MVCC (Multi-Version Concurrency Control)
MVCC is a concurrency control technique that maintains multiple versions of data so that reads and writes do not block each other.
7.1 PostgreSQL MVCC: xmin/xmax
PostgreSQL MVCC Implementation
================================
Each tuple (row) includes metadata:
Tuple Header:
+----------+----------+--------+
| xmin | xmax | data |
+----------+----------+--------+
xmin: Transaction ID that created (INSERTed) this tuple
xmax: Transaction ID that deleted (DELETE/UPDATE) this tuple (0 if valid)
Example: UPDATE users SET age=31 WHERE id=1;
Before (xmin=100, xmax=0):
+----------+----------+-------------------+
| xmin=100 | xmax=0 | id=1, age=30 | <-- Current version
+----------+----------+-------------------+
After (Tx 200 executes UPDATE):
+----------+----------+-------------------+
| xmin=100 | xmax=200 | id=1, age=30 | <-- Previous version (soft delete)
+----------+----------+-------------------+
+----------+----------+-------------------+
| xmin=200 | xmax=0 | id=1, age=31 | <-- New version
+----------+----------+-------------------+
Visibility rules:
PostgreSQL Visibility Rules (Snapshot Isolation)
=================================================
Current transaction's snapshot: active_tx = {150, 180}
snapshot_xmin = 150 (oldest active transaction)
Conditions for a tuple to be visible:
1. xmin is committed AND xmin is before snapshot
2. xmax is 0, or xmax is not committed, or
xmax is after snapshot
Examples:
- (xmin=100, xmax=0): Visible (100 is committed, not deleted)
- (xmin=100, xmax=200): Not visible if Tx200 is committed
- (xmin=150, xmax=0): Not visible (150 is still active)
- (xmin=120, xmax=180): Visible (180 is active, so delete not yet visible)
7.2 MySQL InnoDB MVCC: Read View
InnoDB Read View Structure
===========================
Snapshot at Read View creation time:
{
creator_trx_id: 200, // Current transaction ID
trx_ids: [150, 180], // Active transactions at creation time
up_limit_id: 150, // Minimum of trx_ids
low_limit_id: 201 // Next transaction ID to be assigned
}
Version chain through Undo Log:
Current Row: {id:1, name:'Carol', trx_id:200}
|
v (undo log pointer)
Undo v1: {id:1, name:'Bob', trx_id:180}
|
v (undo log pointer)
Undo v2: {id:1, name:'Alice', trx_id:100}
Visibility check:
- trx_id=200 (creator): modified by self -> visible
- trx_id less than up_limit_id(150): definitely committed -> visible
- trx_id at or above low_limit_id(201): future transaction -> not visible
- trx_id in trx_ids: active transaction -> not visible
- Otherwise: committed -> visible
8. Transaction Isolation Levels
8.1 Four Isolation Levels
Isolation Levels and Allowed Anomalies
========================================
Isolation Level | Dirty Read | Non-Repeatable | Phantom Read
=================== | ========== | ============== | ============
READ UNCOMMITTED | Allowed | Allowed | Allowed
READ COMMITTED | Prevented | Allowed | Allowed
REPEATABLE READ | Prevented | Prevented | Allowed*
SERIALIZABLE | Prevented | Prevented | Prevented
* MySQL InnoDB's REPEATABLE READ also prevents Phantom Read via Gap Locks
8.2 Anomaly Examples
-- Dirty Read (READ UNCOMMITTED)
-- Tx1:
BEGIN;
UPDATE accounts SET balance = 500 WHERE id = 1; -- originally 1000
-- Not yet COMMITted!
-- Tx2 (READ UNCOMMITTED):
SELECT balance FROM accounts WHERE id = 1;
-- Result: 500 (reading uncommitted data!)
-- Tx1:
ROLLBACK; -- 500 is now invalidated
-- Tx2 may have made decisions based on incorrect data
-- Non-Repeatable Read (READ COMMITTED)
-- Tx1:
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- Result: 1000
-- Tx2:
BEGIN;
UPDATE accounts SET balance = 500 WHERE id = 1;
COMMIT;
-- Tx1:
SELECT balance FROM accounts WHERE id = 1;
-- Result: 500 (different result within same transaction!)
-- Phantom Read (REPEATABLE READ, non-InnoDB)
-- Tx1:
BEGIN;
SELECT COUNT(*) FROM orders WHERE status = 'pending';
-- Result: 5
-- Tx2:
BEGIN;
INSERT INTO orders (status) VALUES ('pending');
COMMIT;
-- Tx1:
SELECT COUNT(*) FROM orders WHERE status = 'pending';
-- Result: 6 (new row appeared - phantom!)
9. Lock Mechanisms
9.1 Lock Types
InnoDB Lock Types
==================
1. Record Lock
Locks a specific index record
Example: WHERE id = 5 -> locks only the id=5 record
2. Gap Lock
Locks the gap between index records
Example: WHERE id BETWEEN 5 AND 10
-> prevents new record insertion between (5, 10)
3. Next-Key Lock
Record Lock + Gap Lock
Example: WHERE id = 5
-> locks id=5 record + gap to previous record
Default locking mode for InnoDB REPEATABLE READ
4. Intention Lock
Table-level intent declaration
- IS (Intention Shared): intends to acquire row-level S lock
- IX (Intention Exclusive): intends to acquire row-level X lock
9.2 Intention Lock Compatibility Matrix
Intention Lock Compatibility
=============================
| IS | IX | S | X
---------|------------|------------|------------|----------
IS | Compatible | Compatible | Compatible | Conflict
IX | Compatible | Compatible | Conflict | Conflict
S | Compatible | Conflict | Compatible | Conflict
X | Conflict | Conflict | Conflict | Conflict
Example:
Tx1: SELECT ... FOR SHARE -> IS(table) + S(row)
Tx2: SELECT ... FOR UPDATE -> IX(table) + X(row)
IS and IX are compatible -> no blocking at table level
S and X on same row -> Tx2 waits at row level
9.3 Deadlock Detection
-- Deadlock scenario
-- Tx1:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1; -- Lock A
-- Tx2:
BEGIN;
UPDATE accounts SET balance = balance - 200 WHERE id = 2; -- Lock B
-- Tx1:
UPDATE accounts SET balance = balance + 100 WHERE id = 2; -- Wait for Lock B
-- Tx2:
UPDATE accounts SET balance = balance + 200 WHERE id = 1; -- Wait for Lock A
-- DEADLOCK! Tx1 -> Lock B -> Tx2 -> Lock A -> Tx1
Deadlock Detection: Wait-for Graph
====================================
Tx1 ---waits for---> Tx2
^ |
| |
+---waits for--------+
Cycle detected! -> Roll back the lower-cost transaction
InnoDB approach:
- Periodically checks the wait-for graph
- Selects victim with least undo log when cycle is found
- ERROR 1213 (40001): Deadlock found when trying to get lock
10. Query Optimizer
10.1 Query Processing Pipeline
Query Processing Flow
======================
SQL Query
|
v
[Parser] --> AST (Abstract Syntax Tree)
|
v
[Binder/Analyzer] --> table/column existence check, type validation
|
v
[Query Rewriter] --> view expansion, subquery transformation
|
v
[Optimizer] --> select optimal execution plan
|
+---> [Cost Estimator] --> estimate cost of each plan
+---> [Plan Enumerator] --> enumerate possible execution plans
|
v
[Execution Engine] --> access data according to execution plan
10.2 Cost-Based Optimization
Cost Estimation Factors
========================
1. I/O cost: Number of pages to read from disk
- Sequential I/O: 1.0 (base unit)
- Random I/O: 4.0 (SSD) ~ 40.0 (HDD)
2. CPU cost: Tuple processing cost
- Comparison operation: 0.01
- Hash computation: 0.02
3. Memory cost: Sorting, hash tables, etc.
Cost formula example (simplified):
Total Cost = (pages * seq_page_cost) + (tuples * cpu_tuple_cost)
Example: Full table scan
pages = 1000, seq_page_cost = 1.0
tuples = 100000, cpu_tuple_cost = 0.01
Total = 1000 * 1.0 + 100000 * 0.01 = 2000
10.3 Join Algorithms and Costs
Join Algorithm Comparison
==========================
1. Nested Loop Join
for each row r in R: -- |R| iterations
for each row s in S: -- |S| iterations
if r.key == s.key:
emit (r, s)
Cost: O(|R| * |S|)
Best for: small tables, when indexes exist
2. Hash Join
Phase 1 (Build):
for each row r in R:
hash_table[hash(r.key)] = r
Phase 2 (Probe):
for each row s in S:
if hash_table[hash(s.key)] exists:
emit (match, s)
Cost: O(|R| + |S|)
Best for: large tables, equi-joins
3. Sort-Merge Join
Phase 1: Sort R by key
Phase 2: Sort S by key
Phase 3: Merge sorted R and S
Cost: O(|R|log|R| + |S|log|S| + |R| + |S|)
Best for: pre-sorted data, range joins
10.4 Index Selection and Statistics
-- Checking PostgreSQL statistics
SELECT
tablename,
attname,
n_distinct, -- Number of distinct values (-1 means all unique)
correlation, -- Correlation between physical and logical ordering
most_common_vals,
most_common_freqs,
histogram_bounds
FROM pg_stats
WHERE tablename = 'users';
-- Example result:
-- attname: status
-- n_distinct: 5
-- most_common_vals: {active, inactive, pending, banned, deleted}
-- most_common_freqs: {0.65, 0.20, 0.10, 0.03, 0.02}
-- correlation: 0.12 (unrelated to physical order)
Index Selection Criteria
=========================
Query: SELECT * FROM users WHERE status = 'active'
Option 1: Full Table Scan
pages: 1000, selectivity: 0.65 (65% of rows are 'active')
cost: 1000 * 1.0 = 1000
Option 2: Index Scan (status index)
index_pages: 10, table_pages: 650 (random I/O)
cost: 10 * 1.0 + 650 * 4.0 = 2610
Conclusion: High selectivity (65%) makes Full Table Scan more efficient!
General rule: Index use is beneficial when selectivity is below 5-15%
11. InnoDB Internals
11.1 InnoDB Architecture Overview
InnoDB Architecture
====================
In-Memory:
+-------------------------------------------------------+
| Buffer Pool |
| [Data Pages] [Index Pages] [Undo Pages] [Lock Info] |
| [Adaptive Hash Index] |
| [Change Buffer] |
+-------------------------------------------------------+
| Log Buffer |
+-------------------------------------------------------+
On-Disk:
+----------------------------+----------------------------+
| System Tablespace | Redo Log Files |
| (ibdata1) | (ib_logfile0, ib_logfile1)|
| - Data Dictionary | - WAL records |
| - Doublewrite Buffer | |
| - Change Buffer | |
| - Undo Logs | |
+----------------------------+----------------------------+
| Per-Table Tablespace (.ibd files) |
| - Table Data + Index |
+--------------------------------------------------------+
11.2 Change Buffer
Change Buffer Operation
========================
Scenario: Secondary Index page is not in Buffer Pool
Without Change Buffer:
1. Read page from disk (Random I/O!)
2. Modify page
3. Mark as dirty
With Change Buffer:
1. Record change in Change Buffer (memory)
2. Merge later when page is read
Effect: Converts random I/O to sequential I/O
Good for: write-heavy workloads with many secondary indexes
Bad for: unique indexes (need duplicate checking)
Configuration:
innodb_change_buffering = all -- inserts, deletes, purges
innodb_change_buffer_max_size = 25 -- 25% of buffer pool
11.3 Adaptive Hash Index (AHI)
Adaptive Hash Index
====================
InnoDB automatically builds hash indexes on frequently accessed B+Tree pages
Normal B+Tree access: O(log N) - 3-4 page accesses
AHI access: O(1) - direct hash table lookup
How it works:
1. Monitors access frequency of specific index patterns
2. Automatically creates hash index when threshold exceeded
3. Automatically rebuilds/removes hash index when patterns change
Monitoring:
SHOW ENGINE INNODB STATUS;
-- Hash table size 34679, used cells 1234
-- 10.52 hash searches/s, 3.48 non-hash searches/s
Note: Can actually degrade performance depending on workload
innodb_adaptive_hash_index = ON|OFF
11.4 Doublewrite Buffer
Doublewrite Buffer Purpose
============================
Problem: Partial Page Write (Torn Page)
- OS page size (4KB) differs from InnoDB page size (16KB)
- During a crash, only part of a 16KB page may have been written
- WAL cannot recover this (WAL only records the changed portion)
Solution: Doublewrite Buffer
1. Write dirty pages first to doublewrite buffer (contiguous area)
2. Then write to actual location
3. On crash: recover complete page from doublewrite buffer
Flow:
[Dirty Page] --> [Doublewrite Buffer (disk)] --> [Actual Table File]
(sequential write, 2MB chunks) (random write)
12. RocksDB Internals
12.1 RocksDB Architecture
RocksDB Architecture
=====================
Write Path:
[Put/Delete] --> [WAL] --> [MemTable (Active)]
|
(switch when full)
|
v
[MemTable (Immutable)]
|
(background flush)
|
v
[L0 SST Files]
|
(background compaction)
|
v
[L1 SST Files]
|
...
|
v
[Ln SST Files]
Read Path:
[Get] --> [MemTable] --> [L0 SSTs] --> [L1 SSTs] --> ... --> [Ln SSTs]
(bloom) (bloom) (bloom) (bloom)
12.2 Column Families
Column Families Concept
========================
Logically separated key-value namespaces within a single DB instance
CF "default": user:1 -> {"name":"Alice"}
CF "metadata": schema_version -> "3.2"
CF "timeseries": ts:1710000000 -> {"temp":23.5}
Each Column Family has independent:
- MemTable
- SSTable set
- Compaction settings
- Compression settings
Usage example:
Options options;
ColumnFamilyOptions cf_opts;
cf_opts.compaction_style = kCompactionStyleLevel;
cf_opts.write_buffer_size = 128 << 20; // 128MB MemTable
// Timeseries CF uses FIFO compaction (auto-delete old data)
ColumnFamilyOptions ts_opts;
ts_opts.compaction_style = kCompactionStyleFIFO;
ts_opts.compaction_options_fifo.max_table_files_size = 1 << 30; // 1GB
12.3 Rate Limiter
Rate Limiter Configuration
============================
Purpose: Limit compaction and flush I/O to protect
foreground read/write performance
Configuration example:
options.rate_limiter.reset(
NewGenericRateLimiter(
100 * 1024 * 1024, // rate_bytes_per_sec: 100MB/s
100 * 1000, // refill_period_us: 100ms
10, // fairness
RateLimiter::Mode::kWritesOnly
)
);
Behavior:
- Prevents compaction/flush from writing more than 100MB/s
- Ensures foreground requests can secure I/O bandwidth
- Can be dynamically adjusted: low during peak, high during off-peak
13. Practical Performance Tuning Tips
13.1 Index Optimization
-- Eliminate double lookup with covering index
-- Before: secondary index -> clustered index (bookmark lookup)
SELECT name, email FROM users WHERE status = 'active';
-- With only index(status), table access needed for name, email
-- After: covering index
CREATE INDEX idx_status_covering ON users(status) INCLUDE (name, email);
-- Query completed using index only (Index Only Scan)
-- Importance of composite index order (Leftmost Prefix Rule)
CREATE INDEX idx_abc ON orders(customer_id, order_date, status);
-- Queries that CAN use this index:
SELECT * FROM orders WHERE customer_id = 100; -- O
SELECT * FROM orders WHERE customer_id = 100 AND order_date > '2025-01-01'; -- O
SELECT * FROM orders WHERE customer_id = 100 AND status = 'shipped'; -- O (skip scan possible)
-- Queries that CANNOT use this index:
SELECT * FROM orders WHERE order_date > '2025-01-01'; -- X
SELECT * FROM orders WHERE status = 'shipped'; -- X
13.2 Buffer Pool Tuning
# MySQL InnoDB Buffer Pool configuration
[mysqld]
# Allocate 70-80% of total memory to Buffer Pool
innodb_buffer_pool_size = 12G
# Split into multiple instances to reduce contention
innodb_buffer_pool_instances = 8
# Preserve Buffer Pool contents across restarts
innodb_buffer_pool_dump_at_shutdown = ON
innodb_buffer_pool_load_at_startup = ON
# Monitoring
SHOW ENGINE INNODB STATUS;
-- Buffer pool hit rate: 999/1000 (99.9% or higher is ideal)
13.3 WAL/Redo Log Tuning
# InnoDB Redo Log configuration
[mysqld]
# Redo log size (too small = frequent checkpoints, too large = longer recovery)
innodb_redo_log_capacity = 4G
# fsync strategy
innodb_flush_log_at_trx_commit = 1 # Safest (default)
# = 0: fsync every second (up to 1 second of data loss possible)
# = 1: fsync per commit (safest, slowest)
# = 2: write to OS buffer per commit, fsync every second
14. Quiz
Test your understanding of database internals.
Q1. Why are B+Tree leaf nodes linked together as a linked list?
A: For range scan optimization. For example, in a query like WHERE age BETWEEN 20 AND 30, the B+Tree can find the leaf node for age=20 and then sequentially scan through the linked list until age=30. Without linked leaf nodes, it would need to traverse from the root for each value. This is the key reason B+Tree is far more efficient than pure B-Tree for range queries.
Q2. Why does write amplification occur in LSM-Tree's Leveled Compaction?
A: In Leveled Compaction, when an SSTable from Level N is merged into Level N+1, all SSTables in Level N+1 that have overlapping key ranges must be read, merged, and rewritten. Since Level N+1 is approximately 10x the size of Level N, a single SSTable may merge with up to 10 SSTables. This process repeats at each level, so total write amplification is approximately 10 times the number of levels. The tradeoff is lower read amplification and lower space amplification.
Q3. Explain why VACUUM is needed in PostgreSQL from an MVCC perspective.
A: PostgreSQL's MVCC does not immediately remove previous tuple versions during UPDATE or DELETE operations. Instead, it sets the xmax field to perform a "soft delete." These leftover previous versions are called "dead tuples." Over time, dead tuples that no active transaction references accumulate, wasting disk space and degrading index performance. VACUUM cleans up these dead tuples to make space reclaimable. While autovacuum performs this automatically, manual VACUUM may be needed after heavy bulk updates.
Q4. Explain how InnoDB's Next-Key Lock prevents Phantom Reads.
A: Next-Key Lock is a combination of Record Lock and Gap Lock. For example, in the query SELECT * FROM orders WHERE amount BETWEEN 100 AND 200 FOR UPDATE, InnoDB locks not only the records matching the condition (Record Lock) but also the gaps between records (Gap Lock). This prevents other transactions from inserting new rows into that range, thereby preventing Phantom Reads. This is why MySQL InnoDB can prevent Phantom Reads even at the REPEATABLE READ isolation level.
Q5. What actions should you take when Buffer Pool hit rate is low (e.g., 90%)?
A: A low Buffer Pool hit rate means frequent disk I/O is occurring. Remediation steps include: (1) Increase innodb_buffer_pool_size. On dedicated DB servers, typically allocate 70-80% of physical memory. (2) Optimize queries to reduce unnecessary full table scans. Adding proper indexes reduces the number of pages accessed. (3) Increase innodb_buffer_pool_instances to reduce mutex contention. (4) If the working set (frequently accessed data) is larger than the Buffer Pool, consider data partitioning or adding a cache layer (Redis). The target should be 99% or higher hit rate.
15. References
- Designing Data-Intensive Applications - Martin Kleppmann (O'Reilly, 2017)
- Database Internals - Alex Petrov (O'Reilly, 2019)
- MySQL Internals Manual - https://dev.mysql.com/doc/internals/en/
- PostgreSQL Documentation: MVCC - https://www.postgresql.org/docs/current/mvcc.html
- RocksDB Wiki - https://github.com/facebook/rocksdb/wiki
- InnoDB Buffer Pool - https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool.html
- B-Tree Visualizer - https://www.cs.usfca.edu/~galles/visualization/BTree.html
- The Log-Structured Merge-Tree (LSM-Tree) - Patrick O'Neil et al. (1996)
- ARIES: A Transaction Recovery Method - C. Mohan et al. (1992)
- Architecture of a Database System - Hellerstein, Stonebraker, Hamilton (2007)
- LevelDB Implementation Notes - https://github.com/google/leveldb/blob/main/doc/impl.md
- CMU Database Group Lectures - https://15445.courses.cs.cmu.edu/
- Use The Index, Luke - https://use-the-index-luke.com/