Database Internals Complete Guide 2025: Storage Engines, B-Tree, LSM-Tree, MVCC, WAL

1. Introduction: Why Understand Database Internals

Databases are the heart of modern software systems. Going beyond simply writing SQL queries, understanding how data is stored and retrieved internally allows you to solve performance problems at their root.

Key topics covered in this guide:

Two paradigms of storage engines: page-oriented (B-Tree) vs log-structured (LSM-Tree)
Buffer Pool and caching strategies
WAL (Write-Ahead Log) and crash recovery
MVCC and transaction isolation levels
Lock mechanisms and deadlock detection
How the query optimizer works
InnoDB and RocksDB internals

Database Internal Architecture Overview
============================================

  Client Query
       |
       v
  [SQL Parser] --> [Query Optimizer] --> [Execution Engine]
       |                                        |
       v                                        v
  [Buffer Pool / Cache]              [Transaction Manager]
       |                                        |
       v                                        v
  [Storage Engine]                    [Lock Manager + MVCC]
       |
       +---> [B-Tree Index]  (page-oriented)
       +---> [LSM-Tree]      (log-structured)
       |
       v
  [WAL (Write-Ahead Log)]
       |
       v
  [Disk (Data Files + Log Files)]

2. Storage Engine Overview: Page-Oriented vs Log-Structured

The storage engine is the core component responsible for storing and reading data from disk. There are two major paradigms.

2.1 Page-Oriented Storage

The approach adopted by traditional RDBMS. Data is managed in fixed-size pages (typically 4KB to 16KB).

Page-Oriented Storage Structure
================================

  [Page 0: File Header]
  [Page 1: B-Tree Root]
  [Page 2: B-Tree Internal]
  [Page 3: B-Tree Leaf (data)]
  [Page 4: B-Tree Leaf (data)]
  [Page 5: Free Space Map]
  ...
  [Page N: B-Tree Leaf (data)]

  Internal structure of each page:
  +----------------------------------+
  | Page Header (LSN, checksum, ...) |
  | Slot Array (offset pointers)     |
  |         ...free space...          |
  | Tuple 3 | Tuple 2 | Tuple 1     |
  +----------------------------------+

Characteristics:

Read-optimized: fast point queries through indexes
In-place updates: data is modified directly at its storage location
Notable implementations: MySQL InnoDB, PostgreSQL, SQL Server, Oracle

2.2 Log-Structured Storage

An approach that records all writes as sequential log entries. Optimized for write performance.

Log-Structured Storage Architecture
====================================

  Write Path:
  [Client Write] --> [MemTable (in-memory, sorted)]
                          |
                     (flush when full)
                          |
                          v
                    [SSTable L0] (immutable, sorted)
                    [SSTable L0]
                          |
                     (compaction)
                          |
                          v
                    [SSTable L1] (merged, sorted)
                          |
                     (compaction)
                          |
                          v
                    [SSTable L2] (larger, sorted)

Characteristics:

Write-optimized: only sequential writes yield high write throughput
Reads may need to check multiple SSTables
Notable implementations: RocksDB, LevelDB, Cassandra, HBase

2.3 Comparison Summary

Property	Page-Oriented (B-Tree)	Log-Structured (LSM-Tree)
Write Performance	Moderate (random I/O)	Excellent (sequential I/O)
Read Performance	Excellent (direct index access)	Moderate (multi-level search)
Space Efficiency	Moderate (page fragmentation)	Excellent (cleaned by compaction)
Write Amplification	Low	High (compaction)
Read Amplification	Low	High (multiple SSTables)
Notable DBs	MySQL, PostgreSQL	RocksDB, Cassandra

3. B-Tree Deep Dive

B-Tree is a self-balancing tree data structure invented in 1972 by Rudolf Bayer and Edward McCreight. It forms the foundation of nearly all RDBMS indexes.

3.1 B-Tree Basic Structure

B-Tree Structure (order = 4)
=============================

              [  30  |  60  ]           <-- Root Node
             /       |       \
            v        v        v
     [10|20]    [40|50]    [70|80|90]   <-- Internal Nodes
     / | \      / | \      / |  |  \
    v  v  v    v  v  v    v  v  v   v
   [1] [15] [25] [35] [45] [55] [65] [75] [85] [95]  <-- Leaf Nodes
    5   17   28   38   48   58   68   78   88   98
    8   19        39        59        79        99

B-Tree properties (order = m):

Root has at least 2 children
Internal nodes have ceil(m/2) to m children
All leaf nodes are at the same depth
Search/Insert/Delete: O(log N)

3.2 Node Splitting

When a node becomes full, a split occurs.

Node Splitting Process (order = 4, inserting key 25)
=====================================================

Before:
  Parent: [20 | 40]
  Child:  [21 | 23 | 24]  (full!)

Step 1: Inserting 25 causes overflow
  Temp:  [21 | 23 | 24 | 25]

Step 2: Promote median (23) to parent
  Parent: [20 | 23 | 40]
  Left:   [21]
  Right:  [24 | 25]

After:
         [20 | 23 | 40]
        /    |     |    \
     ...  [21]  [24|25]  ...

3.3 B+Tree vs B-Tree

Most databases use B+Tree rather than pure B-Tree.

B+Tree Structure
=================

  Internal Nodes: keys only (no data)
              [30 | 60]
             /    |    \
            v     v     v
  Leaf:  [10|20|30] --> [40|50|60] --> [70|80|90]
          data          data           data

  Key differences:
  - B-Tree: data stored in all nodes
  - B+Tree: data stored only in leaf nodes
  - B+Tree: leaf nodes linked as linked list (range scan optimization)

B+Tree advantages:

Internal nodes can store more keys (higher fanout)
Range queries traverse leaf nodes sequentially
More consistent search performance (always traverses to leaves)

3.4 Clustered vs Non-Clustered Index

Clustered Index (InnoDB Primary Key)
=====================================

  B+Tree leaf = actual table data
  
  [PK: 1] --> {id:1, name:'Alice', age:30}
  [PK: 2] --> {id:2, name:'Bob',   age:25}
  [PK: 3] --> {id:3, name:'Carol', age:35}

Non-Clustered Index (Secondary Index)
======================================

  B+Tree leaf = primary key reference
  
  [name:'Alice'] --> PK: 1  (requires another clustered index lookup!)
  [name:'Bob']   --> PK: 2
  [name:'Carol'] --> PK: 3

Clustered index characteristics:

Only one per table (typically the PK)
Data is physically sorted in PK order
Very efficient for range queries

Non-clustered index double lookup problem:

After secondary index lookup, must look up the clustered index again
This is called "bookmark lookup" or "table lookup"
Can be avoided with a Covering Index

4. LSM-Tree Deep Dive

LSM-Tree (Log-Structured Merge Tree) is a data structure optimized for write-intensive workloads.

4.1 LSM-Tree Architecture

LSM-Tree Full Architecture
===========================

  [Write Request]
       |
       v
  [WAL (disk)] <--- Write-ahead for crash recovery
       |
       v
  [MemTable (memory)]  <--- Red-Black Tree or Skip List
       |
  (MemTable reaches threshold)
       |
       v
  [Immutable MemTable]
       |
  (flush to disk)
       |
       v
  [Level 0 SSTables]  <--- Key ranges may overlap
       |
  (Compaction)
       |
       v
  [Level 1 SSTables]  <--- Key ranges do not overlap
       |
  (Compaction)
       |
       v
  [Level 2 SSTables]  <--- Larger files
       |
       ...

4.2 MemTable

The MemTable is a sorted, in-memory data structure.

// MemTable implementation example (Skip List based)
class MemTable {
    private final ConcurrentSkipListMap<byte[], byte[]> data;
    private final AtomicLong approximateSize;
    private static final long FLUSH_THRESHOLD = 64 * 1024 * 1024; // 64MB

    public void put(byte[] key, byte[] value) {
        data.put(key, value);
        approximateSize.addAndGet(key.length + value.length);
    }

    public byte[] get(byte[] key) {
        return data.get(key);
    }

    public boolean shouldFlush() {
        return approximateSize.get() >= FLUSH_THRESHOLD;
    }

    // MemTable -> SSTable conversion
    public SSTable flush(String sstablePath) {
        SSTableBuilder builder = new SSTableBuilder(sstablePath);
        for (Map.Entry<byte[], byte[]> entry : data.entrySet()) {
            builder.add(entry.getKey(), entry.getValue());
        }
        return builder.build(); // Written to disk in sorted order
    }
}

4.3 SSTable (Sorted String Table)

SSTable Internal Structure
===========================

  +-------------------------------------------+
  | Data Block 0                               |
  |   key1:value1 | key2:value2 | key3:value3  |
  +-------------------------------------------+
  | Data Block 1                               |
  |   key4:value4 | key5:value5 | key6:value6  |
  +-------------------------------------------+
  | ...                                        |
  +-------------------------------------------+
  | Meta Block (Bloom Filter)                  |
  +-------------------------------------------+
  | Index Block                                |
  |   block0_last_key -> offset0               |
  |   block1_last_key -> offset1               |
  +-------------------------------------------+
  | Footer                                     |
  |   index_offset | meta_offset | magic_num   |
  +-------------------------------------------+

4.4 Compaction Strategies

Size-Tiered Compaction (STCS)

Size-Tiered Compaction
======================

  Merges SSTables of similar size

  Level 0: [T1] [T2] [T3] [T4]  (each 64MB)
                   |
          (merge when 4 accumulate)
                   |
                   v
  Level 1: [   T1+T2+T3+T4   ]  (approx 256MB)

  Pros: Low write amplification
  Cons: High space amplification (temporarily needs 2x space)
        Reads may need to check multiple SSTables

Leveled Compaction (LCS)

Leveled Compaction
==================

  Each level consists of SSTables with non-overlapping key ranges
  Level N+1 is 10x the size of Level N

  L0: [a-z] [a-z] [a-z]    (key range overlap allowed, each 64MB)
       |
  L1: [a-e] [f-k] [l-p] [q-z]    (no key range overlap, total 640MB)
       |
  L2: [a-b] [c-d] ... [y-z]      (no key range overlap, total 6.4GB)

  Pros: Low read amplification, low space amplification
  Cons: High write amplification (up to 10x per level)

4.5 Bloom Filter

A Bloom Filter is a probabilistic data structure that quickly determines whether a key exists in an SSTable.

# Bloom Filter operation principle
class BloomFilter:
    def __init__(self, size, num_hash_functions):
        self.bit_array = [0] * size
        self.size = size
        self.num_hash = num_hash_functions

    def add(self, key):
        for i in range(self.num_hash):
            index = hash(key, seed=i) % self.size
            self.bit_array[index] = 1

    def might_contain(self, key):
        """
        True  -> key might exist (false positives possible)
        False -> key definitely does not exist (no false negatives)
        """
        for i in range(self.num_hash):
            index = hash(key, seed=i) % self.size
            if self.bit_array[index] == 0:
                return False
        return True

# Usage example: SSTable read optimization
def get(key):
    # 1. Check MemTable
    result = memtable.get(key)
    if result:
        return result

    # 2. Quickly filter using each SSTable's Bloom Filter
    for sstable in sorted_sstables:
        if sstable.bloom_filter.might_contain(key):
            result = sstable.search(key)  # Actual disk I/O
            if result:
                return result
    return None

4.6 Write Amplification

Write Amplification Analysis
==============================

  Original data: 1 byte

  B-Tree write amplification:
    WAL write(1) + page write(1) = approx 2x
    (can actually be higher since entire page may be written)

  LSM-Tree write amplification (Leveled Compaction):
    WAL write(1) + MemTable flush(1) + L0->L1(10) + L1->L2(10) + ...
    = approx 10 * level_count

  Example: 5 levels = approx 50x write amplification!

5. Buffer Pool Management

The Buffer Pool is the core component that caches disk pages in memory.

5.1 Buffer Pool Architecture

Buffer Pool Structure
=====================

  [Hash Table: page_id -> frame_id]
       |
       v
  +------+------+------+------+------+
  |Frame0|Frame1|Frame2|Frame3|Frame4|  <-- Buffer Pool (memory)
  |Page5 |Page12|Page3 |empty |Page8 |
  |dirty |clean |dirty |      |clean |
  +------+------+------+------+------+

  [Free List]: [Frame3]
  [LRU List]:  Frame4 <-> Frame1 <-> Frame0 <-> Frame2
               (least)                           (most)

  Page Table (mapping):
    Page 5  -> Frame 0 (dirty, pin_count=2)
    Page 12 -> Frame 1 (clean, pin_count=0)
    Page 3  -> Frame 2 (dirty, pin_count=1)
    Page 8  -> Frame 4 (clean, pin_count=0)

5.2 Page Replacement Algorithms

LRU (Least Recently Used)

class LRUReplacer:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = OrderedDict()

    def access(self, frame_id):
        if frame_id in self.cache:
            self.cache.move_to_end(frame_id)
        else:
            if len(self.cache) >= self.capacity:
                self.cache.popitem(last=False)  # Remove oldest
            self.cache[frame_id] = True

    def evict(self):
        if self.cache:
            frame_id, _ = self.cache.popitem(last=False)
            return frame_id
        return None

Clock Algorithm (Second Chance)

Clock Algorithm
================

  A clock hand sweeps around finding frames to replace

      [Frame0: ref=1]
     /                \
  [Frame4: ref=0]    [Frame1: ref=1]   <-- clock hand position
     \                /
      [Frame3: ref=1]
         |
      [Frame2: ref=0]

  Replacement process:
  1. Check ref bit of Frame1 where clock hand points
  2. If ref=1, set ref=0 and move to next
  3. When a frame with ref=0 is found, select it for replacement
  4. Frame2 (ref=0) is selected for eviction

5.3 Dirty Pages and Checkpoints

Checkpoint Process
==================

  1. Collect dirty page list
     Frame0(Page5): dirty -> needs to be written to disk
     Frame2(Page3): dirty -> needs to be written to disk

  2. Fuzzy Checkpoint (InnoDB approach)
     - Does not write all dirty pages at once
     - Writes incrementally in the background
     - Distributes system load

  3. Record Checkpoint LSN
     - WAL logs before this LSN are not needed for recovery
     - WAL files can be recycled

  Timeline:
  WAL: [LSN:100][LSN:101]...[LSN:200][LSN:201]...
                              ^
                         Checkpoint LSN=200
                         (all changes before LSN 200 safely on disk)

6. WAL (Write-Ahead Log) Deep Dive

WAL is a protocol that writes logs before modifying data. It is the cornerstone of crash recovery.

6.1 How WAL Works

WAL Write Flow
===============

  1. Transaction begins
     BEGIN;

  2. UPDATE users SET name='Bob' WHERE id=1;
     a) Write log record to WAL:
        [LSN:101, TxID:5, Table:users, PK:1, 
         old:{name:'Alice'}, new:{name:'Bob'}]
     b) Modify page in Buffer Pool (in memory only)

  3. COMMIT;
     a) Write commit record to WAL:
        [LSN:102, TxID:5, COMMIT]
     b) fsync WAL to disk (this guarantees COMMIT durability!)
     c) Respond success to client

  Key point: Data pages are lazily written to disk later
  As long as WAL is on disk, recovery after crash is possible

6.2 Log Sequence Number (LSN)

LSN Structure and Role
=======================

  WAL File:
  [LSN:100|INSERT|TxID:3|...][LSN:101|UPDATE|TxID:5|...][LSN:102|COMMIT|TxID:5]

  Buffer Pool Page Header:
  +------------------+
  | Page LSN: 101    |  <-- Last WAL record applied to this page
  | Checksum: 0xAB12 |
  | ...               |
  +------------------+

  Checkpoint LSN: 100
    -> All changes before LSN 100 are reflected on disk
    -> During recovery, redo from LSN 100 onwards

  Flushed LSN: 102
    -> Last LSN written to disk in WAL file

6.3 Group Commit

Group Commit Optimization
==========================

  Individual fsync calls are very expensive (hundreds of us even on SSD)

  Without Group Commit:
    Tx1: write WAL -> fsync -> respond   (10ms)
    Tx2: write WAL -> fsync -> respond   (10ms)
    Tx3: write WAL -> fsync -> respond   (10ms)
    Total: 30ms, 3 fsync calls

  With Group Commit:
    Tx1: write WAL --|
    Tx2: write WAL --|--> single fsync -> respond to all
    Tx3: write WAL --|
    Total: 10ms, 1 fsync call

  Effect: Dramatically reduces fsync count, greatly improving throughput

6.4 Crash Recovery

ARIES Recovery Protocol (Analysis-Redo-Undo)
==============================================

  Phase 1: Analysis
    Scan WAL from the beginning to:
    - Identify active transactions at crash time
    - Identify dirty page list

  Phase 2: Redo
    Replay all WAL records after Checkpoint LSN
    - Restore committed transaction changes
    - Also restore uncommitted transaction changes (temporarily)

  Phase 3: Undo
    Reverse uncommitted transaction changes in reverse order

  Example:
  WAL: [Tx1:INSERT][Tx2:UPDATE][Tx1:COMMIT][Tx2:DELETE][CRASH!]
                                                         ^
  Analysis: Tx1=committed, Tx2=active(uncommitted)
  Redo:     Replay Tx1's INSERT, also replay Tx2's UPDATE/DELETE
  Undo:     Reverse Tx2's DELETE, reverse Tx2's UPDATE

7. MVCC (Multi-Version Concurrency Control)

MVCC is a concurrency control technique that maintains multiple versions of data so that reads and writes do not block each other.

7.1 PostgreSQL MVCC: xmin/xmax

PostgreSQL MVCC Implementation
================================

  Each tuple (row) includes metadata:

  Tuple Header:
  +----------+----------+--------+
  | xmin     | xmax     | data   |
  +----------+----------+--------+

  xmin: Transaction ID that created (INSERTed) this tuple
  xmax: Transaction ID that deleted (DELETE/UPDATE) this tuple (0 if valid)

  Example: UPDATE users SET age=31 WHERE id=1;

  Before (xmin=100, xmax=0):
  +----------+----------+-------------------+
  | xmin=100 | xmax=0   | id=1, age=30      |  <-- Current version
  +----------+----------+-------------------+

  After (Tx 200 executes UPDATE):
  +----------+----------+-------------------+
  | xmin=100 | xmax=200 | id=1, age=30      |  <-- Previous version (soft delete)
  +----------+----------+-------------------+
  +----------+----------+-------------------+
  | xmin=200 | xmax=0   | id=1, age=31      |  <-- New version
  +----------+----------+-------------------+

Visibility rules:

PostgreSQL Visibility Rules (Snapshot Isolation)
=================================================

  Current transaction's snapshot: active_tx = {150, 180}
  snapshot_xmin = 150 (oldest active transaction)

  Conditions for a tuple to be visible:
  1. xmin is committed AND xmin is before snapshot
  2. xmax is 0, or xmax is not committed, or 
     xmax is after snapshot

  Examples:
  - (xmin=100, xmax=0): Visible (100 is committed, not deleted)
  - (xmin=100, xmax=200): Not visible if Tx200 is committed
  - (xmin=150, xmax=0): Not visible (150 is still active)
  - (xmin=120, xmax=180): Visible (180 is active, so delete not yet visible)

7.2 MySQL InnoDB MVCC: Read View

InnoDB Read View Structure
===========================

  Snapshot at Read View creation time:
  {
    creator_trx_id: 200,     // Current transaction ID
    trx_ids: [150, 180],     // Active transactions at creation time
    up_limit_id: 150,        // Minimum of trx_ids
    low_limit_id: 201        // Next transaction ID to be assigned
  }

  Version chain through Undo Log:
  
  Current Row:  {id:1, name:'Carol', trx_id:200}
       |
       v (undo log pointer)
  Undo v1:     {id:1, name:'Bob',   trx_id:180}
       |
       v (undo log pointer)
  Undo v2:     {id:1, name:'Alice', trx_id:100}

  Visibility check:
  - trx_id=200 (creator): modified by self -> visible
  - trx_id less than up_limit_id(150): definitely committed -> visible
  - trx_id at or above low_limit_id(201): future transaction -> not visible
  - trx_id in trx_ids: active transaction -> not visible
  - Otherwise: committed -> visible

8. Transaction Isolation Levels

8.1 Four Isolation Levels

Isolation Levels and Allowed Anomalies
========================================

  Isolation Level     | Dirty Read | Non-Repeatable | Phantom Read
  =================== | ========== | ============== | ============
  READ UNCOMMITTED    |  Allowed   |    Allowed     |   Allowed
  READ COMMITTED      | Prevented  |    Allowed     |   Allowed
  REPEATABLE READ     | Prevented  |   Prevented    |   Allowed*
  SERIALIZABLE        | Prevented  |   Prevented    |  Prevented

  * MySQL InnoDB's REPEATABLE READ also prevents Phantom Read via Gap Locks

8.2 Anomaly Examples

-- Dirty Read (READ UNCOMMITTED)
-- Tx1:
BEGIN;
UPDATE accounts SET balance = 500 WHERE id = 1; -- originally 1000
-- Not yet COMMITted!

-- Tx2 (READ UNCOMMITTED):
SELECT balance FROM accounts WHERE id = 1;
-- Result: 500 (reading uncommitted data!)

-- Tx1:
ROLLBACK; -- 500 is now invalidated
-- Tx2 may have made decisions based on incorrect data

-- Non-Repeatable Read (READ COMMITTED)
-- Tx1:
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- Result: 1000

-- Tx2:
BEGIN;
UPDATE accounts SET balance = 500 WHERE id = 1;
COMMIT;

-- Tx1:
SELECT balance FROM accounts WHERE id = 1;
-- Result: 500 (different result within same transaction!)

-- Phantom Read (REPEATABLE READ, non-InnoDB)
-- Tx1:
BEGIN;
SELECT COUNT(*) FROM orders WHERE status = 'pending';
-- Result: 5

-- Tx2:
BEGIN;
INSERT INTO orders (status) VALUES ('pending');
COMMIT;

-- Tx1:
SELECT COUNT(*) FROM orders WHERE status = 'pending';
-- Result: 6 (new row appeared - phantom!)

9. Lock Mechanisms

9.1 Lock Types

InnoDB Lock Types
==================

  1. Record Lock
     Locks a specific index record
     Example: WHERE id = 5 -> locks only the id=5 record

  2. Gap Lock
     Locks the gap between index records
     Example: WHERE id BETWEEN 5 AND 10
         -> prevents new record insertion between (5, 10)

  3. Next-Key Lock
     Record Lock + Gap Lock
     Example: WHERE id = 5 
         -> locks id=5 record + gap to previous record
     Default locking mode for InnoDB REPEATABLE READ

  4. Intention Lock
     Table-level intent declaration
     - IS (Intention Shared): intends to acquire row-level S lock
     - IX (Intention Exclusive): intends to acquire row-level X lock

9.2 Intention Lock Compatibility Matrix

Intention Lock Compatibility
=============================

           | IS         | IX         | S          | X
  ---------|------------|------------|------------|----------
  IS       | Compatible | Compatible | Compatible | Conflict
  IX       | Compatible | Compatible | Conflict   | Conflict
  S        | Compatible | Conflict   | Compatible | Conflict
  X        | Conflict   | Conflict   | Conflict   | Conflict

  Example:
  Tx1: SELECT ... FOR SHARE  -> IS(table) + S(row)
  Tx2: SELECT ... FOR UPDATE -> IX(table) + X(row)
  
  IS and IX are compatible -> no blocking at table level
  S and X on same row -> Tx2 waits at row level

9.3 Deadlock Detection

-- Deadlock scenario
-- Tx1:
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;  -- Lock A

-- Tx2:
BEGIN;
UPDATE accounts SET balance = balance - 200 WHERE id = 2;  -- Lock B

-- Tx1:
UPDATE accounts SET balance = balance + 100 WHERE id = 2;  -- Wait for Lock B

-- Tx2:
UPDATE accounts SET balance = balance + 200 WHERE id = 1;  -- Wait for Lock A
-- DEADLOCK! Tx1 -> Lock B -> Tx2 -> Lock A -> Tx1

Deadlock Detection: Wait-for Graph
====================================

  Tx1 ---waits for---> Tx2
   ^                    |
   |                    |
   +---waits for--------+

  Cycle detected! -> Roll back the lower-cost transaction

  InnoDB approach:
  - Periodically checks the wait-for graph
  - Selects victim with least undo log when cycle is found
  - ERROR 1213 (40001): Deadlock found when trying to get lock

10. Query Optimizer

10.1 Query Processing Pipeline

Query Processing Flow
======================

  SQL Query
     |
     v
  [Parser] --> AST (Abstract Syntax Tree)
     |
     v
  [Binder/Analyzer] --> table/column existence check, type validation
     |
     v
  [Query Rewriter] --> view expansion, subquery transformation
     |
     v
  [Optimizer] --> select optimal execution plan
     |
     +---> [Cost Estimator] --> estimate cost of each plan
     +---> [Plan Enumerator] --> enumerate possible execution plans
     |
     v
  [Execution Engine] --> access data according to execution plan

10.2 Cost-Based Optimization

Cost Estimation Factors
========================

  1. I/O cost: Number of pages to read from disk
     - Sequential I/O: 1.0 (base unit)
     - Random I/O: 4.0 (SSD) ~ 40.0 (HDD)

  2. CPU cost: Tuple processing cost
     - Comparison operation: 0.01
     - Hash computation: 0.02

  3. Memory cost: Sorting, hash tables, etc.

  Cost formula example (simplified):
  Total Cost = (pages * seq_page_cost) + (tuples * cpu_tuple_cost)

  Example: Full table scan
  pages = 1000, seq_page_cost = 1.0
  tuples = 100000, cpu_tuple_cost = 0.01
  Total = 1000 * 1.0 + 100000 * 0.01 = 2000

10.3 Join Algorithms and Costs

Join Algorithm Comparison
==========================

  1. Nested Loop Join
     for each row r in R:        -- |R| iterations
       for each row s in S:      -- |S| iterations
         if r.key == s.key:
           emit (r, s)
     Cost: O(|R| * |S|)
     Best for: small tables, when indexes exist

  2. Hash Join
     Phase 1 (Build):
       for each row r in R:
         hash_table[hash(r.key)] = r
     Phase 2 (Probe):
       for each row s in S:
         if hash_table[hash(s.key)] exists:
           emit (match, s)
     Cost: O(|R| + |S|)
     Best for: large tables, equi-joins

  3. Sort-Merge Join
     Phase 1: Sort R by key
     Phase 2: Sort S by key
     Phase 3: Merge sorted R and S
     Cost: O(|R|log|R| + |S|log|S| + |R| + |S|)
     Best for: pre-sorted data, range joins

10.4 Index Selection and Statistics

-- Checking PostgreSQL statistics
SELECT
    tablename,
    attname,
    n_distinct,    -- Number of distinct values (-1 means all unique)
    correlation,   -- Correlation between physical and logical ordering
    most_common_vals,
    most_common_freqs,
    histogram_bounds
FROM pg_stats
WHERE tablename = 'users';

-- Example result:
-- attname: status
-- n_distinct: 5
-- most_common_vals: {active, inactive, pending, banned, deleted}
-- most_common_freqs: {0.65, 0.20, 0.10, 0.03, 0.02}
-- correlation: 0.12 (unrelated to physical order)

Index Selection Criteria
=========================

  Query: SELECT * FROM users WHERE status = 'active'

  Option 1: Full Table Scan
    pages: 1000, selectivity: 0.65 (65% of rows are 'active')
    cost: 1000 * 1.0 = 1000

  Option 2: Index Scan (status index)
    index_pages: 10, table_pages: 650 (random I/O)
    cost: 10 * 1.0 + 650 * 4.0 = 2610

  Conclusion: High selectivity (65%) makes Full Table Scan more efficient!

  General rule: Index use is beneficial when selectivity is below 5-15%

11. InnoDB Internals

11.1 InnoDB Architecture Overview

InnoDB Architecture
====================

  In-Memory:
  +-------------------------------------------------------+
  | Buffer Pool                                            |
  |  [Data Pages] [Index Pages] [Undo Pages] [Lock Info]   |
  |  [Adaptive Hash Index]                                 |
  |  [Change Buffer]                                       |
  +-------------------------------------------------------+
  | Log Buffer                                             |
  +-------------------------------------------------------+

  On-Disk:
  +----------------------------+----------------------------+
  | System Tablespace          | Redo Log Files             |
  |  (ibdata1)                 |  (ib_logfile0, ib_logfile1)|
  |  - Data Dictionary         |  - WAL records             |
  |  - Doublewrite Buffer      |                            |
  |  - Change Buffer           |                            |
  |  - Undo Logs               |                            |
  +----------------------------+----------------------------+
  | Per-Table Tablespace (.ibd files)                       |
  |  - Table Data + Index                                   |
  +--------------------------------------------------------+

11.2 Change Buffer

Change Buffer Operation
========================

  Scenario: Secondary Index page is not in Buffer Pool

  Without Change Buffer:
    1. Read page from disk (Random I/O!)
    2. Modify page
    3. Mark as dirty

  With Change Buffer:
    1. Record change in Change Buffer (memory)
    2. Merge later when page is read
    
  Effect: Converts random I/O to sequential I/O
  Good for: write-heavy workloads with many secondary indexes
  Bad for: unique indexes (need duplicate checking)

  Configuration:
  innodb_change_buffering = all  -- inserts, deletes, purges
  innodb_change_buffer_max_size = 25  -- 25% of buffer pool

11.3 Adaptive Hash Index (AHI)

Adaptive Hash Index
====================

  InnoDB automatically builds hash indexes on frequently accessed B+Tree pages

  Normal B+Tree access: O(log N) - 3-4 page accesses
  AHI access: O(1) - direct hash table lookup

  How it works:
  1. Monitors access frequency of specific index patterns
  2. Automatically creates hash index when threshold exceeded
  3. Automatically rebuilds/removes hash index when patterns change

  Monitoring:
  SHOW ENGINE INNODB STATUS;
  -- Hash table size 34679, used cells 1234
  -- 10.52 hash searches/s, 3.48 non-hash searches/s

  Note: Can actually degrade performance depending on workload
  innodb_adaptive_hash_index = ON|OFF

11.4 Doublewrite Buffer

Doublewrite Buffer Purpose
============================

  Problem: Partial Page Write (Torn Page)
  - OS page size (4KB) differs from InnoDB page size (16KB)
  - During a crash, only part of a 16KB page may have been written
  - WAL cannot recover this (WAL only records the changed portion)

  Solution: Doublewrite Buffer
  1. Write dirty pages first to doublewrite buffer (contiguous area)
  2. Then write to actual location
  3. On crash: recover complete page from doublewrite buffer

  Flow:
  [Dirty Page] --> [Doublewrite Buffer (disk)] --> [Actual Table File]
                   (sequential write, 2MB chunks)   (random write)

12. RocksDB Internals

12.1 RocksDB Architecture

RocksDB Architecture
=====================

  Write Path:
  [Put/Delete] --> [WAL] --> [MemTable (Active)]
                                    |
                              (switch when full)
                                    |
                                    v
                            [MemTable (Immutable)]
                                    |
                              (background flush)
                                    |
                                    v
                            [L0 SST Files]
                                    |
                           (background compaction)
                                    |
                                    v
                            [L1 SST Files]
                                    |
                                   ...
                                    |
                                    v
                            [Ln SST Files]

  Read Path:
  [Get] --> [MemTable] --> [L0 SSTs] --> [L1 SSTs] --> ... --> [Ln SSTs]
             (bloom)        (bloom)       (bloom)               (bloom)

12.2 Column Families

Column Families Concept
========================

  Logically separated key-value namespaces within a single DB instance

  CF "default":     user:1 -> {"name":"Alice"}
  CF "metadata":    schema_version -> "3.2"
  CF "timeseries":  ts:1710000000 -> {"temp":23.5}

  Each Column Family has independent:
  - MemTable
  - SSTable set
  - Compaction settings
  - Compression settings

  Usage example:
  Options options;
  ColumnFamilyOptions cf_opts;
  cf_opts.compaction_style = kCompactionStyleLevel;
  cf_opts.write_buffer_size = 128 << 20;  // 128MB MemTable

  // Timeseries CF uses FIFO compaction (auto-delete old data)
  ColumnFamilyOptions ts_opts;
  ts_opts.compaction_style = kCompactionStyleFIFO;
  ts_opts.compaction_options_fifo.max_table_files_size = 1 << 30; // 1GB

12.3 Rate Limiter

Rate Limiter Configuration
============================

  Purpose: Limit compaction and flush I/O to protect
           foreground read/write performance

  Configuration example:
  options.rate_limiter.reset(
      NewGenericRateLimiter(
          100 * 1024 * 1024,   // rate_bytes_per_sec: 100MB/s
          100 * 1000,          // refill_period_us: 100ms
          10,                  // fairness
          RateLimiter::Mode::kWritesOnly
      )
  );

  Behavior:
  - Prevents compaction/flush from writing more than 100MB/s
  - Ensures foreground requests can secure I/O bandwidth
  - Can be dynamically adjusted: low during peak, high during off-peak

13. Practical Performance Tuning Tips

13.1 Index Optimization

-- Eliminate double lookup with covering index
-- Before: secondary index -> clustered index (bookmark lookup)
SELECT name, email FROM users WHERE status = 'active';
-- With only index(status), table access needed for name, email

-- After: covering index
CREATE INDEX idx_status_covering ON users(status) INCLUDE (name, email);
-- Query completed using index only (Index Only Scan)

-- Importance of composite index order (Leftmost Prefix Rule)
CREATE INDEX idx_abc ON orders(customer_id, order_date, status);

-- Queries that CAN use this index:
SELECT * FROM orders WHERE customer_id = 100;                    -- O
SELECT * FROM orders WHERE customer_id = 100 AND order_date > '2025-01-01'; -- O
SELECT * FROM orders WHERE customer_id = 100 AND status = 'shipped';        -- O (skip scan possible)

-- Queries that CANNOT use this index:
SELECT * FROM orders WHERE order_date > '2025-01-01';            -- X
SELECT * FROM orders WHERE status = 'shipped';                   -- X

13.2 Buffer Pool Tuning

# MySQL InnoDB Buffer Pool configuration
[mysqld]
# Allocate 70-80% of total memory to Buffer Pool
innodb_buffer_pool_size = 12G

# Split into multiple instances to reduce contention
innodb_buffer_pool_instances = 8

# Preserve Buffer Pool contents across restarts
innodb_buffer_pool_dump_at_shutdown = ON
innodb_buffer_pool_load_at_startup = ON

# Monitoring
SHOW ENGINE INNODB STATUS;
-- Buffer pool hit rate: 999/1000 (99.9% or higher is ideal)

13.3 WAL/Redo Log Tuning

# InnoDB Redo Log configuration
[mysqld]
# Redo log size (too small = frequent checkpoints, too large = longer recovery)
innodb_redo_log_capacity = 4G

# fsync strategy
innodb_flush_log_at_trx_commit = 1  # Safest (default)
# = 0: fsync every second (up to 1 second of data loss possible)
# = 1: fsync per commit (safest, slowest)
# = 2: write to OS buffer per commit, fsync every second

14. Quiz

Test your understanding of database internals.

Q1. Why are B+Tree leaf nodes linked together as a linked list?

A: For range scan optimization. For example, in a query like WHERE age BETWEEN 20 AND 30, the B+Tree can find the leaf node for age=20 and then sequentially scan through the linked list until age=30. Without linked leaf nodes, it would need to traverse from the root for each value. This is the key reason B+Tree is far more efficient than pure B-Tree for range queries.

Q2. Why does write amplification occur in LSM-Tree's Leveled Compaction?

A: In Leveled Compaction, when an SSTable from Level N is merged into Level N+1, all SSTables in Level N+1 that have overlapping key ranges must be read, merged, and rewritten. Since Level N+1 is approximately 10x the size of Level N, a single SSTable may merge with up to 10 SSTables. This process repeats at each level, so total write amplification is approximately 10 times the number of levels. The tradeoff is lower read amplification and lower space amplification.

Q3. Explain why VACUUM is needed in PostgreSQL from an MVCC perspective.

A: PostgreSQL's MVCC does not immediately remove previous tuple versions during UPDATE or DELETE operations. Instead, it sets the xmax field to perform a "soft delete." These leftover previous versions are called "dead tuples." Over time, dead tuples that no active transaction references accumulate, wasting disk space and degrading index performance. VACUUM cleans up these dead tuples to make space reclaimable. While autovacuum performs this automatically, manual VACUUM may be needed after heavy bulk updates.

Q4. Explain how InnoDB's Next-Key Lock prevents Phantom Reads.

A: Next-Key Lock is a combination of Record Lock and Gap Lock. For example, in the query SELECT * FROM orders WHERE amount BETWEEN 100 AND 200 FOR UPDATE, InnoDB locks not only the records matching the condition (Record Lock) but also the gaps between records (Gap Lock). This prevents other transactions from inserting new rows into that range, thereby preventing Phantom Reads. This is why MySQL InnoDB can prevent Phantom Reads even at the REPEATABLE READ isolation level.

Q5. What actions should you take when Buffer Pool hit rate is low (e.g., 90%)?

A: A low Buffer Pool hit rate means frequent disk I/O is occurring. Remediation steps include: (1) Increase innodb_buffer_pool_size. On dedicated DB servers, typically allocate 70-80% of physical memory. (2) Optimize queries to reduce unnecessary full table scans. Adding proper indexes reduces the number of pages accessed. (3) Increase innodb_buffer_pool_instances to reduce mutex contention. (4) If the working set (frequently accessed data) is larger than the Buffer Pool, consider data partitioning or adding a cache layer (Redis). The target should be 99% or higher hit rate.

15. References

Designing Data-Intensive Applications - Martin Kleppmann (O'Reilly, 2017)
Database Internals - Alex Petrov (O'Reilly, 2019)
MySQL Internals Manual - https://dev.mysql.com/doc/internals/en/
PostgreSQL Documentation: MVCC - https://www.postgresql.org/docs/current/mvcc.html
RocksDB Wiki - https://github.com/facebook/rocksdb/wiki
InnoDB Buffer Pool - https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool.html
B-Tree Visualizer - https://www.cs.usfca.edu/~galles/visualization/BTree.html
The Log-Structured Merge-Tree (LSM-Tree) - Patrick O'Neil et al. (1996)
ARIES: A Transaction Recovery Method - C. Mohan et al. (1992)
Architecture of a Database System - Hellerstein, Stonebraker, Hamilton (2007)
LevelDB Implementation Notes - https://github.com/google/leveldb/blob/main/doc/impl.md
CMU Database Group Lectures - https://15445.courses.cs.cmu.edu/
Use The Index, Luke - https://use-the-index-luke.com/

Table of Contents