Skip to content
Published on

LSM-Tree Deep Dive — RocksDB, LevelDB, Compaction, Bloom Filter, Write Amplification (2025)

Authors

Why yet another storage engine

The previous post dove into PostgreSQL and the elegance of B-trees — the data structure that dominated databases for decades. Then in the 2010s a different family quietly took over: Cassandra, HBase, LevelDB, RocksDB, ScyllaDB, InfluxDB, TiKV, CockroachDB, FoundationDB, ClickHouse MergeTree. All built on LSM-Tree (Log-Structured Merge Tree) or a variant.

Why LSM? B-tree's weakness is clear — random writes. Every update finds a page, modifies it, writes WAL, and fsyncs to an arbitrary disk location. Deadly on HDDs; still painful on SSDs because of write amplification and wear leveling.

LSM's answer: write only sequentially to disk. Never touch existing data. Express updates as new records, deletes as tombstones. Merge (compact) later in the background.

This simple idea boosts write throughput 10–100x and powers modern NoSQL. This post covers LSM from the 1996 paper through RocksDB's latest compaction strategies, Bloom filters, the RUM conjecture, real systems (Cassandra, ScyllaDB, TiKV), and coexistence with B-trees.


1. Origins — the 1996 insight

1.1 The O'Neil paper

O'Neil, Cheng, Gawlick, and O'Neil proposed LSM-Tree in 1996. Problem: B-tree updates were random I/O, HDD seeks were ~10ms (100 IOPS), TPC-B benchmarks outgrew B-trees.

Key idea: batch writes in memory, flush to disk as large sequential chunks. Sequential I/O is 100–1000x faster than random on HDDs. Merging with on-disk data is itself a sequential merge.

Applied in 2003 via Google's Bigtable, then LevelDB (Google OSS), RocksDB (Facebook fork), Cassandra, HBase.

1.2 The RUM conjecture

Athanassoulis et al. (2016). Every storage engine tries to optimize:

  • R (Read amp): disk reads per logical read.
  • U (Update amp = Write amp): bytes written per logical write.
  • M (Memory amp = Space amp): disk usage vs logical size.

You cannot minimize all three simultaneously. Reducing one increases another.

  • B-tree: good R (log N), bad U (random), medium M.
  • LSM: bad R (multiple levels), good U (sequential), variable M.
  • Hash table: best R (O(1)), bad U, bad M.

Engine design is picking where to sit in this triangle.


2. Structure — three simple layers

2.1 Basic anatomy

+----------+
| Memtable |  <- in-memory, sorted (skiplist)
+----------+
     |
     | flush
     v
+------------+------------+------------+
| SSTable L0 | SSTable L0 | SSTable L0 |
+------------+------------+------------+
     |
     | compaction
     v
+----------+----------+----------+
| SST L1   | SST L1   | SST L1   |
+----------+----------+----------+
     |
     v
+-----+-----+-----+
| L2  | L2  | L2  | ...
+-----+-----+-----+
  • Memtable: in-memory sorted structure, typically a skiplist (RocksDB, LevelDB).
  • WAL: disk log protecting memtable against crashes.
  • SSTable (Sorted String Table): immutable sorted file.
  • Compaction: merges lower-level SSTables into higher levels.

2.2 Write path

PUT(key, value)
  1. Append (key, value) to WAL + optional fsync
  2. Insert into Memtable
  3. If Memtable exceeds threshold:
       - Allocate new Memtable (writes continue)
       - Mark old one immutable (flush candidate)
       - Background thread flushes to new L0 SSTable
       - Discard WAL portion for that memtable

Key insight: writes only touch memory. Disk writes are sequential (WAL append, then SSTable flush).

2.3 WAL

Same concept as Postgres — reconstructs memtable on crash. Differences:

  • LSM's WAL is discarded after memtable flush (Postgres keeps until checkpoint).
  • SSTables themselves preserve past state; WAL only protects memtable.

RocksDB WriteOptions.sync tunes durability. group commit batches fsyncs.

2.4 SSTable — the virtue of immutability

+-----------------+
| Data Block 1    |  sorted key-value pairs
+-----------------+
| Data Block 2    |
+-----------------+
| ...             |
+-----------------+
| Meta Block      |  Bloom filter, stats
+-----------------+
| Meta Index      |
+-----------------+
| Data Index      |  first key of each data block
+-----------------+
| Footer          |  fixed size, points to meta index
+-----------------+

Immutability is decisive. Nothing writes while you read → lock-free reads. Backup is trivial file copy. Caching is efficient — blocks never change.

2.5 Why a skiplist memtable

RocksDB's default memtable is a skiplist. Why not B+tree or red-black tree?

  1. Lock-free concurrency: CAS-based inserts work naturally.
  2. Simple implementation: no rebalancing.
  3. Range-scan friendly: leaf already a linked list.

Downside: worse cache locality than B+tree for random lookups. RocksDB offers alternatives: skip_list, hash_skip_list (point-heavy), vector (prefix scan).


3. Read path — LSM's Achilles heel

3.1 Search many places

GET(key)
  1. Check active memtable
  2. Check immutable memtables
  3. Check ALL L0 SSTables (overlapping ranges)
  4. Check L1, L2, ... — one SSTable per level (non-overlapping)

Worst case: 5 L0 + 6 levels = 12 lookups. Compare to B-tree's O(log N) pages — brutal.

3.2 Bloom filter — fast "not here"

Probabilistic: "definitely absent" or "probably present".

- k hash functions
- m-bit array
- insert: set k hashed positions to 1
- query: all k positions 1"maybe", else "no"

False-positive rate P = (1 - e^(-kn/m))^k. Typical: 10 bits/key → ~1% FPR. 1B keys costs 12.5GB memory — worth it.

3.3 Block cache

Hot blocks cached separately from OS page cache. RocksDB:

block_cache_size = 16GB
cache_index_and_filter_blocks = true
pin_l0_filter_and_index_blocks_in_cache = true

Blocks stored decompressed in cache even if compressed on disk.

3.4 Partitioned index

A 100GB DB may have 1GB of index. Solution: two-level index — a small top-level in memory, detailed index blocks on disk.

3.5 Point lookup vs range scan

Bloom filters help point lookups only. Range scans must visit every level. LSM is weaker for range-heavy workloads; TSDBs partially offset this via hot-recent query patterns.


4. Compaction — LSM's heart and nightmare

4.1 Why compact

Without compaction:

  1. Read amp explodes — L0 grows unbounded.
  2. Space amp — deleted/old versions linger forever.
  3. File-descriptor exhaustion.

Compaction merges SSTables, keeping only latest versions, removing tombstones eventually, and resolving overlaps.

4.2 Size-Tiered (STCS) — Cassandra default

When N (default 4) same-size-class SSTables accumulate, merge into one.

start:  [10MB] [10MB] [10MB] [10MB]
merge:  [40MB]
later:  [40MB] [40MB] [40MB] [40MB]
merge:  [160MB]

Pros: low write amp. Cons: high space amp — up to 2x during compaction.

4.3 Leveled (LCS) — LevelDB/RocksDB default

Size limits per level; each level has non-overlapping key ranges (except L0).

L0: [a-k] [f-p] [l-z]   (can overlap)
L1: 10MB cap — [a-c] [c-g] [g-m] ...   (non-overlapping)
L2: 100MB cap
L3: 1GB cap

When Ln exceeds cap, pick an SSTable and merge with overlapping SSTables in L(n+1).

Pros: low space amp (~1.1x). Cons: high write amp (10–30x worst case).

4.4 Universal (UCS) — RocksDB-only

Like STCS, more sophisticated. All SSTables in L0; merge by size conditions. Suited for write-heavy, read-light (caching, log ingest). Space amp 2–3x.

4.5 FIFO

Time-series only. Old SSTables just deleted, no compaction. Perfect for logs/metrics with TTL.

4.6 Tombstones and the awkward dance

Tombstones mark deletions. They must persist until every lower level with that key is purged, or deleted data "resurrects."

Cassandra's famous zombie rows: tombstone GC'd after gc_grace_seconds, a late-syncing node replays old data → resurrection. Fix: set gc_grace_seconds larger than max node downtime (default 10 days).

4.7 Subcompaction

Split big compactions across threads. RocksDB's max_subcompactions. Partition input key ranges into N sub-ranges, merge in parallel. Downside: CPU contention with queries.

4.8 Scheduling — an art

Bad scheduling:

  • Compaction too slow → L0 piles up → write stall (app-level throttle).
  • Compaction too aggressive → I/O contention → latency spikes.

RocksDB tunables:

max_background_compactions = 4
level0_file_num_compaction_trigger = 4
level0_slowdown_writes_trigger = 20
level0_stop_writes_trigger = 36
compaction_readahead_size = 2MB
rate_limiter_bytes_per_sec = 100MB

rate_limiter prevents compaction from stealing foreground I/O.


5. Write / Space / Read amplification by the numbers

5.1 Write amp

Leveled: WA ≈ 1 + (L-1) * T, where L=levels, T=level multiplier (default 10). Six levels, T=10 → WA=51. One byte written = 51 bytes on disk. Hard on SSDs.

Size-tiered: WA = L + 1 — much smaller. That's why Cassandra defaults to STCS.

5.2 Space amp

Leveled: ~1.1x steady, up to 2x during compaction. Size-tiered: ~1.5x avg, up to 2x worst.

5.3 Read amp

Leveled: best case Bloom filters catch all negatives; typical 1–2 file accesses; worst = levels + L0 files. B-tree: O(log N) page accesses with smaller constants.

5.4 Hybrid tiered + leveled

Modern trend: lower levels tiered (write speed), upper levels leveled (space). level_compaction_dynamic_level_bytes gives a similar effect in RocksDB.


6. RocksDB internals

6.1 Architecture

Facebook's LevelDB fork — embedded KV only; networking/distribution is layered above. Uses:

  • MyRocks: MySQL storage engine.
  • TiKV: distributed KV using RocksDB per shard.
  • CockroachDB: migrated to Pebble (Go port).
  • Kafka Streams: RocksDB state store.

6.2 Column Families

Multiple logical LSMs inside one DB instance:

rocksdb::ColumnFamilyOptions cf_opts;
rocksdb::ColumnFamilyHandle* cf;
db->CreateColumnFamily(cf_opts, "metadata", &cf);
db->Put(WriteOptions(), cf, "key1", "value1");

Each CF can have its own compaction, Bloom filter, block size — analogous to "tables."

6.3 Write batch — atomicity

WriteBatch batch;
batch.Put("key1", "value1");
batch.Put("key2", "value2");
batch.Delete("key3");
db->Write(WriteOptions(), &batch);

Atomically applied — one WAL record.

6.4 Snapshot

const Snapshot* snap = db->GetSnapshot();
ReadOptions ro;
ro.snapshot = snap;
db->Get(ro, "key1", &value);
db->ReleaseSnapshot(snap);

Snapshots are cheap in LSM due to immutable SSTables. Compaction won't drop sequence numbers beneath a live snapshot.

6.5 Transactions

Optimistic (check at write-time) or pessimistic (2PL). Upper layers like MyRocks/TiKV build richer TX semantics.

6.6 Merge operator

Turn INCR key from 3 round trips into 1:

db->Merge(WriteOptions(), "counter", "1");

Resolved at Get/compaction time. Cassandra counters use this idea.

6.7 TTL

Expiry metadata in SSTables; filtered at read, removed at compaction.


7. Cassandra — distributed LSM archetype

7.1 Dynamo + Bigtable

2008, Facebook. Dynamo's partitioning/replication + Bigtable's data model (column family, SSTable). Consistent hashing partitions keys; each key replicated to RF nodes; each node runs a local LSM.

7.2 Write path

  1. Coordinator receives.
  2. Sends to RF replicas in parallel.
  3. Each writes local commit log + memtable.
  4. W acks → client ack.

W is consistency level (ONE, QUORUM, ALL). QUORUM reads + QUORUM writes → strong consistency.

7.3 Read repair

Read R replicas, reconcile differences, return latest — eventual consistency that converges.

7.4 Caches

  • Row cache: full row data.
  • Key cache: key → SSTable position.
  • Bloom filter: per SSTable.

bloom_filter_fp_chance: 0.01 default for STCS, 0.1 for LCS.

7.5 Compaction strategies

  • STCS: default, write-centric.
  • LCS: read-centric, needs SSDs.
  • TWCS: time-series; whole windows droppable.
  • DTCS: TWCS ancestor, deprecated.

7.6 Cassandra's pains

JVM GC pauses; thread-per-request overhead; intricate compaction tuning; no read-before-write (upserts-only; LWT exists but expensive). These motivated ScyllaDB.


8. ScyllaDB

8.1 Shard-per-core

Cassandra-protocol-compatible rewrite in C++:

  • Each CPU core = independent shard.
  • Message passing between shards.
  • Almost no context switches, few locks.

Built on Seastar, DPDK, io_uring, NUMA-aware.

8.2 Performance

~10x Cassandra throughput on identical hardware; dramatically lower p99 (no GC pauses).

8.3 LSM reimplementation

Same logic with C++ allocator tuning, per-shard LSM, async I/O. SSTable format remains Cassandra-compatible for easy migration.


9. TiKV and CockroachDB

9.1 TiKV

CNCF graduate, Rust. TiDB storage layer.

  • Data sharded into Regions (96MB default).
  • Each region replicated via Raft (3 copies).
  • Each node stores region data in RocksDB.
  • Transactions via Percolator (Google paper): 2PC + optimistic.

9.2 CockroachDB

Go. Spanner-style multi-region DB. Originally RocksDB, migrated to Pebble (Go) in 2022 — no CGo overhead, better Go runtime integration.

9.3 Why LSM for distributed DBs

Distributed systems are write-heavy (Raft log apply, metadata, compaction). Sequential-write LSM fits well. Snapshots are cheap — vital for region moves and replica bootstrap.


10. B-tree vs LSM

10.1 At a glance

TraitB-tree (Postgres, InnoDB)LSM (RocksDB, Cassandra)
Writesin-placeappend-only
Write speedlower (random)higher (sequential)
Read speedhigher (O log N)lower (multi-level)
Range scansexcellentdecent (merge iterator)
Write amp~3x10–30x (leveled)
Space amp1.2–1.4x1.1–2x
Backuphardereasy (copy immutable files)
Compaction/VACUUMneededneeded (heavier)
SSD wearfriendlierharsher
Ops complexitylowerhigher

10.2 Sweet spots

  • B-tree: balanced R/W, complex queries, ACID, OLTP.
  • LSM: write floods, time-series, logs, KV, distributed shards.

10.3 Convergence

B-tree camp adopts COW (LMDB, WiredTiger) — no longer in-place. LSM camp adopts hybrid compaction. Shared goal: sequential-friendly with good reads.


11. Running LSM in production

11.1 RocksDB tuning checklist

write_buffer_size = 256MB
max_write_buffer_number = 4
min_write_buffer_number_to_merge = 2
target_file_size_base = 128MB
max_bytes_for_level_base = 1GB
max_bytes_for_level_multiplier = 10
compression_per_level = [none, none, zstd, zstd, zstd, zstd, zstd]
bloom_filter_bits_per_key = 10
cache_index_and_filter_blocks = true
block_cache = 16GB
rate_limiter = 100MB/s

Note L0/L1 no compression (flush speed), L2+ zstd (space).

11.2 Metrics

rocksdb.compaction.write.bytes
rocksdb.estimate.num.keys
rocksdb.block.cache.hit
rocksdb.block.cache.miss
rocksdb.bloom.filter.useful
rocksdb.write.stall
rocksdb.pending.compaction.bytes

Watch pending.compaction.bytes — if it climbs, write stall looms.

11.3 Incident patterns

  • Compaction can't keep up → L0 piles → stall. Fix: more compaction threads, throttle writes.
  • Range scans without Bloom benefit → p99 spike. Fix: revisit query patterns, prefix Bloom.
  • Long-lived snapshot blocks compaction — "LSM's idle in transaction."

11.4 Backup

Immutable SSTables allow hardlink-based backups.

BackupEngine::CreateNewBackup(db)

11.5 DR

Need WAL + SSTables only: checkpoint, copy, load + WAL replay. Cassandra's nodetool snapshot does this.


12. Advanced

12.1 Wisckey — KV separation

  1. Big values dominate compaction write amp. Solution: keys in LSM, values in append-only log; LSM stores (key, value_pointer). RocksDB's BlobDB, Pebble, and Badger use this.

12.2 Monkey — optimized Bloom filters

  1. Same FPR across levels is suboptimal. Tight FPR for upper levels (small, hot), loose for lower. 2x read throughput at same memory.

12.3 REMIX / Dostoevsky

Mathematical hybrids of leveled/tiered; RocksDB is moving this way.

12.4 Learned indexes

Kraska 2018. Index as ML model. Practical gains for predictable distributions.


13. Choosing an engine

13.1 Decision tree

  1. Need ACID? → Postgres/MySQL.
  2. Time-series/logs? → InfluxDB/ClickHouse/TimescaleDB.
  3. Distributed KV/wide-column? → Cassandra/ScyllaDB/DynamoDB.
  4. Large-scale analytics? → ClickHouse.
  5. Embedded KV? → RocksDB/LevelDB.
  6. Otherwise: start with Postgres.

13.2 Emerging

FoundationDB, ScyllaDB, TiKV/TiDB, CockroachDB, Neon/PlanetScale/Crunchy — all use LSM, B-tree, or variants underneath. Fundamentals unchanged for 30 years.


Closing — in praise of sequential I/O

Every LSM choice flows from one sentence: sequential I/O is orders of magnitude faster than random. Batch in memory, flush sequentially, merge in background. That simple idea powered the NoSQL revolution.

LSM isn't free — read amp, write amp, compaction overhead, complex tuning. B-tree isn't dead — Postgres and InnoDB still thrive.

The answer is understanding your workload. Write floods → LSM. Complex queries and transactions → B-tree. Need both → hybrid (Postgres hot, ClickHouse cold, RocksDB as state store).

Three operator instincts:

  1. Measure write amplification — SSD life depends on it.
  2. Watch compaction speed — if writes outpace it, disaster.
  3. Trust Bloom filters, but verify — FPR is step one of tuning.

Next up: ClickHouse MergeTree internals — LSM's cousin and OLAP champion.