[Prometheus] TSDB Internals: WAL, Chunks, Blocks, Compaction

1. Overview
2. TSDB Directory Structure
3. Write-Ahead Log (WAL)
4. Head Block
5. Persistent Block Structure
6. Compaction
7. Retention Management
8. mmap and Memory Management
- 8.1 mmap Usage
- 8.2 Memory Usage Analysis
9. Out-of-Order Sample Handling
- 9.1 Out-of-Order Support
- 9.2 WBL (Write-Behind Log)
10. Performance Characteristics
- 10.1 Write Performance
- 10.2 Read Performance
11. Summary

1. Overview

The Prometheus TSDB (Time Series Database) is a local storage engine optimized for time series data. It combines compression techniques inspired by Facebook's Gorilla paper with an LSM-tree-inspired block-based structure to deliver high write performance and efficient storage.

This post analyzes the entire storage hierarchy from the TSDB directory structure to WAL, Head Block, persistent blocks, and compaction.

2. TSDB Directory Structure

The Prometheus data directory has the following structure:

data/
  |-- wal/
  |     |-- 00000001
  |     |-- 00000002
  |     +-- 00000003
  |
  |-- chunks_head/
  |     |-- 000001
  |     +-- 000002
  |
  |-- 01BKGV7JBM69T2G1BGBGM6KB12/   (Block ULID)
  |     |-- meta.json
  |     |-- index
  |     |-- chunks/
  |     |     |-- 000001
  |     |     +-- 000002
  |     +-- tombstones
  |
  |-- 01BKGTZQ1SYQJTR4PB43C8PD98/   (Block ULID)
  |     |-- meta.json
  |     |-- index
  |     |-- chunks/
  |     +-- tombstones
  |
  |-- lock
  +-- queries.active

Role of each directory and file:

wal/: Write-Ahead Log segment files
chunks_head/: Memory-mapped chunk files for the Head Block
ULID directories/: Each persistent block (ULID is a time-based unique ID)
lock: Lock file ensuring exclusive process access
queries.active: Tracking currently active queries

3. Write-Ahead Log (WAL)

3.1 Purpose of WAL

The WAL is the core mechanism ensuring data durability. All incoming data is first written to the WAL before being applied to memory (Head Block). If Prometheus crashes, the WAL is replayed to recover the Head Block.

3.2 Segment File Structure

The WAL consists of fixed-size (default 128MB) segment files:

Segment File Internal Structure:
+----------+----------+----------+-----+
| Record 1 | Record 2 | Record 3 | ... |
+----------+----------+----------+-----+

Each Record:
+--------+--------+---------+------+
| Type   | Length | CRC32   | Data |
| 1 byte | varint | 4 bytes | ...  |
+--------+--------+---------+------+

3.3 Record Types

The WAL has 4 main record types:

Series Record (type 1): New time series registration

Series Record:
+----------+----------------------------+
| Series ID| Labels (name/value pairs)  |
+----------+----------------------------+

When a new time series is first encountered, a Series Record is written. The Series ID is a unique reference number within the Head Block.

Samples Record (type 2): Sample data

Samples Record:
+----------+-----------+-------+
| Series ID| Timestamp | Value |
+----------+-----------+-------+

Each scraped sample is recorded with its corresponding Series ID.

Tombstones Record (type 3): Deletion markers

Recorded when deletion of a specific time range of series data is requested. Actual deletion is performed during compaction.

Exemplars Record (type 4): Exemplar data

Exemplar samples containing additional metadata such as trace IDs.

3.4 WAL Management

Segment rotation: New segments are created when a segment reaches 128MB
Segment cleanup: When the Head Block is compacted into a persistent block, WAL segments for that time range are deleted
Checkpointing: Periodic checkpoints are created for faster WAL replay. Checkpoints contain only the latest state of active series

WAL Checkpoint Process:
1. Collect current active series list
2. Create checkpoint directory (checkpoint.NNNNN)
3. Write Series Records for active series
4. Delete previous checkpoint and corresponding WAL segments

3.5 WAL Compression

WAL compression can be enabled with the --storage.tsdb.wal-compression flag. It uses Snappy compression and typically reduces WAL size by about half. CPU overhead is minimal.

4. Head Block

4.1 Head Block Structure

The Head Block is the in-memory structure holding the most recent data in the TSDB:

Head Block:
+------------------------------------------+
| memSeries Map                             |
|   series_id_1 -> memSeries_1             |
|   series_id_2 -> memSeries_2             |
|   ...                                     |
+------------------------------------------+
| Posting Lists (in-memory index)          |
+------------------------------------------+
| Stripe Lock Pool                         |
+------------------------------------------+

4.2 memSeries Structure

Each time series is represented by a memSeries struct:

memSeries:
+------------+
| ref (ID)   |
| labels     |
| chunks     | --> [chunk_0] -> [chunk_1] -> [chunk_current]
| headChunk  | --> currently active chunk (writable)
| firstTs    |
| lastTs     |
| lastValue  |
+------------+

Key fields:

ref: Unique reference number for the series
labels: Sorted list of metric name and labels
chunks: Linked list of completed chunks
headChunk: The currently active chunk receiving new data

4.3 Chunk Encoding

Prometheus uses compression techniques from the Gorilla paper:

Timestamp Encoding (Delta-of-Delta):

Sample 1: t1 (raw value stored)
Sample 2: d1 = t2 - t1 (first delta)
Sample 3: dd = (t3 - t2) - (t2 - t1) (delta-of-delta)

With regular scraping, dd is mostly near 0
-> Can be represented with very few bits

Bit encoding:
dd == 0: 1 bit ('0')
-63 <= dd <= 64: 2 + 7 = 9 bits
-255 <= dd <= 256: 2 + 9 = 11 bits
-2047 <= dd <= 2048: 2 + 12 = 14 bits
Otherwise: 4 + 32 = 36 bits

Value Encoding (XOR):

Sample 1: v1 (raw float64 stored, 64 bits)
Sample 2: xor = v2 XOR v1
Sample 3: xor = v3 XOR v2

Consecutive similar values produce XOR results mostly zeros
-> Compression using leading zeros and trailing zeros

Bit encoding:
xor == 0: 1 bit ('0')
Same leading/trailing as previous: 2 + significant bits
Otherwise: 2 + 5(leading) + 6(significant_length) + significant bits

This compression achieves a remarkable average of 1.37 bytes per sample.

4.4 Chunk Lifecycle

1. New sample arrives
2. Append to headChunk
3. headChunk reaches 120 samples or chunkRange (2 hours)
4. headChunk completion (becomes immutable)
5. Written to chunks_head/ directory as mmap file
6. New headChunk created
7. Previous chunks accessed via mmap (can be freed from memory)

4.5 Memory Mapped Chunks

Since Prometheus 2.19, completed Head Block chunks are written as mmap files in the chunks_head/ directory. This provides:

Significantly reduced memory usage (OS manages via page cache)
Faster crash recovery with reduced WAL replay time
Chunk data loaded into memory only when needed

5. Persistent Block Structure

5.1 Block Overview

Head Block data is compacted into persistent blocks after a set period (default 2 hours):

Block Directory:
01BKGV7JBM69T2G1BGBGM6KB12/
  |-- meta.json      (block metadata)
  |-- index          (series index)
  |-- chunks/
  |     |-- 000001   (chunk data)
  |     +-- 000002
  +-- tombstones      (deletion markers)

5.2 meta.json

Contains block metadata:

{
  "ulid": "01BKGV7JBM69T2G1BGBGM6KB12",
  "minTime": 1602547200000,
  "maxTime": 1602554400000,
  "stats": {
    "numSamples": 1234567,
    "numSeries": 5678,
    "numChunks": 9012
  },
  "compaction": {
    "level": 1,
    "sources": ["01BKGV7JBM69T2G1BGBGM6KB12"]
  },
  "version": 1
}

5.3 Index Structure

The index file contains structures for fast series lookup:

Index File Structure:
+------------------+
| Symbol Table     |  (dictionary of all label names/values)
+------------------+
| Series           |  (per-series labels and chunk references)
+------------------+
| Label Index      |  (label name -> possible values)
+------------------+
| Postings         |  (label pair -> series ID list)
+------------------+
| Postings Offset  |  (offset table for posting lists)
+------------------+
| TOC              |  (Table of Contents)
+------------------+

5.4 Posting List

The Posting List is the core of the inverted index. It maps each label name-value pair to a list of corresponding series IDs:

Example:
job="prometheus" -> [1, 3, 5, 7, 9]
job="node"       -> [2, 4, 6, 8, 10]
instance="localhost:9090" -> [1, 2]
instance="localhost:9100" -> [3, 4]

Query: job="prometheus" AND instance="localhost:9090"
  -> [1, 3, 5, 7, 9] INTERSECT [1, 2]
  -> [1]

Posting Lists are stored sorted, enabling O(n) intersection and union operations.

5.5 Chunk Files

Chunk files store the actual compressed time series sample data:

Chunk File Format:
+--------+--------+--------+-----+
| Chunk 1| Chunk 2| Chunk 3| ... |
+--------+--------+--------+-----+

Each Chunk:
+----------+----------+--------+------+
| Length   | Encoding | Data   | CRC  |
| uvarint  | 1 byte   | ...    | 4B   |
+----------+----------+--------+------+

Encoding types:

0: Raw (uncompressed)
1: XOR (for float samples, default)
2: Histogram
3: Float Histogram

6. Compaction

6.1 Compaction Overview

Compaction merges smaller blocks into larger blocks:

Level 0: [2h] [2h] [2h] [2h] [2h] [2h]
                     |
                     v (compaction)
Level 1: [  6h  ] [  6h  ] [  6h  ]
                     |
                     v (compaction)
Level 2: [      18h       ] [  6h  ]

6.2 Level-based Compaction

Prometheus uses a level-based compaction strategy:

Compaction decision criteria:
1. When 3 or more blocks at the same level exist
2. When merged result time range does not exceed 10% of retention
3. When merged result does not exceed max-block-duration

Compaction process:
1. Select blocks to merge
2. Create new block directory (temporary)
3. Merge all series from source blocks
4. Apply tombstones (remove deleted data)
5. Generate new index
6. Generate new chunk files
7. Update meta.json (increment level, record sources)
8. Atomically activate new block
9. Delete previous blocks (deferred)

6.3 Vertical Compaction

Special compaction performed when blocks with overlapping time ranges exist:

Occurs when:
- Backfilling past data
- Out-of-order sample setting enabled
- Duplicate blocks after restoration

Processing:
1. Detect overlapping blocks
2. Merge samples of same series (sort by timestamp)
3. Remove duplicate samples
4. Consolidate into single block

6.4 Deletion Processing

Data deletion is handled in two stages:

Stage 1 - Deletion request (API call):
  - Record deletion range in tombstones file
  - Actual data still exists

Stage 2 - During compaction:
  - Apply tombstones to exclude data in deleted ranges
  - New block does not contain deleted data

7. Retention Management

7.1 Time-based Retention

--storage.tsdb.retention.time=15d (default)

Behavior:
1. Blocks with maxTime older than current time minus retention are deletion candidates
2. Checked on each compaction cycle
3. Deleted at block granularity (no partial deletion)

7.2 Size-based Retention

--storage.tsdb.retention.size=10GB

Behavior:
1. Calculate total disk usage of all blocks
2. Delete oldest blocks first when limit exceeded
3. WAL and chunks_head excluded from size calculation

7.3 Combined Retention

When both conditions are set, whichever is reached first applies:

--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB

- Blocks older than 30 days deleted regardless of size
- If exceeding 50GB, oldest blocks deleted even within 30 days

8. mmap and Memory Management

8.1 mmap Usage

Prometheus TSDB extensively uses mmap (memory-mapped files):

mmap usage areas:
1. Persistent block index files
2. Persistent block chunk files
3. Head Block completed chunks (chunks_head/)

mmap benefits:
- OS automatically manages memory via page cache
- Only needed portions loaded into memory (lazy loading)
- OS automatically frees pages under memory pressure
- Reduces Go process heap memory burden

8.2 Memory Usage Analysis

Major memory consumers:

1. Head Block memSeries (largest proportion):
   - Approximately 500 bytes to 1KB per series
   - Proportional to active series count

2. Head Block headChunks:
   - One active chunk per series
   - Approximately 100-200 bytes per chunk

3. Posting Lists (in-memory index):
   - Proportional to label cardinality

4. WAL replay buffer:
   - Temporarily high memory usage at startup
   - Released after replay completes

5. Query execution memory:
   - Proportional to concurrent queries and result sizes

9. Out-of-Order Sample Handling

9.1 Out-of-Order Support

Since Prometheus 2.39, out-of-order samples are supported:

--storage.tsdb.out-of-order-time-window=30m

Behavior:
1. Past sample arrives with timestamp older than Head Block's latest
2. If within OOO window, written to separate WBL (Write-Behind Log)
3. Stored as out-of-order chunks in Head Block
4. Merged with in-order chunks during compaction

9.2 WBL (Write-Behind Log)

A dedicated WAL for out-of-order samples:

WBL vs WAL:
- WAL: for in-order samples only
- WBL: for out-of-order samples only
- Both logs used for crash recovery
- WBL only active when OOO window is configured

10. Performance Characteristics

10.1 Write Performance

Write path:
WAL Write (sequential I/O) -> Head Block Update (memory)

Characteristics:
- WAL performs only sequential writes for disk I/O optimization
- Head Block updates are memory operations, extremely fast
- Capable of ingesting millions of samples per second

10.2 Read Performance

Read path:
1. Query parsing and optimization
2. Posting list lookup for relevant series
3. Load chunks for the time range
4. Chunk decoding and result computation

Optimizations:
- Fast series filtering via posting list intersection
- Load only necessary chunks via mmap
- Skip irrelevant blocks by time range

11. Summary

The Prometheus TSDB demonstrates a design that maximally exploits the characteristics of time series data. It achieves 1.37 bytes per sample compression with delta-of-delta and XOR encoding, ensures data durability with WAL, and maintains query efficiency with level-based compaction.

In the next post, we will analyze the PromQL engine internals, covering the lexer and parser, AST structure, and query evaluation engine behavior.