- Authors

- Name
- Youngju Kim
- @fjvbn20031
- 1. Overview
- 2. TSDB Directory Structure
- 3. Write-Ahead Log (WAL)
- 4. Head Block
- 5. Persistent Block Structure
- 6. Compaction
- 7. Retention Management
- 8. mmap and Memory Management
- 9. Out-of-Order Sample Handling
- 10. Performance Characteristics
- 11. Summary
1. Overview
The Prometheus TSDB (Time Series Database) is a local storage engine optimized for time series data. It combines compression techniques inspired by Facebook's Gorilla paper with an LSM-tree-inspired block-based structure to deliver high write performance and efficient storage.
This post analyzes the entire storage hierarchy from the TSDB directory structure to WAL, Head Block, persistent blocks, and compaction.
2. TSDB Directory Structure
The Prometheus data directory has the following structure:
data/
|-- wal/
| |-- 00000001
| |-- 00000002
| +-- 00000003
|
|-- chunks_head/
| |-- 000001
| +-- 000002
|
|-- 01BKGV7JBM69T2G1BGBGM6KB12/ (Block ULID)
| |-- meta.json
| |-- index
| |-- chunks/
| | |-- 000001
| | +-- 000002
| +-- tombstones
|
|-- 01BKGTZQ1SYQJTR4PB43C8PD98/ (Block ULID)
| |-- meta.json
| |-- index
| |-- chunks/
| +-- tombstones
|
|-- lock
+-- queries.active
Role of each directory and file:
- wal/: Write-Ahead Log segment files
- chunks_head/: Memory-mapped chunk files for the Head Block
- ULID directories/: Each persistent block (ULID is a time-based unique ID)
- lock: Lock file ensuring exclusive process access
- queries.active: Tracking currently active queries
3. Write-Ahead Log (WAL)
3.1 Purpose of WAL
The WAL is the core mechanism ensuring data durability. All incoming data is first written to the WAL before being applied to memory (Head Block). If Prometheus crashes, the WAL is replayed to recover the Head Block.
3.2 Segment File Structure
The WAL consists of fixed-size (default 128MB) segment files:
Segment File Internal Structure:
+----------+----------+----------+-----+
| Record 1 | Record 2 | Record 3 | ... |
+----------+----------+----------+-----+
Each Record:
+--------+--------+---------+------+
| Type | Length | CRC32 | Data |
| 1 byte | varint | 4 bytes | ... |
+--------+--------+---------+------+
3.3 Record Types
The WAL has 4 main record types:
Series Record (type 1): New time series registration
Series Record:
+----------+----------------------------+
| Series ID| Labels (name/value pairs) |
+----------+----------------------------+
When a new time series is first encountered, a Series Record is written. The Series ID is a unique reference number within the Head Block.
Samples Record (type 2): Sample data
Samples Record:
+----------+-----------+-------+
| Series ID| Timestamp | Value |
+----------+-----------+-------+
Each scraped sample is recorded with its corresponding Series ID.
Tombstones Record (type 3): Deletion markers
Recorded when deletion of a specific time range of series data is requested. Actual deletion is performed during compaction.
Exemplars Record (type 4): Exemplar data
Exemplar samples containing additional metadata such as trace IDs.
3.4 WAL Management
- Segment rotation: New segments are created when a segment reaches 128MB
- Segment cleanup: When the Head Block is compacted into a persistent block, WAL segments for that time range are deleted
- Checkpointing: Periodic checkpoints are created for faster WAL replay. Checkpoints contain only the latest state of active series
WAL Checkpoint Process:
1. Collect current active series list
2. Create checkpoint directory (checkpoint.NNNNN)
3. Write Series Records for active series
4. Delete previous checkpoint and corresponding WAL segments
3.5 WAL Compression
WAL compression can be enabled with the --storage.tsdb.wal-compression flag. It uses Snappy compression and typically reduces WAL size by about half. CPU overhead is minimal.
4. Head Block
4.1 Head Block Structure
The Head Block is the in-memory structure holding the most recent data in the TSDB:
Head Block:
+------------------------------------------+
| memSeries Map |
| series_id_1 -> memSeries_1 |
| series_id_2 -> memSeries_2 |
| ... |
+------------------------------------------+
| Posting Lists (in-memory index) |
+------------------------------------------+
| Stripe Lock Pool |
+------------------------------------------+
4.2 memSeries Structure
Each time series is represented by a memSeries struct:
memSeries:
+------------+
| ref (ID) |
| labels |
| chunks | --> [chunk_0] -> [chunk_1] -> [chunk_current]
| headChunk | --> currently active chunk (writable)
| firstTs |
| lastTs |
| lastValue |
+------------+
Key fields:
- ref: Unique reference number for the series
- labels: Sorted list of metric name and labels
- chunks: Linked list of completed chunks
- headChunk: The currently active chunk receiving new data
4.3 Chunk Encoding
Prometheus uses compression techniques from the Gorilla paper:
Timestamp Encoding (Delta-of-Delta):
Sample 1: t1 (raw value stored)
Sample 2: d1 = t2 - t1 (first delta)
Sample 3: dd = (t3 - t2) - (t2 - t1) (delta-of-delta)
With regular scraping, dd is mostly near 0
-> Can be represented with very few bits
Bit encoding:
dd == 0: 1 bit ('0')
-63 <= dd <= 64: 2 + 7 = 9 bits
-255 <= dd <= 256: 2 + 9 = 11 bits
-2047 <= dd <= 2048: 2 + 12 = 14 bits
Otherwise: 4 + 32 = 36 bits
Value Encoding (XOR):
Sample 1: v1 (raw float64 stored, 64 bits)
Sample 2: xor = v2 XOR v1
Sample 3: xor = v3 XOR v2
Consecutive similar values produce XOR results mostly zeros
-> Compression using leading zeros and trailing zeros
Bit encoding:
xor == 0: 1 bit ('0')
Same leading/trailing as previous: 2 + significant bits
Otherwise: 2 + 5(leading) + 6(significant_length) + significant bits
This compression achieves a remarkable average of 1.37 bytes per sample.
4.4 Chunk Lifecycle
1. New sample arrives
2. Append to headChunk
3. headChunk reaches 120 samples or chunkRange (2 hours)
4. headChunk completion (becomes immutable)
5. Written to chunks_head/ directory as mmap file
6. New headChunk created
7. Previous chunks accessed via mmap (can be freed from memory)
4.5 Memory Mapped Chunks
Since Prometheus 2.19, completed Head Block chunks are written as mmap files in the chunks_head/ directory. This provides:
- Significantly reduced memory usage (OS manages via page cache)
- Faster crash recovery with reduced WAL replay time
- Chunk data loaded into memory only when needed
5. Persistent Block Structure
5.1 Block Overview
Head Block data is compacted into persistent blocks after a set period (default 2 hours):
Block Directory:
01BKGV7JBM69T2G1BGBGM6KB12/
|-- meta.json (block metadata)
|-- index (series index)
|-- chunks/
| |-- 000001 (chunk data)
| +-- 000002
+-- tombstones (deletion markers)
5.2 meta.json
Contains block metadata:
{
"ulid": "01BKGV7JBM69T2G1BGBGM6KB12",
"minTime": 1602547200000,
"maxTime": 1602554400000,
"stats": {
"numSamples": 1234567,
"numSeries": 5678,
"numChunks": 9012
},
"compaction": {
"level": 1,
"sources": ["01BKGV7JBM69T2G1BGBGM6KB12"]
},
"version": 1
}
5.3 Index Structure
The index file contains structures for fast series lookup:
Index File Structure:
+------------------+
| Symbol Table | (dictionary of all label names/values)
+------------------+
| Series | (per-series labels and chunk references)
+------------------+
| Label Index | (label name -> possible values)
+------------------+
| Postings | (label pair -> series ID list)
+------------------+
| Postings Offset | (offset table for posting lists)
+------------------+
| TOC | (Table of Contents)
+------------------+
5.4 Posting List
The Posting List is the core of the inverted index. It maps each label name-value pair to a list of corresponding series IDs:
Example:
job="prometheus" -> [1, 3, 5, 7, 9]
job="node" -> [2, 4, 6, 8, 10]
instance="localhost:9090" -> [1, 2]
instance="localhost:9100" -> [3, 4]
Query: job="prometheus" AND instance="localhost:9090"
-> [1, 3, 5, 7, 9] INTERSECT [1, 2]
-> [1]
Posting Lists are stored sorted, enabling O(n) intersection and union operations.
5.5 Chunk Files
Chunk files store the actual compressed time series sample data:
Chunk File Format:
+--------+--------+--------+-----+
| Chunk 1| Chunk 2| Chunk 3| ... |
+--------+--------+--------+-----+
Each Chunk:
+----------+----------+--------+------+
| Length | Encoding | Data | CRC |
| uvarint | 1 byte | ... | 4B |
+----------+----------+--------+------+
Encoding types:
- 0: Raw (uncompressed)
- 1: XOR (for float samples, default)
- 2: Histogram
- 3: Float Histogram
6. Compaction
6.1 Compaction Overview
Compaction merges smaller blocks into larger blocks:
Level 0: [2h] [2h] [2h] [2h] [2h] [2h]
|
v (compaction)
Level 1: [ 6h ] [ 6h ] [ 6h ]
|
v (compaction)
Level 2: [ 18h ] [ 6h ]
6.2 Level-based Compaction
Prometheus uses a level-based compaction strategy:
Compaction decision criteria:
1. When 3 or more blocks at the same level exist
2. When merged result time range does not exceed 10% of retention
3. When merged result does not exceed max-block-duration
Compaction process:
1. Select blocks to merge
2. Create new block directory (temporary)
3. Merge all series from source blocks
4. Apply tombstones (remove deleted data)
5. Generate new index
6. Generate new chunk files
7. Update meta.json (increment level, record sources)
8. Atomically activate new block
9. Delete previous blocks (deferred)
6.3 Vertical Compaction
Special compaction performed when blocks with overlapping time ranges exist:
Occurs when:
- Backfilling past data
- Out-of-order sample setting enabled
- Duplicate blocks after restoration
Processing:
1. Detect overlapping blocks
2. Merge samples of same series (sort by timestamp)
3. Remove duplicate samples
4. Consolidate into single block
6.4 Deletion Processing
Data deletion is handled in two stages:
Stage 1 - Deletion request (API call):
- Record deletion range in tombstones file
- Actual data still exists
Stage 2 - During compaction:
- Apply tombstones to exclude data in deleted ranges
- New block does not contain deleted data
7. Retention Management
7.1 Time-based Retention
--storage.tsdb.retention.time=15d (default)
Behavior:
1. Blocks with maxTime older than current time minus retention are deletion candidates
2. Checked on each compaction cycle
3. Deleted at block granularity (no partial deletion)
7.2 Size-based Retention
--storage.tsdb.retention.size=10GB
Behavior:
1. Calculate total disk usage of all blocks
2. Delete oldest blocks first when limit exceeded
3. WAL and chunks_head excluded from size calculation
7.3 Combined Retention
When both conditions are set, whichever is reached first applies:
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB
- Blocks older than 30 days deleted regardless of size
- If exceeding 50GB, oldest blocks deleted even within 30 days
8. mmap and Memory Management
8.1 mmap Usage
Prometheus TSDB extensively uses mmap (memory-mapped files):
mmap usage areas:
1. Persistent block index files
2. Persistent block chunk files
3. Head Block completed chunks (chunks_head/)
mmap benefits:
- OS automatically manages memory via page cache
- Only needed portions loaded into memory (lazy loading)
- OS automatically frees pages under memory pressure
- Reduces Go process heap memory burden
8.2 Memory Usage Analysis
Major memory consumers:
1. Head Block memSeries (largest proportion):
- Approximately 500 bytes to 1KB per series
- Proportional to active series count
2. Head Block headChunks:
- One active chunk per series
- Approximately 100-200 bytes per chunk
3. Posting Lists (in-memory index):
- Proportional to label cardinality
4. WAL replay buffer:
- Temporarily high memory usage at startup
- Released after replay completes
5. Query execution memory:
- Proportional to concurrent queries and result sizes
9. Out-of-Order Sample Handling
9.1 Out-of-Order Support
Since Prometheus 2.39, out-of-order samples are supported:
--storage.tsdb.out-of-order-time-window=30m
Behavior:
1. Past sample arrives with timestamp older than Head Block's latest
2. If within OOO window, written to separate WBL (Write-Behind Log)
3. Stored as out-of-order chunks in Head Block
4. Merged with in-order chunks during compaction
9.2 WBL (Write-Behind Log)
A dedicated WAL for out-of-order samples:
WBL vs WAL:
- WAL: for in-order samples only
- WBL: for out-of-order samples only
- Both logs used for crash recovery
- WBL only active when OOO window is configured
10. Performance Characteristics
10.1 Write Performance
Write path:
WAL Write (sequential I/O) -> Head Block Update (memory)
Characteristics:
- WAL performs only sequential writes for disk I/O optimization
- Head Block updates are memory operations, extremely fast
- Capable of ingesting millions of samples per second
10.2 Read Performance
Read path:
1. Query parsing and optimization
2. Posting list lookup for relevant series
3. Load chunks for the time range
4. Chunk decoding and result computation
Optimizations:
- Fast series filtering via posting list intersection
- Load only necessary chunks via mmap
- Skip irrelevant blocks by time range
11. Summary
The Prometheus TSDB demonstrates a design that maximally exploits the characteristics of time series data. It achieves 1.37 bytes per sample compression with delta-of-delta and XOR encoding, ensures data durability with WAL, and maintains query efficiency with level-based compaction.
In the next post, we will analyze the PromQL engine internals, covering the lexer and parser, AST structure, and query evaluation engine behavior.