Split View: Prometheus TSDB 내부 구조: WAL, Chunks, Blocks, Compaction

Prometheus TSDB 내부 구조: WAL, Chunks, Blocks, Compaction

1. 개요
2. TSDB 디렉토리 구조
3. Write-Ahead Log (WAL)
4. Head Block
5. 영속 블록 구조
6. 컴팩션
7. 리텐션 관리
8. mmap과 메모리 관리
- 8.1 mmap 활용
- 8.2 메모리 사용 분석
9. Out-of-Order 샘플 처리
- 9.1 Out-of-Order 지원
- 9.2 WBL (Write-Behind Log)
10. 성능 특성
- 10.1 쓰기 성능
- 10.2 읽기 성능
11. 정리

1. 개요

Prometheus의 TSDB(Time Series Database)는 시계열 데이터에 최적화된 로컬 스토리지 엔진입니다. Facebook의 Gorilla 논문에서 영감을 받은 압축 기법과 LSM-tree에서 영감을 받은 블록 기반 구조를 결합하여 높은 쓰기 성능과 효율적인 스토리지를 제공합니다.

이 글에서는 TSDB의 디렉토리 구조부터 WAL, Head Block, 영속 블록, 컴팩션까지 전체 스토리지 계층을 분석합니다.

2. TSDB 디렉토리 구조

Prometheus의 데이터 디렉토리는 다음과 같은 구조를 갖습니다:

data/
  |-- wal/
  |     |-- 00000001
  |     |-- 00000002
  |     +-- 00000003
  |
  |-- chunks_head/
  |     |-- 000001
  |     +-- 000002
  |
  |-- 01BKGV7JBM69T2G1BGBGM6KB12/   (Block ULID)
  |     |-- meta.json
  |     |-- index
  |     |-- chunks/
  |     |     |-- 000001
  |     |     +-- 000002
  |     +-- tombstones
  |
  |-- 01BKGTZQ1SYQJTR4PB43C8PD98/   (Block ULID)
  |     |-- meta.json
  |     |-- index
  |     |-- chunks/
  |     +-- tombstones
  |
  |-- lock
  +-- queries.active

각 디렉토리와 파일의 역할:

wal/: Write-Ahead Log 세그먼트 파일들
chunks_head/: Head Block의 메모리 매핑된 청크 파일들
ULID 디렉토리/: 각 영속 블록 (ULID는 시간 기반 고유 ID)
lock: 프로세스 단독 접근 보장용 잠금 파일
queries.active: 현재 활성 쿼리 추적

3. Write-Ahead Log (WAL)

3.1 WAL의 목적

WAL은 데이터 내구성을 보장하는 핵심 메커니즘입니다. 모든 수신 데이터는 먼저 WAL에 기록된 후 메모리(Head Block)에 적용됩니다. Prometheus가 비정상 종료되면 WAL을 리플레이하여 Head Block을 복구합니다.

3.2 세그먼트 파일 구조

WAL은 고정 크기(기본 128MB)의 세그먼트 파일로 구성됩니다:

세그먼트 파일 내부 구조:
+----------+----------+----------+-----+
| Record 1 | Record 2 | Record 3 | ... |
+----------+----------+----------+-----+

각 Record:
+--------+--------+---------+------+
| Type   | Length | CRC32   | Data |
| 1 byte | varint | 4 bytes | ...  |
+--------+--------+---------+------+

3.3 레코드 타입

WAL에는 4가지 주요 레코드 타입이 있습니다:

Series Record (타입 1): 새로운 시계열 등록

Series Record:
+----------+-------------------+
| Series ID| Labels (name/value pairs) |
+----------+-------------------+

새로운 시계열이 처음 발견되면 Series Record가 기록됩니다. Series ID는 Head Block 내에서 고유한 참조 번호입니다.

Samples Record (타입 2): 샘플 데이터

Samples Record:
+----------+-----------+-------+
| Series ID| Timestamp | Value |
+----------+-----------+-------+

스크래핑된 각 샘플은 해당하는 Series ID와 함께 기록됩니다.

Tombstones Record (타입 3): 삭제 마킹

시계열 데이터의 특정 시간 범위 삭제를 요청할 때 기록됩니다. 실제 삭제는 컴팩션 시점에 수행됩니다.

Exemplars Record (타입 4): Exemplar 데이터

트레이스 ID 등 추가 메타데이터가 포함된 Exemplar 샘플입니다.

3.4 WAL 관리

세그먼트 순환: 세그먼트가 128MB에 도달하면 새 세그먼트가 생성됩니다
세그먼트 정리: Head Block이 컴팩트되어 영속 블록이 생성되면, 해당 시간 범위의 WAL 세그먼트는 삭제됩니다
체크포인트: WAL의 빠른 리플레이를 위해 주기적으로 체크포인트가 생성됩니다. 체크포인트에는 활성 시계열의 최신 상태만 포함됩니다

WAL 체크포인트 프로세스:
1. 현재 활성 시계열 목록 수집
2. 체크포인트 디렉토리 생성 (checkpoint.NNNNN)
3. 활성 시계열의 Series Record 기록
4. 이전 체크포인트 및 해당 WAL 세그먼트 삭제

3.5 WAL 압축

--storage.tsdb.wal-compression 플래그로 WAL 압축을 활성화할 수 있습니다. Snappy 압축을 사용하며, 일반적으로 WAL 크기를 절반 정도로 줄입니다. CPU 오버헤드는 미미합니다.

4. Head Block

4.1 Head Block 구조

Head Block은 TSDB에서 가장 최근의 데이터를 보관하는 인메모리 구조입니다:

Head Block:
+------------------------------------------+
| memSeries Map                             |
|   series_id_1 -> memSeries_1             |
|   series_id_2 -> memSeries_2             |
|   ...                                     |
+------------------------------------------+
| Posting Lists (in-memory index)          |
+------------------------------------------+
| Stripe Lock Pool                         |
+------------------------------------------+

4.2 memSeries 구조

각 시계열은 memSeries 구조체로 표현됩니다:

memSeries:
+------------+
| ref (ID)   |
| labels     |
| chunks     | --> [chunk_0] -> [chunk_1] -> [chunk_current]
| headChunk  | --> 현재 활성 청크 (쓰기 가능)
| firstTs    |
| lastTs     |
| lastValue  |
+------------+

주요 필드:

ref: 시계열의 고유 참조 번호
labels: 메트릭 이름과 레이블의 정렬된 목록
chunks: 완성된 청크들의 연결 리스트
headChunk: 현재 데이터가 추가되고 있는 활성 청크

4.3 청크 인코딩

Prometheus는 Gorilla 논문의 압축 기법을 사용합니다:

타임스탬프 인코딩 (Delta-of-Delta):

Sample 1: t1 (원본 값 저장)
Sample 2: d1 = t2 - t1 (첫 번째 delta)
Sample 3: dd = (t3 - t2) - (t2 - t1) (delta-of-delta)

스크래핑이 규칙적이면 dd는 대부분 0에 가까움
-> 매우 적은 비트로 표현 가능

비트 인코딩:
dd == 0: 1 bit ('0')
-63 <= dd <= 64: 2 + 7 = 9 bits
-255 <= dd <= 256: 2 + 9 = 11 bits
-2047 <= dd <= 2048: 2 + 12 = 14 bits
그 외: 4 + 32 = 36 bits

값 인코딩 (XOR):

Sample 1: v1 (원본 float64 저장, 64 bits)
Sample 2: xor = v2 XOR v1
Sample 3: xor = v3 XOR v2

연속된 값이 유사하면 XOR 결과의 대부분이 0
-> leading zeros와 trailing zeros를 이용한 압축

비트 인코딩:
xor == 0: 1 bit ('0')
leading/trailing이 이전과 같으면: 2 + significant bits
그 외: 2 + 5(leading) + 6(significant_length) + significant bits

이 압축 방식으로 샘플당 평균 1.37바이트라는 놀라운 압축률을 달성합니다.

4.4 청크 수명주기

1. 새 샘플 도착
2. headChunk에 추가
3. headChunk가 120개 샘플 또는 chunkRange(2시간)에 도달
4. headChunk 완성 처리 (immutable)
5. chunks_head/ 디렉토리에 mmap 파일로 기록
6. 새 headChunk 생성
7. 이전 청크는 mmap으로 접근 (메모리에서 해제 가능)

4.5 Memory Mapped Chunks

Prometheus 2.19부터 Head Block의 완성된 청크는 chunks_head/ 디렉토리에 mmap 파일로 기록됩니다. 이를 통해:

메모리 사용량을 크게 줄일 수 있습니다 (OS가 페이지 캐시로 관리)
크래시 복구 시 WAL 리플레이 시간이 단축됩니다
청크 데이터는 필요할 때만 메모리에 로드됩니다

5. 영속 블록 구조

5.1 블록 개요

Head Block의 데이터가 일정 시간(기본 2시간)이 지나면 영속 블록으로 컴팩트됩니다:

Block Directory:
01BKGV7JBM69T2G1BGBGM6KB12/
  |-- meta.json      (블록 메타데이터)
  |-- index          (시계열 인덱스)
  |-- chunks/
  |     |-- 000001   (청크 데이터)
  |     +-- 000002
  +-- tombstones      (삭제 마킹)

5.2 meta.json

블록의 메타데이터를 포함합니다:

{
  "ulid": "01BKGV7JBM69T2G1BGBGM6KB12",
  "minTime": 1602547200000,
  "maxTime": 1602554400000,
  "stats": {
    "numSamples": 1234567,
    "numSeries": 5678,
    "numChunks": 9012
  },
  "compaction": {
    "level": 1,
    "sources": ["01BKGV7JBM69T2G1BGBGM6KB12"]
  },
  "version": 1
}

5.3 인덱스 구조

인덱스 파일은 시계열을 빠르게 검색하기 위한 구조를 포함합니다:

Index File Structure:
+------------------+
| Symbol Table     |  (모든 레이블 이름/값의 사전)
+------------------+
| Series           |  (시계열별 레이블과 청크 참조)
+------------------+
| Label Index      |  (레이블 이름 -> 가능한 값들)
+------------------+
| Postings         |  (레이블 쌍 -> 시계열 ID 목록)
+------------------+
| Postings Offset  |  (포스팅 목록의 오프셋 테이블)
+------------------+
| TOC              |  (Table of Contents)
+------------------+

5.4 Posting List

Posting List는 역인덱스의 핵심입니다. 각 레이블 이름-값 쌍에서 해당하는 시계열 ID 목록으로의 매핑입니다:

예시:
job="prometheus" -> [1, 3, 5, 7, 9]
job="node"       -> [2, 4, 6, 8, 10]
instance="localhost:9090" -> [1, 2]
instance="localhost:9100" -> [3, 4]

쿼리: job="prometheus" AND instance="localhost:9090"
  -> [1, 3, 5, 7, 9] INTERSECT [1, 2]
  -> [1]

Posting List는 정렬된 상태로 저장되어 교집합/합집합 연산이 O(n) 시간에 수행됩니다.

5.5 청크 파일

청크 파일에는 실제 시계열 샘플 데이터가 압축 저장됩니다:

Chunk File Format:
+--------+--------+--------+-----+
| Chunk 1| Chunk 2| Chunk 3| ... |
+--------+--------+--------+-----+

각 Chunk:
+----------+----------+--------+------+
| Length   | Encoding | Data   | CRC  |
| uvarint  | 1 byte   | ...    | 4B   |
+----------+----------+--------+------+

인코딩 타입:

0: Raw (미압축)
1: XOR (float 샘플용, 기본값)
2: Histogram
3: Float Histogram

6. 컴팩션

6.1 컴팩션 개요

컴팩션은 작은 블록을 큰 블록으로 병합하는 프로세스입니다:

Level 0: [2h] [2h] [2h] [2h] [2h] [2h]
                     |
                     v (compaction)
Level 1: [  6h  ] [  6h  ] [  6h  ]
                     |
                     v (compaction)
Level 2: [      18h       ] [  6h  ]

6.2 Level-based 컴팩션

Prometheus는 레벨 기반 컴팩션 전략을 사용합니다:

컴팩션 결정 기준:
1. 시간 범위가 같은 레벨의 블록 3개 이상이 있을 때
2. 병합 결과의 시간 범위가 retention 기간의 10%를 초과하지 않을 때
3. 병합 결과가 max-block-duration을 초과하지 않을 때

컴팩션 프로세스:
1. 병합할 블록 선택
2. 새 블록 디렉토리 생성 (임시)
3. 모든 소스 블록의 시계열 병합
4. tombstone 적용 (삭제된 데이터 제거)
5. 새 인덱스 생성
6. 새 청크 파일 생성
7. meta.json 업데이트 (level 증가, sources 기록)
8. 원자적으로 새 블록을 활성화
9. 이전 블록 삭제 (지연)

6.3 Vertical 컴팩션

시간 범위가 겹치는 블록이 존재할 때 수행되는 특수 컴팩션입니다:

발생 상황:
- Backfill로 과거 데이터 삽입
- out-of-order 샘플 허용 설정
- 블록 복원 후 중복 발생

처리 방식:
1. 겹치는 블록 감지
2. 동일 시계열의 샘플 병합 (타임스탬프 기반 정렬)
3. 중복 샘플 제거
4. 단일 블록으로 통합

6.4 삭제 처리

데이터 삭제는 두 단계로 처리됩니다:

1단계 - 삭제 요청 (API 호출):
  - tombstones 파일에 삭제 범위 기록
  - 실제 데이터는 아직 존재

2단계 - 컴팩션 시:
  - tombstones를 적용하여 삭제된 범위의 데이터 제외
  - 새 블록에는 삭제된 데이터가 포함되지 않음

7. 리텐션 관리

7.1 시간 기반 리텐션

--storage.tsdb.retention.time=15d (기본값)

동작 방식:
1. 블록의 maxTime이 현재 시간 - retention보다 이전이면 삭제 대상
2. 컴팩션 주기마다 확인
3. 블록 단위로 삭제 (부분 삭제 불가)

7.2 크기 기반 리텐션

--storage.tsdb.retention.size=10GB

동작 방식:
1. 모든 블록의 총 디스크 사용량 계산
2. 제한을 초과하면 가장 오래된 블록부터 삭제
3. WAL과 chunks_head는 크기 계산에서 제외

7.3 복합 리텐션

두 조건을 동시에 설정하면 먼저 도달하는 조건이 적용됩니다:

--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB

- 30일이 지난 블록은 크기에 관계없이 삭제
- 50GB를 초과하면 30일 이내라도 오래된 블록부터 삭제

8. mmap과 메모리 관리

8.1 mmap 활용

Prometheus TSDB는 mmap(memory-mapped files)을 적극 활용합니다:

mmap 사용 영역:
1. 영속 블록의 인덱스 파일
2. 영속 블록의 청크 파일
3. Head Block의 완성된 청크 (chunks_head/)

mmap의 장점:
- OS가 페이지 캐시를 통해 자동으로 메모리 관리
- 필요한 부분만 메모리에 로드 (lazy loading)
- 메모리 부족 시 OS가 자동으로 페이지 해제
- Go 프로세스의 힙 메모리 부담 감소

8.2 메모리 사용 분석

주요 메모리 소비 영역:

1. Head Block memSeries (가장 큰 비중):
   - 시계열당 약 500바이트~1KB
   - 활성 시계열 수에 비례

2. Head Block headChunks:
   - 시계열당 하나의 활성 청크
   - 청크당 약 100~200바이트

3. Posting Lists (인메모리 인덱스):
   - 레이블 카디널리티에 비례

4. WAL 리플레이 버퍼:
   - 시작 시 일시적으로 큰 메모리 사용
   - 리플레이 완료 후 해제

5. 쿼리 실행 메모리:
   - 동시 쿼리 수와 결과 크기에 비례

9. Out-of-Order 샘플 처리

9.1 Out-of-Order 지원

Prometheus 2.39부터 out-of-order 샘플을 지원합니다:

--storage.tsdb.out-of-order-time-window=30m

동작 방식:
1. 현재 Head Block의 최신 타임스탬프보다 과거 샘플 도착
2. out-of-order 윈도우 내이면 별도의 WBL(Write-Behind Log)에 기록
3. Head Block에 out-of-order 청크로 저장
4. 컴팩션 시 in-order 청크와 병합

9.2 WBL (Write-Behind Log)

Out-of-order 샘플 전용 WAL입니다:

WBL vs WAL:
- WAL: in-order 샘플 전용
- WBL: out-of-order 샘플 전용
- 두 로그 모두 크래시 복구에 사용
- WBL은 OOO 윈도우가 설정된 경우에만 활성화

10. 성능 특성

10.1 쓰기 성능

쓰기 경로:
WAL Write (순차 I/O) -> Head Block Update (메모리)

특성:
- WAL은 순차 쓰기만 수행하여 디스크 I/O 최적화
- Head Block은 메모리 연산으로 매우 빠름
- 초당 수백만 샘플 수집 가능

10.2 읽기 성능

읽기 경로:
1. 쿼리 파싱 및 최적화
2. 포스팅 리스트로 관련 시계열 검색
3. 해당 시간 범위의 청크 로드
4. 청크 디코딩 및 결과 계산

최적화:
- 포스팅 리스트 교집합으로 빠른 시계열 필터링
- mmap으로 필요한 청크만 로드
- 시간 범위로 불필요한 블록 스킵

11. 정리

Prometheus TSDB는 시계열 데이터의 특성을 최대한 활용한 설계를 보여줍니다. delta-of-delta와 XOR 인코딩으로 샘플당 1.37바이트의 압축률을 달성하고, WAL로 데이터 내구성을 보장하며, 레벨 기반 컴팩션으로 쿼리 효율성을 유지합니다.

다음 글에서는 PromQL 엔진의 내부 구조를 분석합니다. 렉서와 파서, AST 구조, 쿼리 평가 엔진의 동작 방식을 살펴볼 예정입니다.

[Prometheus] TSDB Internals: WAL, Chunks, Blocks, Compaction

1. Overview
2. TSDB Directory Structure
3. Write-Ahead Log (WAL)
4. Head Block
5. Persistent Block Structure
6. Compaction
7. Retention Management
8. mmap and Memory Management
- 8.1 mmap Usage
- 8.2 Memory Usage Analysis
9. Out-of-Order Sample Handling
- 9.1 Out-of-Order Support
- 9.2 WBL (Write-Behind Log)
10. Performance Characteristics
- 10.1 Write Performance
- 10.2 Read Performance
11. Summary

1. Overview

The Prometheus TSDB (Time Series Database) is a local storage engine optimized for time series data. It combines compression techniques inspired by Facebook's Gorilla paper with an LSM-tree-inspired block-based structure to deliver high write performance and efficient storage.

This post analyzes the entire storage hierarchy from the TSDB directory structure to WAL, Head Block, persistent blocks, and compaction.

2. TSDB Directory Structure

The Prometheus data directory has the following structure:

data/
  |-- wal/
  |     |-- 00000001
  |     |-- 00000002
  |     +-- 00000003
  |
  |-- chunks_head/
  |     |-- 000001
  |     +-- 000002
  |
  |-- 01BKGV7JBM69T2G1BGBGM6KB12/   (Block ULID)
  |     |-- meta.json
  |     |-- index
  |     |-- chunks/
  |     |     |-- 000001
  |     |     +-- 000002
  |     +-- tombstones
  |
  |-- 01BKGTZQ1SYQJTR4PB43C8PD98/   (Block ULID)
  |     |-- meta.json
  |     |-- index
  |     |-- chunks/
  |     +-- tombstones
  |
  |-- lock
  +-- queries.active

Role of each directory and file:

wal/: Write-Ahead Log segment files
chunks_head/: Memory-mapped chunk files for the Head Block
ULID directories/: Each persistent block (ULID is a time-based unique ID)
lock: Lock file ensuring exclusive process access
queries.active: Tracking currently active queries

3. Write-Ahead Log (WAL)

3.1 Purpose of WAL

The WAL is the core mechanism ensuring data durability. All incoming data is first written to the WAL before being applied to memory (Head Block). If Prometheus crashes, the WAL is replayed to recover the Head Block.

3.2 Segment File Structure

The WAL consists of fixed-size (default 128MB) segment files:

Segment File Internal Structure:
+----------+----------+----------+-----+
| Record 1 | Record 2 | Record 3 | ... |
+----------+----------+----------+-----+

Each Record:
+--------+--------+---------+------+
| Type   | Length | CRC32   | Data |
| 1 byte | varint | 4 bytes | ...  |
+--------+--------+---------+------+

3.3 Record Types

The WAL has 4 main record types:

Series Record (type 1): New time series registration

Series Record:
+----------+----------------------------+
| Series ID| Labels (name/value pairs)  |
+----------+----------------------------+

When a new time series is first encountered, a Series Record is written. The Series ID is a unique reference number within the Head Block.

Samples Record (type 2): Sample data

Samples Record:
+----------+-----------+-------+
| Series ID| Timestamp | Value |
+----------+-----------+-------+

Each scraped sample is recorded with its corresponding Series ID.

Tombstones Record (type 3): Deletion markers

Recorded when deletion of a specific time range of series data is requested. Actual deletion is performed during compaction.

Exemplars Record (type 4): Exemplar data

Exemplar samples containing additional metadata such as trace IDs.

3.4 WAL Management

Segment rotation: New segments are created when a segment reaches 128MB
Segment cleanup: When the Head Block is compacted into a persistent block, WAL segments for that time range are deleted
Checkpointing: Periodic checkpoints are created for faster WAL replay. Checkpoints contain only the latest state of active series

WAL Checkpoint Process:
1. Collect current active series list
2. Create checkpoint directory (checkpoint.NNNNN)
3. Write Series Records for active series
4. Delete previous checkpoint and corresponding WAL segments

3.5 WAL Compression

WAL compression can be enabled with the --storage.tsdb.wal-compression flag. It uses Snappy compression and typically reduces WAL size by about half. CPU overhead is minimal.

4. Head Block

4.1 Head Block Structure

The Head Block is the in-memory structure holding the most recent data in the TSDB:

Head Block:
+------------------------------------------+
| memSeries Map                             |
|   series_id_1 -> memSeries_1             |
|   series_id_2 -> memSeries_2             |
|   ...                                     |
+------------------------------------------+
| Posting Lists (in-memory index)          |
+------------------------------------------+
| Stripe Lock Pool                         |
+------------------------------------------+

4.2 memSeries Structure

Each time series is represented by a memSeries struct:

memSeries:
+------------+
| ref (ID)   |
| labels     |
| chunks     | --> [chunk_0] -> [chunk_1] -> [chunk_current]
| headChunk  | --> currently active chunk (writable)
| firstTs    |
| lastTs     |
| lastValue  |
+------------+

Key fields:

ref: Unique reference number for the series
labels: Sorted list of metric name and labels
chunks: Linked list of completed chunks
headChunk: The currently active chunk receiving new data

4.3 Chunk Encoding

Prometheus uses compression techniques from the Gorilla paper:

Timestamp Encoding (Delta-of-Delta):

Sample 1: t1 (raw value stored)
Sample 2: d1 = t2 - t1 (first delta)
Sample 3: dd = (t3 - t2) - (t2 - t1) (delta-of-delta)

With regular scraping, dd is mostly near 0
-> Can be represented with very few bits

Bit encoding:
dd == 0: 1 bit ('0')
-63 <= dd <= 64: 2 + 7 = 9 bits
-255 <= dd <= 256: 2 + 9 = 11 bits
-2047 <= dd <= 2048: 2 + 12 = 14 bits
Otherwise: 4 + 32 = 36 bits

Value Encoding (XOR):

Sample 1: v1 (raw float64 stored, 64 bits)
Sample 2: xor = v2 XOR v1
Sample 3: xor = v3 XOR v2

Consecutive similar values produce XOR results mostly zeros
-> Compression using leading zeros and trailing zeros

Bit encoding:
xor == 0: 1 bit ('0')
Same leading/trailing as previous: 2 + significant bits
Otherwise: 2 + 5(leading) + 6(significant_length) + significant bits

This compression achieves a remarkable average of 1.37 bytes per sample.

4.4 Chunk Lifecycle

1. New sample arrives
2. Append to headChunk
3. headChunk reaches 120 samples or chunkRange (2 hours)
4. headChunk completion (becomes immutable)
5. Written to chunks_head/ directory as mmap file
6. New headChunk created
7. Previous chunks accessed via mmap (can be freed from memory)

4.5 Memory Mapped Chunks

Since Prometheus 2.19, completed Head Block chunks are written as mmap files in the chunks_head/ directory. This provides:

Significantly reduced memory usage (OS manages via page cache)
Faster crash recovery with reduced WAL replay time
Chunk data loaded into memory only when needed

5. Persistent Block Structure

5.1 Block Overview

Head Block data is compacted into persistent blocks after a set period (default 2 hours):

Block Directory:
01BKGV7JBM69T2G1BGBGM6KB12/
  |-- meta.json      (block metadata)
  |-- index          (series index)
  |-- chunks/
  |     |-- 000001   (chunk data)
  |     +-- 000002
  +-- tombstones      (deletion markers)

5.2 meta.json

Contains block metadata:

{
  "ulid": "01BKGV7JBM69T2G1BGBGM6KB12",
  "minTime": 1602547200000,
  "maxTime": 1602554400000,
  "stats": {
    "numSamples": 1234567,
    "numSeries": 5678,
    "numChunks": 9012
  },
  "compaction": {
    "level": 1,
    "sources": ["01BKGV7JBM69T2G1BGBGM6KB12"]
  },
  "version": 1
}

5.3 Index Structure

The index file contains structures for fast series lookup:

Index File Structure:
+------------------+
| Symbol Table     |  (dictionary of all label names/values)
+------------------+
| Series           |  (per-series labels and chunk references)
+------------------+
| Label Index      |  (label name -> possible values)
+------------------+
| Postings         |  (label pair -> series ID list)
+------------------+
| Postings Offset  |  (offset table for posting lists)
+------------------+
| TOC              |  (Table of Contents)
+------------------+

5.4 Posting List

The Posting List is the core of the inverted index. It maps each label name-value pair to a list of corresponding series IDs:

Example:
job="prometheus" -> [1, 3, 5, 7, 9]
job="node"       -> [2, 4, 6, 8, 10]
instance="localhost:9090" -> [1, 2]
instance="localhost:9100" -> [3, 4]

Query: job="prometheus" AND instance="localhost:9090"
  -> [1, 3, 5, 7, 9] INTERSECT [1, 2]
  -> [1]

Posting Lists are stored sorted, enabling O(n) intersection and union operations.

5.5 Chunk Files

Chunk files store the actual compressed time series sample data:

Chunk File Format:
+--------+--------+--------+-----+
| Chunk 1| Chunk 2| Chunk 3| ... |
+--------+--------+--------+-----+

Each Chunk:
+----------+----------+--------+------+
| Length   | Encoding | Data   | CRC  |
| uvarint  | 1 byte   | ...    | 4B   |
+----------+----------+--------+------+

Encoding types:

0: Raw (uncompressed)
1: XOR (for float samples, default)
2: Histogram
3: Float Histogram

6. Compaction

6.1 Compaction Overview

Compaction merges smaller blocks into larger blocks:

Level 0: [2h] [2h] [2h] [2h] [2h] [2h]
                     |
                     v (compaction)
Level 1: [  6h  ] [  6h  ] [  6h  ]
                     |
                     v (compaction)
Level 2: [      18h       ] [  6h  ]

6.2 Level-based Compaction

Prometheus uses a level-based compaction strategy:

Compaction decision criteria:
1. When 3 or more blocks at the same level exist
2. When merged result time range does not exceed 10% of retention
3. When merged result does not exceed max-block-duration

Compaction process:
1. Select blocks to merge
2. Create new block directory (temporary)
3. Merge all series from source blocks
4. Apply tombstones (remove deleted data)
5. Generate new index
6. Generate new chunk files
7. Update meta.json (increment level, record sources)
8. Atomically activate new block
9. Delete previous blocks (deferred)

6.3 Vertical Compaction

Special compaction performed when blocks with overlapping time ranges exist:

Occurs when:
- Backfilling past data
- Out-of-order sample setting enabled
- Duplicate blocks after restoration

Processing:
1. Detect overlapping blocks
2. Merge samples of same series (sort by timestamp)
3. Remove duplicate samples
4. Consolidate into single block

6.4 Deletion Processing

Data deletion is handled in two stages:

Stage 1 - Deletion request (API call):
  - Record deletion range in tombstones file
  - Actual data still exists

Stage 2 - During compaction:
  - Apply tombstones to exclude data in deleted ranges
  - New block does not contain deleted data

7. Retention Management

7.1 Time-based Retention

--storage.tsdb.retention.time=15d (default)

Behavior:
1. Blocks with maxTime older than current time minus retention are deletion candidates
2. Checked on each compaction cycle
3. Deleted at block granularity (no partial deletion)

7.2 Size-based Retention

--storage.tsdb.retention.size=10GB

Behavior:
1. Calculate total disk usage of all blocks
2. Delete oldest blocks first when limit exceeded
3. WAL and chunks_head excluded from size calculation

7.3 Combined Retention

When both conditions are set, whichever is reached first applies:

--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB

- Blocks older than 30 days deleted regardless of size
- If exceeding 50GB, oldest blocks deleted even within 30 days

8. mmap and Memory Management

8.1 mmap Usage

Prometheus TSDB extensively uses mmap (memory-mapped files):

mmap usage areas:
1. Persistent block index files
2. Persistent block chunk files
3. Head Block completed chunks (chunks_head/)

mmap benefits:
- OS automatically manages memory via page cache
- Only needed portions loaded into memory (lazy loading)
- OS automatically frees pages under memory pressure
- Reduces Go process heap memory burden

8.2 Memory Usage Analysis

Major memory consumers:

1. Head Block memSeries (largest proportion):
   - Approximately 500 bytes to 1KB per series
   - Proportional to active series count

2. Head Block headChunks:
   - One active chunk per series
   - Approximately 100-200 bytes per chunk

3. Posting Lists (in-memory index):
   - Proportional to label cardinality

4. WAL replay buffer:
   - Temporarily high memory usage at startup
   - Released after replay completes

5. Query execution memory:
   - Proportional to concurrent queries and result sizes

9. Out-of-Order Sample Handling

9.1 Out-of-Order Support

Since Prometheus 2.39, out-of-order samples are supported:

--storage.tsdb.out-of-order-time-window=30m

Behavior:
1. Past sample arrives with timestamp older than Head Block's latest
2. If within OOO window, written to separate WBL (Write-Behind Log)
3. Stored as out-of-order chunks in Head Block
4. Merged with in-order chunks during compaction

9.2 WBL (Write-Behind Log)

A dedicated WAL for out-of-order samples:

WBL vs WAL:
- WAL: for in-order samples only
- WBL: for out-of-order samples only
- Both logs used for crash recovery
- WBL only active when OOO window is configured

10. Performance Characteristics

10.1 Write Performance

Write path:
WAL Write (sequential I/O) -> Head Block Update (memory)

Characteristics:
- WAL performs only sequential writes for disk I/O optimization
- Head Block updates are memory operations, extremely fast
- Capable of ingesting millions of samples per second

10.2 Read Performance

Read path:
1. Query parsing and optimization
2. Posting list lookup for relevant series
3. Load chunks for the time range
4. Chunk decoding and result computation

Optimizations:
- Fast series filtering via posting list intersection
- Load only necessary chunks via mmap
- Skip irrelevant blocks by time range

11. Summary

The Prometheus TSDB demonstrates a design that maximally exploits the characteristics of time series data. It achieves 1.37 bytes per sample compression with delta-of-delta and XOR encoding, ensures data durability with WAL, and maintains query efficiency with level-based compaction.

In the next post, we will analyze the PromQL engine internals, covering the lexer and parser, AST structure, and query evaluation engine behavior.