Split View: etcd 스토리지 엔진: BoltDB와 MVCC

etcd 스토리지 엔진: BoltDB와 MVCC

etcd의 데이터 저장과 버전 관리를 담당하는 스토리지 엔진의 내부 구조를 살펴봅니다. BoltDB(bbolt)의 B+ 트리 기반 저장 메커니즘과 MVCC의 다중 버전 관리를 상세히 분석합니다.

1. BoltDB(bbolt) 내부 구조

1.1 B+ 트리 개요

BoltDB는 B+ 트리를 핵심 데이터 구조로 사용합니다. B+ 트리의 특성:

모든 데이터가 리프 노드에 저장
내부 노드는 키만 포함하여 분기 결정에 사용
리프 노드가 연결 리스트로 연결되어 범위 스캔에 효율적
균형 트리로 모든 리프가 같은 깊이

1.2 페이지 타입

BoltDB는 4가지 페이지 타입을 사용합니다:

Meta Page: 데이터베이스 메타데이터(버전, 페이지 크기, 루트 버킷 등). 2개의 메타 페이지가 교대로 업데이트
Freelist Page: 사용 가능한(해제된) 페이지 목록
Branch Page: B+ 트리 내부 노드. 키와 자식 페이지 포인터 저장
Leaf Page: B+ 트리 리프 노드. 키-값 쌍 또는 서브 버킷 정보 저장

Page Layout:
+----------+--------+---------+
| Page ID  | Flags  | Count   |
+----------+--------+---------+
| Ptr/Data | Ptr/Data | ...   |
+----------+--------+---------+

1.3 트랜잭션 모델

BoltDB는 ACID 트랜잭션을 지원합니다:

읽기 트랜잭션: 여러 개가 동시에 가능. 스냅샷 기반 읽기
쓰기 트랜잭션: 한 번에 하나만 가능. 전체 데이터베이스에 대한 배타적 잠금
Copy-on-Write: 쓰기 시 수정된 페이지를 복사하여 새 위치에 작성

// BoltDB 트랜잭션 사용 예시
db.Update(func(tx *bolt.Tx) error {
    b := tx.Bucket([]byte("myBucket"))
    return b.Put([]byte("key"), []byte("value"))
})

1.4 Copy-on-Write 메커니즘

BoltDB의 쓰기는 기존 페이지를 수정하지 않고 새 페이지에 복사합니다:

쓰기 트랜잭션 시작
수정이 필요한 페이지를 새 위치에 복사
복사된 페이지에서 수정 수행
메타 페이지를 업데이트하여 새 루트를 가리킴
트랜잭션 커밋 시 새 메타 페이지를 디스크에 fsync

이 방식은 읽기 트랜잭션이 이전 스냅샷을 안전하게 읽을 수 있게 합니다.

2. MVCC 상세 분석

2.1 Revision 개념

etcd의 MVCC에서 가장 중요한 개념은 Revision입니다:

전역(global) 단조 증가 카운터
모든 트랜잭션(쓰기 작업)마다 1씩 증가
각 revision은 main과 sub 두 부분으로 구성

Revision = (main, sub)
main: 트랜잭션 번호 (전역 증가)
sub: 트랜잭션 내 작업 번호 (0부터 시작)

예시:
Put("a", "1")  -> revision (2, 0)
Txn:
  Put("b", "2")  -> revision (3, 0)
  Put("c", "3")  -> revision (3, 1)

2.2 Key Index

Key Index는 키 이름에서 해당 키의 모든 revision 정보로의 매핑입니다:

// keyIndex 구조 (간략화)
type keyIndex struct {
    key         []byte
    modified    revision    // 최종 수정 revision
    generations []generation
}

type generation struct {
    ver     int64       // 현재 generation 내 버전
    created revision    // generation 생성 revision
    revs    []revision  // 이 generation의 모든 revision
}

키의 라이프사이클:

키 생성(Put) -> 새 generation 시작
키 수정(Put) -> 현재 generation에 revision 추가
키 삭제(Delete) -> 현재 generation에 tombstone 추가, generation 종료
키 재생성(Put) -> 새 generation 시작

2.3 BoltDB 내 데이터 저장

etcd는 BoltDB의 key 버킷에 다음과 같이 저장합니다:

키: revision 바이트열 (main + sub를 바이트로 인코딩)
값: KeyValue 프로토콜 버퍼 (키 이름, 값, create_revision, mod_revision, version, lease 등)

BoltDB key bucket:
  key=(2,0) -> KeyValue{key="a", value="1", create_revision=2, mod_revision=2, version=1}
  key=(3,0) -> KeyValue{key="a", value="2", create_revision=2, mod_revision=3, version=2}
  key=(4,0) -> KeyValue{key="b", value="x", create_revision=4, mod_revision=4, version=1}

2.4 Range 쿼리 처리

Range 쿼리가 처리되는 과정:

Key Index에서 요청된 키의 최신 revision을 조회
특정 revision이 요청된 경우 해당 revision을 조회
BoltDB에서 revision을 키로 실제 데이터를 읽기
결과를 클라이언트에 반환

3. 컴팩션(Compaction)

3.1 컴팩션의 필요성

MVCC는 모든 버전을 유지하므로 데이터가 계속 증가합니다. 컴팩션은 지정된 revision 이전의 오래된 버전을 제거하여 공간을 회수합니다.

3.2 자동 컴팩션 모드

etcd는 두 가지 자동 컴팩션 모드를 지원합니다:

Periodic 모드:

지정된 시간 간격으로 컴팩션 실행
예: --auto-compaction-retention=1h (1시간마다)
마지막 컴팩션 이후 경과 시간 기준

Revision 모드:

지정된 revision 수만큼의 히스토리를 유지
예: --auto-compaction-retention=1000 (최근 1000 revision 유지)
현재 revision에서 지정 수를 뺀 revision까지 컴팩션

3.3 컴팩션 과정

컴팩션 revision 결정
Key Index에서 해당 revision 이전의 불필요한 revision 제거
삭제된 키(tombstone)의 generation 정리
BoltDB에서 해당 revision 이전의 키-값 엔트리 삭제
컴팩션 완료 후 scheduled compact revision 업데이트

// 컴팩션 처리 (간략화)
func (s *store) compact(rev int64) {
    // Key Index에서 오래된 revision 제거
    keep := s.kvindex.Compact(rev)
    // BoltDB에서 유지할 필요 없는 엔트리 삭제
    s.b.BatchTx().UnsafeForEach(keyBucketName, func(k, v []byte) error {
        if !keep[revision(k)] {
            s.b.BatchTx().UnsafeDelete(keyBucketName, k)
        }
        return nil
    })
}

4. 디프래그멘테이션(Defragmentation)

4.1 컴팩션 후 공간 문제

BoltDB의 Copy-on-Write 특성 때문에 컴팩션으로 데이터를 삭제해도 디스크 공간은 즉시 반환되지 않습니다. 삭제된 페이지는 freelist에 추가되어 재사용되지만, 파일 크기는 줄어들지 않습니다.

4.2 디프래그멘테이션 과정

디프래그멘테이션은 BoltDB 파일을 재작성하여 사용되지 않는 공간을 회수합니다:

새 임시 BoltDB 파일 생성
기존 데이터베이스의 모든 유효한 데이터를 새 파일에 복사
기존 파일을 새 파일로 교체
결과적으로 파일 크기가 줄어듦

4.3 디프래그멘테이션 주의사항

디프래그멘테이션 중 쓰기 성능이 저하될 수 있음
한 번에 한 멤버씩 실행하는 것을 권장
피크 시간을 피해 실행
임시로 추가 디스크 공간이 필요

5. 백엔드 배치 최적화

5.1 쓰기 배치

etcd는 성능을 위해 여러 쓰기 작업을 하나의 BoltDB 트랜잭션으로 배치합니다:

기본 배치 간격: 100ms
기본 배치 제한: 10000개 작업
배치는 간격 또는 제한에 먼저 도달하면 커밋

5.2 배치 트랜잭션 구조

// BatchTx는 읽기-쓰기 트랜잭션을 배치로 처리
type batchTx struct {
    tx      *bolt.Tx
    backend *backend
    pending int  // 미커밋 작업 수
}

// 배치 커밋 조건
func (t *batchTx) safePending() int {
    // pending이 batchLimit에 도달하면 커밋
    // 또는 batchInterval이 경과하면 커밋
}

5.3 성능 튜닝 매개변수

--backend-batch-interval: 배치 커밋 간격 (기본 100ms)
--backend-batch-limit: 배치당 최대 작업 수 (기본 10000)
--quota-backend-bytes: 백엔드 DB 최대 크기 (기본 2GB, 최대 8GB)

6. 스토리지 모니터링

6.1 주요 메트릭

etcd 스토리지 관련 모니터링해야 할 주요 메트릭:

etcd_mvcc_db_total_size_in_bytes: 현재 DB 파일 크기
etcd_mvcc_db_total_size_in_use_in_bytes: 실제 사용 중인 크기
etcd_debugging_mvcc_keys_total: 저장된 키 수
etcd_debugging_mvcc_db_compaction_keys_total: 컴팩션된 키 수
etcd_disk_backend_commit_duration_seconds: 백엔드 커밋 지연

6.2 공간 부족 대응

etcd 백엔드가 quota에 도달하면:

NOSPACE 알람이 발생
쓰기 요청이 거부됨
컴팩션과 디프래그멘테이션을 수행
etcdctl alarm disarm으로 알람 해제
quota 증가 검토(--quota-backend-bytes)

7. 정리

etcd의 스토리지 엔진은 BoltDB의 안정적인 B+ 트리 저장과 MVCC의 다중 버전 관리를 결합하여 일관성과 성능을 모두 달성합니다. 컴팩션과 디프래그멘테이션을 통한 적절한 공간 관리가 운영에서 중요합니다. 다음 글에서는 etcd 클러스터 운영과 장애 복구를 다루겠습니다.

etcd Storage Engine: BoltDB and MVCC

This post examines the internal structure of etcd's storage engine responsible for data storage and version management. We analyze BoltDB's (bbolt) B+ tree-based storage mechanism and MVCC's multi-version management in detail.

1. BoltDB (bbolt) Internal Structure

1.1 B+ Tree Overview

BoltDB uses a B+ tree as its core data structure with these characteristics:

All data stored in leaf nodes
Internal nodes contain only keys for branching decisions
Leaf nodes linked as a linked list for efficient range scans
Balanced tree where all leaves are at the same depth

1.2 Page Types

BoltDB uses 4 page types:

Meta Page: Database metadata (version, page size, root bucket, etc.). Two meta pages are updated alternately
Freelist Page: List of available (freed) pages
Branch Page: B+ tree internal nodes storing keys and child page pointers
Leaf Page: B+ tree leaf nodes storing key-value pairs or sub-bucket information

Page Layout:
+----------+--------+---------+
| Page ID  | Flags  | Count   |
+----------+--------+---------+
| Ptr/Data | Ptr/Data | ...   |
+----------+--------+---------+

1.3 Transaction Model

BoltDB supports ACID transactions:

Read transactions: Multiple concurrent reads possible; snapshot-based reading
Write transactions: Only one at a time; exclusive lock on entire database
Copy-on-Write: Modified pages are copied to new locations during writes

// BoltDB transaction usage example
db.Update(func(tx *bolt.Tx) error {
    b := tx.Bucket([]byte("myBucket"))
    return b.Put([]byte("key"), []byte("value"))
})

1.4 Copy-on-Write Mechanism

BoltDB writes copy pages to new locations rather than modifying existing ones:

Start write transaction
Copy pages requiring modification to new locations
Perform modifications on copied pages
Update meta page to point to new root
fsync new meta page to disk at commit

This allows read transactions to safely read previous snapshots.

2. MVCC Detailed Analysis

2.1 Revision Concept

The most important concept in etcd's MVCC is the Revision:

Global monotonically increasing counter
Increments by 1 for every transaction (write operation)
Each revision consists of main and sub parts

Revision = (main, sub)
main: Transaction number (globally increasing)
sub: Operation number within transaction (starts from 0)

Example:
Put("a", "1")  -> revision (2, 0)
Txn:
  Put("b", "2")  -> revision (3, 0)
  Put("c", "3")  -> revision (3, 1)

2.2 Key Index

The Key Index maps key names to all revision information for that key:

// keyIndex structure (simplified)
type keyIndex struct {
    key         []byte
    modified    revision    // last modified revision
    generations []generation
}

type generation struct {
    ver     int64       // version within current generation
    created revision    // generation creation revision
    revs    []revision  // all revisions in this generation
}

Key lifecycle:

Key creation (Put) -> New generation starts
Key modification (Put) -> Revision added to current generation
Key deletion (Delete) -> Tombstone added to current generation, generation ends
Key recreation (Put) -> New generation starts

2.3 Data Storage in BoltDB

etcd stores data in BoltDB's key bucket as follows:

Key: Revision byte sequence (main + sub encoded as bytes)
Value: KeyValue protocol buffer (key name, value, create_revision, mod_revision, version, lease, etc.)

BoltDB key bucket:
  key=(2,0) -> KeyValue{key="a", value="1", create_revision=2, mod_revision=2, version=1}
  key=(3,0) -> KeyValue{key="a", value="2", create_revision=2, mod_revision=3, version=2}
  key=(4,0) -> KeyValue{key="b", value="x", create_revision=4, mod_revision=4, version=1}

2.4 Range Query Processing

How a Range query is processed:

Look up the latest revision for the requested key in the Key Index
If a specific revision is requested, look up that revision
Read actual data from BoltDB using revision as key
Return results to client

3. Compaction

3.1 Need for Compaction

MVCC maintains all versions, causing continuous data growth. Compaction removes old versions before a specified revision to reclaim space.

3.2 Auto-Compaction Modes

etcd supports two auto-compaction modes:

Periodic mode:

Runs compaction at specified time intervals
Example: --auto-compaction-retention=1h (every hour)
Based on elapsed time since last compaction

Revision mode:

Maintains history for a specified number of revisions
Example: --auto-compaction-retention=1000 (keep last 1000 revisions)
Compacts up to current revision minus specified number

3.3 Compaction Process

Determine compaction revision
Remove unnecessary revisions before that revision from Key Index
Clean up generations of deleted keys (tombstones)
Delete key-value entries before that revision from BoltDB
Update scheduled compact revision after completion

// Compaction processing (simplified)
func (s *store) compact(rev int64) {
    // Remove old revisions from Key Index
    keep := s.kvindex.Compact(rev)
    // Delete entries not needed from BoltDB
    s.b.BatchTx().UnsafeForEach(keyBucketName, func(k, v []byte) error {
        if !keep[revision(k)] {
            s.b.BatchTx().UnsafeDelete(keyBucketName, k)
        }
        return nil
    })
}

4. Defragmentation

4.1 Space Issues After Compaction

Due to BoltDB's Copy-on-Write nature, deleting data via compaction does not immediately return disk space. Deleted pages are added to the freelist for reuse but the file size does not shrink.

4.2 Defragmentation Process

Defragmentation rewrites the BoltDB file to reclaim unused space:

Create a new temporary BoltDB file
Copy all valid data from existing database to new file
Replace existing file with new file
Result is a reduced file size

4.3 Defragmentation Considerations

Write performance may degrade during defragmentation
Recommended to run one member at a time
Avoid peak hours
Temporarily requires additional disk space

5. Backend Batch Optimization

5.1 Write Batching

etcd batches multiple write operations into a single BoltDB transaction for performance:

Default batch interval: 100ms
Default batch limit: 10000 operations
Batch commits when either interval or limit is reached first

5.2 Performance Tuning Parameters

--backend-batch-interval: Batch commit interval (default 100ms)
--backend-batch-limit: Maximum operations per batch (default 10000)
--quota-backend-bytes: Maximum backend DB size (default 2GB, max 8GB)

6. Storage Monitoring

6.1 Key Metrics

Key metrics to monitor for etcd storage:

etcd_mvcc_db_total_size_in_bytes: Current DB file size
etcd_mvcc_db_total_size_in_use_in_bytes: Actual size in use
etcd_debugging_mvcc_keys_total: Number of stored keys
etcd_disk_backend_commit_duration_seconds: Backend commit latency

6.2 Handling Space Exhaustion

When etcd backend reaches its quota:

NOSPACE alarm is raised
Write requests are rejected
Perform compaction and defragmentation
Clear alarm with etcdctl alarm disarm
Consider increasing quota (--quota-backend-bytes)

7. Summary

etcd's storage engine combines BoltDB's stable B+ tree storage with MVCC's multi-version management to achieve both consistency and performance. Proper space management through compaction and defragmentation is critical for operations. The next post covers etcd cluster operations and disaster recovery.