- Authors

- Name
- Youngju Kim
- @fjvbn20031
etcd Storage Engine: BoltDB and MVCC
This post examines the internal structure of etcd's storage engine responsible for data storage and version management. We analyze BoltDB's (bbolt) B+ tree-based storage mechanism and MVCC's multi-version management in detail.
1. BoltDB (bbolt) Internal Structure
1.1 B+ Tree Overview
BoltDB uses a B+ tree as its core data structure with these characteristics:
- All data stored in leaf nodes
- Internal nodes contain only keys for branching decisions
- Leaf nodes linked as a linked list for efficient range scans
- Balanced tree where all leaves are at the same depth
1.2 Page Types
BoltDB uses 4 page types:
- Meta Page: Database metadata (version, page size, root bucket, etc.). Two meta pages are updated alternately
- Freelist Page: List of available (freed) pages
- Branch Page: B+ tree internal nodes storing keys and child page pointers
- Leaf Page: B+ tree leaf nodes storing key-value pairs or sub-bucket information
Page Layout:
+----------+--------+---------+
| Page ID | Flags | Count |
+----------+--------+---------+
| Ptr/Data | Ptr/Data | ... |
+----------+--------+---------+
1.3 Transaction Model
BoltDB supports ACID transactions:
- Read transactions: Multiple concurrent reads possible; snapshot-based reading
- Write transactions: Only one at a time; exclusive lock on entire database
- Copy-on-Write: Modified pages are copied to new locations during writes
// BoltDB transaction usage example
db.Update(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("myBucket"))
return b.Put([]byte("key"), []byte("value"))
})
1.4 Copy-on-Write Mechanism
BoltDB writes copy pages to new locations rather than modifying existing ones:
- Start write transaction
- Copy pages requiring modification to new locations
- Perform modifications on copied pages
- Update meta page to point to new root
- fsync new meta page to disk at commit
This allows read transactions to safely read previous snapshots.
2. MVCC Detailed Analysis
2.1 Revision Concept
The most important concept in etcd's MVCC is the Revision:
- Global monotonically increasing counter
- Increments by 1 for every transaction (write operation)
- Each revision consists of main and sub parts
Revision = (main, sub)
main: Transaction number (globally increasing)
sub: Operation number within transaction (starts from 0)
Example:
Put("a", "1") -> revision (2, 0)
Txn:
Put("b", "2") -> revision (3, 0)
Put("c", "3") -> revision (3, 1)
2.2 Key Index
The Key Index maps key names to all revision information for that key:
// keyIndex structure (simplified)
type keyIndex struct {
key []byte
modified revision // last modified revision
generations []generation
}
type generation struct {
ver int64 // version within current generation
created revision // generation creation revision
revs []revision // all revisions in this generation
}
Key lifecycle:
- Key creation (Put) -> New generation starts
- Key modification (Put) -> Revision added to current generation
- Key deletion (Delete) -> Tombstone added to current generation, generation ends
- Key recreation (Put) -> New generation starts
2.3 Data Storage in BoltDB
etcd stores data in BoltDB's key bucket as follows:
- Key: Revision byte sequence (main + sub encoded as bytes)
- Value: KeyValue protocol buffer (key name, value, create_revision, mod_revision, version, lease, etc.)
BoltDB key bucket:
key=(2,0) -> KeyValue{key="a", value="1", create_revision=2, mod_revision=2, version=1}
key=(3,0) -> KeyValue{key="a", value="2", create_revision=2, mod_revision=3, version=2}
key=(4,0) -> KeyValue{key="b", value="x", create_revision=4, mod_revision=4, version=1}
2.4 Range Query Processing
How a Range query is processed:
- Look up the latest revision for the requested key in the Key Index
- If a specific revision is requested, look up that revision
- Read actual data from BoltDB using revision as key
- Return results to client
3. Compaction
3.1 Need for Compaction
MVCC maintains all versions, causing continuous data growth. Compaction removes old versions before a specified revision to reclaim space.
3.2 Auto-Compaction Modes
etcd supports two auto-compaction modes:
Periodic mode:
- Runs compaction at specified time intervals
- Example: --auto-compaction-retention=1h (every hour)
- Based on elapsed time since last compaction
Revision mode:
- Maintains history for a specified number of revisions
- Example: --auto-compaction-retention=1000 (keep last 1000 revisions)
- Compacts up to current revision minus specified number
3.3 Compaction Process
- Determine compaction revision
- Remove unnecessary revisions before that revision from Key Index
- Clean up generations of deleted keys (tombstones)
- Delete key-value entries before that revision from BoltDB
- Update scheduled compact revision after completion
// Compaction processing (simplified)
func (s *store) compact(rev int64) {
// Remove old revisions from Key Index
keep := s.kvindex.Compact(rev)
// Delete entries not needed from BoltDB
s.b.BatchTx().UnsafeForEach(keyBucketName, func(k, v []byte) error {
if !keep[revision(k)] {
s.b.BatchTx().UnsafeDelete(keyBucketName, k)
}
return nil
})
}
4. Defragmentation
4.1 Space Issues After Compaction
Due to BoltDB's Copy-on-Write nature, deleting data via compaction does not immediately return disk space. Deleted pages are added to the freelist for reuse but the file size does not shrink.
4.2 Defragmentation Process
Defragmentation rewrites the BoltDB file to reclaim unused space:
- Create a new temporary BoltDB file
- Copy all valid data from existing database to new file
- Replace existing file with new file
- Result is a reduced file size
4.3 Defragmentation Considerations
- Write performance may degrade during defragmentation
- Recommended to run one member at a time
- Avoid peak hours
- Temporarily requires additional disk space
5. Backend Batch Optimization
5.1 Write Batching
etcd batches multiple write operations into a single BoltDB transaction for performance:
- Default batch interval: 100ms
- Default batch limit: 10000 operations
- Batch commits when either interval or limit is reached first
5.2 Performance Tuning Parameters
- --backend-batch-interval: Batch commit interval (default 100ms)
- --backend-batch-limit: Maximum operations per batch (default 10000)
- --quota-backend-bytes: Maximum backend DB size (default 2GB, max 8GB)
6. Storage Monitoring
6.1 Key Metrics
Key metrics to monitor for etcd storage:
- etcd_mvcc_db_total_size_in_bytes: Current DB file size
- etcd_mvcc_db_total_size_in_use_in_bytes: Actual size in use
- etcd_debugging_mvcc_keys_total: Number of stored keys
- etcd_disk_backend_commit_duration_seconds: Backend commit latency
6.2 Handling Space Exhaustion
When etcd backend reaches its quota:
- NOSPACE alarm is raised
- Write requests are rejected
- Perform compaction and defragmentation
- Clear alarm with etcdctl alarm disarm
- Consider increasing quota (--quota-backend-bytes)
7. Summary
etcd's storage engine combines BoltDB's stable B+ tree storage with MVCC's multi-version management to achieve both consistency and performance. Proper space management through compaction and defragmentation is critical for operations. The next post covers etcd cluster operations and disaster recovery.