- Authors

- Name
- Youngju Kim
- @fjvbn20031
etcd Architecture Internals: Raft Consensus Algorithm
etcd is a strongly consistent distributed key-value store that plays a critical role in distributed systems. All Kubernetes cluster state is stored in etcd, and its reliability is built on the Raft consensus algorithm.
1. Raft Consensus Algorithm Overview
Raft is a consensus algorithm designed for understandability, providing safety equivalent to Paxos while being significantly simpler to implement.
1.1 Core Concepts
Raft decomposes consensus into three sub-problems:
- Leader Election: Electing a single leader in the cluster
- Log Replication: Leader replicates log entries to followers
- Safety: Ensures committed entries are never lost
1.2 Node States
Each node in a Raft cluster is in one of three states:
- Leader: Handles client requests and replicates logs
- Follower: Passive role, responds to leader requests
- Candidate: Intermediate state participating in leader election
1.3 Terms
Raft divides time into logical units called Terms. Each Term begins with an election, and if successful, one leader exists for that Term. Terms are monotonically increasing, and messages from older Terms are ignored.
2. Leader Election Details
2.1 Election Process
- A Follower transitions to Candidate after not receiving heartbeats for an election timeout period
- The Candidate increments the current Term, votes for itself, and sends RequestVote RPCs to other nodes
- If it receives majority votes, it becomes leader
- Upon becoming leader, it immediately sends empty heartbeats to establish authority
2.2 Election Timeout
Election timeouts are randomized to prevent multiple nodes from becoming Candidates simultaneously. In etcd, the default range is 1000-1500ms.
2.3 Pre-Vote Mechanism
etcd implements Pre-Vote to prevent unnecessary leader changes after network partition recovery. Before becoming a Candidate, a node first requests Pre-Votes, and only begins the actual election if a majority agrees.
// Simplified raft state transition in etcd
// Follower -> PreCandidate -> Candidate -> Leader
func (r *raft) tickElection() {
r.electionElapsed++
if r.electionElapsed >= r.randomizedElectionTimeout {
r.electionElapsed = 0
r.Step(pb.Message{Type: pb.MsgHup})
}
}
2.4 Learner Nodes
Since etcd 3.4, Learner nodes are supported. Learners are non-voting members that only receive log replication. Keeping new members as Learners until they catch up with existing logs avoids impacting cluster availability.
3. Log Replication
3.1 Log Entry Structure
Each log entry contains:
- Index: Position in the log (monotonically increasing)
- Term: The Term when the entry was created
- Data: Command to apply to the state machine (e.g., Put key=value)
3.2 Replication Process
- Client sends request to leader
- Leader appends log entry to its log
- Leader replicates to followers via AppendEntries RPC
- Committed when majority of followers acknowledge
- Committed entry is applied to the state machine
- Response returned to client
3.3 Log Consistency Guarantees
Raft ensures log consistency through two properties:
- Log Matching: Two entries with the same index and term store the same command
- Leader Completeness: Committed entries exist in all subsequent leaders' logs
Leader Log: [1:1][2:1][3:2][4:2][5:3]
^-- committed up to here
Follower A: [1:1][2:1][3:2][4:2][5:3] (up to date)
Follower B: [1:1][2:1][3:2] (lagging, will catch up)
4. etcd Server Architecture
4.1 EtcdServer Structure
EtcdServer is the core struct of etcd, coordinating the following components:
- RaftNode: Handles Raft consensus logic
- Backend: BoltDB-based storage
- KV Store: MVCC key-value store
- Lessor: Lease management
- Watchable Store: Watch mechanism
4.2 Request Processing Flow
How a client Put request is processed:
- gRPC server receives the request
- Authentication and authorization check
- Request proposed to Raft
- Committed after Raft consensus
- Apply loop applies committed entry to state machine
- Persisted to backend (BoltDB)
- Response returned to client
4.3 Apply Loop
The apply loop is etcd's core loop, applying Raft-committed entries to the state machine in order:
// Simplified apply loop structure
func (s *EtcdServer) applyAll(ep *etcdProgress, apply *apply) {
s.applySnapshot(ep, apply)
s.applyEntries(ep, apply)
// Update index after application completes
s.setAppliedIndex(apply.snapshot.Metadata.Index)
}
4.4 Linearizable Read
etcd supports Linearizable Read by default. For read requests, the leader sends heartbeats to a majority to confirm it is still the current leader, preventing stale reads.
Using the Serializable Read option reads locally without leader confirmation, offering better performance but potentially returning non-current data.
5. Key-Value Store: BoltDB and MVCC
5.1 BoltDB (bbolt) Backend
etcd uses BoltDB (bbolt fork) as its backend store. BoltDB is a B+ tree-based embedded key-value database featuring:
- Read-optimized B+ tree structure
- Page-based storage (4KB default)
- ACID transaction support
- Single writer, multiple reader model
5.2 MVCC (Multi-Version Concurrency Control)
etcd uses MVCC to maintain all versions of keys:
- Revision: Global monotonically increasing counter, incremented per transaction
- Key Index: Mapping from key name to revision list
- Generation: Lifecycle of a key from creation to deletion
Key: "/app/config"
Generation 1: [create:rev5] [modify:rev8] [modify:rev12] [tombstone:rev15]
Generation 2: [create:rev20] [modify:rev25]
^-- current
5.3 BoltDB Internal Bucket Structure
etcd maintains two main buckets in BoltDB:
- key bucket: Stores actual key-value data with revision as key
- meta bucket: Stores metadata (consistent index, scheduled compact revision, etc.)
6. Watch Mechanism
6.1 Watchable Store
etcd's Watch streams changes to keys or key ranges in real-time:
- synced watchers: Caught up to current revision, receiving new events immediately
- unsynced watchers: Not yet caught up, replaying events from history
6.2 Event Generation
When a write operation occurs in the MVCC store:
- Events are generated when the transaction commits
- The watchable store delivers events to relevant watchers
- Events are sent to clients via gRPC streams
6.3 Watch Recovery
When a client reconnects after disconnection, it can resume the Watch from the last received revision. If that revision has already been compacted, an error is returned and the client must reload all data.
7. Network and gRPC Communication
7.1 Peer Communication
Communication between etcd cluster members uses two channels:
- Stream: Long-lived connections for Raft messages (heartbeats, append entries, etc.)
- Pipeline: Large data transfers (snapshots, etc.)
7.2 Client Communication
Clients communicate with etcd via gRPC. Key services:
- KV: Put, Range, DeleteRange, Txn
- Watch: Watch
- Lease: LeaseGrant, LeaseRevoke, LeaseKeepAlive
- Cluster: MemberAdd, MemberRemove, MemberList
- Maintenance: Alarm, Status, Defragment, Snapshot
8. Summary
etcd's architecture ensures strong consistency through the Raft consensus algorithm while achieving high performance through BoltDB's efficient storage and MVCC's multi-version management. In the next post, we will dive deeper into the BoltDB and MVCC storage engine internals.