etcd Architecture Internals: Raft Consensus Algorithm

etcd is a strongly consistent distributed key-value store that plays a critical role in distributed systems. All Kubernetes cluster state is stored in etcd, and its reliability is built on the Raft consensus algorithm.

1. Raft Consensus Algorithm Overview

Raft is a consensus algorithm designed for understandability, providing safety equivalent to Paxos while being significantly simpler to implement.

1.1 Core Concepts

Raft decomposes consensus into three sub-problems:

Leader Election: Electing a single leader in the cluster
Log Replication: Leader replicates log entries to followers
Safety: Ensures committed entries are never lost

1.2 Node States

Each node in a Raft cluster is in one of three states:

Leader: Handles client requests and replicates logs
Follower: Passive role, responds to leader requests
Candidate: Intermediate state participating in leader election

1.3 Terms

Raft divides time into logical units called Terms. Each Term begins with an election, and if successful, one leader exists for that Term. Terms are monotonically increasing, and messages from older Terms are ignored.

2. Leader Election Details

2.1 Election Process

A Follower transitions to Candidate after not receiving heartbeats for an election timeout period
The Candidate increments the current Term, votes for itself, and sends RequestVote RPCs to other nodes
If it receives majority votes, it becomes leader
Upon becoming leader, it immediately sends empty heartbeats to establish authority

2.2 Election Timeout

Election timeouts are randomized to prevent multiple nodes from becoming Candidates simultaneously. In etcd, the default range is 1000-1500ms.

2.3 Pre-Vote Mechanism

etcd implements Pre-Vote to prevent unnecessary leader changes after network partition recovery. Before becoming a Candidate, a node first requests Pre-Votes, and only begins the actual election if a majority agrees.

// Simplified raft state transition in etcd
// Follower -> PreCandidate -> Candidate -> Leader

func (r *raft) tickElection() {
    r.electionElapsed++
    if r.electionElapsed >= r.randomizedElectionTimeout {
        r.electionElapsed = 0
        r.Step(pb.Message{Type: pb.MsgHup})
    }
}

2.4 Learner Nodes

Since etcd 3.4, Learner nodes are supported. Learners are non-voting members that only receive log replication. Keeping new members as Learners until they catch up with existing logs avoids impacting cluster availability.

3. Log Replication

3.1 Log Entry Structure

Each log entry contains:

Index: Position in the log (monotonically increasing)
Term: The Term when the entry was created
Data: Command to apply to the state machine (e.g., Put key=value)

3.2 Replication Process

Client sends request to leader
Leader appends log entry to its log
Leader replicates to followers via AppendEntries RPC
Committed when majority of followers acknowledge
Committed entry is applied to the state machine
Response returned to client

3.3 Log Consistency Guarantees

Raft ensures log consistency through two properties:

Log Matching: Two entries with the same index and term store the same command
Leader Completeness: Committed entries exist in all subsequent leaders' logs

Leader Log:  [1:1][2:1][3:2][4:2][5:3]
                                  ^-- committed up to here

Follower A:  [1:1][2:1][3:2][4:2][5:3]  (up to date)
Follower B:  [1:1][2:1][3:2]            (lagging, will catch up)

4. etcd Server Architecture

4.1 EtcdServer Structure

EtcdServer is the core struct of etcd, coordinating the following components:

RaftNode: Handles Raft consensus logic
Backend: BoltDB-based storage
KV Store: MVCC key-value store
Lessor: Lease management
Watchable Store: Watch mechanism

4.2 Request Processing Flow

How a client Put request is processed:

gRPC server receives the request
Authentication and authorization check
Request proposed to Raft
Committed after Raft consensus
Apply loop applies committed entry to state machine
Persisted to backend (BoltDB)
Response returned to client

4.3 Apply Loop

The apply loop is etcd's core loop, applying Raft-committed entries to the state machine in order:

// Simplified apply loop structure
func (s *EtcdServer) applyAll(ep *etcdProgress, apply *apply) {
    s.applySnapshot(ep, apply)
    s.applyEntries(ep, apply)
    // Update index after application completes
    s.setAppliedIndex(apply.snapshot.Metadata.Index)
}

4.4 Linearizable Read

etcd supports Linearizable Read by default. For read requests, the leader sends heartbeats to a majority to confirm it is still the current leader, preventing stale reads.

Using the Serializable Read option reads locally without leader confirmation, offering better performance but potentially returning non-current data.

5. Key-Value Store: BoltDB and MVCC

5.1 BoltDB (bbolt) Backend

etcd uses BoltDB (bbolt fork) as its backend store. BoltDB is a B+ tree-based embedded key-value database featuring:

Read-optimized B+ tree structure
Page-based storage (4KB default)
ACID transaction support
Single writer, multiple reader model

5.2 MVCC (Multi-Version Concurrency Control)

etcd uses MVCC to maintain all versions of keys:

Revision: Global monotonically increasing counter, incremented per transaction
Key Index: Mapping from key name to revision list
Generation: Lifecycle of a key from creation to deletion

Key: "/app/config"
  Generation 1: [create:rev5] [modify:rev8] [modify:rev12] [tombstone:rev15]
  Generation 2: [create:rev20] [modify:rev25]
                                              ^-- current

5.3 BoltDB Internal Bucket Structure

etcd maintains two main buckets in BoltDB:

key bucket: Stores actual key-value data with revision as key
meta bucket: Stores metadata (consistent index, scheduled compact revision, etc.)

6. Watch Mechanism

6.1 Watchable Store

etcd's Watch streams changes to keys or key ranges in real-time:

synced watchers: Caught up to current revision, receiving new events immediately
unsynced watchers: Not yet caught up, replaying events from history

6.2 Event Generation

When a write operation occurs in the MVCC store:

Events are generated when the transaction commits
The watchable store delivers events to relevant watchers
Events are sent to clients via gRPC streams

6.3 Watch Recovery

When a client reconnects after disconnection, it can resume the Watch from the last received revision. If that revision has already been compacted, an error is returned and the client must reload all data.

7. Network and gRPC Communication

7.1 Peer Communication

Communication between etcd cluster members uses two channels:

Stream: Long-lived connections for Raft messages (heartbeats, append entries, etc.)
Pipeline: Large data transfers (snapshots, etc.)

7.2 Client Communication

Clients communicate with etcd via gRPC. Key services:

KV: Put, Range, DeleteRange, Txn
Watch: Watch
Lease: LeaseGrant, LeaseRevoke, LeaseKeepAlive
Cluster: MemberAdd, MemberRemove, MemberList
Maintenance: Alarm, Status, Defragment, Snapshot

8. Summary

etcd's architecture ensures strong consistency through the Raft consensus algorithm while achieving high performance through BoltDB's efficient storage and MVCC's multi-version management. In the next post, we will dive deeper into the BoltDB and MVCC storage engine internals.