- Published on
Raft Consensus Complete Guide 2025: Leader Election, Log Replication, Safety, etcd/Consul in Practice
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Introduction: Why Raft?
The Fundamental Problem of Distributed Consensus
Five servers each answer "what is the current balance?". Networks partition, servers crash, some respond slowly, requests reach only some servers. Yet all servers must agree on the same state. This is distributed consensus.
Paxos, and Its Pain
Leslie Lamport published Paxos in 1989. Mathematically perfect, but notoriously hard to understand. Google's Chubby paper famously noted:
"There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system."
Everyone implementing Paxos invented their own variant, none fully verified.
Enter Raft
Stanford's Diego Ongaro and John Ousterhout published Raft in 2014 with one goal:
"Understandability first."
Raft offers the same safety as Paxos but decomposes the problem into three independent sub-problems:
- Leader Election
- Log Replication
- Safety
This decomposition made Raft ubiquitous: etcd (Kubernetes backend), Consul, CockroachDB, TiDB, YugabyteDB, MongoDB (internally), RedPanda, and Kafka KRaft. If you use Kubernetes, you already depend on Raft daily.
1. The Raft Model
Node States
Each node is in exactly one state:
Follower ──timeout──▶ Candidate ──majority──▶ Leader
▲ │ │
└────higher term / leader discovered──────────────┘
- Follower: Passive. Replicates from Leader.
- Candidate: Running an election.
- Leader: Sole write entry point. Appends and replicates.
Normal operation: 1 Leader + N-1 Followers.
Term: Logical Time
A Term is a monotonically increasing integer. Each Term starts with an election, then normal operation. Higher-Term messages force stale Leaders to step down.
Two RPCs
Raft uses only two RPCs:
- RequestVote: Candidate requests votes.
- AppendEntries: Leader replicates logs + heartbeat.
2. Leader Election
Election Timeout
Each Follower has a random election timeout (typically 150-300ms). On timeout:
- Transition to Candidate.
- Increment
currentTerm. - Vote for self.
- Send RequestVote to all peers.
- Reset election timer.
Vote Granting Rules
A voter grants a vote iff:
- Candidate's
term >= currentTerm. - Has not voted (or voted for same candidate) this Term.
- Candidate's log is at least as up-to-date as voter's.
Up-to-date check: the log with the higher last-entry Term wins; if tied, the longer log wins. This rule enforces Leader Completeness.
Three Outcomes
- Majority vote → become Leader, send heartbeats immediately.
- Discover another Leader (same or higher Term) → revert to Follower.
- Split vote / timeout → increment Term, try again.
Preventing Split Votes
Randomized timeouts in [150ms, 300ms] ensure one Candidate usually times out first. Split votes are mathematically rare and resolve in the next Term.
Pseudocode
class RaftNode:
def start_election(self):
self.state = "candidate"
self.current_term += 1
self.voted_for = self.id
self.votes_received = {self.id}
self.reset_election_timer()
for peer in self.peers:
self.send_request_vote(peer)
def handle_vote_response(self, response):
if response.term > self.current_term:
self.become_follower(response.term)
return
if response.vote_granted:
self.votes_received.add(response.voter_id)
if len(self.votes_received) > len(self.peers) // 2:
self.become_leader()
def become_leader(self):
self.state = "leader"
self.reset_heartbeat_timer()
self.send_heartbeats()
3. Log Replication
Log Entry Structure
index: 1 2 3 4 5 6
term: 1 1 1 2 3 3
cmd: [x=3] [y=1] [x=2] [z=0] [x=9] [y=5]
Replication Flow
- Client request → Leader appends locally (uncommitted).
- Leader sends AppendEntries to all Followers in parallel.
- On majority ack → commit.
- Leader applies to state machine, responds to client.
- Next AppendEntries propagates
commitIndexso Followers apply too.
AppendEntries RPC
AppendEntries(
term,
leaderId,
prevLogIndex,
prevLogTerm,
entries[],
leaderCommit
)
Follower returns success=true only if the entry at prevLogIndex has prevLogTerm.
Log Matching Property
- If two logs have an entry at the same index/term, commands are identical.
- If two logs have an entry at the same index/term, all preceding entries are identical.
Enforced by the prevLogIndex/prevLogTerm consistency check.
Resolving Log Inconsistencies
Leader maintains nextIndex[peer]. On AppendEntries failure, decrement and retry until a match is found; then overwrite the Follower's suffix with the Leader's log.
def send_append_entries(self, follower_id):
next_idx = self.next_index[follower_id]
prev_idx = next_idx - 1
prev_term = self.log[prev_idx].term if prev_idx > 0 else 0
entries = self.log[next_idx:]
response = self.rpc(follower_id, "AppendEntries", {
"term": self.current_term,
"prev_log_index": prev_idx,
"prev_log_term": prev_term,
"entries": entries,
"leader_commit": self.commit_index,
})
if response.success:
self.next_index[follower_id] = next_idx + len(entries)
self.match_index[follower_id] = self.next_index[follower_id] - 1
self.maybe_advance_commit()
else:
if response.term > self.current_term:
self.become_follower(response.term)
else:
self.next_index[follower_id] -= 1
4. Safety Properties
1. Election Safety
At most one Leader per Term. Each node votes at most once per Term; two majorities cannot both exist.
2. Leader Append-Only
Leaders never overwrite or delete entries — only append.
3. Log Matching
Same-index/same-term entries imply identical prefixes.
4. Leader Completeness
Any committed entry is present in all future Leaders. Enforced by the up-to-date check in RequestVote: committed entries live on a majority, and any winning Candidate must have a log at least as up-to-date as that majority.
5. State Machine Safety
If one node applies an entry at index i, no other node applies a different entry at i. Follows from 1-4.
The Subtle Bug: Committing Previous-Term Entries
Raft paper's Figure 8: a Leader from Term 2 replicates to a majority but crashes before committing. A new Leader in Term 3 could overwrite it.
Rule: A Leader counts replicas only for entries from its current Term. Once a current-Term entry commits, earlier entries commit transitively via Log Matching.
def maybe_advance_commit(self):
for n in range(self.commit_index + 1, len(self.log) + 1):
if self.log[n].term != self.current_term:
continue
count = 1
for peer in self.peers:
if self.match_index[peer] >= n:
count += 1
if count > len(self.peers) // 2:
self.commit_index = n
5. Cluster Membership Change
The Danger of Naive Changes
Switching from 3 to 5 nodes atomically is impossible — some nodes use the old config, some the new. Majorities in C_old and C_new may not overlap → two Leaders possible.
Solution 1: Joint Consensus
Leader logs C_old,new (union). Decisions during transition require majorities in both C_old AND C_new. Once committed, Leader logs C_new.
Solution 2: Single-Server Changes (etcd)
Add/remove one node at a time. Majorities always overlap (e.g., 3-node majority=2 and 4-node majority=3 share at least one node). etcd uses this.
6. Log Compaction (Snapshot)
Logs cannot grow forever. Each node independently snapshots:
Snapshot {
last_included_index: 5,
last_included_term: 2,
state_machine_state: {...}
}
Entries before last_included_index are discarded.
InstallSnapshot RPC
When a Follower falls too far behind, the Leader sends a snapshot:
InstallSnapshot(
term,
leaderId,
lastIncludedIndex,
lastIncludedTerm,
offset,
data[],
done
)
etcd snapshots every 10,000 entries by default (--snapshot-count).
7. Client Interaction
Linearizable Reads
A stale Leader can exist (partitioned from peers). Options:
- Read through Log: safe but slow.
- ReadIndex: record
commitIndex, heartbeat to confirm leadership, then respond. etcd default. - Lease Read: Leader holds a time-bounded lease. Fast; needs clock sync.
Exactly-Once Writes
Clients tag requests with (client_id, sequence_no). Leader stores processed pairs in the state machine and returns cached results for duplicates.
8. Real Systems
etcd
- Go
go.etcd.io/raftlibrary. Stores all Kubernetes state. - MVCC KV store, Watch API, Leases. Ports 2379/2380.
- 3-node: ~10,000 writes/s (fsync-bound), ~40,000 linearizable reads/s.
- Recommended size: 3 or 5.
Consul
HashiCorp's Consul: Raft-based consistent KV, multi-DC (each DC is its own Raft cluster), DNS service discovery.
CockroachDB / TiDB / YugabyteDB
NewSQL databases split data into ranges/regions, each with its own Raft group (Multi-Raft):
- CockroachDB: 64MB ranges.
- TiKV: 96MB regions.
- Optimizations: message batching, Follower Replication, region merge/split.
MongoDB & Kafka KRaft
MongoDB replica sets use a "Raft-like" protocol. Kafka 2.8+ KRaft replaces ZooKeeper with a Raft-based controller quorum.
9. Performance Optimizations
- Pipeline AppendEntries — don't wait for ack.
- Batching — amortize fsync.
- Parallel Disk Write — append and send concurrently.
- PreVote — avoid spurious Term increments after partition recovery.
- Follower Read — for stale reads, serve from Follower.
- Witness Replicas — vote-only replicas (CockroachDB).
10. Operational Issues
Election Storm
Symptoms: Leader churn, throughput near zero. Causes: network latency > election timeout, GC pauses, disk I/O stalls. Fix: raise election-timeout (etcd default 1000ms → 5000ms), lower heartbeat-interval, enable PreVote.
Unbounded Log Growth
Verify --snapshot-count and monitor Follower lag; restart stuck Followers.
Quorum Loss
3-node cluster loses 2 nodes → no majority. Recovery: force-restart the remaining node as a single-node cluster (etcdctl snapshot restore), then add nodes one by one. Prevention: odd node counts, spread across AZs.
Slow Disk
Raft fsyncs every write:
- HDD: ~10ms → 100 ops/s
- SSD: ~0.1ms → 10,000 ops/s
- NVMe: ~0.01ms → 100,000 ops/s
Never run Raft on HDD.
Network Partitions
Majority-based quorum prevents split-brain. Minority side rejects reads/writes — Raft is a CP system.
11. Raft vs Paxos vs Zab
| Aspect | Raft | Multi-Paxos | Zab |
|---|---|---|---|
| Difficulty | Easy | Hard | Medium |
| Leader-based | Yes | Optional | Yes |
| Election | Timeout + RequestVote | Proposal numbers | FLE |
| Log consistency | Enforced identical | Loose | Enforced |
| Used by | etcd, Consul, CockroachDB | Spanner, Chubby | ZooKeeper |
Why Raft Won
- Pseudocode and implementation guide in the paper.
- Clear state model (Follower/Candidate/Leader).
- Many reference implementations.
- Teachable in university courses.
Google internal systems still favor Paxos; open source is overwhelmingly Raft.
12. Implementing Raft Yourself
Learning Resources
- Original paper: "In Search of an Understandable Consensus Algorithm".
- raft.github.io — interactive visualization.
- MIT 6.824 Lab 2 — implement Raft in Go (2A: Election, 2B: Replication, 2C: Persistence, 2D: Snapshot).
Common Student Mistakes
- Heartbeats missing piggy-backed log data.
- Forgetting Term checks on RPCs.
- Misusing
commitIndexfor previous-Term entries. - Missing persistence (currentTerm, votedFor, log must be fsync'd).
- Wrong timer-reset placement.
Quiz Review
Q1. What design choice makes Raft more understandable than Paxos?
A. Decomposition into Leader Election, Log Replication, and Safety as independent sub-problems; a small state space (3 states); and a strong-Leader model that makes log flow unidirectional.
Q2. Why are Raft clusters sized at 3, 5, 7 (odd)?
A. 2N+1 nodes tolerate N failures. 4 nodes tolerate only 1 failure — same as 3 but with more cost. Even counts are strictly worse.
Q3. Why does RequestVote need the up-to-date log check?
A. To guarantee Leader Completeness. Since committed entries live on a majority, requiring winners to have a log at least as up-to-date as a majority member ensures they carry all committed entries.
Q4. Why are previous-Term entries not committed purely by replica count?
A. Figure 8 shows such an entry can still be overwritten. Raft commits a previous-Term entry only transitively, once a current-Term entry commits.
Q5. Trade-offs of increasing etcd election timeout?
A. Pro: fewer spurious elections from GC/network blips. Con: longer write-unavailability after a true Leader death. Default 1000ms suits on-prem; cloud and high-load systems often need 3000-5000ms.
Closing Thoughts
Raft is not just an algorithm — it is a research philosophy: "If an existing algorithm is hard, invent one that is understandable."
Key Ideas
- Leader-based consensus.
- Term-based logical time.
- Election Safety via majority voting.
- Log Matching via consistency checks.
- Leader Completeness via up-to-date checks.
You already rely on Raft daily via kubectl apply → etcd, Consul KV, CockroachDB, and Kafka KRaft. Understanding Raft makes their behavior, failures, and performance transparent.
Next Steps
EPaxos, Flexible Paxos, HotStuff, Byzantine Raft — but first, implement Raft by hand.