Skip to content
Published on

Raft Consensus Complete Guide 2025: Leader Election, Log Replication, Safety, etcd/Consul in Practice

Authors

Introduction: Why Raft?

The Fundamental Problem of Distributed Consensus

Five servers each answer "what is the current balance?". Networks partition, servers crash, some respond slowly, requests reach only some servers. Yet all servers must agree on the same state. This is distributed consensus.

Paxos, and Its Pain

Leslie Lamport published Paxos in 1989. Mathematically perfect, but notoriously hard to understand. Google's Chubby paper famously noted:

"There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system."

Everyone implementing Paxos invented their own variant, none fully verified.

Enter Raft

Stanford's Diego Ongaro and John Ousterhout published Raft in 2014 with one goal:

"Understandability first."

Raft offers the same safety as Paxos but decomposes the problem into three independent sub-problems:

  1. Leader Election
  2. Log Replication
  3. Safety

This decomposition made Raft ubiquitous: etcd (Kubernetes backend), Consul, CockroachDB, TiDB, YugabyteDB, MongoDB (internally), RedPanda, and Kafka KRaft. If you use Kubernetes, you already depend on Raft daily.


1. The Raft Model

Node States

Each node is in exactly one state:

Follower  ──timeout──▶  Candidate  ──majority──▶  Leader
   ▲                        │                        │
   └────higher term / leader discovered──────────────┘
  • Follower: Passive. Replicates from Leader.
  • Candidate: Running an election.
  • Leader: Sole write entry point. Appends and replicates.

Normal operation: 1 Leader + N-1 Followers.

Term: Logical Time

A Term is a monotonically increasing integer. Each Term starts with an election, then normal operation. Higher-Term messages force stale Leaders to step down.

Two RPCs

Raft uses only two RPCs:

  1. RequestVote: Candidate requests votes.
  2. AppendEntries: Leader replicates logs + heartbeat.

2. Leader Election

Election Timeout

Each Follower has a random election timeout (typically 150-300ms). On timeout:

  1. Transition to Candidate.
  2. Increment currentTerm.
  3. Vote for self.
  4. Send RequestVote to all peers.
  5. Reset election timer.

Vote Granting Rules

A voter grants a vote iff:

  1. Candidate's term >= currentTerm.
  2. Has not voted (or voted for same candidate) this Term.
  3. Candidate's log is at least as up-to-date as voter's.

Up-to-date check: the log with the higher last-entry Term wins; if tied, the longer log wins. This rule enforces Leader Completeness.

Three Outcomes

  1. Majority vote → become Leader, send heartbeats immediately.
  2. Discover another Leader (same or higher Term) → revert to Follower.
  3. Split vote / timeout → increment Term, try again.

Preventing Split Votes

Randomized timeouts in [150ms, 300ms] ensure one Candidate usually times out first. Split votes are mathematically rare and resolve in the next Term.

Pseudocode

class RaftNode:
    def start_election(self):
        self.state = "candidate"
        self.current_term += 1
        self.voted_for = self.id
        self.votes_received = {self.id}
        self.reset_election_timer()
        for peer in self.peers:
            self.send_request_vote(peer)

    def handle_vote_response(self, response):
        if response.term > self.current_term:
            self.become_follower(response.term)
            return
        if response.vote_granted:
            self.votes_received.add(response.voter_id)
            if len(self.votes_received) > len(self.peers) // 2:
                self.become_leader()

    def become_leader(self):
        self.state = "leader"
        self.reset_heartbeat_timer()
        self.send_heartbeats()

3. Log Replication

Log Entry Structure

index:    1     2     3     4     5     6
term:     1     1     1     2     3     3
cmd:    [x=3] [y=1] [x=2] [z=0] [x=9] [y=5]

Replication Flow

  1. Client request → Leader appends locally (uncommitted).
  2. Leader sends AppendEntries to all Followers in parallel.
  3. On majority ack → commit.
  4. Leader applies to state machine, responds to client.
  5. Next AppendEntries propagates commitIndex so Followers apply too.

AppendEntries RPC

AppendEntries(
    term,
    leaderId,
    prevLogIndex,
    prevLogTerm,
    entries[],
    leaderCommit
)

Follower returns success=true only if the entry at prevLogIndex has prevLogTerm.

Log Matching Property

  1. If two logs have an entry at the same index/term, commands are identical.
  2. If two logs have an entry at the same index/term, all preceding entries are identical.

Enforced by the prevLogIndex/prevLogTerm consistency check.

Resolving Log Inconsistencies

Leader maintains nextIndex[peer]. On AppendEntries failure, decrement and retry until a match is found; then overwrite the Follower's suffix with the Leader's log.

def send_append_entries(self, follower_id):
    next_idx = self.next_index[follower_id]
    prev_idx = next_idx - 1
    prev_term = self.log[prev_idx].term if prev_idx > 0 else 0
    entries = self.log[next_idx:]
    response = self.rpc(follower_id, "AppendEntries", {
        "term": self.current_term,
        "prev_log_index": prev_idx,
        "prev_log_term": prev_term,
        "entries": entries,
        "leader_commit": self.commit_index,
    })
    if response.success:
        self.next_index[follower_id] = next_idx + len(entries)
        self.match_index[follower_id] = self.next_index[follower_id] - 1
        self.maybe_advance_commit()
    else:
        if response.term > self.current_term:
            self.become_follower(response.term)
        else:
            self.next_index[follower_id] -= 1

4. Safety Properties

1. Election Safety

At most one Leader per Term. Each node votes at most once per Term; two majorities cannot both exist.

2. Leader Append-Only

Leaders never overwrite or delete entries — only append.

3. Log Matching

Same-index/same-term entries imply identical prefixes.

4. Leader Completeness

Any committed entry is present in all future Leaders. Enforced by the up-to-date check in RequestVote: committed entries live on a majority, and any winning Candidate must have a log at least as up-to-date as that majority.

5. State Machine Safety

If one node applies an entry at index i, no other node applies a different entry at i. Follows from 1-4.

The Subtle Bug: Committing Previous-Term Entries

Raft paper's Figure 8: a Leader from Term 2 replicates to a majority but crashes before committing. A new Leader in Term 3 could overwrite it.

Rule: A Leader counts replicas only for entries from its current Term. Once a current-Term entry commits, earlier entries commit transitively via Log Matching.

def maybe_advance_commit(self):
    for n in range(self.commit_index + 1, len(self.log) + 1):
        if self.log[n].term != self.current_term:
            continue
        count = 1
        for peer in self.peers:
            if self.match_index[peer] >= n:
                count += 1
        if count > len(self.peers) // 2:
            self.commit_index = n

5. Cluster Membership Change

The Danger of Naive Changes

Switching from 3 to 5 nodes atomically is impossible — some nodes use the old config, some the new. Majorities in C_old and C_new may not overlap → two Leaders possible.

Solution 1: Joint Consensus

Leader logs C_old,new (union). Decisions during transition require majorities in both C_old AND C_new. Once committed, Leader logs C_new.

Solution 2: Single-Server Changes (etcd)

Add/remove one node at a time. Majorities always overlap (e.g., 3-node majority=2 and 4-node majority=3 share at least one node). etcd uses this.


6. Log Compaction (Snapshot)

Logs cannot grow forever. Each node independently snapshots:

Snapshot {
    last_included_index: 5,
    last_included_term:  2,
    state_machine_state: {...}
}

Entries before last_included_index are discarded.

InstallSnapshot RPC

When a Follower falls too far behind, the Leader sends a snapshot:

InstallSnapshot(
    term,
    leaderId,
    lastIncludedIndex,
    lastIncludedTerm,
    offset,
    data[],
    done
)

etcd snapshots every 10,000 entries by default (--snapshot-count).


7. Client Interaction

Linearizable Reads

A stale Leader can exist (partitioned from peers). Options:

  1. Read through Log: safe but slow.
  2. ReadIndex: record commitIndex, heartbeat to confirm leadership, then respond. etcd default.
  3. Lease Read: Leader holds a time-bounded lease. Fast; needs clock sync.

Exactly-Once Writes

Clients tag requests with (client_id, sequence_no). Leader stores processed pairs in the state machine and returns cached results for duplicates.


8. Real Systems

etcd

  • Go go.etcd.io/raft library. Stores all Kubernetes state.
  • MVCC KV store, Watch API, Leases. Ports 2379/2380.
  • 3-node: ~10,000 writes/s (fsync-bound), ~40,000 linearizable reads/s.
  • Recommended size: 3 or 5.

Consul

HashiCorp's Consul: Raft-based consistent KV, multi-DC (each DC is its own Raft cluster), DNS service discovery.

CockroachDB / TiDB / YugabyteDB

NewSQL databases split data into ranges/regions, each with its own Raft group (Multi-Raft):

  • CockroachDB: 64MB ranges.
  • TiKV: 96MB regions.
  • Optimizations: message batching, Follower Replication, region merge/split.

MongoDB & Kafka KRaft

MongoDB replica sets use a "Raft-like" protocol. Kafka 2.8+ KRaft replaces ZooKeeper with a Raft-based controller quorum.


9. Performance Optimizations

  1. Pipeline AppendEntries — don't wait for ack.
  2. Batching — amortize fsync.
  3. Parallel Disk Write — append and send concurrently.
  4. PreVote — avoid spurious Term increments after partition recovery.
  5. Follower Read — for stale reads, serve from Follower.
  6. Witness Replicas — vote-only replicas (CockroachDB).

10. Operational Issues

Election Storm

Symptoms: Leader churn, throughput near zero. Causes: network latency > election timeout, GC pauses, disk I/O stalls. Fix: raise election-timeout (etcd default 1000ms → 5000ms), lower heartbeat-interval, enable PreVote.

Unbounded Log Growth

Verify --snapshot-count and monitor Follower lag; restart stuck Followers.

Quorum Loss

3-node cluster loses 2 nodes → no majority. Recovery: force-restart the remaining node as a single-node cluster (etcdctl snapshot restore), then add nodes one by one. Prevention: odd node counts, spread across AZs.

Slow Disk

Raft fsyncs every write:

  • HDD: ~10ms → 100 ops/s
  • SSD: ~0.1ms → 10,000 ops/s
  • NVMe: ~0.01ms → 100,000 ops/s

Never run Raft on HDD.

Network Partitions

Majority-based quorum prevents split-brain. Minority side rejects reads/writes — Raft is a CP system.


11. Raft vs Paxos vs Zab

AspectRaftMulti-PaxosZab
DifficultyEasyHardMedium
Leader-basedYesOptionalYes
ElectionTimeout + RequestVoteProposal numbersFLE
Log consistencyEnforced identicalLooseEnforced
Used byetcd, Consul, CockroachDBSpanner, ChubbyZooKeeper

Why Raft Won

  1. Pseudocode and implementation guide in the paper.
  2. Clear state model (Follower/Candidate/Leader).
  3. Many reference implementations.
  4. Teachable in university courses.

Google internal systems still favor Paxos; open source is overwhelmingly Raft.


12. Implementing Raft Yourself

Learning Resources

  1. Original paper: "In Search of an Understandable Consensus Algorithm".
  2. raft.github.io — interactive visualization.
  3. MIT 6.824 Lab 2 — implement Raft in Go (2A: Election, 2B: Replication, 2C: Persistence, 2D: Snapshot).

Common Student Mistakes

  1. Heartbeats missing piggy-backed log data.
  2. Forgetting Term checks on RPCs.
  3. Misusing commitIndex for previous-Term entries.
  4. Missing persistence (currentTerm, votedFor, log must be fsync'd).
  5. Wrong timer-reset placement.

Quiz Review

Q1. What design choice makes Raft more understandable than Paxos?

A. Decomposition into Leader Election, Log Replication, and Safety as independent sub-problems; a small state space (3 states); and a strong-Leader model that makes log flow unidirectional.

Q2. Why are Raft clusters sized at 3, 5, 7 (odd)?

A. 2N+1 nodes tolerate N failures. 4 nodes tolerate only 1 failure — same as 3 but with more cost. Even counts are strictly worse.

Q3. Why does RequestVote need the up-to-date log check?

A. To guarantee Leader Completeness. Since committed entries live on a majority, requiring winners to have a log at least as up-to-date as a majority member ensures they carry all committed entries.

Q4. Why are previous-Term entries not committed purely by replica count?

A. Figure 8 shows such an entry can still be overwritten. Raft commits a previous-Term entry only transitively, once a current-Term entry commits.

Q5. Trade-offs of increasing etcd election timeout?

A. Pro: fewer spurious elections from GC/network blips. Con: longer write-unavailability after a true Leader death. Default 1000ms suits on-prem; cloud and high-load systems often need 3000-5000ms.


Closing Thoughts

Raft is not just an algorithm — it is a research philosophy: "If an existing algorithm is hard, invent one that is understandable."

Key Ideas

  1. Leader-based consensus.
  2. Term-based logical time.
  3. Election Safety via majority voting.
  4. Log Matching via consistency checks.
  5. Leader Completeness via up-to-date checks.

You already rely on Raft daily via kubectl apply → etcd, Consul KV, CockroachDB, and Kafka KRaft. Understanding Raft makes their behavior, failures, and performance transparent.

Next Steps

EPaxos, Flexible Paxos, HotStuff, Byzantine Raft — but first, implement Raft by hand.


References