✍️ 필사 모드: Distributed Consensus Deep Dive — Paxos, Raft, ZAB, FLP, etcd, ZooKeeper, KRaft, BFT, CRDT (2025)
English"The problem with distributed systems is that the parts of the system you don't think about are the parts that will fail." — Leslie Lamport
Why does Kubernetes depend on etcd? Why did Kafka replace ZooKeeper with KRaft? Why does the PostgreSQL HA solution Patroni require a "DCS" (Distributed Configuration Store)? The answer is the same in every case: distributed consensus.
Getting several nodes to "agree" on a single decision is trivial locally, but as soon as a network enters the picture it becomes a domain of mathematically proven impossibility. This post is a map that starts from the shock of the FLP result, walks through the elegance of Paxos and Raft, and reaches the 2025 consensus landscape (KRaft, BFT, CRDTs) in one sweep.
1. Problem Definition — Why "Agreement" Is Hard
What Consensus Must Guarantee
- Agreement — every correct node reaches the same value
- Validity — the decided value must be one that some node proposed
- Termination — eventually a decision is made
The Hostile Reality of Distributed Environments
- Network delay: you don't know when a message will arrive
- Message loss: messages can vanish
- Node crash: no response
- Network partition: only some nodes can talk to each other
- Asynchronous setting: no global clock
The FLP Impossibility Result (1985)
The shocking proof by Fischer, Lynch, and Patterson:
In an asynchronous distributed system, if even a single node can fail-stop, no deterministic consensus algorithm exists.
That is, an algorithm that perfectly satisfies "Agreement + Validity + Termination" cannot exist mathematically.
So What Do We Do?
- Probabilistic termination: it eventually terminates but without a guarantee
- Partial synchrony assumption: "eventually message delays will be bounded"
- Timeout-based failure detector
- Liveness vs safety trade-off — safety always, liveness only in good conditions
Both Paxos and Raft go in this direction.
2. Paxos — Lamport's Monument
Arrival
Proposed by Leslie Lamport in his 1990 paper "The Part-Time Parliament." It was a satire modeled on the parliament of the ancient Greek island of Paxos. The reviewers didn't get the joke, so it was rejected, and only published in 1998.
In 2001 he re-explained it in "Paxos Made Simple," but... it is still hard. Even Lamport himself:
"I have never actually implemented Multi-Paxos."
Basic Paxos Roles
- Proposer — proposes values
- Acceptor — accepts/rejects proposals
- Learner — learns the decided value
The 2-Phase Protocol
Phase 1 (Prepare)
- A proposer picks a number
nand broadcastsPrepare(n) - An acceptor "if it has never seen a number larger than
n" promises, and returns any previously accepted value
Phase 2 (Accept)
- Once a majority responds, the proposer broadcasts
Accept(n, v)— with the previously accepted value if one exists, otherwise its own - An acceptor accepts "if it hasn't promised against a higher
n" - On majority acceptance, the value is decided
Why It's Hard
- With multiple proposers racing concurrently, progress can stall (the liveness issue)
- Multi-Paxos — efficient via leader election, but the engineering details are hell
- Scarce practical implementation material — a huge gap between pseudocode and a real implementation
Real-World Uses
- Google Chubby (a Paxos-based lock service, the grandfather of ZooKeeper)
- Google Spanner (evolved Paxos)
- CockroachDB (Multi-Raft, but influenced by it)
3. Raft — "Understandable Consensus"
Motivation
In 2014, Diego Ongaro and John Ousterhout (Stanford) published "In Search of an Understandable Consensus Algorithm." The goal was exactly one thing: more understandable than Paxos.
Key Decomposition
Raft breaks consensus into three parts:
- Leader Election — pick a leader
- Log Replication — the leader replicates the log
- Safety — consistency even when a leader dies
States
Each node is in one of three states:
- Follower — passively receives leader messages
- Candidate — in an election
- Leader — the leader
Leader Election
- If a follower fails to receive a heartbeat from the leader for
electionTimeout(randomized 150-300ms), it transitions to Candidate - It increments the term, votes for itself, and broadcasts
RequestVote - Majority of votes → Leader. On a tie, retry after timeout.
- Thanks to the randomized timeout, split votes naturally resolve
Log Replication
- Client → request to leader
- Leader appends to its own log and broadcasts
AppendEntriesto followers - Once a majority of followers confirm, it is "committed"
- The leader applies to the state machine and replies to the client
- Followers also receive the commit index and apply
Safety — Log Matching Property
- If two logs have an entry with the same term at the same index, all preceding entries are identical
- At election time, "only the node with the most up-to-date log can become leader" (Election Restriction)
- Entries from earlier terms are committed together with an entry of the leader's current term (preventing the Figure 8 scenario)
Real-World Raft Implementations
- etcd (CoreOS → CNCF) — the heart of the Kubernetes control plane
- Consul (HashiCorp)
- TiKV (Multi-Raft)
- Kafka KRaft (ZooKeeper replacement in 2022)
- CockroachDB (Multi-Raft)
Joint Consensus — The Elegance of Membership Changes
When adding/removing nodes, you pass through a Joint Consensus stage in which the two configurations coexist, enabling zero-downtime changes. Mistakes here are a shortcut to split-brain, so the paper emphasizes this heavily.
4. ZooKeeper and ZAB
ZooKeeper's History
- Developed by Yahoo in 2007, the distributed coordinator of the Hadoop ecosystem
- An Apache top-level project
- Countless systems depend on it: Kafka, Hadoop, HBase, Solr, etc.
ZAB (ZooKeeper Atomic Broadcast)
Similar to Paxos, but:
- FIFO order guarantee — preserves the request order of a given client
- Primary-backup model — the leader mediates all writes
- Epoch-based — a generation number analogous to a term
The Znode Model
- Hierarchical paths (
/app/config/foo) - Ephemeral — auto-deleted when the session ends (used for leader election)
- Sequential — auto-incremented numbers (for lock implementations)
- Watch — change notifications (pub/sub-like)
ZooKeeper Use Cases
- Leader Election (ephemeral + sequential)
- Distributed locks (sequential znode contention)
- Configuration store
- Service discovery
- Kafka broker metadata (pre-KRaft)
Limitations
- JVM-based — on GC pause, session expiration cascades into failures
- Hard to scale writes — single leader bottleneck
- Operational complexity — odd number of servers, quorum maintenance
- Session model pitfalls — a beginner's debugging hell
5. KRaft — Why Kafka Dropped ZooKeeper
Motivation
- Kafka stored broker/topic/partition/ACL metadata in ZooKeeper
- At scale, ZooKeeper became the bottleneck (registering/unregistering tens of thousands of partitions)
- The burden of running two distributed systems
- Duplicate auth/security configuration
KRaft Arrives
- KIP-500 (proposed in 2019)
- Production Ready in Kafka 3.3 in 2022
- ZooKeeper deprecated in 3.5 in 2023
- Fully removed in 4.0 in 2025
Internals
- A special metadata topic
__cluster_metadata - Controller nodes use Raft to agree on this topic
- Brokers follow this topic as a replica (zero-downtime)
- Leader transitions, topic creation, etc. are expressed as appends to this log
Benefits
- 10x or more faster startup
- Millions of partitions become feasible
- Simplified operations
- Kafka-only deployment
Migration
ZooKeeper → KRaft migration happened en masse at large enterprises in 2024-2025. "Every Kafka cluster eventually converges to KRaft."
6. etcd — The Heart of Kubernetes
Why Kubernetes Uses etcd
All state of the K8s API Server (Pods, Services, ConfigMaps, Secrets) is stored in etcd. Leader election and distributed locks are etcd-based as well.
Raft Implementation
- Raft implemented in Go (
go.etcd.io/raft) - Distributed key-value store
- Version-based snapshots (
revision) - A watch feature for change streams
Performance Characteristics
- Write latency: 10-30ms (replication + fsync)
- Writes per second: ~10K with a single leader
- Capacity: 8GB default limit (millions of keys)
- etcd is not a high-performance DB — it is a highly available KV store
Operational Tips
- SSDs are mandatory — fsync is the floor
- Dedicated network, low latency
- Defrag periodically (separate from compaction)
- Automate backups —
etcdctl snapshot save - Full Kubernetes cluster recovery rides on this
etcd vs ZooKeeper vs Consul
| Aspect | etcd | ZooKeeper | Consul |
|---|---|---|---|
| Language | Go | Java | Go |
| Consensus | Raft | ZAB | Raft |
| Data model | flat KV | hierarchical tree | KV + service discovery |
| Watch | stream | one-shot | stream |
| Health checks | external | ephemeral | built-in |
| Main usage | Kubernetes | Kafka (legacy), Hadoop | Service mesh, legacy HA |
The 2025 mainstream: new projects use etcd or Consul, Kafka moves to KRaft.
7. Byzantine Fault Tolerance (BFT)
The Byzantine Generals Problem
Lamport's classic problem (1982): generals besieging an enemy position try to agree on attack/retreat via messengers, but traitor generals may send false messages. How can the majority reach a correct decision?
Are Paxos/Raft BFT?
No. They only assume Crash Fault Tolerance (CFT):
- Nodes may halt and disappear, but do not lie
- Messages may be lost but are not tampered with
If malicious nodes (hacking, bugs) exist, Paxos/Raft break.
PBFT (Practical Byzantine Fault Tolerance)
- Proposed by Castro and Liskov in 1999
- Requires
n >= 3f + 1nodes (wherefis the number of Byzantine nodes) - To tolerate 1 malicious node, you need 4 machines; 7 for 2
- A 3-phase protocol: Pre-prepare → Prepare → Commit
HotStuff (2018)
- The basis for Facebook Libra (now Aptos/Sui)
- Gives BFT a chain-like simplicity
- Linear message complexity
Blockchain and BFT
- Tendermint (Cosmos) — BFT-based
- Algorand, Avalanche, Aptos, Sui — all BFT variants
- Bitcoin/Ethereum PoW/PoS are probabilistic consensus (a different category)
Do Enterprises Need BFT?
- Typical internal systems: Raft/Paxos is enough
- Infrastructure shared across organizations (inter-bank, inter-government): consider BFT
- In 2025, BFT is a practical target in enterprise blockchains like Hyperledger Fabric and Corda
8. CRDT — Convergence Magic Without Consensus
Motivation
Consensus is expensive. It needs network round-trips + majority agreement. Offline, it doesn't even work. And yet...
Collaborative documents (Google Docs), messaging (WhatsApp), shopping carts all "eventually converge to the same state" without consensus. How?
CRDT (Conflict-free Replicated Data Type)
- Guarantees Strong Eventual Consistency
- Any merge order produces the same result
- Commutativity + Associativity + Idempotency
Two Flavors
State-based (CvRDT): share full state; merge(a, b) is a semilattice (join) operation
- G-Counter (increment-only), PN-Counter, OR-Set, LWW-Register
Operation-based (CmRDT): share operations, messages delivered over ordered/reliable channels
- More efficient but with strict prerequisites
Famous CRDT Implementations
- Automerge — collaboration on JSON-like data
- Yjs — document collaboration (VS Code Live Share, a candidate reviewed by Notion)
- Riak (2010s) — CRDT-based distributed DB
- Redis — CRDT enterprise (Active-Active)
- Figma — its own CRDT variant
Post-2024 — Local-First
- The Local-First Software movement
- Offline-first + CRDT synchronization
- Works without servers
- Jamsocket, Automerge, and Y.js lead
9. Weak Consistency Models
The Consistency Spectrum
Strong ← Linearizable → Sequential → Causal → Eventual → Weak
Linearizable (Strong Consistency)
- All operations share a single global order
- Once written, every reader sees it
- Provided by Raft/Paxos-based systems
- Cost: higher latency, lower availability
Causal Consistency
- Preserves only cause-and-effect relationships (concurrent ops are free)
- Uses Vector Clocks / Version Vectors
- COPS, Eiger papers
- Sufficient for social-media timelines
Eventual Consistency
- "Eventually" converges
- Default in DynamoDB and Cassandra
- Higher performance, lower consistency
- The tragedy: many teams still use AP systems as if they were CA
Misunderstandings of the CAP Theorem
- Brewer (2000) — "under a network partition, sacrifice Consistency OR Availability"
- Common misreading: "pick any two of C/A/P"
- Reality: partitions (P) are unavoidable, so you choose C vs A
- Modern extension: PACELC — even without partitions, you trade Latency vs Consistency
10. Distributed Locks — An Application of Consensus
The ZooKeeper Approach
- Create a sequential + ephemeral znode under
/locks/resource/ - Watch the znode with the next-smaller number than yours
- When that one is deleted, it's your turn
The etcd Approach
lease = client.grant(10) # 10-second TTL
client.put("/lock", "me", lease) # acquire the lock on success
- Atomic acquisition via
compare-and-swap - TTL prevents zombie locks
Redlock (Redis) — Controversial
As covered in the earlier Redis post:
- Fine if for performance
- If correctness matters, use ZooKeeper/etcd
Fencing Tokens
- On lock acquisition, issue a monotonically increasing token
- When writing to the resource, check the token → reject old tokens
- Prevents zombie writes after GC pauses / network delay recoveries
11. Leader Election in Practice — Pitfall Guide
Split Vote
In Raft, if multiple Candidates start voting at the same term simultaneously → no one gets a majority
- Solution: randomized timeout (150-300ms)
- There should be only one election (else consensus rounds are wasted)
Leader Flapping
Frequent leader changes due to network instability:
- Connection resets → commit delays
- Symptoms: a flood of
leader changedlogs - Remedies: stabilize the network, tune heartbeat interval, PreVote (a pre-election before becoming leader)
Network Partition (Brain Split)
- The side without a majority rejects reads/writes (if well-designed)
- Badly designed: both sides have a leader → data forks
- Recovery: forcibly discard one side, re-replicate from the other
Clock Drift
- Raft's assumption about clock synchronization is weak (only timeouts)
- Still, NTP is a baseline
- Spanner provides global linearizability with TrueTime (GPS + atomic clocks)
12. The 2025 Consensus Landscape
Mainstream
- Raft — the de-facto default for new systems
- Multi-Raft — a Raft group per shard (CockroachDB, TiKV, YugabyteDB)
- Paxos/EPaxos — Google internals, some DBs
- ZAB — ZooKeeper, hardly any new adoption
Emerging
- HotStuff — Meta (Libra) origin, spreading as enterprise BFT
- Narwhal and Bullshark (Sui/Aptos) — optimizations for message isolation
- Paxos-like Quorum Read/Write — the real read path in etcd and Spanner
Frontier
- FastPaxos / MencLib — 1-RTT on the optimal path
- Generalized Paxos — consensus based on command dependencies
- Rollup consensus — sequencer consensus in Ethereum L2s
13. Top 10 Distributed-Consensus Anti-patterns
- "Roll your own Paxos" — guaranteed failure
- Fork a Raft library and patch on top — you'll miss upstream fixes
- A 4-node cluster instead of odd — quorum of 3 tolerates two splits
- Consensus over the WAN — latency explosions, regional splits
- Leaving ZooKeeper session timeout at default — long GCs lead to cascades
- Ignoring etcd's 8GB limit — one day, a sudden outage
- Raft in an environment that needs Byzantine tolerance — one malicious node ruins everything
- Heavy writes to a consensus system — KV only, values over 1MB forbidden
- Home-grown distributed locks — 99% buggy
- Using Eventual Consistency as if it were Strong — data forks
14. Using Distributed Consensus Wisely — Checklist
- Do not implement it yourself — use proven etcd/ZK/Consul
- Odd node count (3, 5, 7)
- Low-latency network or the same region
- SSD + a dedicated disk for the consensus log
- Backup & recovery automation —
etcdctl snapshot - Monitoring — leader changes, commit latency, disk fsync
- Respect size limits — etcd 8GB, ZK heap
- Drill blackout/partition scenarios (see Chaos Engineering)
- Document your consistency needs — strong vs eventual
- Adopt fencing tokens (when using distributed locks)
- Plan KRaft migration (if you use Kafka)
- Consider CRDTs — do you really need consensus?
Closing — The Paradox of Consensus
Every interesting problem in distributed systems ultimately reduces to consensus. Kubernetes agreeing on pod state, Kafka electing partition leaders, blockchains ordering transactions, collaborative docs merging edits — all of it.
As the FLP theorem says, "perfect consensus" is mathematically impossible. Every algorithm we use is a pragmatic compromise between safety and liveness. Paxos is elegant but hard, Raft is understandable but full of details, BFT is powerful but expensive, and CRDTs are an alternative that avoids consensus.
"The only way to make distributed systems simple is to understand that they're fundamentally hard, and stop pretending otherwise." — Kyle Kingsbury (Jepsen)
Next Up — Modern CI/CD Pipelines, Fully Disassembled
If consensus is the heart of distributed systems, CI/CD is the circulatory system of modern development. In the next post:
- The history of CI/CD — Hudson → Jenkins, Travis, CircleCI, GitHub Actions
- Pipeline as Code — the philosophy of
.github/workflows - Build caching — remote caches in Turborepo, Nx, Bazel
- Container-based builds — BuildKit, Nixpacks, Bun
- Test parallelization — sharding, defending against flaky tests
- Artifact management — OCI Registry, GHCR, Artifactory
- Secrets management — Vault, AWS Secrets Manager, OIDC
- Deployment strategies — Blue/Green, Canary (connected to the previous post)
- GitOps — Argo CD, Flux, pull-based deployments
- Supply chain security — SLSA, Sigstore, SBOM, provenance
- Dev loop speed — the "commit to deploy in 5 minutes" philosophy
A journey to complete the last piece of the developer productivity puzzle.
"The impossibility of consensus is not a bug in distributed systems — it's the fundamental feature that makes them interesting." — Leslie Lamport (Turing Award lecture, 2013)
현재 단락 (1/248)
Why does Kubernetes depend on etcd? Why did Kafka replace ZooKeeper with **KRaft**? Why does the Pos...