Chaos and Order

💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

"The problem with distributed systems is that the parts of the system you don't think about are the parts that will fail." — Leslie Lamport

Why does Kubernetes depend on etcd? Why did Kafka replace ZooKeeper with KRaft? Why does the PostgreSQL HA solution Patroni require a "DCS" (Distributed Configuration Store)? The answer is the same in every case: distributed consensus.

Getting several nodes to "agree" on a single decision is trivial locally, but as soon as a network enters the picture it becomes a domain of mathematically proven impossibility. This post is a map that starts from the shock of the FLP result, walks through the elegance of Paxos and Raft, and reaches the 2025 consensus landscape (KRaft, BFT, CRDTs) in one sweep.

1. Problem Definition — Why "Agreement" Is Hard

What Consensus Must Guarantee

Agreement — every correct node reaches the same value
Validity — the decided value must be one that some node proposed
Termination — eventually a decision is made

The Hostile Reality of Distributed Environments

Network delay: you don't know when a message will arrive
Message loss: messages can vanish
Node crash: no response
Network partition: only some nodes can talk to each other
Asynchronous setting: no global clock

The FLP Impossibility Result (1985)

The shocking proof by Fischer, Lynch, and Patterson:

In an asynchronous distributed system, if even a single node can fail-stop, no deterministic consensus algorithm exists.

That is, an algorithm that perfectly satisfies "Agreement + Validity + Termination" cannot exist mathematically.

So What Do We Do?

Probabilistic termination: it eventually terminates but without a guarantee
Partial synchrony assumption: "eventually message delays will be bounded"
Timeout-based failure detector
Liveness vs safety trade-off — safety always, liveness only in good conditions

Both Paxos and Raft go in this direction.

2. Paxos — Lamport's Monument

Arrival

Proposed by Leslie Lamport in his 1990 paper "The Part-Time Parliament." It was a satire modeled on the parliament of the ancient Greek island of Paxos. The reviewers didn't get the joke, so it was rejected, and only published in 1998.

In 2001 he re-explained it in "Paxos Made Simple," but... it is still hard. Even Lamport himself:

"I have never actually implemented Multi-Paxos."

Basic Paxos Roles

Proposer — proposes values
Acceptor — accepts/rejects proposals
Learner — learns the decided value

The 2-Phase Protocol

Phase 1 (Prepare)

A proposer picks a number n and broadcasts Prepare(n)
An acceptor "if it has never seen a number larger than n" promises, and returns any previously accepted value

Phase 2 (Accept)

Once a majority responds, the proposer broadcasts Accept(n, v) — with the previously accepted value if one exists, otherwise its own
An acceptor accepts "if it hasn't promised against a higher n"
On majority acceptance, the value is decided

Why It's Hard

With multiple proposers racing concurrently, progress can stall (the liveness issue)
Multi-Paxos — efficient via leader election, but the engineering details are hell
Scarce practical implementation material — a huge gap between pseudocode and a real implementation

Real-World Uses

Google Chubby (a Paxos-based lock service, the grandfather of ZooKeeper)
Google Spanner (evolved Paxos)
CockroachDB (Multi-Raft, but influenced by it)

3. Raft — "Understandable Consensus"

Motivation

In 2014, Diego Ongaro and John Ousterhout (Stanford) published "In Search of an Understandable Consensus Algorithm." The goal was exactly one thing: more understandable than Paxos.

Key Decomposition

Raft breaks consensus into three parts:

Leader Election — pick a leader
Log Replication — the leader replicates the log
Safety — consistency even when a leader dies

States

Each node is in one of three states:

Follower — passively receives leader messages
Candidate — in an election
Leader — the leader

Leader Election

If a follower fails to receive a heartbeat from the leader for electionTimeout (randomized 150-300ms), it transitions to Candidate
It increments the term, votes for itself, and broadcasts RequestVote
Majority of votes → Leader. On a tie, retry after timeout.
Thanks to the randomized timeout, split votes naturally resolve

Log Replication

Client → request to leader
Leader appends to its own log and broadcasts AppendEntries to followers
Once a majority of followers confirm, it is "committed"
The leader applies to the state machine and replies to the client
Followers also receive the commit index and apply

Safety — Log Matching Property

If two logs have an entry with the same term at the same index, all preceding entries are identical
At election time, "only the node with the most up-to-date log can become leader" (Election Restriction)
Entries from earlier terms are committed together with an entry of the leader's current term (preventing the Figure 8 scenario)

Real-World Raft Implementations

etcd (CoreOS → CNCF) — the heart of the Kubernetes control plane
Consul (HashiCorp)
TiKV (Multi-Raft)
Kafka KRaft (ZooKeeper replacement in 2022)
CockroachDB (Multi-Raft)

Joint Consensus — The Elegance of Membership Changes

When adding/removing nodes, you pass through a Joint Consensus stage in which the two configurations coexist, enabling zero-downtime changes. Mistakes here are a shortcut to split-brain, so the paper emphasizes this heavily.

4. ZooKeeper and ZAB

ZooKeeper's History

Developed by Yahoo in 2007, the distributed coordinator of the Hadoop ecosystem
An Apache top-level project
Countless systems depend on it: Kafka, Hadoop, HBase, Solr, etc.

ZAB (ZooKeeper Atomic Broadcast)

Similar to Paxos, but:

FIFO order guarantee — preserves the request order of a given client
Primary-backup model — the leader mediates all writes
Epoch-based — a generation number analogous to a term

The Znode Model

Hierarchical paths (/app/config/foo)
Ephemeral — auto-deleted when the session ends (used for leader election)
Sequential — auto-incremented numbers (for lock implementations)
Watch — change notifications (pub/sub-like)

ZooKeeper Use Cases

Leader Election (ephemeral + sequential)
Distributed locks (sequential znode contention)
Configuration store
Service discovery
Kafka broker metadata (pre-KRaft)

Limitations

JVM-based — on GC pause, session expiration cascades into failures
Hard to scale writes — single leader bottleneck
Operational complexity — odd number of servers, quorum maintenance
Session model pitfalls — a beginner's debugging hell

5. KRaft — Why Kafka Dropped ZooKeeper

Motivation

Kafka stored broker/topic/partition/ACL metadata in ZooKeeper
At scale, ZooKeeper became the bottleneck (registering/unregistering tens of thousands of partitions)
The burden of running two distributed systems
Duplicate auth/security configuration

KRaft Arrives

KIP-500 (proposed in 2019)
Production Ready in Kafka 3.3 in 2022
ZooKeeper deprecated in 3.5 in 2023
Fully removed in 4.0 in 2025

Internals

A special metadata topic __cluster_metadata
Controller nodes use Raft to agree on this topic
Brokers follow this topic as a replica (zero-downtime)
Leader transitions, topic creation, etc. are expressed as appends to this log

Benefits

10x or more faster startup
Millions of partitions become feasible
Simplified operations
Kafka-only deployment

Migration

ZooKeeper → KRaft migration happened en masse at large enterprises in 2024-2025. "Every Kafka cluster eventually converges to KRaft."

6. etcd — The Heart of Kubernetes

Why Kubernetes Uses etcd

All state of the K8s API Server (Pods, Services, ConfigMaps, Secrets) is stored in etcd. Leader election and distributed locks are etcd-based as well.

Raft Implementation

Raft implemented in Go (go.etcd.io/raft)
Distributed key-value store
Version-based snapshots (revision)
A watch feature for change streams

Performance Characteristics

Write latency: 10-30ms (replication + fsync)
Writes per second: ~10K with a single leader
Capacity: 8GB default limit (millions of keys)
etcd is not a high-performance DB — it is a highly available KV store

Operational Tips

SSDs are mandatory — fsync is the floor
Dedicated network, low latency
Defrag periodically (separate from compaction)
Automate backups — etcdctl snapshot save
Full Kubernetes cluster recovery rides on this

etcd vs ZooKeeper vs Consul

Aspect	etcd	ZooKeeper	Consul
Language	Go	Java	Go
Consensus	Raft	ZAB	Raft
Data model	flat KV	hierarchical tree	KV + service discovery
Watch	stream	one-shot	stream
Health checks	external	ephemeral	built-in
Main usage	Kubernetes	Kafka (legacy), Hadoop	Service mesh, legacy HA

The 2025 mainstream: new projects use etcd or Consul, Kafka moves to KRaft.

7. Byzantine Fault Tolerance (BFT)

The Byzantine Generals Problem

Lamport's classic problem (1982): generals besieging an enemy position try to agree on attack/retreat via messengers, but traitor generals may send false messages. How can the majority reach a correct decision?

Are Paxos/Raft BFT?

No. They only assume Crash Fault Tolerance (CFT):

Nodes may halt and disappear, but do not lie
Messages may be lost but are not tampered with

If malicious nodes (hacking, bugs) exist, Paxos/Raft break.

PBFT (Practical Byzantine Fault Tolerance)

Proposed by Castro and Liskov in 1999
Requires n >= 3f + 1 nodes (where f is the number of Byzantine nodes)
To tolerate 1 malicious node, you need 4 machines; 7 for 2
A 3-phase protocol: Pre-prepare → Prepare → Commit

HotStuff (2018)

The basis for Facebook Libra (now Aptos/Sui)
Gives BFT a chain-like simplicity
Linear message complexity

Blockchain and BFT

Tendermint (Cosmos) — BFT-based
Algorand, Avalanche, Aptos, Sui — all BFT variants
Bitcoin/Ethereum PoW/PoS are probabilistic consensus (a different category)

Do Enterprises Need BFT?

Typical internal systems: Raft/Paxos is enough
Infrastructure shared across organizations (inter-bank, inter-government): consider BFT
In 2025, BFT is a practical target in enterprise blockchains like Hyperledger Fabric and Corda

8. CRDT — Convergence Magic Without Consensus

Motivation

Consensus is expensive. It needs network round-trips + majority agreement. Offline, it doesn't even work. And yet...

Collaborative documents (Google Docs), messaging (WhatsApp), shopping carts all "eventually converge to the same state" without consensus. How?

CRDT (Conflict-free Replicated Data Type)

Guarantees Strong Eventual Consistency
Any merge order produces the same result
Commutativity + Associativity + Idempotency

Two Flavors

State-based (CvRDT): share full state; merge(a, b) is a semilattice (join) operation

G-Counter (increment-only), PN-Counter, OR-Set, LWW-Register

Operation-based (CmRDT): share operations, messages delivered over ordered/reliable channels

More efficient but with strict prerequisites

Famous CRDT Implementations

Automerge — collaboration on JSON-like data
Yjs — document collaboration (VS Code Live Share, a candidate reviewed by Notion)
Riak (2010s) — CRDT-based distributed DB
Redis — CRDT enterprise (Active-Active)
Figma — its own CRDT variant

Post-2024 — Local-First

The Local-First Software movement
Offline-first + CRDT synchronization
Works without servers
Jamsocket, Automerge, and Y.js lead

9. Weak Consistency Models

The Consistency Spectrum

Strong ← Linearizable → Sequential → Causal → Eventual → Weak

Linearizable (Strong Consistency)

All operations share a single global order
Once written, every reader sees it
Provided by Raft/Paxos-based systems
Cost: higher latency, lower availability

Causal Consistency

Preserves only cause-and-effect relationships (concurrent ops are free)
Uses Vector Clocks / Version Vectors
COPS, Eiger papers
Sufficient for social-media timelines

Eventual Consistency

"Eventually" converges
Default in DynamoDB and Cassandra
Higher performance, lower consistency
The tragedy: many teams still use AP systems as if they were CA

Misunderstandings of the CAP Theorem

Brewer (2000) — "under a network partition, sacrifice Consistency OR Availability"
Common misreading: "pick any two of C/A/P"
Reality: partitions (P) are unavoidable, so you choose C vs A
Modern extension: PACELC — even without partitions, you trade Latency vs Consistency

10. Distributed Locks — An Application of Consensus

The ZooKeeper Approach

Create a sequential + ephemeral znode under /locks/resource/
Watch the znode with the next-smaller number than yours
When that one is deleted, it's your turn

The etcd Approach

lease = client.grant(10)  # 10-second TTL
client.put("/lock", "me", lease)  # acquire the lock on success

Atomic acquisition via compare-and-swap
TTL prevents zombie locks

Redlock (Redis) — Controversial

As covered in the earlier Redis post:

Fine if for performance
If correctness matters, use ZooKeeper/etcd

Fencing Tokens

On lock acquisition, issue a monotonically increasing token
When writing to the resource, check the token → reject old tokens
Prevents zombie writes after GC pauses / network delay recoveries

11. Leader Election in Practice — Pitfall Guide

Split Vote

In Raft, if multiple Candidates start voting at the same term simultaneously → no one gets a majority

Solution: randomized timeout (150-300ms)
There should be only one election (else consensus rounds are wasted)

Leader Flapping

Frequent leader changes due to network instability:

Connection resets → commit delays
Symptoms: a flood of leader changed logs
Remedies: stabilize the network, tune heartbeat interval, PreVote (a pre-election before becoming leader)

Network Partition (Brain Split)

The side without a majority rejects reads/writes (if well-designed)
Badly designed: both sides have a leader → data forks
Recovery: forcibly discard one side, re-replicate from the other

Clock Drift

Raft's assumption about clock synchronization is weak (only timeouts)
Still, NTP is a baseline
Spanner provides global linearizability with TrueTime (GPS + atomic clocks)

12. The 2025 Consensus Landscape

Mainstream

Raft — the de-facto default for new systems
Multi-Raft — a Raft group per shard (CockroachDB, TiKV, YugabyteDB)
Paxos/EPaxos — Google internals, some DBs
ZAB — ZooKeeper, hardly any new adoption

Emerging

HotStuff — Meta (Libra) origin, spreading as enterprise BFT
Narwhal and Bullshark (Sui/Aptos) — optimizations for message isolation
Paxos-like Quorum Read/Write — the real read path in etcd and Spanner

Frontier

FastPaxos / MencLib — 1-RTT on the optimal path
Generalized Paxos — consensus based on command dependencies
Rollup consensus — sequencer consensus in Ethereum L2s

13. Top 10 Distributed-Consensus Anti-patterns

"Roll your own Paxos" — guaranteed failure
Fork a Raft library and patch on top — you'll miss upstream fixes
A 4-node cluster instead of odd — quorum of 3 tolerates two splits
Consensus over the WAN — latency explosions, regional splits
Leaving ZooKeeper session timeout at default — long GCs lead to cascades
Ignoring etcd's 8GB limit — one day, a sudden outage
Raft in an environment that needs Byzantine tolerance — one malicious node ruins everything
Heavy writes to a consensus system — KV only, values over 1MB forbidden
Home-grown distributed locks — 99% buggy
Using Eventual Consistency as if it were Strong — data forks

14. Using Distributed Consensus Wisely — Checklist

Closing — The Paradox of Consensus

Every interesting problem in distributed systems ultimately reduces to consensus. Kubernetes agreeing on pod state, Kafka electing partition leaders, blockchains ordering transactions, collaborative docs merging edits — all of it.

As the FLP theorem says, "perfect consensus" is mathematically impossible. Every algorithm we use is a pragmatic compromise between safety and liveness. Paxos is elegant but hard, Raft is understandable but full of details, BFT is powerful but expensive, and CRDTs are an alternative that avoids consensus.

"The only way to make distributed systems simple is to understand that they're fundamentally hard, and stop pretending otherwise." — Kyle Kingsbury (Jepsen)

Next Up — Modern CI/CD Pipelines, Fully Disassembled

If consensus is the heart of distributed systems, CI/CD is the circulatory system of modern development. In the next post:

The history of CI/CD — Hudson → Jenkins, Travis, CircleCI, GitHub Actions
Pipeline as Code — the philosophy of .github/workflows
Build caching — remote caches in Turborepo, Nx, Bazel
Container-based builds — BuildKit, Nixpacks, Bun
Test parallelization — sharding, defending against flaky tests
Artifact management — OCI Registry, GHCR, Artifactory
Secrets management — Vault, AWS Secrets Manager, OIDC
Deployment strategies — Blue/Green, Canary (connected to the previous post)
GitOps — Argo CD, Flux, pull-based deployments
Supply chain security — SLSA, Sigstore, SBOM, provenance
Dev loop speed — the "commit to deploy in 5 minutes" philosophy

A journey to complete the last piece of the developer productivity puzzle.

"The impossibility of consensus is not a bug in distributed systems — it's the fundamental feature that makes them interesting." — Leslie Lamport (Turing Award lecture, 2013)