✍️ 필사 모드: The Essential Difficulty of Distributed Systems — CAP, PACELC, Raft/Paxos, Vector Clock, Saga, Event Sourcing (2025)
EnglishWhy Distributed Systems Now — In 2026, You Can't Avoid It
Until 2015, "distributed systems were for Netflix and Google." In 2026, even 3-person startups deal with them.
- A single PostgreSQL instance gets a Read Replica after 6 months, then sharding a year later.
- The moment you adopt Kubernetes you are riding consensus (etcd/Raft) between API Server, Scheduler, kubelet.
- Kafka brings partitions, replication, offsets, and Exactly-Once Semantics problems.
- Cloudflare Workers + D1 + Durable Objects is essentially a distributed Raft cluster.
- LLM inference across GPUs and regions now requires distributed coordination too.
Distributed systems are no longer a choice — they are a condition. Yet it remains the most misunderstood topic in the industry. By the end of this article, you'll see why "we chose CP" is a nearly empty statement.
Part 1 — Why Distribution Is Essentially Hard
The 8 Fallacies of Distributed Computing
Peter Deutsch (Sun, 1994). Still valid 30 years later:
- The network is reliable — packets get dropped.
- Latency is zero — speed of light is finite (NY-Sydney RTT ~200ms).
- Bandwidth is infinite.
- The network is secure — MITM, DNS hijacking, BGP hijacking happen.
- Topology doesn't change.
- There is one administrator.
- Transport cost is zero — cloud egress bills prove otherwise.
- The network is homogeneous — Ethernet, WiFi, satellite, 5G under one TCP/IP skin.
99% of distributed systems papers address problems caused by these being false.
Two Processes Can't Agree on "Now"
On one machine, now() is clear. On two machines? Fundamentally uncertain.
- NTP sync has millisecond errors.
- VMs can be paused by hypervisors (Stop-the-World).
- Google Spanner, with atomic clocks and GPS, still admits a 7ms uncertainty interval.
Hence in distributed systems, event order is determined by logical time, not absolute time — Lamport's 1978 insight.
Part 2 — Rereading the CAP Theorem
Original Definition (Brewer 2000 / Gilbert & Lynch 2002)
"During a network Partition (P), a system can guarantee only one of Consistency (C) or Availability (A)."
Three common misconceptions:
Misconception 1: "Choose CP or AP"
Wrong. Without partitions, both C and A are achievable. CAP speaks only about the trade-off during a partition. Brewer clarified this in 2012's "CAP Twelve Years Later":
"The 'two of three' formulation was always misleading. In reality, the 'choice' is only relevant during partitions."
Misconception 2: "Consistency = C in ACID"
Wrong. CAP's C is Linearizability (a single object, single operation appears atomic). ACID's C means "no constraint violations" — a different concept.
Misconception 3: "Availability = service doesn't die"
Wrong. CAP's A requires every request receives a non-error response — far stricter than 99.99% uptime.
PACELC — What CAP Missed
Daniel Abadi (2010): "During Partition, A vs C; Else, Latency vs Consistency."
| System | On Partition | Normal |
|---|---|---|
| DynamoDB (default) | PA | EL (low latency, weak consistency) |
| Cassandra | PA | EL |
| MongoDB | PA/PC (config) | EC |
| HBase | PC | EC |
| PostgreSQL (single) | N/A | EC |
| Spanner | PC | EC (TrueTime offsets L) |
PACELC matters because normal operation is 99.99% of the user experience. "We're CP" tells you nothing about the other 99.99%.
Consistency Model Spectrum
From strongest to weakest:
- Linearizable (Strong) — appears as a single machine to outside observers. Most expensive.
- Sequential — all processes see the same order; real-time order not guaranteed.
- Causal — causally related events are ordered (Vector Clock).
- Read-Your-Writes — I see my own writes immediately.
- Monotonic Reads — once seen, never see a staler value.
- Eventual — converges given enough time.
Most apps don't need Linearizable. Causal + Read-Your-Writes is usually enough.
Part 3 — Time and Order — Lamport to HLC
Lamport Clock (1978)
Leslie Lamport's "Time, Clocks, and the Ordering of Events in a Distributed System" — the most important distributed computing paper.
1. Each node maintains a monotonic local counter L.
2. Local event: L = L + 1
3. Send: L = L + 1, attach L to message
4. Receive: L = max(L_local, L_message) + 1
Limit: captures Happens-Before only partially. Total order yes, causal order no.
Vector Clock (Fidge, Mattern 1988)
Each node maintains counters for all nodes as a vector.
Node A: [A=3, B=1, C=0]
Node B: [A=2, B=4, C=0]
On send from A to B, attach A's vector.
B updates with element-wise max.
Compare V1, V2:
- V1 <= V2 element-wise: V1 happens-before V2
- Neither: concurrent
Used in early DynamoDB, Riak, Cassandra conflict detection. Downside: vector size grows with node count.
HLC — Hybrid Logical Clock (2014)
Hybrid of physical and logical time. Used by CockroachDB, MongoDB 4.0+, YugabyteDB.
(wall_time, logical_counter)
Safe against NTP drift, human-readable timestamps, preserves causal order.
TrueTime — Google Spanner's Secret Weapon (2012)
Atomic clocks + GPS in every datacenter. API:
TT.now() -> { earliest: t1, latest: t2 } # uncertainty interval
TT.after(t) # is t definitely in the past?
TT.before(t) # is t definitely in the future?
Spanner commit:
1. Choose commit timestamp s (after TT.now().latest)
2. Wait until s is definitely past (commit wait) -- usually under 7ms
3. Respond
Commit wait delivers global External Consistency. Only production distributed DB that depends on hardware (atomic clocks). AWS Time Sync (2023, microsecond-precision) now approaches TrueTime without custom hardware.
Part 4 — Consensus — Paxos and Raft
FLP Impossibility (1985)
Fischer, Lynch, Paterson: async network + any process may fail → no deterministic consensus algorithm exists.
How do Paxos/Raft work then? They sidestep via timeouts + randomization. Not 100% deterministic, but converge in practice.
Paxos (Lamport, submitted 1989, published 1998)
"The Part-Time Parliament." The Greek-island parliament metaphor caused reviewers to reject it; publication delayed 9 years. Infamous for being "incomprehensible."
Roles:
- Proposer: proposes values
- Acceptor: accepts/rejects (majority is key)
- Learner: learns the chosen value
Two phases:
Phase 1 (Prepare):
Proposer -> Acceptors: "Will you accept proposal n?"
Acceptor -> Proposer: "Yes if n > any prior; returns prior accepted."
Phase 2 (Accept):
Proposer -> Acceptor: "Accept proposal n with value v"
(v = most recent accepted value seen, or proposer's choice)
Acceptor -> Proposer: "Accept/reject"
Majority acceptance = value chosen.
Why notorious:
- Too many variants (Multi-Paxos, Fast Paxos, Generalized Paxos).
- Paper style not engineer-friendly.
- Implementation edge cases not documented.
Used in Google Chubby, early ZooKeeper, Megastore, Spanner.
Raft (Ongaro & Ousterhout, 2014)
Stanford PhD Diego Ongaro created Raft "because Paxos is too hard." Paper title: "In Search of an Understandable Consensus Algorithm."
Decomposed into three sub-problems:
1. Leader Election
- All nodes start as Follower.
- Miss heartbeats -> become Candidate, request votes after random timeout.
- Majority vote -> Leader.
- term number in heartbeats prevents split-brain.
2. Log Replication
- All writes go to leader.
- Leader appends log ->
AppendEntriesRPC to followers. - Majority replicated -> commit -> apply to state machine.
3. Safety
- Election Safety: at most one leader per term.
- Leader Append-Only: never overwrite log.
- Log Matching: same index/term -> identical history before.
- Leader Completeness: committed entries present in all future leaders.
- State Machine Safety: same index -> same value everywhere.
Systems using Raft: etcd (Kubernetes), Consul, TiKV/TiDB, CockroachDB, NATS JetStream, Redpanda, Cloudflare Durable Objects.
Paxos = proven theory. Raft = practical engineering. 90% of 2020s-era new systems use Raft.
Multi-Paxos, EPaxos, Flexible Paxos
- Multi-Paxos: Phase 1 once per leader, Phase 2 per request.
- EPaxos (Egalitarian): leaderless, every node proposes. Good for geo-distribution.
- Flexible Paxos: separate quorum sizes for Phase 1/2.
Part 5 — Transactions — 2PC, 3PC, Saga
2PC (Two-Phase Commit)
Simplest and most dangerous transaction protocol.
Phase 1 (Prepare):
Coordinator -> Participants: "Can commit?"
Participants -> Coordinator: "Yes / No"
Phase 2 (Commit):
All Yes -> "Commit!"
Any No -> "Abort!"
Fatal flaw: if Coordinator dies right before Phase 2, participants block indefinitely. Only legacy XA setups (DB2, Oracle RAC) use 2PC in production.
3PC — adds a phase to avoid blocking, but still fails under network partition. Rarely used.
Saga Pattern (Garcia-Molina & Salem, 1987)
Decompose a long transaction into local transactions + compensating transactions.
Order -> Payment -> Inventory -> Shipping
On failure, compensate in reverse:
Shipping fail -> restore inventory -> refund payment -> cancel order
Two styles:
- Choreography: each service publishes/subscribes events. No central coordinator.
- Orchestration: Saga Orchestrator calls each step centrally.
2020s mainstream:
- Temporal.io (open-sourced Uber Cadence) — Workflow-as-Code standard.
- AWS Step Functions — serverless Saga.
- Netflix Conductor — early orchestrator.
Saga's limit: no Isolation. Intermediate state is externally visible; mitigated by "semantic locks."
TCC (Try-Confirm-Cancel)
From Alibaba microservices. Three phases: resource reservation (Try), confirm, cancel.
Part 6 — Event Sourcing + CQRS
Event Sourcing
"Store the sequence of events, not the state."
[current state: balance $100] -- traditional
[events: +$50, -$20, +$70] -- Event Sourcing
-> replay to $100
Pros: perfect audit trail, time travel, natural event integration. Cons: needs snapshots, schema evolution is hard, querying current state is expensive (requires CQRS).
CQRS
Physically separate write and read models.
Command -> Write Model (Event Store) -> events
-> Read Model (Materialized View)
-> Read Model (ElasticSearch)
-> Read Model (Redis)
Query -> Read Models
Powerful but complexity explodes. Don't apply to simple CRUD.
Mainstream: EventStoreDB, Kafka + ksqlDB, Axon Framework (Java), Marten + PostgreSQL JSONB (.NET).
Outbox Pattern — the Exactly-Once Secret
Atomic "DB commit + message publish"?
Naive (wrong):
1. DB commit
2. Kafka publish
Step 2 failure breaks consistency. Reverse order has same problem.
Outbox Pattern:
BEGIN;
INSERT INTO orders (...);
INSERT INTO outbox (event_type, payload) VALUES (...);
COMMIT;
-- A separate process polls outbox and publishes to Kafka.
-- On success, delete or mark the outbox row.
Debezium (Kafka Connect CDC) automates this: PostgreSQL WAL -> Kafka stream.
Part 7 — Exactly-Once Semantics — Does It Exist?
Three Meanings of "Exactly Once"
- Message transmission ExO: fundamentally impossible (Two Generals Problem).
- Message processing ExO: achievable via dedup + idempotency.
- End effect ExO: consumer-side idempotency is the key.
At-Least-Once + Idempotency = Effective Exactly-Once
Kafka's official guideline. Assume producer duplicates; design idempotent consumers.
INSERT INTO processed_events (event_id, result)
VALUES ('evt-123', ...)
ON CONFLICT (event_id) DO NOTHING;
Kafka Transactional Producer (2017+)
producer.initTransactions();
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
Atomic partition/offset commit + consumer isolation.level=read_committed = Exactly-Once, provided producer/consumer/broker configs align.
Part 8 — Production Distributed DBs — Spanner, CockroachDB, TiDB, YugabyteDB
Google Spanner
- Paper at OSDI 2012.
- TrueTime-based External Consistency.
- Paxos within partition + 2PC across partitions.
- SQL + distributed transactions + horizontal scaling — deemed "impossible" until mid-2010s.
- Commercialized as Cloud Spanner in 2017.
CockroachDB (2015)
"Spanner open-sourced." Ex-Google engineers.
- Raft instead of Paxos.
- HLC instead of TrueTime (no hardware dependency).
- PostgreSQL wire-protocol compatible.
- Geo-partitioning at row level.
Switched to BSL license in 2022, some backlash. Series F $278M in 2024 accelerated adoption.
TiDB (PingCAP, 2016)
Chinese origin.
- MySQL compatible (vs Cockroach's PostgreSQL).
- TiKV (Raft) + TiDB (SQL) + PD (Placement Driver).
- HTAP: row store (TiKV) + column store (TiFlash).
- PingCAP revenue crossed $100M in 2024.
YugabyteDB (2016)
Ex-Facebook/Nutanix founders.
- Spanner-inspired + 100% PostgreSQL compatible.
- Raft + HLC.
- True read-your-writes since 2.18 (2023).
Part 9 — Streaming Distribution — Kafka, Pulsar, Redpanda
Kafka
- Partition = unit of distribution.
- ISR (In-Sync Replicas) — synchronized replicas.
acks=allfor maximum durability.- KRaft (2022 GA) — ZooKeeper removed, Raft-based metadata.
Pulsar (Apache, Yahoo 2016)
- Compute/Storage separated (broker vs BookKeeper).
- First-class geo-replication.
- More complex ops than Kafka, but strong multi-tenancy.
Redpanda (2019)
- Kafka-compatible in C++.
- Built-in Raft, no ZooKeeper/KRaft.
- Thread-per-core (Seastar) — dramatically lower latency.
- Claims safety without fsync (via WAL + Raft).
Part 10 — Real Failure Cases
1. Roblox 73-hour Outage (Oct 2021)
Cause: Consul's BoltDB writeback contention -> write stalls -> leader election blocked -> total outage. Lesson: background compaction in KV stores affects consensus.
2. GitLab Data Loss (Jan 2017)
Cause: engineer ran rm -rf on primary. All 5 backups failed (misconfigured/empty/stale).
Lesson: "we have backups" means nothing without verified restores.
3. Knight Capital $440M in 45 min (2012)
Cause: deploy applied new code to 1 of 8 servers. Old code triggered unlimited buys. Lesson: deploy atomicity failure = distributed inconsistency = economic disaster.
4. Cloudflare 2022 Regex Outage
Cause: one regex pinned CPU at 100% -> all edges down. Lesson: availability equals the atomic blast radius of the weakest link.
5. AWS us-east-1 DynamoDB 2015
Cause: metadata service overload -> table-list failures -> cascading failures. Lesson: "everything as a service" can become a distributed SPOF.
Part 11 — Practical Checklist (12 items)
- Speak PACELC, not CAP — steady-state dominates.
- Exactly-Once is a myth — design At-Least-Once + idempotency.
- Don't trust clocks — use HLC/TrueTime-based DBs or Lamport clocks.
- Don't introduce new 2PC — use Saga or Outbox.
- For consensus, use Raft — via etcd/Consul/proven libs; don't roll your own.
- Systems that depend on leaders must handle "no leader" periods.
- Understand quorum sizes — N=3 tolerates only 1 failure.
- Retries always need exponential backoff + jitter.
- Design cache invalidation upfront — retrofitting is hell.
- No distribution without observability — tracing/metrics/logs trio.
- Adopt Chaos Engineering — inject partitions deliberately.
- Practice data recovery regularly — GitLab can happen again.
Part 12 — 10 Anti-Patterns
- Saying "we chose CP" — CAP misunderstood.
- Single coordinator — guaranteed SPOF.
- Distributed locks for business logic — usually solvable via idempotency.
- Local clocks for ordering — NTP drift breeds bugs.
- Transactions across microservice boundaries — redesign with Saga/Outbox.
- Retries without timeouts — amplify outages.
- Distributed monolith — many services, all deployed together.
- Testing only locally — distributed bugs live in latency.
- Calling single-leader DB with replicas "horizontal scaling."
- Hiding every inconsistency behind "eventual consistency" — users don't understand "eventual."
Part 13 — Learning Resources
- Book: Designing Data-Intensive Applications (Kleppmann) — best intro.
- Book: Database Internals (Petrov) — LSM/B-Tree/Consensus internals.
- Book: Understanding Distributed Systems (Vitillo).
- Papers: Lamport "Time, Clocks" / Raft / Spanner OSDI 2012 / Bigtable / Dynamo 2007.
- Course: MIT 6.824 Distributed Systems (YouTube).
- Jepsen.io — real-world violations of consistency claims.
Closing — Distributed Systems Is Philosophy
Lamport's famous definition:
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."
Distributed systems is the art of producing trustworthy results in an uncontrolled world. There is no perfection — only trade-offs. Make those trade-offs explicitly and consciously, and you can handle near-infinite scale.
In 2026, every engineer writes distributed code. Most just don't realize it.
Coming Next — "Database Internals Fully Dissected" — B-Tree, LSM, WAL, MVCC, Vacuum, Replication, Index Secrets
This series covered distribution "above." Next: distribution "below" — the heart of a single DB: B-Tree vs LSM, WAL, MVCC internals, Vacuum/Compaction, index structures (BRIN, GIN, GiST, HNSW), query planner, replication internals, partitioning strategies. The full answer to "why does a DB get slow?" — next post.
현재 단락 (1/246)
Until 2015, "distributed systems were for Netflix and Google." In 2026, even 3-person startups deal ...