Skip to content
Published on

The Essential Difficulty of Distributed Systems — CAP, PACELC, Raft/Paxos, Vector Clock, Saga, Event Sourcing (2025)

Authors

Why Distributed Systems Now — In 2026, You Can't Avoid It

Until 2015, "distributed systems were for Netflix and Google." In 2026, even 3-person startups deal with them.

  • A single PostgreSQL instance gets a Read Replica after 6 months, then sharding a year later.
  • The moment you adopt Kubernetes you are riding consensus (etcd/Raft) between API Server, Scheduler, kubelet.
  • Kafka brings partitions, replication, offsets, and Exactly-Once Semantics problems.
  • Cloudflare Workers + D1 + Durable Objects is essentially a distributed Raft cluster.
  • LLM inference across GPUs and regions now requires distributed coordination too.

Distributed systems are no longer a choice — they are a condition. Yet it remains the most misunderstood topic in the industry. By the end of this article, you'll see why "we chose CP" is a nearly empty statement.

Part 1 — Why Distribution Is Essentially Hard

The 8 Fallacies of Distributed Computing

Peter Deutsch (Sun, 1994). Still valid 30 years later:

  1. The network is reliable — packets get dropped.
  2. Latency is zero — speed of light is finite (NY-Sydney RTT ~200ms).
  3. Bandwidth is infinite.
  4. The network is secure — MITM, DNS hijacking, BGP hijacking happen.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero — cloud egress bills prove otherwise.
  8. The network is homogeneous — Ethernet, WiFi, satellite, 5G under one TCP/IP skin.

99% of distributed systems papers address problems caused by these being false.

Two Processes Can't Agree on "Now"

On one machine, now() is clear. On two machines? Fundamentally uncertain.

  • NTP sync has millisecond errors.
  • VMs can be paused by hypervisors (Stop-the-World).
  • Google Spanner, with atomic clocks and GPS, still admits a 7ms uncertainty interval.

Hence in distributed systems, event order is determined by logical time, not absolute time — Lamport's 1978 insight.

Part 2 — Rereading the CAP Theorem

Original Definition (Brewer 2000 / Gilbert & Lynch 2002)

"During a network Partition (P), a system can guarantee only one of Consistency (C) or Availability (A)."

Three common misconceptions:

Misconception 1: "Choose CP or AP"

Wrong. Without partitions, both C and A are achievable. CAP speaks only about the trade-off during a partition. Brewer clarified this in 2012's "CAP Twelve Years Later":

"The 'two of three' formulation was always misleading. In reality, the 'choice' is only relevant during partitions."

Misconception 2: "Consistency = C in ACID"

Wrong. CAP's C is Linearizability (a single object, single operation appears atomic). ACID's C means "no constraint violations" — a different concept.

Misconception 3: "Availability = service doesn't die"

Wrong. CAP's A requires every request receives a non-error response — far stricter than 99.99% uptime.

PACELC — What CAP Missed

Daniel Abadi (2010): "During Partition, A vs C; Else, Latency vs Consistency."

SystemOn PartitionNormal
DynamoDB (default)PAEL (low latency, weak consistency)
CassandraPAEL
MongoDBPA/PC (config)EC
HBasePCEC
PostgreSQL (single)N/AEC
SpannerPCEC (TrueTime offsets L)

PACELC matters because normal operation is 99.99% of the user experience. "We're CP" tells you nothing about the other 99.99%.

Consistency Model Spectrum

From strongest to weakest:

  1. Linearizable (Strong) — appears as a single machine to outside observers. Most expensive.
  2. Sequential — all processes see the same order; real-time order not guaranteed.
  3. Causal — causally related events are ordered (Vector Clock).
  4. Read-Your-Writes — I see my own writes immediately.
  5. Monotonic Reads — once seen, never see a staler value.
  6. Eventual — converges given enough time.

Most apps don't need Linearizable. Causal + Read-Your-Writes is usually enough.

Part 3 — Time and Order — Lamport to HLC

Lamport Clock (1978)

Leslie Lamport's "Time, Clocks, and the Ordering of Events in a Distributed System" — the most important distributed computing paper.

1. Each node maintains a monotonic local counter L.
2. Local event: L = L + 1
3. Send: L = L + 1, attach L to message
4. Receive: L = max(L_local, L_message) + 1

Limit: captures Happens-Before only partially. Total order yes, causal order no.

Vector Clock (Fidge, Mattern 1988)

Each node maintains counters for all nodes as a vector.

Node A: [A=3, B=1, C=0]
Node B: [A=2, B=4, C=0]

On send from A to B, attach A's vector.
B updates with element-wise max.

Compare V1, V2:
- V1 <= V2 element-wise: V1 happens-before V2
- Neither: concurrent

Used in early DynamoDB, Riak, Cassandra conflict detection. Downside: vector size grows with node count.

HLC — Hybrid Logical Clock (2014)

Hybrid of physical and logical time. Used by CockroachDB, MongoDB 4.0+, YugabyteDB.

(wall_time, logical_counter)

Safe against NTP drift, human-readable timestamps, preserves causal order.

TrueTime — Google Spanner's Secret Weapon (2012)

Atomic clocks + GPS in every datacenter. API:

TT.now() -> { earliest: t1, latest: t2 }  # uncertainty interval
TT.after(t)  # is t definitely in the past?
TT.before(t) # is t definitely in the future?

Spanner commit:

1. Choose commit timestamp s (after TT.now().latest)
2. Wait until s is definitely past (commit wait) -- usually under 7ms
3. Respond

Commit wait delivers global External Consistency. Only production distributed DB that depends on hardware (atomic clocks). AWS Time Sync (2023, microsecond-precision) now approaches TrueTime without custom hardware.

Part 4 — Consensus — Paxos and Raft

FLP Impossibility (1985)

Fischer, Lynch, Paterson: async network + any process may fail → no deterministic consensus algorithm exists.

How do Paxos/Raft work then? They sidestep via timeouts + randomization. Not 100% deterministic, but converge in practice.

Paxos (Lamport, submitted 1989, published 1998)

"The Part-Time Parliament." The Greek-island parliament metaphor caused reviewers to reject it; publication delayed 9 years. Infamous for being "incomprehensible."

Roles:

  • Proposer: proposes values
  • Acceptor: accepts/rejects (majority is key)
  • Learner: learns the chosen value

Two phases:

Phase 1 (Prepare):
  Proposer -> Acceptors: "Will you accept proposal n?"
  Acceptor -> Proposer: "Yes if n > any prior; returns prior accepted."

Phase 2 (Accept):
  Proposer -> Acceptor: "Accept proposal n with value v"
  (v = most recent accepted value seen, or proposer's choice)
  Acceptor -> Proposer: "Accept/reject"

Majority acceptance = value chosen.

Why notorious:

  1. Too many variants (Multi-Paxos, Fast Paxos, Generalized Paxos).
  2. Paper style not engineer-friendly.
  3. Implementation edge cases not documented.

Used in Google Chubby, early ZooKeeper, Megastore, Spanner.

Raft (Ongaro & Ousterhout, 2014)

Stanford PhD Diego Ongaro created Raft "because Paxos is too hard." Paper title: "In Search of an Understandable Consensus Algorithm."

Decomposed into three sub-problems:

1. Leader Election

  • All nodes start as Follower.
  • Miss heartbeats -> become Candidate, request votes after random timeout.
  • Majority vote -> Leader.
  • term number in heartbeats prevents split-brain.

2. Log Replication

  • All writes go to leader.
  • Leader appends log -> AppendEntries RPC to followers.
  • Majority replicated -> commit -> apply to state machine.

3. Safety

  • Election Safety: at most one leader per term.
  • Leader Append-Only: never overwrite log.
  • Log Matching: same index/term -> identical history before.
  • Leader Completeness: committed entries present in all future leaders.
  • State Machine Safety: same index -> same value everywhere.

Systems using Raft: etcd (Kubernetes), Consul, TiKV/TiDB, CockroachDB, NATS JetStream, Redpanda, Cloudflare Durable Objects.

Paxos = proven theory. Raft = practical engineering. 90% of 2020s-era new systems use Raft.

Multi-Paxos, EPaxos, Flexible Paxos

  • Multi-Paxos: Phase 1 once per leader, Phase 2 per request.
  • EPaxos (Egalitarian): leaderless, every node proposes. Good for geo-distribution.
  • Flexible Paxos: separate quorum sizes for Phase 1/2.

Part 5 — Transactions — 2PC, 3PC, Saga

2PC (Two-Phase Commit)

Simplest and most dangerous transaction protocol.

Phase 1 (Prepare):
  Coordinator -> Participants: "Can commit?"
  Participants -> Coordinator: "Yes / No"

Phase 2 (Commit):
  All Yes -> "Commit!"
  Any No  -> "Abort!"

Fatal flaw: if Coordinator dies right before Phase 2, participants block indefinitely. Only legacy XA setups (DB2, Oracle RAC) use 2PC in production.

3PC — adds a phase to avoid blocking, but still fails under network partition. Rarely used.

Saga Pattern (Garcia-Molina & Salem, 1987)

Decompose a long transaction into local transactions + compensating transactions.

Order -> Payment -> Inventory -> Shipping

On failure, compensate in reverse:
Shipping fail -> restore inventory -> refund payment -> cancel order

Two styles:

  • Choreography: each service publishes/subscribes events. No central coordinator.
  • Orchestration: Saga Orchestrator calls each step centrally.

2020s mainstream:

  • Temporal.io (open-sourced Uber Cadence) — Workflow-as-Code standard.
  • AWS Step Functions — serverless Saga.
  • Netflix Conductor — early orchestrator.

Saga's limit: no Isolation. Intermediate state is externally visible; mitigated by "semantic locks."

TCC (Try-Confirm-Cancel)

From Alibaba microservices. Three phases: resource reservation (Try), confirm, cancel.

Part 6 — Event Sourcing + CQRS

Event Sourcing

"Store the sequence of events, not the state."

[current state: balance $100]  -- traditional

[events: +$50, -$20, +$70]  -- Event Sourcing
-> replay to $100

Pros: perfect audit trail, time travel, natural event integration. Cons: needs snapshots, schema evolution is hard, querying current state is expensive (requires CQRS).

CQRS

Physically separate write and read models.

Command -> Write Model (Event Store) -> events
                        -> Read Model (Materialized View)
                        -> Read Model (ElasticSearch)
                        -> Read Model (Redis)
Query   -> Read Models

Powerful but complexity explodes. Don't apply to simple CRUD.

Mainstream: EventStoreDB, Kafka + ksqlDB, Axon Framework (Java), Marten + PostgreSQL JSONB (.NET).

Outbox Pattern — the Exactly-Once Secret

Atomic "DB commit + message publish"?

Naive (wrong):

1. DB commit
2. Kafka publish

Step 2 failure breaks consistency. Reverse order has same problem.

Outbox Pattern:

BEGIN;
  INSERT INTO orders (...);
  INSERT INTO outbox (event_type, payload) VALUES (...);
COMMIT;

-- A separate process polls outbox and publishes to Kafka.
-- On success, delete or mark the outbox row.

Debezium (Kafka Connect CDC) automates this: PostgreSQL WAL -> Kafka stream.

Part 7 — Exactly-Once Semantics — Does It Exist?

Three Meanings of "Exactly Once"

  1. Message transmission ExO: fundamentally impossible (Two Generals Problem).
  2. Message processing ExO: achievable via dedup + idempotency.
  3. End effect ExO: consumer-side idempotency is the key.

At-Least-Once + Idempotency = Effective Exactly-Once

Kafka's official guideline. Assume producer duplicates; design idempotent consumers.

INSERT INTO processed_events (event_id, result)
VALUES ('evt-123', ...)
ON CONFLICT (event_id) DO NOTHING;

Kafka Transactional Producer (2017+)

producer.initTransactions();
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();

Atomic partition/offset commit + consumer isolation.level=read_committed = Exactly-Once, provided producer/consumer/broker configs align.

Part 8 — Production Distributed DBs — Spanner, CockroachDB, TiDB, YugabyteDB

Google Spanner

  • Paper at OSDI 2012.
  • TrueTime-based External Consistency.
  • Paxos within partition + 2PC across partitions.
  • SQL + distributed transactions + horizontal scaling — deemed "impossible" until mid-2010s.
  • Commercialized as Cloud Spanner in 2017.

CockroachDB (2015)

"Spanner open-sourced." Ex-Google engineers.

  • Raft instead of Paxos.
  • HLC instead of TrueTime (no hardware dependency).
  • PostgreSQL wire-protocol compatible.
  • Geo-partitioning at row level.

Switched to BSL license in 2022, some backlash. Series F $278M in 2024 accelerated adoption.

TiDB (PingCAP, 2016)

Chinese origin.

  • MySQL compatible (vs Cockroach's PostgreSQL).
  • TiKV (Raft) + TiDB (SQL) + PD (Placement Driver).
  • HTAP: row store (TiKV) + column store (TiFlash).
  • PingCAP revenue crossed $100M in 2024.

YugabyteDB (2016)

Ex-Facebook/Nutanix founders.

  • Spanner-inspired + 100% PostgreSQL compatible.
  • Raft + HLC.
  • True read-your-writes since 2.18 (2023).

Part 9 — Streaming Distribution — Kafka, Pulsar, Redpanda

Kafka

  • Partition = unit of distribution.
  • ISR (In-Sync Replicas) — synchronized replicas.
  • acks=all for maximum durability.
  • KRaft (2022 GA) — ZooKeeper removed, Raft-based metadata.

Pulsar (Apache, Yahoo 2016)

  • Compute/Storage separated (broker vs BookKeeper).
  • First-class geo-replication.
  • More complex ops than Kafka, but strong multi-tenancy.

Redpanda (2019)

  • Kafka-compatible in C++.
  • Built-in Raft, no ZooKeeper/KRaft.
  • Thread-per-core (Seastar) — dramatically lower latency.
  • Claims safety without fsync (via WAL + Raft).

Part 10 — Real Failure Cases

1. Roblox 73-hour Outage (Oct 2021)

Cause: Consul's BoltDB writeback contention -> write stalls -> leader election blocked -> total outage. Lesson: background compaction in KV stores affects consensus.

2. GitLab Data Loss (Jan 2017)

Cause: engineer ran rm -rf on primary. All 5 backups failed (misconfigured/empty/stale). Lesson: "we have backups" means nothing without verified restores.

3. Knight Capital $440M in 45 min (2012)

Cause: deploy applied new code to 1 of 8 servers. Old code triggered unlimited buys. Lesson: deploy atomicity failure = distributed inconsistency = economic disaster.

4. Cloudflare 2022 Regex Outage

Cause: one regex pinned CPU at 100% -> all edges down. Lesson: availability equals the atomic blast radius of the weakest link.

5. AWS us-east-1 DynamoDB 2015

Cause: metadata service overload -> table-list failures -> cascading failures. Lesson: "everything as a service" can become a distributed SPOF.

Part 11 — Practical Checklist (12 items)

  1. Speak PACELC, not CAP — steady-state dominates.
  2. Exactly-Once is a myth — design At-Least-Once + idempotency.
  3. Don't trust clocks — use HLC/TrueTime-based DBs or Lamport clocks.
  4. Don't introduce new 2PC — use Saga or Outbox.
  5. For consensus, use Raft — via etcd/Consul/proven libs; don't roll your own.
  6. Systems that depend on leaders must handle "no leader" periods.
  7. Understand quorum sizes — N=3 tolerates only 1 failure.
  8. Retries always need exponential backoff + jitter.
  9. Design cache invalidation upfront — retrofitting is hell.
  10. No distribution without observability — tracing/metrics/logs trio.
  11. Adopt Chaos Engineering — inject partitions deliberately.
  12. Practice data recovery regularly — GitLab can happen again.

Part 12 — 10 Anti-Patterns

  1. Saying "we chose CP" — CAP misunderstood.
  2. Single coordinator — guaranteed SPOF.
  3. Distributed locks for business logic — usually solvable via idempotency.
  4. Local clocks for ordering — NTP drift breeds bugs.
  5. Transactions across microservice boundaries — redesign with Saga/Outbox.
  6. Retries without timeouts — amplify outages.
  7. Distributed monolith — many services, all deployed together.
  8. Testing only locally — distributed bugs live in latency.
  9. Calling single-leader DB with replicas "horizontal scaling."
  10. Hiding every inconsistency behind "eventual consistency" — users don't understand "eventual."

Part 13 — Learning Resources

  • Book: Designing Data-Intensive Applications (Kleppmann) — best intro.
  • Book: Database Internals (Petrov) — LSM/B-Tree/Consensus internals.
  • Book: Understanding Distributed Systems (Vitillo).
  • Papers: Lamport "Time, Clocks" / Raft / Spanner OSDI 2012 / Bigtable / Dynamo 2007.
  • Course: MIT 6.824 Distributed Systems (YouTube).
  • Jepsen.io — real-world violations of consistency claims.

Closing — Distributed Systems Is Philosophy

Lamport's famous definition:

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."

Distributed systems is the art of producing trustworthy results in an uncontrolled world. There is no perfection — only trade-offs. Make those trade-offs explicitly and consciously, and you can handle near-infinite scale.

In 2026, every engineer writes distributed code. Most just don't realize it.

Coming Next — "Database Internals Fully Dissected" — B-Tree, LSM, WAL, MVCC, Vacuum, Replication, Index Secrets

This series covered distribution "above." Next: distribution "below" — the heart of a single DB: B-Tree vs LSM, WAL, MVCC internals, Vacuum/Compaction, index structures (BRIN, GIN, GiST, HNSW), query planner, replication internals, partitioning strategies. The full answer to "why does a DB get slow?" — next post.