The Essential Difficulty of Distributed Systems — CAP, PACELC, Raft/Paxos, Vector Clock, Saga, Event Sourcing (2025)

Why Distributed Systems Now — In 2026, You Can't Avoid It

Until 2015, "distributed systems were for Netflix and Google." In 2026, even 3-person startups deal with them.

A single PostgreSQL instance gets a Read Replica after 6 months, then sharding a year later.
The moment you adopt Kubernetes you are riding consensus (etcd/Raft) between API Server, Scheduler, kubelet.
Kafka brings partitions, replication, offsets, and Exactly-Once Semantics problems.
Cloudflare Workers + D1 + Durable Objects is essentially a distributed Raft cluster.
LLM inference across GPUs and regions now requires distributed coordination too.

Distributed systems are no longer a choice — they are a condition. Yet it remains the most misunderstood topic in the industry. By the end of this article, you'll see why "we chose CP" is a nearly empty statement.

Part 1 — Why Distribution Is Essentially Hard

The 8 Fallacies of Distributed Computing

Peter Deutsch (Sun, 1994). Still valid 30 years later:

The network is reliable — packets get dropped.
Latency is zero — speed of light is finite (NY-Sydney RTT ~200ms).
Bandwidth is infinite.
The network is secure — MITM, DNS hijacking, BGP hijacking happen.
Topology doesn't change.
There is one administrator.
Transport cost is zero — cloud egress bills prove otherwise.
The network is homogeneous — Ethernet, WiFi, satellite, 5G under one TCP/IP skin.

99% of distributed systems papers address problems caused by these being false.

Two Processes Can't Agree on "Now"

On one machine, now() is clear. On two machines? Fundamentally uncertain.

NTP sync has millisecond errors.
VMs can be paused by hypervisors (Stop-the-World).
Google Spanner, with atomic clocks and GPS, still admits a 7ms uncertainty interval.

Hence in distributed systems, event order is determined by logical time, not absolute time — Lamport's 1978 insight.

Part 2 — Rereading the CAP Theorem

Original Definition (Brewer 2000 / Gilbert & Lynch 2002)

"During a network Partition (P), a system can guarantee only one of Consistency (C) or Availability (A)."

Three common misconceptions:

Misconception 1: "Choose CP or AP"

Wrong. Without partitions, both C and A are achievable. CAP speaks only about the trade-off during a partition. Brewer clarified this in 2012's "CAP Twelve Years Later":

"The 'two of three' formulation was always misleading. In reality, the 'choice' is only relevant during partitions."

Misconception 2: "Consistency = C in ACID"

Wrong. CAP's C is Linearizability (a single object, single operation appears atomic). ACID's C means "no constraint violations" — a different concept.

Misconception 3: "Availability = service doesn't die"

Wrong. CAP's A requires every request receives a non-error response — far stricter than 99.99% uptime.

PACELC — What CAP Missed

Daniel Abadi (2010): "During Partition, A vs C; Else, Latency vs Consistency."

System	On Partition	Normal
DynamoDB (default)	PA	EL (low latency, weak consistency)
Cassandra	PA	EL
MongoDB	PA/PC (config)	EC
HBase	PC	EC
PostgreSQL (single)	N/A	EC
Spanner	PC	EC (TrueTime offsets L)

PACELC matters because normal operation is 99.99% of the user experience. "We're CP" tells you nothing about the other 99.99%.

Consistency Model Spectrum

From strongest to weakest:

Linearizable (Strong) — appears as a single machine to outside observers. Most expensive.
Sequential — all processes see the same order; real-time order not guaranteed.
Causal — causally related events are ordered (Vector Clock).
Read-Your-Writes — I see my own writes immediately.
Monotonic Reads — once seen, never see a staler value.
Eventual — converges given enough time.

Most apps don't need Linearizable. Causal + Read-Your-Writes is usually enough.

Part 3 — Time and Order — Lamport to HLC

Lamport Clock (1978)

Leslie Lamport's "Time, Clocks, and the Ordering of Events in a Distributed System" — the most important distributed computing paper.

1. Each node maintains a monotonic local counter L.
2. Local event: L = L + 1
3. Send: L = L + 1, attach L to message
4. Receive: L = max(L_local, L_message) + 1

Limit: captures Happens-Before only partially. Total order yes, causal order no.

Vector Clock (Fidge, Mattern 1988)

Each node maintains counters for all nodes as a vector.

Node A: [A=3, B=1, C=0]
Node B: [A=2, B=4, C=0]

On send from A to B, attach A's vector.
B updates with element-wise max.

Compare V1, V2:
- V1 <= V2 element-wise: V1 happens-before V2
- Neither: concurrent

Used in early DynamoDB, Riak, Cassandra conflict detection. Downside: vector size grows with node count.

HLC — Hybrid Logical Clock (2014)

Hybrid of physical and logical time. Used by CockroachDB, MongoDB 4.0+, YugabyteDB.

(wall_time, logical_counter)

Safe against NTP drift, human-readable timestamps, preserves causal order.

TrueTime — Google Spanner's Secret Weapon (2012)

Atomic clocks + GPS in every datacenter. API:

TT.now() -> { earliest: t1, latest: t2 }  # uncertainty interval
TT.after(t)  # is t definitely in the past?
TT.before(t) # is t definitely in the future?

Spanner commit:

1. Choose commit timestamp s (after TT.now().latest)
2. Wait until s is definitely past (commit wait) -- usually under 7ms
3. Respond

Commit wait delivers global External Consistency. Only production distributed DB that depends on hardware (atomic clocks). AWS Time Sync (2023, microsecond-precision) now approaches TrueTime without custom hardware.

Part 4 — Consensus — Paxos and Raft

FLP Impossibility (1985)

Fischer, Lynch, Paterson: async network + any process may fail → no deterministic consensus algorithm exists.

How do Paxos/Raft work then? They sidestep via timeouts + randomization. Not 100% deterministic, but converge in practice.

Paxos (Lamport, submitted 1989, published 1998)

"The Part-Time Parliament." The Greek-island parliament metaphor caused reviewers to reject it; publication delayed 9 years. Infamous for being "incomprehensible."

Roles:

Proposer: proposes values
Acceptor: accepts/rejects (majority is key)
Learner: learns the chosen value

Two phases:

Phase 1 (Prepare):
  Proposer -> Acceptors: "Will you accept proposal n?"
  Acceptor -> Proposer: "Yes if n > any prior; returns prior accepted."

Phase 2 (Accept):
  Proposer -> Acceptor: "Accept proposal n with value v"
  (v = most recent accepted value seen, or proposer's choice)
  Acceptor -> Proposer: "Accept/reject"

Majority acceptance = value chosen.

Why notorious:

Too many variants (Multi-Paxos, Fast Paxos, Generalized Paxos).
Paper style not engineer-friendly.
Implementation edge cases not documented.

Used in Google Chubby, early ZooKeeper, Megastore, Spanner.

Raft (Ongaro & Ousterhout, 2014)

Stanford PhD Diego Ongaro created Raft "because Paxos is too hard." Paper title: "In Search of an Understandable Consensus Algorithm."

Decomposed into three sub-problems:

1. Leader Election

All nodes start as Follower.
Miss heartbeats -> become Candidate, request votes after random timeout.
Majority vote -> Leader.
term number in heartbeats prevents split-brain.

2. Log Replication

All writes go to leader.
Leader appends log -> AppendEntries RPC to followers.
Majority replicated -> commit -> apply to state machine.

3. Safety

Election Safety: at most one leader per term.
Leader Append-Only: never overwrite log.
Log Matching: same index/term -> identical history before.
Leader Completeness: committed entries present in all future leaders.
State Machine Safety: same index -> same value everywhere.

Systems using Raft: etcd (Kubernetes), Consul, TiKV/TiDB, CockroachDB, NATS JetStream, Redpanda, Cloudflare Durable Objects.

Paxos = proven theory. Raft = practical engineering. 90% of 2020s-era new systems use Raft.

Multi-Paxos, EPaxos, Flexible Paxos

Multi-Paxos: Phase 1 once per leader, Phase 2 per request.
EPaxos (Egalitarian): leaderless, every node proposes. Good for geo-distribution.
Flexible Paxos: separate quorum sizes for Phase 1/2.

Part 5 — Transactions — 2PC, 3PC, Saga

2PC (Two-Phase Commit)

Simplest and most dangerous transaction protocol.

Phase 1 (Prepare):
  Coordinator -> Participants: "Can commit?"
  Participants -> Coordinator: "Yes / No"

Phase 2 (Commit):
  All Yes -> "Commit!"
  Any No  -> "Abort!"

Fatal flaw: if Coordinator dies right before Phase 2, participants block indefinitely. Only legacy XA setups (DB2, Oracle RAC) use 2PC in production.

3PC — adds a phase to avoid blocking, but still fails under network partition. Rarely used.

Saga Pattern (Garcia-Molina & Salem, 1987)

Decompose a long transaction into local transactions + compensating transactions.

Order -> Payment -> Inventory -> Shipping

On failure, compensate in reverse:
Shipping fail -> restore inventory -> refund payment -> cancel order

Two styles:

Choreography: each service publishes/subscribes events. No central coordinator.
Orchestration: Saga Orchestrator calls each step centrally.

2020s mainstream:

Temporal.io (open-sourced Uber Cadence) — Workflow-as-Code standard.
AWS Step Functions — serverless Saga.
Netflix Conductor — early orchestrator.

Saga's limit: no Isolation. Intermediate state is externally visible; mitigated by "semantic locks."

TCC (Try-Confirm-Cancel)

From Alibaba microservices. Three phases: resource reservation (Try), confirm, cancel.

Part 6 — Event Sourcing + CQRS

Event Sourcing

"Store the sequence of events, not the state."

[current state: balance $100]  -- traditional

[events: +$50, -$20, +$70]  -- Event Sourcing
-> replay to $100

Pros: perfect audit trail, time travel, natural event integration. Cons: needs snapshots, schema evolution is hard, querying current state is expensive (requires CQRS).

CQRS

Physically separate write and read models.

Command -> Write Model (Event Store) -> events
                        -> Read Model (Materialized View)
                        -> Read Model (ElasticSearch)
                        -> Read Model (Redis)
Query   -> Read Models

Powerful but complexity explodes. Don't apply to simple CRUD.

Mainstream: EventStoreDB, Kafka + ksqlDB, Axon Framework (Java), Marten + PostgreSQL JSONB (.NET).

Outbox Pattern — the Exactly-Once Secret

Atomic "DB commit + message publish"?

Naive (wrong):

1. DB commit
2. Kafka publish

Step 2 failure breaks consistency. Reverse order has same problem.

Outbox Pattern:

BEGIN;
  INSERT INTO orders (...);
  INSERT INTO outbox (event_type, payload) VALUES (...);
COMMIT;

-- A separate process polls outbox and publishes to Kafka.
-- On success, delete or mark the outbox row.

Debezium (Kafka Connect CDC) automates this: PostgreSQL WAL -> Kafka stream.

Part 7 — Exactly-Once Semantics — Does It Exist?

Three Meanings of "Exactly Once"

Message transmission ExO: fundamentally impossible (Two Generals Problem).
Message processing ExO: achievable via dedup + idempotency.
End effect ExO: consumer-side idempotency is the key.

At-Least-Once + Idempotency = Effective Exactly-Once

Kafka's official guideline. Assume producer duplicates; design idempotent consumers.

INSERT INTO processed_events (event_id, result)
VALUES ('evt-123', ...)
ON CONFLICT (event_id) DO NOTHING;

Kafka Transactional Producer (2017+)

producer.initTransactions();
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();

Atomic partition/offset commit + consumer isolation.level=read_committed = Exactly-Once, provided producer/consumer/broker configs align.

Part 8 — Production Distributed DBs — Spanner, CockroachDB, TiDB, YugabyteDB

Google Spanner

Paper at OSDI 2012.
TrueTime-based External Consistency.
Paxos within partition + 2PC across partitions.
SQL + distributed transactions + horizontal scaling — deemed "impossible" until mid-2010s.
Commercialized as Cloud Spanner in 2017.

CockroachDB (2015)

"Spanner open-sourced." Ex-Google engineers.

Raft instead of Paxos.
HLC instead of TrueTime (no hardware dependency).
PostgreSQL wire-protocol compatible.
Geo-partitioning at row level.

Switched to BSL license in 2022, some backlash. Series F $278M in 2024 accelerated adoption.

TiDB (PingCAP, 2016)

Chinese origin.

MySQL compatible (vs Cockroach's PostgreSQL).
TiKV (Raft) + TiDB (SQL) + PD (Placement Driver).
HTAP: row store (TiKV) + column store (TiFlash).
PingCAP revenue crossed $100M in 2024.

YugabyteDB (2016)

Ex-Facebook/Nutanix founders.

Spanner-inspired + 100% PostgreSQL compatible.
Raft + HLC.
True read-your-writes since 2.18 (2023).

Part 9 — Streaming Distribution — Kafka, Pulsar, Redpanda

Kafka

Partition = unit of distribution.
ISR (In-Sync Replicas) — synchronized replicas.
acks=all for maximum durability.
KRaft (2022 GA) — ZooKeeper removed, Raft-based metadata.

Pulsar (Apache, Yahoo 2016)

Compute/Storage separated (broker vs BookKeeper).
First-class geo-replication.
More complex ops than Kafka, but strong multi-tenancy.

Redpanda (2019)

Kafka-compatible in C++.
Built-in Raft, no ZooKeeper/KRaft.
Thread-per-core (Seastar) — dramatically lower latency.
Claims safety without fsync (via WAL + Raft).

Part 10 — Real Failure Cases

1. Roblox 73-hour Outage (Oct 2021)

Cause: Consul's BoltDB writeback contention -> write stalls -> leader election blocked -> total outage. Lesson: background compaction in KV stores affects consensus.

2. GitLab Data Loss (Jan 2017)

Cause: engineer ran rm -rf on primary. All 5 backups failed (misconfigured/empty/stale). Lesson: "we have backups" means nothing without verified restores.

3. Knight Capital $440M in 45 min (2012)

Cause: deploy applied new code to 1 of 8 servers. Old code triggered unlimited buys. Lesson: deploy atomicity failure = distributed inconsistency = economic disaster.

4. Cloudflare 2022 Regex Outage

Cause: one regex pinned CPU at 100% -> all edges down. Lesson: availability equals the atomic blast radius of the weakest link.

5. AWS us-east-1 DynamoDB 2015

Cause: metadata service overload -> table-list failures -> cascading failures. Lesson: "everything as a service" can become a distributed SPOF.

Part 11 — Practical Checklist (12 items)

Speak PACELC, not CAP — steady-state dominates.
Exactly-Once is a myth — design At-Least-Once + idempotency.
Don't trust clocks — use HLC/TrueTime-based DBs or Lamport clocks.
Don't introduce new 2PC — use Saga or Outbox.
For consensus, use Raft — via etcd/Consul/proven libs; don't roll your own.
Systems that depend on leaders must handle "no leader" periods.
Understand quorum sizes — N=3 tolerates only 1 failure.
Retries always need exponential backoff + jitter.
Design cache invalidation upfront — retrofitting is hell.
No distribution without observability — tracing/metrics/logs trio.
Adopt Chaos Engineering — inject partitions deliberately.
Practice data recovery regularly — GitLab can happen again.

Part 12 — 10 Anti-Patterns

Saying "we chose CP" — CAP misunderstood.
Single coordinator — guaranteed SPOF.
Distributed locks for business logic — usually solvable via idempotency.
Local clocks for ordering — NTP drift breeds bugs.
Transactions across microservice boundaries — redesign with Saga/Outbox.
Retries without timeouts — amplify outages.
Distributed monolith — many services, all deployed together.
Testing only locally — distributed bugs live in latency.
Calling single-leader DB with replicas "horizontal scaling."
Hiding every inconsistency behind "eventual consistency" — users don't understand "eventual."

Part 13 — Learning Resources

Book: Designing Data-Intensive Applications (Kleppmann) — best intro.
Book: Database Internals (Petrov) — LSM/B-Tree/Consensus internals.
Book: Understanding Distributed Systems (Vitillo).
Papers: Lamport "Time, Clocks" / Raft / Spanner OSDI 2012 / Bigtable / Dynamo 2007.
Course: MIT 6.824 Distributed Systems (YouTube).
Jepsen.io — real-world violations of consistency claims.

Closing — Distributed Systems Is Philosophy

Lamport's famous definition:

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."

Distributed systems is the art of producing trustworthy results in an uncontrolled world. There is no perfection — only trade-offs. Make those trade-offs explicitly and consciously, and you can handle near-infinite scale.

In 2026, every engineer writes distributed code. Most just don't realize it.

Coming Next — "Database Internals Fully Dissected" — B-Tree, LSM, WAL, MVCC, Vacuum, Replication, Index Secrets

This series covered distribution "above." Next: distribution "below" — the heart of a single DB: B-Tree vs LSM, WAL, MVCC internals, Vacuum/Compaction, index structures (BRIN, GIN, GiST, HNSW), query planner, replication internals, partitioning strategies. The full answer to "why does a DB get slow?" — next post.