Distributed Systems Fundamentals: From CAP Theorem to Consensus Algorithms and Consistency Models

1. Why Distributed Systems

1.1 The Need for Distributed Systems

A single server cannot meet the demands of modern services. There are three core reasons why distributed systems are essential.

Scalability: As traffic grows, a single server's CPU, memory, and disk hit their limits. Horizontal scaling (scale-out) distributes the load across multiple servers.

Availability: Even if one server dies, the service must continue running. Services like Netflix and Google target 99.99%+ availability.

Fault Tolerance: Network partitions, disk failures, data center fires — failures are inevitable. Distributed systems are designed to withstand them.

1.2 Challenges of Distributed Systems

Single Server                  Distributed System
+-----------+               +-----+  +-----+  +-----+
|  App + DB |    vs         | N1  |--| N2  |--| N3  |
+-----------+               +-----+  +-----+  +-----+
                                  \    |    /
                             Network (unreliable)

Distributed systems have unique challenges:

Network latency: Communication between nodes takes ms to seconds
Partial failures: Some nodes crash while others remain healthy
Clock synchronization: Physical clocks on each node differ slightly
Byzantine failures: Nodes may send incorrect data

2. CAP Theorem

2.1 What Is the CAP Theorem?

Proposed by Eric Brewer in 2000 and proven by Seth Gilbert and Nancy Lynch in 2002.

A distributed system can guarantee at most two of three properties simultaneously:

Property	Description
Consistency	Every read receives the most recent write
Availability	Every request receives a response (success or failure)
Partition Tolerance	System continues to operate despite network partitions

2.2 Why You Cannot Have All Three — Intuitive Proof

Network partition occurs!

  [Node A]  ----X----  [Node B]
  data=1                data=1

  Client sends write(data=2) to Node A

  Option 1 (CP): Cannot sync with Node B → Reject request (sacrifice availability)
  Option 2 (AP): Update Node A only → Allow inconsistency (sacrifice consistency)

Since network partitions (P) are unavoidable in distributed systems, the real choice is CP vs AP.

2.3 CP Systems vs AP Systems

Aspect	CP Systems	AP Systems
Behavior during partition	Reject writes or wait	Accept writes, merge later
Examples	ZooKeeper, etcd, Spanner	Cassandra, DynamoDB, Riak
Best for	Financial transactions, leader election	Social media, shopping carts

2.4 PACELC Extension

The PACELC model addresses CAP's limitations:

if (Partition) then
    Availability vs Consistency
else
    Latency vs Consistency

System	During Partition	Normal Operation
Cassandra	PA (availability)	EL (low latency)
Spanner	PC (consistency)	EC (consistency)
MongoDB	PA (availability)	EC (consistency)

3. Consistency Models

3.1 Consistency Levels Compared

From strongest to weakest:

Strict Consistency
    |
Linearizability
    |
Sequential Consistency
    |
Causal Consistency
    |
Eventual Consistency

3.2 Strong Consistency

Every read returns the result of the most recent write.

Time -->
Client A:  write(x=1) ------>
Client B:              read(x) --> must return 1

Implementation: Synchronous replication to all nodes before responding
Pro: Simple programming model
Con: High latency, low availability

3.3 Sequential Consistency

All nodes see operations in the same order, but the order may differ from real-time.

Real time:       A:write(x=1)  B:write(x=2)
Sequential OK:   B:write(x=2), A:write(x=1)
                 (as long as all nodes agree on this order)

3.4 Causal Consistency

Operations with causal relationships are ordered. Unrelated operations may be seen in different orders.

A writes a post --> B writes a comment (causally related)
  --> All nodes see the post before the comment

C writes a post / D writes another post (no causal relation)
  --> Different nodes may see them in different orders

3.5 Eventual Consistency

If updates stop, eventually all nodes will converge to the same value.

Time -->
Node A:  x=1 ......... x=1 (propagation) --> x=1
Node B:  x=0 ......... x=0 --> x=1 (slight delay)
Node C:  x=0 ......... x=0 --> x=1

Used by DNS, CDN caches, Amazon S3
Requires conflict resolution: Last-Write-Wins, CRDTs, application-level merging

3.6 Consistency Model Comparison

Model	Latency	Availability	Programming Difficulty	Use Cases
Strong	High	Low	Easy	Finance, inventory
Sequential	Medium	Medium	Medium	Distributed locks
Causal	Low	High	Medium	Social feeds
Eventual	Very low	Very high	Hard	DNS, caches

4. Replication

4.1 Leader-Follower Replication

The most common replication strategy.

           writes
Client --> [Leader] --> replication --> [Follower 1]
                                   --> [Follower 2]
                                   --> [Follower 3]
           reads
Client --> [any Follower 1, 2, or 3]

Synchronous replication: Leader waits for all followers to acknowledge. Strong consistency, high latency.

Asynchronous replication: Leader responds immediately, followers replicate later. Low latency, potential data loss.

Semi-synchronous replication: Leader waits for at least one follower. Default approach in MySQL, PostgreSQL.

# Semi-synchronous replication pseudocode
def write(data):
    leader.write(data)
    ack_count = 1  # leader itself
    for follower in followers:
        if follower.replicate(data):
            ack_count += 1
            if ack_count >= 2:  # at least 1 follower confirmed
                return SUCCESS
    return FAILURE

4.2 Multi-Leader Replication

Multiple nodes accept writes simultaneously.

[Leader A] <--> [Leader B] <--> [Leader C]
    |               |               |
[Follower]     [Follower]     [Follower]

Use cases: Multi-datacenter, offline work (Google Docs)
Core challenge: Write conflict resolution
- Last-Write-Wins (LWW): Write with largest timestamp wins
- Merge: Use CRDTs (Conflict-free Replicated Data Types)
- User resolution: Show conflicts to users for manual choice

4.3 Leaderless Replication (Quorum)

No leader; clients read/write to multiple nodes simultaneously.

         write(x=1)
Client --> [Node A] OK
       --> [Node B] OK
       --> [Node C] FAIL (down)

W=2 success --> Write succeeds!

         read(x)
Client --> [Node A] x=1
       --> [Node B] x=1
       --> [Node C] x=0 (stale)

R=2 --> return latest value (x=1)

Quorum formula: W + R > N guarantees reading the latest data

Config	N	W	R	Characteristics
Strong consistency	3	2	2	Consistency guaranteed
Fast writes	3	1	3	Fast writes, slow reads
Fast reads	3	3	1	Slow writes, fast reads

Used by Cassandra, DynamoDB, Riak
Anti-entropy and read repair fix inconsistencies

5. Partitioning (Sharding)

5.1 Why Partition

When data exceeds a single node's capacity, distribute it across multiple nodes.

5.2 Hash Partitioning

hash(key) mod N = partition_number

key="user_123" --> hash --> 7 --> 7 mod 3 = 1 --> Partition 1
key="user_456" --> hash --> 2 --> 2 mod 3 = 2 --> Partition 2

Pro: Even data distribution
Con: Inefficient range queries, massive redistribution on node changes

5.3 Range Partitioning

Partition 0: A-F
Partition 1: G-M
Partition 2: N-S
Partition 3: T-Z

Pro: Efficient range queries
Con: Hotspots possible (data concentrated in certain ranges)

5.4 Consistent Hashing

Minimizes key redistribution when nodes are added or removed.

        0
       / \
     330   30
     /       \
   300        60
   |    Ring    |
   270        90
     \       /
     240   120
       \ /
       180

Node A: 0-90
Node B: 90-180
Node C: 180-270
Node D: 270-360

Add Node E at position 135 --> only keys 90-135 move from B to E

Virtual nodes eliminate imbalances
Used by Cassandra, DynamoDB, Memcached

5.5 Rebalancing Strategies

Strategy	Description	Examples
Fixed partition count	Set partition count upfront, assign to nodes	Elasticsearch, Riak
Dynamic splitting	Split/merge based on partition size	HBase, RethinkDB
Proportional to nodes	Partition count scales with node count	Cassandra

6. Consensus Algorithms

6.1 The Consensus Problem

All nodes in a distributed system agreeing on a single value. Essential for leader election, atomic commits, and distributed locks.

6.2 Paxos Basics

Proposed by Leslie Lamport in 1989. Notoriously difficult to understand.

Three roles:

Proposer: Proposes a value
Acceptor: Accepts or rejects values
Learner: Learns the agreed-upon value

Two phases:

Phase 1: Prepare
  Proposer --> Acceptor: "Please prepare for proposal number n"
  Acceptor --> Proposer: "OK" (includes previously accepted value if any)

Phase 2: Accept
  Proposer --> Acceptor: "Please accept proposal n with value v"
  Acceptor --> Proposer: "Accepted" (consensus reached if majority accepts)

Problem: Livelock possible; solved by Multi-Paxos

6.3 Raft in Detail

Proposed by Diego Ongaro and John Ousterhout in 2014. Designed to be more understandable than Paxos.

Core idea: Leader-based consensus

Node states: Leader, Follower, Candidate

6.3.1 Leader Election

Time -->
           Term 1          Term 2
Follower: [Heartbeat]... [Timeout!] --> Candidate
Candidate: Votes for self + sends RequestVote to others
           Gets majority --> Elected as Leader!

Leader:   [Heartbeat]-->[Heartbeat]-->[Heartbeat]-->...

# Raft leader election pseudocode
class RaftNode:
    def __init__(self):
        self.state = "follower"
        self.current_term = 0
        self.voted_for = None
        self.election_timeout = random(150, 300)  # ms

    def on_timeout(self):
        self.state = "candidate"
        self.current_term += 1
        self.voted_for = self.id
        votes = 1  # self-vote

        for peer in self.peers:
            if peer.request_vote(self.current_term, self.log):
                votes += 1

        if votes > len(self.peers) // 2:
            self.state = "leader"
            self.send_heartbeats()

6.3.2 Log Replication

Client --> Leader: write(x=5)

Leader Log: [Term1:x=3] [Term1:y=7] [Term2:x=5]
                                        | AppendEntries
Follower1: [Term1:x=3] [Term1:y=7] [Term2:x=5] OK
Follower2: [Term1:x=3] [Term1:y=7] [Term2:x=5] OK
Follower3: [Term1:x=3] [Term1:y=7] (network delay)

Majority (3/5) replicated --> Commit! --> Respond to client

6.3.3 Safety Properties

Election Safety: At most one leader per term
Leader Append-Only: Leader never overwrites or deletes log entries
Log Matching: If two logs have entries with the same index and term, all preceding entries are identical
State Machine Safety: Committed entries are applied in the same order on all nodes

6.4 Raft in Production

System	Use
etcd	Kubernetes cluster state storage
Consul	Service discovery, KV store
CockroachDB	Distributed SQL database
TiKV	Distributed KV engine for TiDB

7. Distributed Transactions

7.1 Two-Phase Commit (2PC)

Phase 1: Prepare (Vote)
  Coordinator --> Participant A: "Can you commit?"
  Coordinator --> Participant B: "Can you commit?"
  A --> Coordinator: "Yes"
  B --> Coordinator: "Yes"

Phase 2: Commit (Execute)
  Coordinator --> A: "Commit!"
  Coordinator --> B: "Commit!"

Problems:

Coordinator is a single point of failure (SPOF)
Blocking: If coordinator fails after Prepare, participants wait indefinitely
Performance: Synchronous protocol, so it is slow

7.2 Three-Phase Commit (3PC)

Adds a Pre-Commit phase to solve 2PC's blocking problem.

Phase 1: CanCommit
Phase 2: PreCommit (with timeout)
Phase 3: DoCommit

Still inconsistent under network partitions
Rarely used in practice

7.3 Saga Pattern

Splits a long transaction into local transactions, with compensating transactions on failure.

Order Creation Saga:

T1: Create Order --> T2: Process Payment --> T3: Deduct Inventory --> T4: Start Shipping
                          | FAIL!
C2: Refund Payment <-- C1: Cancel Order

Two implementation styles:

Style	Description	Trade-offs
Choreography	Event-driven, each service publishes next event	Simple, low coupling. Hard to trace
Orchestration	Central orchestrator controls the sequence	Easy to trace, centralization risk

# Saga Orchestrator pseudocode
class OrderSaga:
    steps = [
        ("create_order", "cancel_order"),
        ("process_payment", "refund_payment"),
        ("update_inventory", "restore_inventory"),
        ("arrange_shipping", "cancel_shipping"),
    ]

    def execute(self):
        completed = []
        for action, compensation in self.steps:
            try:
                getattr(self, action)()
                completed.append(compensation)
            except Exception:
                # Run compensations in reverse
                for comp in reversed(completed):
                    getattr(self, comp)()
                raise SagaFailed()

8. Distributed Clocks

8.1 The Problem with Physical Clocks

NTP synchronization error: milliseconds to hundreds of milliseconds
Clock drift: Diverges over time
Leap seconds: Occasionally an extra second is added or removed

Conclusion: Physical clocks alone cannot determine event ordering accurately.

8.2 Lamport Clock (Logical Clock)

Proposed by Leslie Lamport in 1978. Each process maintains a counter.

Rules:
1. Increment counter on each event
2. Include counter when sending messages
3. On receive: max(own counter, received counter) + 1

Process A:  (1)--> (2)--> (3)---------> (6)
                          send          recv
Process B:        (1)--> (2)--> (4)--> (5)
                          recv          send

Limitation: Can determine causality but not whether two events are concurrent

8.3 Vector Clock

Each node maintains a vector of counters for all nodes.

3 nodes: A, B, C

A: [1,0,0] --> [2,0,0] --> [2,0,0] send to B
B: [0,0,0] --> [0,1,0] --> [2,2,0] recv from A --> [2,3,0] send to C
C: [0,0,0] --> [0,0,1] --> [2,3,2] recv from B

Determining concurrency:

V1 <= V2: Every element of V1 is less than or equal to V2 — V1 happens-before V2
V1 and V2 are incomparable — Concurrent events

def compare_vector_clocks(vc1, vc2):
    """Compare two vector clocks"""
    less = False
    greater = False
    for a, b in zip(vc1, vc2):
        if a < b:
            less = True
        elif a > b:
            greater = True

    if less and not greater:
        return "BEFORE"      # vc1 < vc2
    elif greater and not less:
        return "AFTER"       # vc1 > vc2
    elif not less and not greater:
        return "EQUAL"
    else:
        return "CONCURRENT"  # concurrent events!

8.4 Hybrid Logical Clock (HLC)

Combines physical clock + logical clock. Used by CockroachDB and MongoDB.

HLC = (physical_time, logical_counter)

When physical times are equal, the logical counter determines ordering
--> Guarantees causal ordering within NTP error bounds

9. Failure Detection

9.1 Heartbeat Approach

Node A --> "Are you alive?" --> Node B
Node A <-- "Yes!" <-- Node B

... timeout expires ...

Node A --> "Are you alive?" --> Node B
Node A <-- (no response) <-- Node B
--> Consider Node B failed

Simple but causes false positives during network delays
Timeout setting is critical: too short means false positives, too long means slow detection

9.2 Phi Accrual Failure Detector

Learns the distribution of heartbeat arrival times and calculates failure probability.

Phi value:
  phi = 1  --> 10% chance of failure
  phi = 2  --> 1% chance of failure
  phi = 3  --> 0.1% chance of failure

If above threshold (e.g., phi > 8) --> consider node failed

Used by Cassandra and Akka
Adapts to changing network conditions

9.3 SWIM Protocol

Scalable Weakly-consistent Infection-style Membership. An efficient group membership protocol.

1. Node A pings Node B
2. No response --> A asks Nodes C, D to "indirect ping B"
3. C, D ping B --> still no response, B is suspected
4. After timeout, if still no response, B is removed
5. Gossip protocol spreads membership changes

Propagates to all nodes in O(log N) rounds
Used by HashiCorp Memberlist and Consul

10. Real-World System Analysis

10.1 Apache Cassandra (AP System)

Architecture:
  - Leaderless, Consistent Hashing
  - Quorum reads/writes (W + R > N)
  - Gossip protocol for membership
  - Last-Write-Wins conflict resolution
  - SSTable + Memtable storage

PACELC: PA/EL
  - During partition: Availability first
  - Normal operation: Low latency first

10.2 Google Spanner (CP System)

Architecture:
  - Globally distributed, strong consistency
  - TrueTime API: GPS + atomic clocks provide time uncertainty intervals
  - Paxos-based replication
  - External consistency guarantee

TrueTime:
  TT.now() --> [earliest, latest]
  Uncertainty interval typically 1-7ms

  On transaction commit:
  1. Assign timestamp s
  2. commit-wait: wait until TT.after(s) is true
  3. Wait for uncertainty interval to guarantee ordering

10.3 Amazon DynamoDB (AP System)

Architecture:
  - Consistent Hashing + virtual nodes
  - Sloppy Quorum: other nodes respond during failures
  - Hinted Handoff: data delivered to original node after recovery
  - Vector Clocks for versioning
  - Merkle Trees for anti-entropy

10.4 System Comparison

Feature	Cassandra	Spanner	DynamoDB
CAP Classification	AP	CP	AP
Consistency	Tunable	Strong	Tunable
Replication	Leaderless	Paxos	Leaderless
Partitioning	Consistent Hash	Range	Consistent Hash
Clocks	NTP	TrueTime	Vector Clock
Query Language	CQL	SQL	PartiQL

11. DDIA Key Takeaways

Key insights from Martin Kleppmann's "Designing Data-Intensive Applications":

Part 1: Foundations of Data Systems

Chapter	Key Concepts
Ch. 1	Reliability, scalability, maintainability
Ch. 2	Relational vs document vs graph models
Ch. 3	Storage engines: B-Tree vs LSM-Tree
Ch. 4	Encoding: JSON, Protobuf, Avro

Part 2: Distributed Data

Chapter	Key Concepts
Ch. 5	Replication: Leader-Follower, Leaderless
Ch. 6	Partitioning: Hash, Range
Ch. 7	Transactions: ACID, isolation levels
Ch. 8	Challenges of distributed systems
Ch. 9	Consistency and consensus

Part 3: Derived Data

Chapter	Key Concepts
Ch. 10	Batch processing: MapReduce
Ch. 11	Stream processing: Event sourcing
Ch. 12	The future of data systems

12. Interview Questions (15)

Q1. Explain the CAP theorem and give real-world examples.

The CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance. Since network partitions are unavoidable, the practical choice is between CP (consistency first) and AP (availability first). ZooKeeper is a CP system suitable for leader election and distributed locks, while Cassandra is an AP system suited for high write throughput workloads.

Q2. Describe the leader election process in Raft.

In Raft, when a follower does not receive a heartbeat within its election timeout, it becomes a candidate. It increments its term, votes for itself, and sends RequestVote RPCs to other nodes. If it receives votes from a majority, it becomes the leader and immediately sends heartbeats. At most one leader can exist per term.

Q3. Explain the difference between Eventual and Strong Consistency.

Strong Consistency guarantees every read returns the most recent write. It requires synchronous replication, leading to high latency and lower availability. Eventual Consistency guarantees that if updates stop, all replicas will eventually converge to the same value. Intermediate reads may return stale data, but latency is low and availability is high. Used by DNS and cache systems.

Q4. Explain Consistent Hashing and why virtual nodes are needed.

Consistent Hashing maps the hash space onto a ring, so adding or removing nodes only redistributes a minimal number of keys. Keys and nodes are mapped using the same hash function, and each key is assigned to the nearest node clockwise. Virtual nodes give each physical node multiple hash positions, smoothing out data distribution imbalances.

Q5. Compare 2PC and the Saga pattern.

2PC (Two-Phase Commit) has a coordinator that instructs all participants to prepare/commit, ensuring atomic commits. It is synchronous, blocking, and the coordinator is a SPOF. The Saga pattern breaks a long transaction into a chain of local transactions, running compensating transactions in reverse on failure. It is asynchronous and scales well but does not guarantee isolation.

Q6. What advantage does a Vector Clock have over a Lamport Clock?

A Lamport Clock can determine happens-before ordering but cannot tell if two events are concurrent. A Vector Clock maintains a vector of counters for every node, enabling accurate detection of concurrency. When two vectors are incomparable, the events are concurrent and require conflict resolution.

Q7. Compare Leader-Follower, Multi-Leader, and Leaderless replication.

Leader-Follower has a single write point with no conflicts, but the leader is a SPOF and write scaling is limited. Multi-Leader offers low write latency across datacenters but requires complex conflict resolution. Leaderless provides high availability and write throughput but needs quorum configuration and conflict resolution strategies.

Q8. How does PACELC extend CAP?

PACELC adds a tradeoff for normal operation on top of CAP. During a partition (P), you choose between availability (A) and consistency (C). Otherwise (E: else), you choose between latency (L) and consistency (C). Cassandra is PA/EL (availability during partition, low latency normally), while Spanner is PC/EC (consistency always).

Q9. How does Google Spanner's TrueTime guarantee strong consistency?

TrueTime uses GPS and atomic clocks to provide a time uncertainty interval. On transaction commit, a timestamp is assigned and commit-wait pauses for the uncertainty interval. This ensures later commits always receive larger timestamps, guaranteeing external consistency globally.

Q10. Why does W + R > N guarantee consistency in quorum reads/writes?

With N nodes, writing to W and reading from R, when W + R > N, the write set and read set always overlap. The overlapping node holds the latest data, so reads are guaranteed to return the most recent value. For example, with N=3, W=2, R=2, at least one node is shared.

Q11. How do you resolve a split-brain problem in distributed systems?

Split-brain occurs when a network partition causes two partitions to each believe they have a leader. Solutions include quorum-based leader election (partitions without a majority cannot elect a leader), fencing tokens (rejecting requests from old leaders), and STONITH (Shoot The Other Node In The Head — forcibly shutting down suspect nodes).

Q12. Compare LSM-Tree and B-Tree from a distributed systems perspective.

LSM-Tree is write-optimized with all writes being sequential, cleaned up via compaction. Ideal for write-heavy workloads (Cassandra, HBase). B-Tree is read-optimized with in-place updates, suitable for read-heavy workloads (traditional RDBMS). In distributed environments, LSM-Trees offer lower replication lag and higher write throughput.

Q13. Why is a Phi Accrual Failure Detector better than simple timeouts?

Simple timeouts declare failure if no response arrives within a fixed time, causing many false positives when network latency varies. The Phi Accrual Failure Detector learns the statistical distribution of heartbeat arrival times and calculates a phi value representing how much the current arrival deviates from the distribution. It adapts to changing network conditions.

Q14. What is the difference between Choreography and Orchestration in the Saga pattern?

In Choreography, each service publishes events and other services subscribe to trigger the next step. No central control, low coupling, but hard to trace. In Orchestration, a central orchestrator controls the entire flow. Easier to trace and debug, but the orchestrator can become a SPOF.

Q15. Why is idempotency important in distributed systems?

Network failures can cause requests to be retried, so performing the same request multiple times must produce the same result. Idempotency keys detect duplicate requests, preventing double processing in operations with side effects like payments or inventory deductions.

13. Quiz (5)

Quiz 1. With N=5, W=3, R=2 — is consistent reading guaranteed?

Yes. W + R = 3 + 2 = 5 > N(5), so the write set and read set always overlap. However, since W + R equals N exactly, this is a boundary case that only holds when all nodes are operational.

Quiz 2. In Raft, a leader at Term 5 receives a RequestVote for Term 3. What happens?

It rejects the request. In Raft, a node always rejects requests with a term lower than its current term. This prevents interference from stale leaders.

Quiz 3. What is the relationship between Vector Clock V1=[2,3,1] and V2=[3,2,1]?

They are concurrent. V1's first element (2) is less than V2's (3), but V1's second element (3) is greater than V2's (2). Since neither vector is fully less than or equal to the other, they are incomparable, indicating concurrent events.

Quiz 4. What is the difference between ONE, QUORUM, and ALL consistency levels in Cassandra?

ONE: One node response suffices. Fastest but may read stale data
QUORUM: Majority (N/2+1) nodes must respond. Balanced choice
ALL: All nodes must respond. Slowest but strong consistency. Fails if any node is down

Quiz 5. What happens if the coordinator crashes after Prepare in 2PC?

Participants are blocked. Participants that responded "Yes" to Prepare can neither commit nor abort, waiting indefinitely with resources locked. This is the biggest drawback of 2PC, addressed by 3PC or coordinator log recovery.

14. References

Books

Martin Kleppmann, "Designing Data-Intensive Applications" (2017)
Maarten van Steen, Andrew S. Tanenbaum, "Distributed Systems" (4th ed., 2023)
Mikito Takada, "Distributed Systems for Fun and Profit" (free online)

Papers

Brewer, E., "CAP Twelve Years Later" (IEEE Computer, 2012)
Lamport, L., "The Part-Time Parliament" (1998) — Original Paxos paper
Ongaro, D. & Ousterhout, J., "In Search of an Understandable Consensus Algorithm" (2014) — Raft paper
Lamport, L., "Time, Clocks, and the Ordering of Events in a Distributed System" (1978)
DeCandia, G. et al., "Dynamo: Amazon's Highly Available Key-value Store" (2007)
Corbett, J.C. et al., "Spanner: Google's Globally-Distributed Database" (2012)

Web Resources

Jepsen — Distributed systems safety testing: https://jepsen.io/
The Raft Consensus Algorithm visualization: https://raft.github.io/
Aphyr's distributed systems class: https://github.com/aphyr/distsys-class
AWS re:Invent distributed systems talk series
Martin Kleppmann's blog: https://martin.kleppmann.com/
Marc Brooker's blog: https://brooker.co.za/blog/
Distributed Systems Reading Group: http://dsrg.pdos.csail.mit.edu/

Table of Contents