Skip to content
Published on

Distributed Systems Fundamentals: From CAP Theorem to Consensus Algorithms and Consistency Models

Authors

Table of Contents

1. Why Distributed Systems

1.1 The Need for Distributed Systems

A single server cannot meet the demands of modern services. There are three core reasons why distributed systems are essential.

Scalability: As traffic grows, a single server's CPU, memory, and disk hit their limits. Horizontal scaling (scale-out) distributes the load across multiple servers.

Availability: Even if one server dies, the service must continue running. Services like Netflix and Google target 99.99%+ availability.

Fault Tolerance: Network partitions, disk failures, data center fires — failures are inevitable. Distributed systems are designed to withstand them.

1.2 Challenges of Distributed Systems

Single Server                  Distributed System
+-----------+               +-----+  +-----+  +-----+
|  App + DB |    vs         | N1  |--| N2  |--| N3  |
+-----------+               +-----+  +-----+  +-----+
                                  \    |    /
                             Network (unreliable)

Distributed systems have unique challenges:

  • Network latency: Communication between nodes takes ms to seconds
  • Partial failures: Some nodes crash while others remain healthy
  • Clock synchronization: Physical clocks on each node differ slightly
  • Byzantine failures: Nodes may send incorrect data

2. CAP Theorem

2.1 What Is the CAP Theorem?

Proposed by Eric Brewer in 2000 and proven by Seth Gilbert and Nancy Lynch in 2002.

A distributed system can guarantee at most two of three properties simultaneously:

PropertyDescription
ConsistencyEvery read receives the most recent write
AvailabilityEvery request receives a response (success or failure)
Partition ToleranceSystem continues to operate despite network partitions

2.2 Why You Cannot Have All Three — Intuitive Proof

Network partition occurs!

  [Node A]  ----X----  [Node B]
  data=1                data=1

  Client sends write(data=2) to Node A

  Option 1 (CP): Cannot sync with Node BReject request (sacrifice availability)
  Option 2 (AP): Update Node A only → Allow inconsistency (sacrifice consistency)

Since network partitions (P) are unavoidable in distributed systems, the real choice is CP vs AP.

2.3 CP Systems vs AP Systems

AspectCP SystemsAP Systems
Behavior during partitionReject writes or waitAccept writes, merge later
ExamplesZooKeeper, etcd, SpannerCassandra, DynamoDB, Riak
Best forFinancial transactions, leader electionSocial media, shopping carts

2.4 PACELC Extension

The PACELC model addresses CAP's limitations:

if (Partition) then
    Availability vs Consistency
else
    Latency vs Consistency
SystemDuring PartitionNormal Operation
CassandraPA (availability)EL (low latency)
SpannerPC (consistency)EC (consistency)
MongoDBPA (availability)EC (consistency)

3. Consistency Models

3.1 Consistency Levels Compared

From strongest to weakest:

Strict Consistency
    |
Linearizability
    |
Sequential Consistency
    |
Causal Consistency
    |
Eventual Consistency

3.2 Strong Consistency

Every read returns the result of the most recent write.

Time -->
Client A:  write(x=1) ------>
Client B:              read(x) --> must return 1
  • Implementation: Synchronous replication to all nodes before responding
  • Pro: Simple programming model
  • Con: High latency, low availability

3.3 Sequential Consistency

All nodes see operations in the same order, but the order may differ from real-time.

Real time:       A:write(x=1)  B:write(x=2)
Sequential OK:   B:write(x=2), A:write(x=1)
                 (as long as all nodes agree on this order)

3.4 Causal Consistency

Operations with causal relationships are ordered. Unrelated operations may be seen in different orders.

A writes a post --> B writes a comment (causally related)
  --> All nodes see the post before the comment

C writes a post / D writes another post (no causal relation)
  --> Different nodes may see them in different orders

3.5 Eventual Consistency

If updates stop, eventually all nodes will converge to the same value.

Time -->
Node A:  x=1 ......... x=1 (propagation) --> x=1
Node B:  x=0 ......... x=0 --> x=1 (slight delay)
Node C:  x=0 ......... x=0 --> x=1
  • Used by DNS, CDN caches, Amazon S3
  • Requires conflict resolution: Last-Write-Wins, CRDTs, application-level merging

3.6 Consistency Model Comparison

ModelLatencyAvailabilityProgramming DifficultyUse Cases
StrongHighLowEasyFinance, inventory
SequentialMediumMediumMediumDistributed locks
CausalLowHighMediumSocial feeds
EventualVery lowVery highHardDNS, caches

4. Replication

4.1 Leader-Follower Replication

The most common replication strategy.

           writes
Client --> [Leader] --> replication --> [Follower 1]
                                   --> [Follower 2]
                                   --> [Follower 3]
           reads
Client --> [any Follower 1, 2, or 3]

Synchronous replication: Leader waits for all followers to acknowledge. Strong consistency, high latency.

Asynchronous replication: Leader responds immediately, followers replicate later. Low latency, potential data loss.

Semi-synchronous replication: Leader waits for at least one follower. Default approach in MySQL, PostgreSQL.

# Semi-synchronous replication pseudocode
def write(data):
    leader.write(data)
    ack_count = 1  # leader itself
    for follower in followers:
        if follower.replicate(data):
            ack_count += 1
            if ack_count >= 2:  # at least 1 follower confirmed
                return SUCCESS
    return FAILURE

4.2 Multi-Leader Replication

Multiple nodes accept writes simultaneously.

[Leader A] <--> [Leader B] <--> [Leader C]
    |               |               |
[Follower]     [Follower]     [Follower]
  • Use cases: Multi-datacenter, offline work (Google Docs)
  • Core challenge: Write conflict resolution
    • Last-Write-Wins (LWW): Write with largest timestamp wins
    • Merge: Use CRDTs (Conflict-free Replicated Data Types)
    • User resolution: Show conflicts to users for manual choice

4.3 Leaderless Replication (Quorum)

No leader; clients read/write to multiple nodes simultaneously.

         write(x=1)
Client --> [Node A] OK
       --> [Node B] OK
       --> [Node C] FAIL (down)

W=2 success --> Write succeeds!

         read(x)
Client --> [Node A] x=1
       --> [Node B] x=1
       --> [Node C] x=0 (stale)

R=2 --> return latest value (x=1)

Quorum formula: W + R > N guarantees reading the latest data

ConfigNWRCharacteristics
Strong consistency322Consistency guaranteed
Fast writes313Fast writes, slow reads
Fast reads331Slow writes, fast reads
  • Used by Cassandra, DynamoDB, Riak
  • Anti-entropy and read repair fix inconsistencies

5. Partitioning (Sharding)

5.1 Why Partition

When data exceeds a single node's capacity, distribute it across multiple nodes.

5.2 Hash Partitioning

hash(key) mod N = partition_number

key="user_123" --> hash --> 7 --> 7 mod 3 = 1 --> Partition 1
key="user_456" --> hash --> 2 --> 2 mod 3 = 2 --> Partition 2
  • Pro: Even data distribution
  • Con: Inefficient range queries, massive redistribution on node changes

5.3 Range Partitioning

Partition 0: A-F
Partition 1: G-M
Partition 2: N-S
Partition 3: T-Z
  • Pro: Efficient range queries
  • Con: Hotspots possible (data concentrated in certain ranges)

5.4 Consistent Hashing

Minimizes key redistribution when nodes are added or removed.

        0
       / \
     330   30
     /       \
   300        60
   |    Ring    |
   270        90
     \       /
     240   120
       \ /
       180

Node A: 0-90
Node B: 90-180
Node C: 180-270
Node D: 270-360

Add Node E at position 135 --> only keys 90-135 move from B to E
  • Virtual nodes eliminate imbalances
  • Used by Cassandra, DynamoDB, Memcached

5.5 Rebalancing Strategies

StrategyDescriptionExamples
Fixed partition countSet partition count upfront, assign to nodesElasticsearch, Riak
Dynamic splittingSplit/merge based on partition sizeHBase, RethinkDB
Proportional to nodesPartition count scales with node countCassandra

6. Consensus Algorithms

6.1 The Consensus Problem

All nodes in a distributed system agreeing on a single value. Essential for leader election, atomic commits, and distributed locks.

6.2 Paxos Basics

Proposed by Leslie Lamport in 1989. Notoriously difficult to understand.

Three roles:

  • Proposer: Proposes a value
  • Acceptor: Accepts or rejects values
  • Learner: Learns the agreed-upon value

Two phases:

Phase 1: Prepare
  Proposer --> Acceptor: "Please prepare for proposal number n"
  Acceptor --> Proposer: "OK" (includes previously accepted value if any)

Phase 2: Accept
  Proposer --> Acceptor: "Please accept proposal n with value v"
  Acceptor --> Proposer: "Accepted" (consensus reached if majority accepts)
  • Problem: Livelock possible; solved by Multi-Paxos

6.3 Raft in Detail

Proposed by Diego Ongaro and John Ousterhout in 2014. Designed to be more understandable than Paxos.

Core idea: Leader-based consensus

Node states: Leader, Follower, Candidate

6.3.1 Leader Election

Time -->
           Term 1          Term 2
Follower: [Heartbeat]... [Timeout!] --> Candidate
Candidate: Votes for self + sends RequestVote to others
           Gets majority --> Elected as Leader!

Leader:   [Heartbeat]-->[Heartbeat]-->[Heartbeat]-->...
# Raft leader election pseudocode
class RaftNode:
    def __init__(self):
        self.state = "follower"
        self.current_term = 0
        self.voted_for = None
        self.election_timeout = random(150, 300)  # ms

    def on_timeout(self):
        self.state = "candidate"
        self.current_term += 1
        self.voted_for = self.id
        votes = 1  # self-vote

        for peer in self.peers:
            if peer.request_vote(self.current_term, self.log):
                votes += 1

        if votes > len(self.peers) // 2:
            self.state = "leader"
            self.send_heartbeats()

6.3.2 Log Replication

Client --> Leader: write(x=5)

Leader Log: [Term1:x=3] [Term1:y=7] [Term2:x=5]
                                        | AppendEntries
Follower1: [Term1:x=3] [Term1:y=7] [Term2:x=5] OK
Follower2: [Term1:x=3] [Term1:y=7] [Term2:x=5] OK
Follower3: [Term1:x=3] [Term1:y=7] (network delay)

Majority (3/5) replicated --> Commit! --> Respond to client

6.3.3 Safety Properties

  • Election Safety: At most one leader per term
  • Leader Append-Only: Leader never overwrites or deletes log entries
  • Log Matching: If two logs have entries with the same index and term, all preceding entries are identical
  • State Machine Safety: Committed entries are applied in the same order on all nodes

6.4 Raft in Production

SystemUse
etcdKubernetes cluster state storage
ConsulService discovery, KV store
CockroachDBDistributed SQL database
TiKVDistributed KV engine for TiDB

7. Distributed Transactions

7.1 Two-Phase Commit (2PC)

Phase 1: Prepare (Vote)
  Coordinator --> Participant A: "Can you commit?"
  Coordinator --> Participant B: "Can you commit?"
  A --> Coordinator: "Yes"
  B --> Coordinator: "Yes"

Phase 2: Commit (Execute)
  Coordinator --> A: "Commit!"
  Coordinator --> B: "Commit!"

Problems:

  • Coordinator is a single point of failure (SPOF)
  • Blocking: If coordinator fails after Prepare, participants wait indefinitely
  • Performance: Synchronous protocol, so it is slow

7.2 Three-Phase Commit (3PC)

Adds a Pre-Commit phase to solve 2PC's blocking problem.

Phase 1: CanCommit
Phase 2: PreCommit (with timeout)
Phase 3: DoCommit
  • Still inconsistent under network partitions
  • Rarely used in practice

7.3 Saga Pattern

Splits a long transaction into local transactions, with compensating transactions on failure.

Order Creation Saga:

T1: Create Order --> T2: Process Payment --> T3: Deduct Inventory --> T4: Start Shipping
                          | FAIL!
C2: Refund Payment <-- C1: Cancel Order

Two implementation styles:

StyleDescriptionTrade-offs
ChoreographyEvent-driven, each service publishes next eventSimple, low coupling. Hard to trace
OrchestrationCentral orchestrator controls the sequenceEasy to trace, centralization risk
# Saga Orchestrator pseudocode
class OrderSaga:
    steps = [
        ("create_order", "cancel_order"),
        ("process_payment", "refund_payment"),
        ("update_inventory", "restore_inventory"),
        ("arrange_shipping", "cancel_shipping"),
    ]

    def execute(self):
        completed = []
        for action, compensation in self.steps:
            try:
                getattr(self, action)()
                completed.append(compensation)
            except Exception:
                # Run compensations in reverse
                for comp in reversed(completed):
                    getattr(self, comp)()
                raise SagaFailed()

8. Distributed Clocks

8.1 The Problem with Physical Clocks

  • NTP synchronization error: milliseconds to hundreds of milliseconds
  • Clock drift: Diverges over time
  • Leap seconds: Occasionally an extra second is added or removed

Conclusion: Physical clocks alone cannot determine event ordering accurately.

8.2 Lamport Clock (Logical Clock)

Proposed by Leslie Lamport in 1978. Each process maintains a counter.

Rules:
1. Increment counter on each event
2. Include counter when sending messages
3. On receive: max(own counter, received counter) + 1

Process A:  (1)--> (2)--> (3)---------> (6)
                          send          recv
Process B:        (1)--> (2)--> (4)--> (5)
                          recv          send
  • Limitation: Can determine causality but not whether two events are concurrent

8.3 Vector Clock

Each node maintains a vector of counters for all nodes.

3 nodes: A, B, C

A: [1,0,0] --> [2,0,0] --> [2,0,0] send to B
B: [0,0,0] --> [0,1,0] --> [2,2,0] recv from A --> [2,3,0] send to C
C: [0,0,0] --> [0,0,1] --> [2,3,2] recv from B

Determining concurrency:

  • V1 <= V2: Every element of V1 is less than or equal to V2 — V1 happens-before V2
  • V1 and V2 are incomparable — Concurrent events
def compare_vector_clocks(vc1, vc2):
    """Compare two vector clocks"""
    less = False
    greater = False
    for a, b in zip(vc1, vc2):
        if a < b:
            less = True
        elif a > b:
            greater = True

    if less and not greater:
        return "BEFORE"      # vc1 < vc2
    elif greater and not less:
        return "AFTER"       # vc1 > vc2
    elif not less and not greater:
        return "EQUAL"
    else:
        return "CONCURRENT"  # concurrent events!

8.4 Hybrid Logical Clock (HLC)

Combines physical clock + logical clock. Used by CockroachDB and MongoDB.

HLC = (physical_time, logical_counter)

When physical times are equal, the logical counter determines ordering
--> Guarantees causal ordering within NTP error bounds

9. Failure Detection

9.1 Heartbeat Approach

Node A --> "Are you alive?" --> Node B
Node A <-- "Yes!" <-- Node B

... timeout expires ...

Node A --> "Are you alive?" --> Node B
Node A <-- (no response) <-- Node B
--> Consider Node B failed
  • Simple but causes false positives during network delays
  • Timeout setting is critical: too short means false positives, too long means slow detection

9.2 Phi Accrual Failure Detector

Learns the distribution of heartbeat arrival times and calculates failure probability.

Phi value:
  phi = 1  --> 10% chance of failure
  phi = 2  --> 1% chance of failure
  phi = 3  --> 0.1% chance of failure

If above threshold (e.g., phi > 8) --> consider node failed
  • Used by Cassandra and Akka
  • Adapts to changing network conditions

9.3 SWIM Protocol

Scalable Weakly-consistent Infection-style Membership. An efficient group membership protocol.

1. Node A pings Node B
2. No response --> A asks Nodes C, D to "indirect ping B"
3. C, D ping B --> still no response, B is suspected
4. After timeout, if still no response, B is removed
5. Gossip protocol spreads membership changes
  • Propagates to all nodes in O(log N) rounds
  • Used by HashiCorp Memberlist and Consul

10. Real-World System Analysis

10.1 Apache Cassandra (AP System)

Architecture:
  - Leaderless, Consistent Hashing
  - Quorum reads/writes (W + R > N)
  - Gossip protocol for membership
  - Last-Write-Wins conflict resolution
  - SSTable + Memtable storage

PACELC: PA/EL
  - During partition: Availability first
  - Normal operation: Low latency first

10.2 Google Spanner (CP System)

Architecture:
  - Globally distributed, strong consistency
  - TrueTime API: GPS + atomic clocks provide time uncertainty intervals
  - Paxos-based replication
  - External consistency guarantee

TrueTime:
  TT.now() --> [earliest, latest]
  Uncertainty interval typically 1-7ms

  On transaction commit:
  1. Assign timestamp s
  2. commit-wait: wait until TT.after(s) is true
  3. Wait for uncertainty interval to guarantee ordering

10.3 Amazon DynamoDB (AP System)

Architecture:
  - Consistent Hashing + virtual nodes
  - Sloppy Quorum: other nodes respond during failures
  - Hinted Handoff: data delivered to original node after recovery
  - Vector Clocks for versioning
  - Merkle Trees for anti-entropy

10.4 System Comparison

FeatureCassandraSpannerDynamoDB
CAP ClassificationAPCPAP
ConsistencyTunableStrongTunable
ReplicationLeaderlessPaxosLeaderless
PartitioningConsistent HashRangeConsistent Hash
ClocksNTPTrueTimeVector Clock
Query LanguageCQLSQLPartiQL

11. DDIA Key Takeaways

Key insights from Martin Kleppmann's "Designing Data-Intensive Applications":

Part 1: Foundations of Data Systems

ChapterKey Concepts
Ch. 1Reliability, scalability, maintainability
Ch. 2Relational vs document vs graph models
Ch. 3Storage engines: B-Tree vs LSM-Tree
Ch. 4Encoding: JSON, Protobuf, Avro

Part 2: Distributed Data

ChapterKey Concepts
Ch. 5Replication: Leader-Follower, Leaderless
Ch. 6Partitioning: Hash, Range
Ch. 7Transactions: ACID, isolation levels
Ch. 8Challenges of distributed systems
Ch. 9Consistency and consensus

Part 3: Derived Data

ChapterKey Concepts
Ch. 10Batch processing: MapReduce
Ch. 11Stream processing: Event sourcing
Ch. 12The future of data systems

12. Interview Questions (15)

Q1. Explain the CAP theorem and give real-world examples.

The CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance. Since network partitions are unavoidable, the practical choice is between CP (consistency first) and AP (availability first). ZooKeeper is a CP system suitable for leader election and distributed locks, while Cassandra is an AP system suited for high write throughput workloads.

Q2. Describe the leader election process in Raft.

In Raft, when a follower does not receive a heartbeat within its election timeout, it becomes a candidate. It increments its term, votes for itself, and sends RequestVote RPCs to other nodes. If it receives votes from a majority, it becomes the leader and immediately sends heartbeats. At most one leader can exist per term.

Q3. Explain the difference between Eventual and Strong Consistency.

Strong Consistency guarantees every read returns the most recent write. It requires synchronous replication, leading to high latency and lower availability. Eventual Consistency guarantees that if updates stop, all replicas will eventually converge to the same value. Intermediate reads may return stale data, but latency is low and availability is high. Used by DNS and cache systems.

Q4. Explain Consistent Hashing and why virtual nodes are needed.

Consistent Hashing maps the hash space onto a ring, so adding or removing nodes only redistributes a minimal number of keys. Keys and nodes are mapped using the same hash function, and each key is assigned to the nearest node clockwise. Virtual nodes give each physical node multiple hash positions, smoothing out data distribution imbalances.

Q5. Compare 2PC and the Saga pattern.

2PC (Two-Phase Commit) has a coordinator that instructs all participants to prepare/commit, ensuring atomic commits. It is synchronous, blocking, and the coordinator is a SPOF. The Saga pattern breaks a long transaction into a chain of local transactions, running compensating transactions in reverse on failure. It is asynchronous and scales well but does not guarantee isolation.

Q6. What advantage does a Vector Clock have over a Lamport Clock?

A Lamport Clock can determine happens-before ordering but cannot tell if two events are concurrent. A Vector Clock maintains a vector of counters for every node, enabling accurate detection of concurrency. When two vectors are incomparable, the events are concurrent and require conflict resolution.

Q7. Compare Leader-Follower, Multi-Leader, and Leaderless replication.

Leader-Follower has a single write point with no conflicts, but the leader is a SPOF and write scaling is limited. Multi-Leader offers low write latency across datacenters but requires complex conflict resolution. Leaderless provides high availability and write throughput but needs quorum configuration and conflict resolution strategies.

Q8. How does PACELC extend CAP?

PACELC adds a tradeoff for normal operation on top of CAP. During a partition (P), you choose between availability (A) and consistency (C). Otherwise (E: else), you choose between latency (L) and consistency (C). Cassandra is PA/EL (availability during partition, low latency normally), while Spanner is PC/EC (consistency always).

Q9. How does Google Spanner's TrueTime guarantee strong consistency?

TrueTime uses GPS and atomic clocks to provide a time uncertainty interval. On transaction commit, a timestamp is assigned and commit-wait pauses for the uncertainty interval. This ensures later commits always receive larger timestamps, guaranteeing external consistency globally.

Q10. Why does W + R > N guarantee consistency in quorum reads/writes?

With N nodes, writing to W and reading from R, when W + R > N, the write set and read set always overlap. The overlapping node holds the latest data, so reads are guaranteed to return the most recent value. For example, with N=3, W=2, R=2, at least one node is shared.

Q11. How do you resolve a split-brain problem in distributed systems?

Split-brain occurs when a network partition causes two partitions to each believe they have a leader. Solutions include quorum-based leader election (partitions without a majority cannot elect a leader), fencing tokens (rejecting requests from old leaders), and STONITH (Shoot The Other Node In The Head — forcibly shutting down suspect nodes).

Q12. Compare LSM-Tree and B-Tree from a distributed systems perspective.

LSM-Tree is write-optimized with all writes being sequential, cleaned up via compaction. Ideal for write-heavy workloads (Cassandra, HBase). B-Tree is read-optimized with in-place updates, suitable for read-heavy workloads (traditional RDBMS). In distributed environments, LSM-Trees offer lower replication lag and higher write throughput.

Q13. Why is a Phi Accrual Failure Detector better than simple timeouts?

Simple timeouts declare failure if no response arrives within a fixed time, causing many false positives when network latency varies. The Phi Accrual Failure Detector learns the statistical distribution of heartbeat arrival times and calculates a phi value representing how much the current arrival deviates from the distribution. It adapts to changing network conditions.

Q14. What is the difference between Choreography and Orchestration in the Saga pattern?

In Choreography, each service publishes events and other services subscribe to trigger the next step. No central control, low coupling, but hard to trace. In Orchestration, a central orchestrator controls the entire flow. Easier to trace and debug, but the orchestrator can become a SPOF.

Q15. Why is idempotency important in distributed systems?

Network failures can cause requests to be retried, so performing the same request multiple times must produce the same result. Idempotency keys detect duplicate requests, preventing double processing in operations with side effects like payments or inventory deductions.


13. Quiz (5)

Quiz 1. With N=5, W=3, R=2 — is consistent reading guaranteed?

Yes. W + R = 3 + 2 = 5 > N(5), so the write set and read set always overlap. However, since W + R equals N exactly, this is a boundary case that only holds when all nodes are operational.

Quiz 2. In Raft, a leader at Term 5 receives a RequestVote for Term 3. What happens?

It rejects the request. In Raft, a node always rejects requests with a term lower than its current term. This prevents interference from stale leaders.

Quiz 3. What is the relationship between Vector Clock V1=[2,3,1] and V2=[3,2,1]?

They are concurrent. V1's first element (2) is less than V2's (3), but V1's second element (3) is greater than V2's (2). Since neither vector is fully less than or equal to the other, they are incomparable, indicating concurrent events.

Quiz 4. What is the difference between ONE, QUORUM, and ALL consistency levels in Cassandra?
  • ONE: One node response suffices. Fastest but may read stale data
  • QUORUM: Majority (N/2+1) nodes must respond. Balanced choice
  • ALL: All nodes must respond. Slowest but strong consistency. Fails if any node is down
Quiz 5. What happens if the coordinator crashes after Prepare in 2PC?

Participants are blocked. Participants that responded "Yes" to Prepare can neither commit nor abort, waiting indefinitely with resources locked. This is the biggest drawback of 2PC, addressed by 3PC or coordinator log recovery.


14. References

Books

  1. Martin Kleppmann, "Designing Data-Intensive Applications" (2017)
  2. Maarten van Steen, Andrew S. Tanenbaum, "Distributed Systems" (4th ed., 2023)
  3. Mikito Takada, "Distributed Systems for Fun and Profit" (free online)

Papers

  1. Brewer, E., "CAP Twelve Years Later" (IEEE Computer, 2012)
  2. Lamport, L., "The Part-Time Parliament" (1998) — Original Paxos paper
  3. Ongaro, D. & Ousterhout, J., "In Search of an Understandable Consensus Algorithm" (2014) — Raft paper
  4. Lamport, L., "Time, Clocks, and the Ordering of Events in a Distributed System" (1978)
  5. DeCandia, G. et al., "Dynamo: Amazon's Highly Available Key-value Store" (2007)
  6. Corbett, J.C. et al., "Spanner: Google's Globally-Distributed Database" (2012)

Web Resources

  1. Jepsen — Distributed systems safety testing: https://jepsen.io/
  2. The Raft Consensus Algorithm visualization: https://raft.github.io/
  3. Aphyr's distributed systems class: https://github.com/aphyr/distsys-class
  4. AWS re:Invent distributed systems talk series
  5. Martin Kleppmann's blog: https://martin.kleppmann.com/
  6. Marc Brooker's blog: https://brooker.co.za/blog/
  7. Distributed Systems Reading Group: http://dsrg.pdos.csail.mit.edu/