- Published on
Distributed Systems Fundamentals: From CAP Theorem to Consensus Algorithms and Consistency Models
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Table of Contents
1. Why Distributed Systems
1.1 The Need for Distributed Systems
A single server cannot meet the demands of modern services. There are three core reasons why distributed systems are essential.
Scalability: As traffic grows, a single server's CPU, memory, and disk hit their limits. Horizontal scaling (scale-out) distributes the load across multiple servers.
Availability: Even if one server dies, the service must continue running. Services like Netflix and Google target 99.99%+ availability.
Fault Tolerance: Network partitions, disk failures, data center fires — failures are inevitable. Distributed systems are designed to withstand them.
1.2 Challenges of Distributed Systems
Single Server Distributed System
+-----------+ +-----+ +-----+ +-----+
| App + DB | vs | N1 |--| N2 |--| N3 |
+-----------+ +-----+ +-----+ +-----+
\ | /
Network (unreliable)
Distributed systems have unique challenges:
- Network latency: Communication between nodes takes ms to seconds
- Partial failures: Some nodes crash while others remain healthy
- Clock synchronization: Physical clocks on each node differ slightly
- Byzantine failures: Nodes may send incorrect data
2. CAP Theorem
2.1 What Is the CAP Theorem?
Proposed by Eric Brewer in 2000 and proven by Seth Gilbert and Nancy Lynch in 2002.
A distributed system can guarantee at most two of three properties simultaneously:
| Property | Description |
|---|---|
| Consistency | Every read receives the most recent write |
| Availability | Every request receives a response (success or failure) |
| Partition Tolerance | System continues to operate despite network partitions |
2.2 Why You Cannot Have All Three — Intuitive Proof
Network partition occurs!
[Node A] ----X---- [Node B]
data=1 data=1
Client sends write(data=2) to Node A
Option 1 (CP): Cannot sync with Node B → Reject request (sacrifice availability)
Option 2 (AP): Update Node A only → Allow inconsistency (sacrifice consistency)
Since network partitions (P) are unavoidable in distributed systems, the real choice is CP vs AP.
2.3 CP Systems vs AP Systems
| Aspect | CP Systems | AP Systems |
|---|---|---|
| Behavior during partition | Reject writes or wait | Accept writes, merge later |
| Examples | ZooKeeper, etcd, Spanner | Cassandra, DynamoDB, Riak |
| Best for | Financial transactions, leader election | Social media, shopping carts |
2.4 PACELC Extension
The PACELC model addresses CAP's limitations:
if (Partition) then
Availability vs Consistency
else
Latency vs Consistency
| System | During Partition | Normal Operation |
|---|---|---|
| Cassandra | PA (availability) | EL (low latency) |
| Spanner | PC (consistency) | EC (consistency) |
| MongoDB | PA (availability) | EC (consistency) |
3. Consistency Models
3.1 Consistency Levels Compared
From strongest to weakest:
Strict Consistency
|
Linearizability
|
Sequential Consistency
|
Causal Consistency
|
Eventual Consistency
3.2 Strong Consistency
Every read returns the result of the most recent write.
Time -->
Client A: write(x=1) ------>
Client B: read(x) --> must return 1
- Implementation: Synchronous replication to all nodes before responding
- Pro: Simple programming model
- Con: High latency, low availability
3.3 Sequential Consistency
All nodes see operations in the same order, but the order may differ from real-time.
Real time: A:write(x=1) B:write(x=2)
Sequential OK: B:write(x=2), A:write(x=1)
(as long as all nodes agree on this order)
3.4 Causal Consistency
Operations with causal relationships are ordered. Unrelated operations may be seen in different orders.
A writes a post --> B writes a comment (causally related)
--> All nodes see the post before the comment
C writes a post / D writes another post (no causal relation)
--> Different nodes may see them in different orders
3.5 Eventual Consistency
If updates stop, eventually all nodes will converge to the same value.
Time -->
Node A: x=1 ......... x=1 (propagation) --> x=1
Node B: x=0 ......... x=0 --> x=1 (slight delay)
Node C: x=0 ......... x=0 --> x=1
- Used by DNS, CDN caches, Amazon S3
- Requires conflict resolution: Last-Write-Wins, CRDTs, application-level merging
3.6 Consistency Model Comparison
| Model | Latency | Availability | Programming Difficulty | Use Cases |
|---|---|---|---|---|
| Strong | High | Low | Easy | Finance, inventory |
| Sequential | Medium | Medium | Medium | Distributed locks |
| Causal | Low | High | Medium | Social feeds |
| Eventual | Very low | Very high | Hard | DNS, caches |
4. Replication
4.1 Leader-Follower Replication
The most common replication strategy.
writes
Client --> [Leader] --> replication --> [Follower 1]
--> [Follower 2]
--> [Follower 3]
reads
Client --> [any Follower 1, 2, or 3]
Synchronous replication: Leader waits for all followers to acknowledge. Strong consistency, high latency.
Asynchronous replication: Leader responds immediately, followers replicate later. Low latency, potential data loss.
Semi-synchronous replication: Leader waits for at least one follower. Default approach in MySQL, PostgreSQL.
# Semi-synchronous replication pseudocode
def write(data):
leader.write(data)
ack_count = 1 # leader itself
for follower in followers:
if follower.replicate(data):
ack_count += 1
if ack_count >= 2: # at least 1 follower confirmed
return SUCCESS
return FAILURE
4.2 Multi-Leader Replication
Multiple nodes accept writes simultaneously.
[Leader A] <--> [Leader B] <--> [Leader C]
| | |
[Follower] [Follower] [Follower]
- Use cases: Multi-datacenter, offline work (Google Docs)
- Core challenge: Write conflict resolution
- Last-Write-Wins (LWW): Write with largest timestamp wins
- Merge: Use CRDTs (Conflict-free Replicated Data Types)
- User resolution: Show conflicts to users for manual choice
4.3 Leaderless Replication (Quorum)
No leader; clients read/write to multiple nodes simultaneously.
write(x=1)
Client --> [Node A] OK
--> [Node B] OK
--> [Node C] FAIL (down)
W=2 success --> Write succeeds!
read(x)
Client --> [Node A] x=1
--> [Node B] x=1
--> [Node C] x=0 (stale)
R=2 --> return latest value (x=1)
Quorum formula: W + R > N guarantees reading the latest data
| Config | N | W | R | Characteristics |
|---|---|---|---|---|
| Strong consistency | 3 | 2 | 2 | Consistency guaranteed |
| Fast writes | 3 | 1 | 3 | Fast writes, slow reads |
| Fast reads | 3 | 3 | 1 | Slow writes, fast reads |
- Used by Cassandra, DynamoDB, Riak
- Anti-entropy and read repair fix inconsistencies
5. Partitioning (Sharding)
5.1 Why Partition
When data exceeds a single node's capacity, distribute it across multiple nodes.
5.2 Hash Partitioning
hash(key) mod N = partition_number
key="user_123" --> hash --> 7 --> 7 mod 3 = 1 --> Partition 1
key="user_456" --> hash --> 2 --> 2 mod 3 = 2 --> Partition 2
- Pro: Even data distribution
- Con: Inefficient range queries, massive redistribution on node changes
5.3 Range Partitioning
Partition 0: A-F
Partition 1: G-M
Partition 2: N-S
Partition 3: T-Z
- Pro: Efficient range queries
- Con: Hotspots possible (data concentrated in certain ranges)
5.4 Consistent Hashing
Minimizes key redistribution when nodes are added or removed.
0
/ \
330 30
/ \
300 60
| Ring |
270 90
\ /
240 120
\ /
180
Node A: 0-90
Node B: 90-180
Node C: 180-270
Node D: 270-360
Add Node E at position 135 --> only keys 90-135 move from B to E
- Virtual nodes eliminate imbalances
- Used by Cassandra, DynamoDB, Memcached
5.5 Rebalancing Strategies
| Strategy | Description | Examples |
|---|---|---|
| Fixed partition count | Set partition count upfront, assign to nodes | Elasticsearch, Riak |
| Dynamic splitting | Split/merge based on partition size | HBase, RethinkDB |
| Proportional to nodes | Partition count scales with node count | Cassandra |
6. Consensus Algorithms
6.1 The Consensus Problem
All nodes in a distributed system agreeing on a single value. Essential for leader election, atomic commits, and distributed locks.
6.2 Paxos Basics
Proposed by Leslie Lamport in 1989. Notoriously difficult to understand.
Three roles:
- Proposer: Proposes a value
- Acceptor: Accepts or rejects values
- Learner: Learns the agreed-upon value
Two phases:
Phase 1: Prepare
Proposer --> Acceptor: "Please prepare for proposal number n"
Acceptor --> Proposer: "OK" (includes previously accepted value if any)
Phase 2: Accept
Proposer --> Acceptor: "Please accept proposal n with value v"
Acceptor --> Proposer: "Accepted" (consensus reached if majority accepts)
- Problem: Livelock possible; solved by Multi-Paxos
6.3 Raft in Detail
Proposed by Diego Ongaro and John Ousterhout in 2014. Designed to be more understandable than Paxos.
Core idea: Leader-based consensus
Node states: Leader, Follower, Candidate
6.3.1 Leader Election
Time -->
Term 1 Term 2
Follower: [Heartbeat]... [Timeout!] --> Candidate
Candidate: Votes for self + sends RequestVote to others
Gets majority --> Elected as Leader!
Leader: [Heartbeat]-->[Heartbeat]-->[Heartbeat]-->...
# Raft leader election pseudocode
class RaftNode:
def __init__(self):
self.state = "follower"
self.current_term = 0
self.voted_for = None
self.election_timeout = random(150, 300) # ms
def on_timeout(self):
self.state = "candidate"
self.current_term += 1
self.voted_for = self.id
votes = 1 # self-vote
for peer in self.peers:
if peer.request_vote(self.current_term, self.log):
votes += 1
if votes > len(self.peers) // 2:
self.state = "leader"
self.send_heartbeats()
6.3.2 Log Replication
Client --> Leader: write(x=5)
Leader Log: [Term1:x=3] [Term1:y=7] [Term2:x=5]
| AppendEntries
Follower1: [Term1:x=3] [Term1:y=7] [Term2:x=5] OK
Follower2: [Term1:x=3] [Term1:y=7] [Term2:x=5] OK
Follower3: [Term1:x=3] [Term1:y=7] (network delay)
Majority (3/5) replicated --> Commit! --> Respond to client
6.3.3 Safety Properties
- Election Safety: At most one leader per term
- Leader Append-Only: Leader never overwrites or deletes log entries
- Log Matching: If two logs have entries with the same index and term, all preceding entries are identical
- State Machine Safety: Committed entries are applied in the same order on all nodes
6.4 Raft in Production
| System | Use |
|---|---|
| etcd | Kubernetes cluster state storage |
| Consul | Service discovery, KV store |
| CockroachDB | Distributed SQL database |
| TiKV | Distributed KV engine for TiDB |
7. Distributed Transactions
7.1 Two-Phase Commit (2PC)
Phase 1: Prepare (Vote)
Coordinator --> Participant A: "Can you commit?"
Coordinator --> Participant B: "Can you commit?"
A --> Coordinator: "Yes"
B --> Coordinator: "Yes"
Phase 2: Commit (Execute)
Coordinator --> A: "Commit!"
Coordinator --> B: "Commit!"
Problems:
- Coordinator is a single point of failure (SPOF)
- Blocking: If coordinator fails after Prepare, participants wait indefinitely
- Performance: Synchronous protocol, so it is slow
7.2 Three-Phase Commit (3PC)
Adds a Pre-Commit phase to solve 2PC's blocking problem.
Phase 1: CanCommit
Phase 2: PreCommit (with timeout)
Phase 3: DoCommit
- Still inconsistent under network partitions
- Rarely used in practice
7.3 Saga Pattern
Splits a long transaction into local transactions, with compensating transactions on failure.
Order Creation Saga:
T1: Create Order --> T2: Process Payment --> T3: Deduct Inventory --> T4: Start Shipping
| FAIL!
C2: Refund Payment <-- C1: Cancel Order
Two implementation styles:
| Style | Description | Trade-offs |
|---|---|---|
| Choreography | Event-driven, each service publishes next event | Simple, low coupling. Hard to trace |
| Orchestration | Central orchestrator controls the sequence | Easy to trace, centralization risk |
# Saga Orchestrator pseudocode
class OrderSaga:
steps = [
("create_order", "cancel_order"),
("process_payment", "refund_payment"),
("update_inventory", "restore_inventory"),
("arrange_shipping", "cancel_shipping"),
]
def execute(self):
completed = []
for action, compensation in self.steps:
try:
getattr(self, action)()
completed.append(compensation)
except Exception:
# Run compensations in reverse
for comp in reversed(completed):
getattr(self, comp)()
raise SagaFailed()
8. Distributed Clocks
8.1 The Problem with Physical Clocks
- NTP synchronization error: milliseconds to hundreds of milliseconds
- Clock drift: Diverges over time
- Leap seconds: Occasionally an extra second is added or removed
Conclusion: Physical clocks alone cannot determine event ordering accurately.
8.2 Lamport Clock (Logical Clock)
Proposed by Leslie Lamport in 1978. Each process maintains a counter.
Rules:
1. Increment counter on each event
2. Include counter when sending messages
3. On receive: max(own counter, received counter) + 1
Process A: (1)--> (2)--> (3)---------> (6)
send recv
Process B: (1)--> (2)--> (4)--> (5)
recv send
- Limitation: Can determine causality but not whether two events are concurrent
8.3 Vector Clock
Each node maintains a vector of counters for all nodes.
3 nodes: A, B, C
A: [1,0,0] --> [2,0,0] --> [2,0,0] send to B
B: [0,0,0] --> [0,1,0] --> [2,2,0] recv from A --> [2,3,0] send to C
C: [0,0,0] --> [0,0,1] --> [2,3,2] recv from B
Determining concurrency:
V1 <= V2: Every element of V1 is less than or equal to V2 — V1 happens-before V2- V1 and V2 are incomparable — Concurrent events
def compare_vector_clocks(vc1, vc2):
"""Compare two vector clocks"""
less = False
greater = False
for a, b in zip(vc1, vc2):
if a < b:
less = True
elif a > b:
greater = True
if less and not greater:
return "BEFORE" # vc1 < vc2
elif greater and not less:
return "AFTER" # vc1 > vc2
elif not less and not greater:
return "EQUAL"
else:
return "CONCURRENT" # concurrent events!
8.4 Hybrid Logical Clock (HLC)
Combines physical clock + logical clock. Used by CockroachDB and MongoDB.
HLC = (physical_time, logical_counter)
When physical times are equal, the logical counter determines ordering
--> Guarantees causal ordering within NTP error bounds
9. Failure Detection
9.1 Heartbeat Approach
Node A --> "Are you alive?" --> Node B
Node A <-- "Yes!" <-- Node B
... timeout expires ...
Node A --> "Are you alive?" --> Node B
Node A <-- (no response) <-- Node B
--> Consider Node B failed
- Simple but causes false positives during network delays
- Timeout setting is critical: too short means false positives, too long means slow detection
9.2 Phi Accrual Failure Detector
Learns the distribution of heartbeat arrival times and calculates failure probability.
Phi value:
phi = 1 --> 10% chance of failure
phi = 2 --> 1% chance of failure
phi = 3 --> 0.1% chance of failure
If above threshold (e.g., phi > 8) --> consider node failed
- Used by Cassandra and Akka
- Adapts to changing network conditions
9.3 SWIM Protocol
Scalable Weakly-consistent Infection-style Membership. An efficient group membership protocol.
1. Node A pings Node B
2. No response --> A asks Nodes C, D to "indirect ping B"
3. C, D ping B --> still no response, B is suspected
4. After timeout, if still no response, B is removed
5. Gossip protocol spreads membership changes
- Propagates to all nodes in O(log N) rounds
- Used by HashiCorp Memberlist and Consul
10. Real-World System Analysis
10.1 Apache Cassandra (AP System)
Architecture:
- Leaderless, Consistent Hashing
- Quorum reads/writes (W + R > N)
- Gossip protocol for membership
- Last-Write-Wins conflict resolution
- SSTable + Memtable storage
PACELC: PA/EL
- During partition: Availability first
- Normal operation: Low latency first
10.2 Google Spanner (CP System)
Architecture:
- Globally distributed, strong consistency
- TrueTime API: GPS + atomic clocks provide time uncertainty intervals
- Paxos-based replication
- External consistency guarantee
TrueTime:
TT.now() --> [earliest, latest]
Uncertainty interval typically 1-7ms
On transaction commit:
1. Assign timestamp s
2. commit-wait: wait until TT.after(s) is true
3. Wait for uncertainty interval to guarantee ordering
10.3 Amazon DynamoDB (AP System)
Architecture:
- Consistent Hashing + virtual nodes
- Sloppy Quorum: other nodes respond during failures
- Hinted Handoff: data delivered to original node after recovery
- Vector Clocks for versioning
- Merkle Trees for anti-entropy
10.4 System Comparison
| Feature | Cassandra | Spanner | DynamoDB |
|---|---|---|---|
| CAP Classification | AP | CP | AP |
| Consistency | Tunable | Strong | Tunable |
| Replication | Leaderless | Paxos | Leaderless |
| Partitioning | Consistent Hash | Range | Consistent Hash |
| Clocks | NTP | TrueTime | Vector Clock |
| Query Language | CQL | SQL | PartiQL |
11. DDIA Key Takeaways
Key insights from Martin Kleppmann's "Designing Data-Intensive Applications":
Part 1: Foundations of Data Systems
| Chapter | Key Concepts |
|---|---|
| Ch. 1 | Reliability, scalability, maintainability |
| Ch. 2 | Relational vs document vs graph models |
| Ch. 3 | Storage engines: B-Tree vs LSM-Tree |
| Ch. 4 | Encoding: JSON, Protobuf, Avro |
Part 2: Distributed Data
| Chapter | Key Concepts |
|---|---|
| Ch. 5 | Replication: Leader-Follower, Leaderless |
| Ch. 6 | Partitioning: Hash, Range |
| Ch. 7 | Transactions: ACID, isolation levels |
| Ch. 8 | Challenges of distributed systems |
| Ch. 9 | Consistency and consensus |
Part 3: Derived Data
| Chapter | Key Concepts |
|---|---|
| Ch. 10 | Batch processing: MapReduce |
| Ch. 11 | Stream processing: Event sourcing |
| Ch. 12 | The future of data systems |
12. Interview Questions (15)
Q1. Explain the CAP theorem and give real-world examples.
The CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance. Since network partitions are unavoidable, the practical choice is between CP (consistency first) and AP (availability first). ZooKeeper is a CP system suitable for leader election and distributed locks, while Cassandra is an AP system suited for high write throughput workloads.
Q2. Describe the leader election process in Raft.
In Raft, when a follower does not receive a heartbeat within its election timeout, it becomes a candidate. It increments its term, votes for itself, and sends RequestVote RPCs to other nodes. If it receives votes from a majority, it becomes the leader and immediately sends heartbeats. At most one leader can exist per term.
Q3. Explain the difference between Eventual and Strong Consistency.
Strong Consistency guarantees every read returns the most recent write. It requires synchronous replication, leading to high latency and lower availability. Eventual Consistency guarantees that if updates stop, all replicas will eventually converge to the same value. Intermediate reads may return stale data, but latency is low and availability is high. Used by DNS and cache systems.
Q4. Explain Consistent Hashing and why virtual nodes are needed.
Consistent Hashing maps the hash space onto a ring, so adding or removing nodes only redistributes a minimal number of keys. Keys and nodes are mapped using the same hash function, and each key is assigned to the nearest node clockwise. Virtual nodes give each physical node multiple hash positions, smoothing out data distribution imbalances.
Q5. Compare 2PC and the Saga pattern.
2PC (Two-Phase Commit) has a coordinator that instructs all participants to prepare/commit, ensuring atomic commits. It is synchronous, blocking, and the coordinator is a SPOF. The Saga pattern breaks a long transaction into a chain of local transactions, running compensating transactions in reverse on failure. It is asynchronous and scales well but does not guarantee isolation.
Q6. What advantage does a Vector Clock have over a Lamport Clock?
A Lamport Clock can determine happens-before ordering but cannot tell if two events are concurrent. A Vector Clock maintains a vector of counters for every node, enabling accurate detection of concurrency. When two vectors are incomparable, the events are concurrent and require conflict resolution.
Q7. Compare Leader-Follower, Multi-Leader, and Leaderless replication.
Leader-Follower has a single write point with no conflicts, but the leader is a SPOF and write scaling is limited. Multi-Leader offers low write latency across datacenters but requires complex conflict resolution. Leaderless provides high availability and write throughput but needs quorum configuration and conflict resolution strategies.
Q8. How does PACELC extend CAP?
PACELC adds a tradeoff for normal operation on top of CAP. During a partition (P), you choose between availability (A) and consistency (C). Otherwise (E: else), you choose between latency (L) and consistency (C). Cassandra is PA/EL (availability during partition, low latency normally), while Spanner is PC/EC (consistency always).
Q9. How does Google Spanner's TrueTime guarantee strong consistency?
TrueTime uses GPS and atomic clocks to provide a time uncertainty interval. On transaction commit, a timestamp is assigned and commit-wait pauses for the uncertainty interval. This ensures later commits always receive larger timestamps, guaranteeing external consistency globally.
Q10. Why does W + R > N guarantee consistency in quorum reads/writes?
With N nodes, writing to W and reading from R, when W + R > N, the write set and read set always overlap. The overlapping node holds the latest data, so reads are guaranteed to return the most recent value. For example, with N=3, W=2, R=2, at least one node is shared.
Q11. How do you resolve a split-brain problem in distributed systems?
Split-brain occurs when a network partition causes two partitions to each believe they have a leader. Solutions include quorum-based leader election (partitions without a majority cannot elect a leader), fencing tokens (rejecting requests from old leaders), and STONITH (Shoot The Other Node In The Head — forcibly shutting down suspect nodes).
Q12. Compare LSM-Tree and B-Tree from a distributed systems perspective.
LSM-Tree is write-optimized with all writes being sequential, cleaned up via compaction. Ideal for write-heavy workloads (Cassandra, HBase). B-Tree is read-optimized with in-place updates, suitable for read-heavy workloads (traditional RDBMS). In distributed environments, LSM-Trees offer lower replication lag and higher write throughput.
Q13. Why is a Phi Accrual Failure Detector better than simple timeouts?
Simple timeouts declare failure if no response arrives within a fixed time, causing many false positives when network latency varies. The Phi Accrual Failure Detector learns the statistical distribution of heartbeat arrival times and calculates a phi value representing how much the current arrival deviates from the distribution. It adapts to changing network conditions.
Q14. What is the difference between Choreography and Orchestration in the Saga pattern?
In Choreography, each service publishes events and other services subscribe to trigger the next step. No central control, low coupling, but hard to trace. In Orchestration, a central orchestrator controls the entire flow. Easier to trace and debug, but the orchestrator can become a SPOF.
Q15. Why is idempotency important in distributed systems?
Network failures can cause requests to be retried, so performing the same request multiple times must produce the same result. Idempotency keys detect duplicate requests, preventing double processing in operations with side effects like payments or inventory deductions.
13. Quiz (5)
Quiz 1. With N=5, W=3, R=2 — is consistent reading guaranteed?
Yes. W + R = 3 + 2 = 5 > N(5), so the write set and read set always overlap. However, since W + R equals N exactly, this is a boundary case that only holds when all nodes are operational.
Quiz 2. In Raft, a leader at Term 5 receives a RequestVote for Term 3. What happens?
It rejects the request. In Raft, a node always rejects requests with a term lower than its current term. This prevents interference from stale leaders.
Quiz 3. What is the relationship between Vector Clock V1=[2,3,1] and V2=[3,2,1]?
They are concurrent. V1's first element (2) is less than V2's (3), but V1's second element (3) is greater than V2's (2). Since neither vector is fully less than or equal to the other, they are incomparable, indicating concurrent events.
Quiz 4. What is the difference between ONE, QUORUM, and ALL consistency levels in Cassandra?
- ONE: One node response suffices. Fastest but may read stale data
- QUORUM: Majority (N/2+1) nodes must respond. Balanced choice
- ALL: All nodes must respond. Slowest but strong consistency. Fails if any node is down
Quiz 5. What happens if the coordinator crashes after Prepare in 2PC?
Participants are blocked. Participants that responded "Yes" to Prepare can neither commit nor abort, waiting indefinitely with resources locked. This is the biggest drawback of 2PC, addressed by 3PC or coordinator log recovery.
14. References
Books
- Martin Kleppmann, "Designing Data-Intensive Applications" (2017)
- Maarten van Steen, Andrew S. Tanenbaum, "Distributed Systems" (4th ed., 2023)
- Mikito Takada, "Distributed Systems for Fun and Profit" (free online)
Papers
- Brewer, E., "CAP Twelve Years Later" (IEEE Computer, 2012)
- Lamport, L., "The Part-Time Parliament" (1998) — Original Paxos paper
- Ongaro, D. & Ousterhout, J., "In Search of an Understandable Consensus Algorithm" (2014) — Raft paper
- Lamport, L., "Time, Clocks, and the Ordering of Events in a Distributed System" (1978)
- DeCandia, G. et al., "Dynamo: Amazon's Highly Available Key-value Store" (2007)
- Corbett, J.C. et al., "Spanner: Google's Globally-Distributed Database" (2012)
Web Resources
- Jepsen — Distributed systems safety testing: https://jepsen.io/
- The Raft Consensus Algorithm visualization: https://raft.github.io/
- Aphyr's distributed systems class: https://github.com/aphyr/distsys-class
- AWS re:Invent distributed systems talk series
- Martin Kleppmann's blog: https://martin.kleppmann.com/
- Marc Brooker's blog: https://brooker.co.za/blog/
- Distributed Systems Reading Group: http://dsrg.pdos.csail.mit.edu/