Split View: 분산 시스템 기초 완전 가이드 2025: 합의 알고리즘, 복제, 일관성 모델, 장애 허용

분산 시스템 기초 완전 가이드 2025: 합의 알고리즘, 복제, 일관성 모델, 장애 허용

1. 왜 분산 시스템인가

1.1 분산 시스템이 필요한 이유

단일 서버로는 해결할 수 없는 세 가지 근본적인 이유가 있습니다.

확장성(Scalability): 단일 머신의 CPU, 메모리, 디스크에는 물리적 한계가 있습니다
가용성(Availability): 단일 장애점(SPOF)을 제거하여 서비스 연속성을 보장합니다
지리적 분산(Geographic Distribution): 사용자에게 가까운 위치에서 서비스하여 지연을 줄입니다

분산 시스템의 핵심 도전 과제
┌─────────────────────────────────────────────┐
│                                             │
│  네트워크는 신뢰할 수 없다                    │
│  ├── 패킷 손실                              │
│  ├── 지연 변동                              │
│  └── 네트워크 파티션                         │
│                                             │
│  노드는 실패할 수 있다                       │
│  ├── 크래시                                 │
│  ├── 슬로우다운                             │
│  └── 비잔틴 장애 (악의적 동작)               │
│                                             │
│  글로벌 시간이 존재하지 않는다                │
│  ├── 클럭 드리프트                           │
│  ├── NTP 오차                               │
│  └── 이벤트 순서 결정의 어려움               │
│                                             │
└─────────────────────────────────────────────┘

1.2 분산 컴퓨팅의 8가지 오류 (Fallacies of Distributed Computing)

Peter Deutsch가 정의한 8가지 잘못된 가정입니다.

번호	오류 (Fallacy)	현실
1	네트워크는 신뢰할 수 있다	패킷 손실, 지연, 파티션이 발생한다
2	지연은 0이다	물리적 거리에 비례하여 지연이 발생한다
3	대역폭은 무한하다	네트워크 대역폭에는 한계가 있다
4	네트워크는 안전하다	보안 위협이 항상 존재한다
5	토폴로지는 변하지 않는다	네트워크 구성은 계속 변한다
6	관리자가 한 명이다	여러 조직이 네트워크를 관리한다
7	전송 비용은 0이다	데이터 전송에는 비용이 든다
8	네트워크는 동질적이다	다양한 장비와 프로토콜이 혼재한다

2. CAP 정리와 PACELC

2.1 CAP 정리

Eric Brewer가 2000년에 제안한 CAP 정리는 분산 시스템 설계의 기본 원칙입니다.

CAP 정리
┌─────────────────────────────────────────────┐
│                                             │
│            Consistency (C)                  │
│               /\                            │
│              /  \                           │
│             /    \                          │
│            / CP   \  CA                     │
│           /  시스템 \ 시스템                  │
│          /          \                       │
│         /____________\                      │
│  Availability (A) ─── Partition             │
│                       Tolerance (P)         │
│                                             │
│  CP: HBase, MongoDB, etcd, ZooKeeper       │
│  AP: Cassandra, DynamoDB, CouchDB          │
│  CA: 단일 노드 RDBMS (이론적)               │
│                                             │
│  네트워크 파티션 시 C 또는 A 중 선택 필수    │
└─────────────────────────────────────────────┘

세 가지 속성:

C (Consistency): 모든 노드가 동일한 데이터를 보장
A (Availability): 모든 요청이 응답을 받음 (장애 없는 노드)
P (Partition Tolerance): 네트워크 파티션 발생 시에도 동작

핵심: 네트워크 파티션은 불가피하므로 P는 필수입니다. 따라서 실제 선택은 CP vs AP입니다.

2.2 PACELC 확장

Daniel Abadi가 제안한 PACELC는 CAP을 확장하여 정상 상태에서의 트레이드오프도 다룹니다.

PACELC 정리
if (Partition) then
    Availability vs Consistency
else
    Latency vs Consistency

┌─────────────────────────────────────────────┐
│ 시스템         │ P 발생 시 │ 정상 시 (E)    │
│───────────────│──────────│───────────────│
│ DynamoDB      │ PA       │ EL (낮은 지연) │
│ Cassandra     │ PA       │ EL             │
│ MongoDB       │ PC       │ EC (강한 일관성)│
│ Google Spanner│ PC       │ EC             │
│ CockroachDB   │ PC       │ EL (읽기 최적화)│
│ PNUTS (Yahoo) │ PC       │ EL             │
└─────────────────────────────────────────────┘

3. 일관성 모델

3.1 일관성 스펙트럼

강한 일관성 ◄─────────────────────────────► 약한 일관성
(Strong)                                    (Weak)

선형화 가능성  순차적    인과적    최종적
(Linearizable) (Sequential) (Causal) (Eventual)

│ 성능 낮음    │          │         │ 성능 높음  │
│ 구현 복잡    │          │         │ 구현 단순  │
│ 높은 보장    │          │         │ 낮은 보장  │

3.2 선형화 가능성 (Linearizability)

가장 강한 일관성 보장입니다. 모든 연산이 실시간 순서와 일치해야 합니다.

선형화 가능성 (Linearizable)
Client A: ──write(x=1)──────────────────────
Client B: ────────────read(x)→1─────────────
Client C: ──────────────────read(x)→1───────

모든 클라이언트가 쓰기 완료 후 항상 최신 값을 읽음

비선형화 (Not Linearizable)
Client A: ──write(x=1)──────────────────────
Client B: ────────────read(x)→1─────────────
Client C: ──────────────────read(x)→0───────  ← 오래된 값!

3.3 최종적 일관성 (Eventual Consistency)

업데이트가 중단 없이 계속되면 결국 모든 복제본이 동일한 상태에 도달합니다.

# Eventual Consistency를 가정한 읽기 패턴
class EventuallyConsistentStore:
    def __init__(self, replicas: list):
        self.replicas = replicas

    def write(self, key: str, value: str, timestamp: float):
        """모든 복제본에 비동기 쓰기"""
        for replica in self.replicas:
            replica.async_put(key, value, timestamp)

    def read(self, key: str) -> str:
        """읽기 - 어떤 복제본에서든 읽을 수 있음"""
        # 최신 타임스탬프 값 선택 (read-repair)
        values = [r.get(key) for r in self.replicas]
        return max(values, key=lambda v: v.timestamp).value

    def read_your_writes(self, key: str, session_token: str) -> str:
        """자신의 쓰기를 읽을 수 있도록 보장"""
        # 세션 토큰으로 마지막 쓰기 위치 확인
        last_write_replica = self.get_replica_for_session(session_token)
        return last_write_replica.get(key).value

3.4 인과적 일관성 (Causal Consistency)

인과적으로 관련된 연산의 순서만 보장합니다.

인과적 일관성 예시:
A: write(x=1)    ← B가 x=1을 읽은 후 y=2를 쓴 경우
B: read(x)→1, write(y=2)
C: read(y)→2, read(x)→?

인과적 일관성이 보장되면:
  C가 y=2를 볼 수 있다면, x=1도 반드시 볼 수 있어야 합니다.
  (y=2의 인과 원인인 x=1이 먼저 보여야 함)

3.5 Read-Your-Writes 일관성

자신이 쓴 데이터를 항상 읽을 수 있습니다.

Read-Your-Writes
Client A: write(x=1) ───→ read(x)→1  (항상 보장)
                           ↑
                    자신의 쓰기를 읽음

Session Consistency: 동일 세션 내에서 보장
Monotonic Reads: 한 번 새 값을 보면 이전 값으로 돌아가지 않음

4. 합의 알고리즘

4.1 Paxos

Leslie Lamport가 제안한 최초의 합의 알고리즘입니다.

Basic Paxos (단일 값 합의)
┌─────────────────────────────────────────────┐
│                                             │
│  Phase 1a: Prepare                          │
│  Proposer → Acceptor: prepare(n)            │
│  "제안 번호 n으로 준비해주세요"              │
│                                             │
│  Phase 1b: Promise                          │
│  Acceptor → Proposer: promise(n, prev_val)  │
│  "n 미만은 거부하겠습니다. 이전 수락값 전달" │
│                                             │
│  Phase 2a: Accept                           │
│  Proposer → Acceptor: accept(n, value)      │
│  "이 값을 수락해주세요"                      │
│                                             │
│  Phase 2b: Accepted                         │
│  Acceptor → Learner: accepted(n, value)     │
│  "과반수가 수락하면 합의 완료"               │
│                                             │
└─────────────────────────────────────────────┘

4.2 Multi-Paxos

실제 시스템에서는 반복적인 합의를 위해 Multi-Paxos를 사용합니다. 안정적인 리더를 선출하여 Phase 1을 생략합니다.

4.3 Raft: 이해하기 쉬운 합의

Diego Ongaro가 2014년에 제안한 Raft는 Paxos와 동등한 안전성을 제공하면서도 이해하기 쉽게 설계되었습니다.

Raft 상태 전이
┌────────────────────────────────────────────────┐
│                                                │
│  ┌──────────┐  타임아웃   ┌──────────┐         │
│  │ Follower │──────────→│ Candidate│         │
│  │          │←──────────│          │         │
│  └──────────┘  실패/발견  └────┬─────┘         │
│       ↑                      │               │
│       │    과반수 투표 획득    │               │
│       │                      ↓               │
│       │              ┌──────────┐            │
│       └──────────────│  Leader  │            │
│         새 리더 발견   │          │            │
│                      └──────────┘            │
│                                                │
│  Term (임기):                                  │
│  ┌───┬───┬───┬───┬───┐                        │
│  │ 1 │ 2 │ 3 │ 4 │ 5 │ ...                   │
│  └───┴───┴───┴───┴───┘                        │
│  각 Term마다 최대 하나의 Leader                 │
└────────────────────────────────────────────────┘

Raft Leader Election

Leader Election 과정
┌────────────────────────────────────────────────┐
│ 1. Follower가 heartbeat 타임아웃              │
│ 2. Term 증가, 자신에게 투표                    │
│ 3. RequestVote RPC를 다른 노드에 전송          │
│ 4. 과반수 투표를 받으면 Leader 됨              │
│                                                │
│ [Node A: Follower]  [Node B: Follower]         │
│ [Node C: Follower]  [Node D: Follower]         │
│ [Node E: Follower]                             │
│                                                │
│ Node A 타임아웃 → Candidate (Term 2)           │
│ A→B: RequestVote(Term 2) → B: Grant            │
│ A→C: RequestVote(Term 2) → C: Grant            │
│ A→D: RequestVote(Term 2) → D: Grant            │
│                                                │
│ A가 과반수(3/5) 획득 → Leader!                 │
│ A→all: AppendEntries(heartbeat)                │
└────────────────────────────────────────────────┘

Raft Log Replication

Log Replication
┌────────────────────────────────────────────────┐
│                                                │
│  Client → Leader: write(x=5)                   │
│                                                │
│  Leader Log:  [1:x=3] [2:y=7] [3:x=5]         │
│  Follower A:  [1:x=3] [2:y=7]                 │
│  Follower B:  [1:x=3] [2:y=7]                 │
│  Follower C:  [1:x=3]                         │
│  Follower D:  [1:x=3] [2:y=7]                 │
│                                                │
│  1. Leader가 로그에 엔트리 추가                 │
│  2. AppendEntries RPC로 복제                   │
│  3. 과반수 복제 확인 시 커밋                    │
│  4. 클라이언트에 응답                          │
│  5. 다음 heartbeat에서 커밋 알림               │
│                                                │
│  안전성: Leader는 커밋된 모든 엔트리를 포함     │
│  (Election Restriction)                        │
└────────────────────────────────────────────────┘

4.4 Raft 구현 예시 (의사코드)

from enum import Enum
from dataclasses import dataclass, field
from typing import Optional
import random
import time

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

@dataclass
class LogEntry:
    term: int
    index: int
    command: str

@dataclass
class RaftNode:
    node_id: str
    state: NodeState = NodeState.FOLLOWER
    current_term: int = 0
    voted_for: Optional[str] = None
    log: list = field(default_factory=list)
    commit_index: int = 0
    last_applied: int = 0

    # Leader 전용
    next_index: dict = field(default_factory=dict)
    match_index: dict = field(default_factory=dict)

    def on_election_timeout(self):
        """선거 타임아웃: Candidate로 전환"""
        self.state = NodeState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id
        votes_received = 1  # 자신에게 투표

        # 모든 노드에 RequestVote 전송
        for peer in self.peers:
            response = self.send_request_vote(peer)
            if response and response.vote_granted:
                votes_received += 1

        if votes_received > len(self.peers) // 2:
            self.become_leader()

    def become_leader(self):
        """Leader로 전환"""
        self.state = NodeState.LEADER
        # next_index 초기화
        for peer in self.peers:
            self.next_index[peer] = len(self.log)
            self.match_index[peer] = 0
        self.send_heartbeats()

    def on_append_entries(self, term, leader_id, entries, leader_commit):
        """AppendEntries RPC 처리"""
        if term < self.current_term:
            return False

        self.state = NodeState.FOLLOWER
        self.current_term = term
        self.reset_election_timer()

        # 로그 일관성 확인 및 엔트리 추가
        self.log.extend(entries)

        if leader_commit > self.commit_index:
            self.commit_index = min(leader_commit, len(self.log) - 1)
            self.apply_committed_entries()

        return True

    def replicate_log(self, command: str):
        """클라이언트 명령 처리 (Leader만)"""
        if self.state != NodeState.LEADER:
            return False

        entry = LogEntry(
            term=self.current_term,
            index=len(self.log),
            command=command
        )
        self.log.append(entry)

        # 모든 Follower에 복제
        ack_count = 1  # 자신
        for peer in self.peers:
            success = self.send_append_entries(peer, [entry])
            if success:
                ack_count += 1

        # 과반수 확인
        if ack_count > len(self.peers) // 2:
            self.commit_index = entry.index
            self.apply_committed_entries()
            return True
        return False

5. 복제 전략

5.1 리더-팔로워 (Leader-Follower)

리더-팔로워 복제
┌─────────────────────────────────────────────┐
│                                             │
│  Client ──write──→ ┌────────┐               │
│                    │ Leader │               │
│  Client ──read──→  │ (R/W)  │               │
│                    └───┬────┘               │
│                   복제  │  복제              │
│              ┌─────────┼─────────┐          │
│              ↓         ↓         ↓          │
│         ┌────────┐┌────────┐┌────────┐      │
│         │Follower││Follower││Follower│      │
│         │ (Read) ││ (Read) ││ (Read) │      │
│         └────────┘└────────┘└────────┘      │
│                                             │
│  동기 복제: 느리지만 데이터 손실 없음        │
│  비동기 복제: 빠르지만 데이터 손실 가능       │
│  반동기 복제: 하나 이상의 팔로워에 동기 복제  │
└─────────────────────────────────────────────┘

5.2 멀티리더 (Multi-Leader)

멀티리더 복제
┌─────────────────────────────────────────────┐
│                                             │
│  DC-US             DC-EU            DC-Asia │
│  ┌────────┐       ┌────────┐      ┌────────┐│
│  │Leader 1│◄─────►│Leader 2│◄────►│Leader 3││
│  │ (R/W)  │       │ (R/W)  │      │ (R/W)  ││
│  └───┬────┘       └───┬────┘      └───┬────┘│
│      │                │               │     │
│  ┌───┴───┐        ┌───┴───┐      ┌───┴───┐ │
│  │Follower│        │Follower│      │Follower│ │
│  └───────┘        └───────┘      └───────┘ │
│                                             │
│  장점: 쓰기 지연 감소, 오프라인 동작 가능    │
│  단점: 충돌 해결 필요 (LWW, CRDT 등)        │
└─────────────────────────────────────────────┘

충돌 해결 전략

# Last-Writer-Wins (LWW) - 가장 단순
def lww_resolve(conflict_a, conflict_b):
    """타임스탬프가 더 큰 값이 승리"""
    if conflict_a.timestamp > conflict_b.timestamp:
        return conflict_a
    return conflict_b

# Custom merge function
def custom_merge(conflict_a, conflict_b):
    """도메인 특화 머지 로직"""
    if conflict_a.type == "counter":
        return Value(conflict_a.value + conflict_b.value)
    elif conflict_a.type == "set":
        return Value(conflict_a.value.union(conflict_b.value))
    else:
        return lww_resolve(conflict_a, conflict_b)

5.3 무리더 (Leaderless) / 쿼럼

무리더 복제 (Dynamo 스타일)
┌─────────────────────────────────────────────┐
│                                             │
│  N=3 (총 복제본), W=2 (쓰기 쿼럼),          │
│  R=2 (읽기 쿼럼)                             │
│                                             │
│  Write: Client → Node1 (ack)                │
│                → Node2 (ack)    ← W=2 충족  │
│                → Node3 (timeout)             │
│                                             │
│  Read:  Client → Node1 (v=5, ts=100)        │
│                → Node2 (v=5, ts=100)  ← 최신│
│                → Node3 (v=3, ts=90)   ← 오래│
│                                             │
│  읽기 복구: 오래된 Node3에 최신값 전송       │
│                                             │
│  쿼럼 조건: W + R > N → 최소 하나의          │
│  교집합이 존재하여 최신값 읽기 보장           │
│                                             │
│  Sloppy Quorum: 장애 시 다른 노드로 대체    │
│  (Hinted Handoff)                           │
└─────────────────────────────────────────────┘

5.4 체인 복제 (Chain Replication)

체인 복제
┌─────────────────────────────────────────────┐
│                                             │
│  Write → [Head] → [Middle] → [Tail] → Read │
│                                             │
│  장점:                                      │
│  - 강한 일관성 보장                          │
│  - 읽기와 쓰기의 부하 분산                   │
│  - 구현이 비교적 단순                        │
│                                             │
│  단점:                                      │
│  - 쓰기 지연이 체인 길이에 비례               │
│  - Head나 Tail 장애 시 재구성 필요           │
│                                             │
│  사용 예: Azure Storage, HDFS               │
└─────────────────────────────────────────────┘

6. 분산 클럭

6.1 물리적 시계와 NTP

물리적 시계의 문제
┌─────────────────────────────────────────────┐
│                                             │
│  Node A 시계: 10:00:00.000                  │
│  Node B 시계: 10:00:00.150  (150ms 차이)    │
│  Node C 시계: 09:59:59.800  (200ms 차이)    │
│                                             │
│  NTP 동기화 정확도: 보통 수십ms              │
│  Google Spanner TrueTime: 약 7ms 오차       │
│                                             │
│  문제: 이벤트 A가 B보다 먼저 발생했는지      │
│  물리적 시계만으로는 판단 불가                │
└─────────────────────────────────────────────┘

6.2 Lamport 타임스탬프

Leslie Lamport가 제안한 논리적 시계입니다. 인과 관계의 순서를 보장합니다.

Lamport Timestamp
규칙:
1. 이벤트 발생 시 로컬 카운터 증가
2. 메시지 전송 시 카운터 포함
3. 메시지 수신 시 max(local, received) + 1

Node A:  [1] ──msg──→ [2] ────→ [3]
                        ↓
Node B:  [1] ────→ [3] (max(1,2)+1) → [4]
                                        ↓
Node C:  [1] ──→ [2] ───────────→ [5] (max(2,4)+1)

한계: a와 b의 Lamport 시간만으로
      인과 관계를 구별할 수 없음
      (L(a) < L(b)라도 a가 b의 원인이 아닐 수 있음)

class LamportClock:
    def __init__(self):
        self.time = 0

    def tick(self) -> int:
        """로컬 이벤트 발생"""
        self.time += 1
        return self.time

    def send(self) -> int:
        """메시지 전송"""
        self.time += 1
        return self.time

    def receive(self, sender_time: int) -> int:
        """메시지 수신"""
        self.time = max(self.time, sender_time) + 1
        return self.time

6.3 벡터 클럭 (Vector Clock)

벡터 클럭
각 노드가 모든 노드의 카운터를 벡터로 유지

Node A: [A:1, B:0, C:0]
        ──msg──→
Node B: [A:0, B:1, C:0] → 수신 후 [A:1, B:2, C:0]
        ──msg──→
Node C: [A:0, B:0, C:1] → 수신 후 [A:1, B:2, C:2]

인과 관계 판단:
V1 < V2  ⟺  모든 i에 대해 V1[i] <= V2[i] 이고
              적어도 하나의 j에 대해 V1[j] < V2[j]

V1 || V2  (동시 발생) ⟺  V1 < V2도 V2 < V1도 아님

class VectorClock:
    def __init__(self, node_id: str, nodes: list):
        self.node_id = node_id
        self.clock = {n: 0 for n in nodes}

    def tick(self):
        """로컬 이벤트"""
        self.clock[self.node_id] += 1

    def send(self) -> dict:
        """메시지 전송"""
        self.clock[self.node_id] += 1
        return self.clock.copy()

    def receive(self, sender_clock: dict):
        """메시지 수신"""
        for node in self.clock:
            self.clock[node] = max(self.clock[node], sender_clock.get(node, 0))
        self.clock[self.node_id] += 1

    def is_before(self, other_clock: dict) -> bool:
        """self가 other보다 먼저 발생했는지"""
        all_leq = all(self.clock[n] <= other_clock.get(n, 0) for n in self.clock)
        any_lt = any(self.clock[n] < other_clock.get(n, 0) for n in self.clock)
        return all_leq and any_lt

    def is_concurrent(self, other_clock: dict) -> bool:
        """동시 발생 여부"""
        return not self.is_before(other_clock) and not self._is_after(other_clock)

6.4 하이브리드 논리 클럭 (HLC)

HLC (Hybrid Logical Clock)
= 물리적 시계 + 논리적 시계의 장점 결합

구성: (physical_time, logical_counter)

장점:
- 물리적 시간에 가까운 값 (사람이 이해 가능)
- 인과 관계 추적 가능
- NTP 오차를 보정

사용 예: CockroachDB, YugabyteDB

7. 장애 모델

7.1 장애 유형

장애 모델 스펙트럼 (허용 난이도 순)
┌─────────────────────────────────────────────┐
│                                             │
│  Crash-Stop (가장 단순)                      │
│  ├── 노드가 멈추고 복구되지 않음             │
│  ├── 감지가 비교적 쉬움                      │
│  └── 대부분의 합의 알고리즘이 가정            │
│                                             │
│  Crash-Recovery                              │
│  ├── 노드가 멈추고 나중에 복구               │
│  ├── 영구 저장소에서 상태 복원               │
│  └── WAL(Write-Ahead Log) 필요              │
│                                             │
│  Omission                                    │
│  ├── 메시지를 보내거나 받지 못함             │
│  ├── 네트워크 파티션 포함                    │
│  └── 타임아웃으로 감지                       │
│                                             │
│  Byzantine (가장 복잡)                       │
│  ├── 노드가 임의의/악의적 동작              │
│  ├── 잘못된 데이터 전송 가능                 │
│  ├── 합의에 3f+1 노드 필요 (f: 장애 수)     │
│  └── 블록체인에서 주로 사용                   │
│                                             │
└─────────────────────────────────────────────┘

7.2 장애 감지

class PhiAccrualFailureDetector:
    """Phi Accrual 장애 감지기 (Akka에서 사용)"""

    def __init__(self, threshold: float = 8.0):
        self.threshold = threshold
        self.heartbeat_intervals = []
        self.last_heartbeat = None

    def heartbeat(self, timestamp: float):
        """하트비트 수신"""
        if self.last_heartbeat:
            interval = timestamp - self.last_heartbeat
            self.heartbeat_intervals.append(interval)
        self.last_heartbeat = timestamp

    def phi(self, current_time: float) -> float:
        """phi 값 계산 - 장애 의심 수준"""
        if not self.heartbeat_intervals:
            return 0.0

        time_since_last = current_time - self.last_heartbeat
        mean = sum(self.heartbeat_intervals) / len(self.heartbeat_intervals)
        variance = sum((x - mean) ** 2 for x in self.heartbeat_intervals) / len(self.heartbeat_intervals)
        std_dev = variance ** 0.5

        if std_dev == 0:
            return float('inf') if time_since_last > mean else 0.0

        # 정규 분포 기반 phi 계산
        y = (time_since_last - mean) / std_dev
        # phi = -log10(1 - CDF(y))
        return max(0.0, y * 1.5)  # 단순화된 근사

    def is_alive(self, current_time: float) -> bool:
        """노드가 살아있는지 판단"""
        return self.phi(current_time) < self.threshold

8. 분산 트랜잭션

8.1 2PC (Two-Phase Commit)

2PC (Two-Phase Commit)
┌────────────────────────────────────────────────┐
│                                                │
│  Phase 1: Prepare (투표)                       │
│  Coordinator → Participant A: prepare?         │
│  Coordinator → Participant B: prepare?         │
│  Coordinator → Participant C: prepare?         │
│                                                │
│  A → Coordinator: YES (준비 완료)              │
│  B → Coordinator: YES                          │
│  C → Coordinator: YES                          │
│                                                │
│  Phase 2: Commit (실행)                        │
│  Coordinator → A: commit                       │
│  Coordinator → B: commit                       │
│  Coordinator → C: commit                       │
│                                                │
│  문제점:                                       │
│  - 코디네이터 장애 시 참가자 블로킹            │
│  - 단일 장애점 (SPOF)                          │
│  - 동기적이라 성능 저하                        │
│                                                │
│  하나라도 NO → Abort                           │
│  코디네이터 장애 → 불확실 상태 (in-doubt)      │
└────────────────────────────────────────────────┘

8.2 3PC (Three-Phase Commit)

3PC: 2PC의 블로킹 문제 해결 시도
Phase 1: CanCommit (투표)
Phase 2: PreCommit (사전 커밋)
Phase 3: DoCommit (최종 커밋)

추가된 PreCommit 단계로:
- 타임아웃 기반 복구 가능
- 하지만 네트워크 파티션에서는 여전히 문제

실무에서는 3PC보다 Saga 패턴을 더 많이 사용

8.3 Saga 패턴

Saga 패턴: 보상 트랜잭션 기반
┌────────────────────────────────────────────────┐
│                                                │
│  정상 흐름:                                    │
│  T1 → T2 → T3 → T4 → 완료                     │
│  (주문) (결제) (재고) (배송)                    │
│                                                │
│  T3 실패 시 보상:                              │
│  T1 → T2 → T3(실패) → C2 → C1                 │
│  (주문) (결제) (재고실패) (결제취소) (주문취소)  │
│                                                │
│  Choreography (이벤트 기반):                   │
│  각 서비스가 이벤트를 발행하고 구독              │
│                                                │
│  Orchestration (중앙 조정):                    │
│  Saga Orchestrator가 각 단계를 조정             │
│                                                │
└────────────────────────────────────────────────┘

# Saga Orchestrator 예시
from dataclasses import dataclass
from typing import Callable
from enum import Enum

class SagaStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    COMPENSATING = "compensating"
    FAILED = "failed"

@dataclass
class SagaStep:
    name: str
    action: Callable
    compensation: Callable

class SagaOrchestrator:
    def __init__(self, steps: list):
        self.steps = steps
        self.completed_steps = []
        self.status = SagaStatus.PENDING

    async def execute(self):
        """Saga 실행"""
        self.status = SagaStatus.RUNNING

        for step in self.steps:
            try:
                await step.action()
                self.completed_steps.append(step)
            except Exception as e:
                print(f"Step '{step.name}' failed: {e}")
                await self.compensate()
                return False

        self.status = SagaStatus.COMPLETED
        return True

    async def compensate(self):
        """보상 트랜잭션 실행 (역순)"""
        self.status = SagaStatus.COMPENSATING

        for step in reversed(self.completed_steps):
            try:
                await step.compensation()
                print(f"Compensated: {step.name}")
            except Exception as e:
                print(f"Compensation failed for '{step.name}': {e}")
                self.status = SagaStatus.FAILED
                return

        self.status = SagaStatus.FAILED

# 사용 예
saga = SagaOrchestrator([
    SagaStep("create_order", create_order, cancel_order),
    SagaStep("process_payment", charge_payment, refund_payment),
    SagaStep("reserve_inventory", reserve_stock, release_stock),
    SagaStep("arrange_shipping", book_delivery, cancel_delivery),
])

9. 파티셔닝과 샤딩

9.1 파티셔닝 전략

범위 파티셔닝 (Range Partitioning)
┌─────────────────────────────────────────────┐
│ Key: A-F → Shard 1                          │
│ Key: G-N → Shard 2                          │
│ Key: O-Z → Shard 3                          │
│                                             │
│ 장점: 범위 쿼리 효율적                       │
│ 단점: 핫스팟 발생 가능 (특정 범위에 쏠림)    │
└─────────────────────────────────────────────┘

해시 파티셔닝 (Hash Partitioning)
┌─────────────────────────────────────────────┐
│ hash(key) % 3 == 0 → Shard 1               │
│ hash(key) % 3 == 1 → Shard 2               │
│ hash(key) % 3 == 2 → Shard 3               │
│                                             │
│ 장점: 균등 분배                              │
│ 단점: 범위 쿼리 불가, 리샤딩 비용 높음       │
└─────────────────────────────────────────────┘

9.2 일관성 해싱 (Consistent Hashing)

일관성 해싱
┌─────────────────────────────────────────────┐
│                                             │
│           Node A (0)                        │
│          ╱        ╲                         │
│        ╱            ╲                       │
│   Node D              Node B                │
│   (270)                (90)                 │
│        ╲            ╱                       │
│          ╲        ╱                         │
│           Node C (180)                      │
│                                             │
│  key의 해시 → 링 위의 위치                   │
│  시계 방향으로 가장 가까운 노드에 할당        │
│                                             │
│  노드 추가/제거 시:                          │
│  - 전체 리밸런싱이 아닌 인접 노드만 영향     │
│  - 평균 K/N 키만 이동 (K: 총 키, N: 노드)   │
│                                             │
│  가상 노드 (Virtual Node):                   │
│  - 각 물리 노드가 링 위에 여러 위치 차지     │
│  - 부하 균등 분배                            │
│                                             │
└─────────────────────────────────────────────┘

import hashlib
from bisect import bisect_right

class ConsistentHashRing:
    def __init__(self, virtual_nodes: int = 150):
        self.virtual_nodes = virtual_nodes
        self.ring = {}         # hash -> node
        self.sorted_keys = []  # sorted hash values
        self.nodes = set()

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node: str):
        """노드 추가"""
        self.nodes.add(node)
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:vn{i}"
            h = self._hash(virtual_key)
            self.ring[h] = node
            self.sorted_keys.append(h)
        self.sorted_keys.sort()

    def remove_node(self, node: str):
        """노드 제거"""
        self.nodes.discard(node)
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:vn{i}"
            h = self._hash(virtual_key)
            del self.ring[h]
            self.sorted_keys.remove(h)

    def get_node(self, key: str) -> str:
        """키에 대한 노드 조회"""
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect_right(self.sorted_keys, h)
        if idx >= len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

# 사용
ring = ConsistentHashRing(virtual_nodes=150)
ring.add_node("node-1")
ring.add_node("node-2")
ring.add_node("node-3")

print(ring.get_node("user:123"))  # → node-2
print(ring.get_node("order:456"))  # → node-1

10. Gossip 프로토콜과 멤버십

10.1 Gossip 프로토콜

Gossip (Epidemic) 프로토콜
┌─────────────────────────────────────────────┐
│                                             │
│  주기적으로 랜덤 노드에 상태 정보 전파       │
│                                             │
│  Round 1: A → B (A의 정보)                  │
│  Round 2: A → C, B → D (A,B의 정보)        │
│  Round 3: C → E, D → F (확산 계속)          │
│  ...                                        │
│  O(log N) 라운드 후 전체 전파                │
│                                             │
│  Push: 자신의 정보를 보냄                    │
│  Pull: 상대의 정보를 요청                    │
│  Push-Pull: 양방향 교환 (가장 효율적)        │
│                                             │
│  사용 예:                                    │
│  - 멤버십 관리 (노드 추가/제거 감지)         │
│  - 장애 감지                                │
│  - 분산 집계 (카운터, 평균 등)               │
│  - Amazon DynamoDB, Apache Cassandra        │
│                                             │
└─────────────────────────────────────────────┘

10.2 SWIM 프로토콜

SWIM (Scalable Weakly-consistent Infection-style Membership)
┌────────────────────────────────────────────────┐
│                                                │
│  1. Ping: 랜덤 노드에 직접 핑                  │
│     A ──ping──→ B                              │
│                                                │
│  2. Ack: 응답                                  │
│     A ←──ack── B                               │
│                                                │
│  3. 무응답 시: 간접 핑 (Ping-Req)              │
│     A ──ping-req──→ C ──ping──→ B              │
│     A ←──ack────── C ←──ack── B               │
│                                                │
│  4. 여전히 무응답: B를 의심(Suspect)            │
│  5. 일정 시간 후: B를 죽은 것으로 판단          │
│                                                │
│  사용: HashiCorp Serf, Consul                  │
└────────────────────────────────────────────────┘

11. 실전 시스템 분석

11.1 Google Spanner

Google Spanner 핵심 기술
┌─────────────────────────────────────────────┐
│                                             │
│  TrueTime API                               │
│  ├── GPS + 원자 시계 기반                    │
│  ├── 시간 불확실성 구간 반환: [earliest, latest]│
│  ├── 오차: 약 7ms 이내                       │
│  └── 외부 일관성 (External Consistency) 보장  │
│                                             │
│  Paxos 기반 복제                             │
│  ├── 각 스패너 서버 그룹이 Paxos 실행         │
│  ├── 동기 복제로 강한 일관성                  │
│  └── 전 세계 분산 가능                       │
│                                             │
│  읽기-쓰기 트랜잭션                           │
│  ├── 2PL (Two-Phase Locking) + 2PC           │
│  ├── TrueTime으로 커밋 타임스탬프 할당        │
│  └── 읽기 전용 트랜잭션: 잠금 없이 스냅샷 읽기│
│                                             │
└─────────────────────────────────────────────┘

11.2 Amazon DynamoDB (Dynamo 논문)

DynamoDB 핵심 설계
┌─────────────────────────────────────────────┐
│                                             │
│  일관성 해싱 + 가상 노드                     │
│  ├── 파티션 키 기반 데이터 분배              │
│  └── 자동 리밸런싱                           │
│                                             │
│  Sloppy Quorum + Hinted Handoff             │
│  ├── N=3, W=2, R=2                          │
│  ├── 장애 시 다른 노드가 대신 저장           │
│  └── 복구 후 원래 노드로 전달                │
│                                             │
│  Vector Clock으로 충돌 감지                   │
│  ├── 인과 관계 추적                          │
│  └── 동시 쓰기 감지 시 클라이언트가 해결      │
│                                             │
│  Anti-Entropy (Merkle Tree)                  │
│  ├── 복제본 간 불일치 탐지                   │
│  └── 효율적인 동기화                         │
│                                             │
│  Gossip 기반 멤버십                           │
│  └── 노드 추가/제거 감지                     │
│                                             │
└─────────────────────────────────────────────┘

11.3 Apache Kafka 내부 구조

Kafka 복제 모델
┌─────────────────────────────────────────────┐
│                                             │
│  Topic: orders (Partition 3, RF=3)          │
│                                             │
│  Partition 0:                               │
│    Leader: Broker 1                         │
│    ISR: [Broker 1, Broker 2, Broker 3]     │
│                                             │
│  Partition 1:                               │
│    Leader: Broker 2                         │
│    ISR: [Broker 2, Broker 3, Broker 1]     │
│                                             │
│  Partition 2:                               │
│    Leader: Broker 3                         │
│    ISR: [Broker 3, Broker 1, Broker 2]     │
│                                             │
│  ISR (In-Sync Replicas):                    │
│  ├── Leader의 로그를 따라잡은 복제본         │
│  ├── 뒤처지면 ISR에서 제거                   │
│  └── acks=all → ISR 전체가 복제 확인         │
│                                             │
│  Controller:                                │
│  ├── ZooKeeper/KRaft로 선출                  │
│  ├── 파티션 리더 할당                        │
│  └── 브로커 장애 시 리더 재선출              │
│                                             │
└─────────────────────────────────────────────┘

12. 설계 패턴 모음

12.1 Circuit Breaker

import time
from enum import Enum
from threading import Lock

class CircuitState(Enum):
    CLOSED = "closed"       # 정상
    OPEN = "open"           # 차단
    HALF_OPEN = "half_open" # 시험

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5,
                 recovery_timeout: float = 30.0,
                 half_open_max: int = 3):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max = half_open_max

        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0
        self.half_open_count = 0
        self.lock = Lock()

    def call(self, func, *args, **kwargs):
        """보호된 함수 호출"""
        with self.lock:
            if self.state == CircuitState.OPEN:
                if time.time() - self.last_failure_time > self.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.half_open_count = 0
                else:
                    raise CircuitBreakerOpenError("Circuit breaker is OPEN")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e

    def _on_success(self):
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                self.half_open_count += 1
                if self.half_open_count >= self.half_open_max:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
            elif self.state == CircuitState.CLOSED:
                self.failure_count = 0

    def _on_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN
            elif self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN

12.2 Bulkhead 패턴

Bulkhead 패턴 (격벽)
┌─────────────────────────────────────────────┐
│                                             │
│  서비스 A용 스레드 풀: [████░░░░░░]          │
│  서비스 B용 스레드 풀: [██████░░░░]          │
│  서비스 C용 스레드 풀: [███░░░░░░░]          │
│                                             │
│  서비스 B가 느려져도 A, C에 영향 없음        │
│  각 서비스의 장애가 격리됨                   │
│                                             │
└─────────────────────────────────────────────┘

12.3 Retry with Exponential Backoff

import random
import time
from functools import wraps

def retry_with_backoff(max_retries: int = 3,
                       base_delay: float = 1.0,
                       max_delay: float = 60.0,
                       jitter: bool = True):
    """지수 백오프와 지터를 적용한 재시도"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries:
                        raise e

                    delay = min(base_delay * (2 ** attempt), max_delay)
                    if jitter:
                        delay = delay * random.uniform(0.5, 1.5)

                    print(f"Attempt {attempt + 1} failed, "
                          f"retrying in {delay:.1f}s...")
                    time.sleep(delay)
        return wrapper
    return decorator

@retry_with_backoff(max_retries=5, base_delay=0.5)
def call_remote_service():
    """원격 서비스 호출"""
    pass

13. 퀴즈

Q1. CAP 정리에서 실제로 선택해야 하는 것은?

네트워크 파티션(P)은 분산 시스템에서 불가피하므로, 실제 선택은 CP vs AP입니다.

CP 선택: 파티션 발생 시 일관성을 우선 (일부 요청 거부). 예: etcd, ZooKeeper, HBase
AP 선택: 파티션 발생 시 가용성을 우선 (오래된 데이터 반환 가능). 예: Cassandra, DynamoDB

PACELC 확장에서는 정상 상태에서의 지연(L) vs 일관성(C) 트레이드오프도 고려합니다.

Q2. Raft에서 Leader Election이 필요한 상황은?

클러스터 초기 시작: 아직 리더가 없을 때
리더 장애: Follower가 리더의 heartbeat를 일정 시간(election timeout) 내에 받지 못할 때
네트워크 파티션: 리더와 과반수 노드가 분리될 때

선거 과정: Follower가 Candidate로 전환하고 Term을 증가시킨 후 다른 노드에 투표를 요청합니다. 과반수 투표를 받으면 새 Leader가 됩니다. 같은 Term에서 두 Candidate가 동시에 선거를 시작하면 split vote가 발생할 수 있으며, 랜덤 타임아웃으로 해결합니다.

Q3. 벡터 클럭이 Lamport 타임스탬프보다 나은 점은?

Lamport 타임스탬프는 L(a) less than L(b)일 때 a가 b의 원인인지, 단지 우연히 더 작은지 구별할 수 없습니다. 벡터 클럭은 인과 관계를 정확히 판단할 수 있습니다.

벡터 클럭으로는 두 이벤트가 인과적으로 관련되는지, 또는 동시 발생(concurrent)인지 정확히 판별할 수 있습니다. 이를 통해 동시 쓰기 충돌을 감지하고 적절한 해결 전략을 적용할 수 있습니다. 단점은 벡터 크기가 노드 수에 비례하여 증가한다는 것입니다.

Q4. Saga 패턴이 2PC 대신 사용되는 이유는?

2PC의 문제점:

코디네이터가 단일 장애점(SPOF)
참가자가 prepare 후 블로킹될 수 있음
동기적이라 성능 저하
마이크로서비스 환경에서 강한 결합 발생

Saga의 장점:

각 서비스가 독립적으로 로컬 트랜잭션 실행
비동기적이라 높은 가용성
장애 시 보상 트랜잭션으로 복구
느슨한 결합 유지

단, Saga는 최종적 일관성만 보장하며, 중간 상태가 노출될 수 있습니다.

Q5. 일관성 해싱에서 가상 노드가 필요한 이유는?

물리 노드만 사용하면 해시 링 위에서 노드가 불균등하게 분포되어 부하 편중이 발생할 수 있습니다.

가상 노드(Virtual Node)를 사용하면:

각 물리 노드가 링 위에 여러 위치를 차지하여 균등한 부하 분배
노드 추가/제거 시 점진적인 리밸런싱 가능
이기종 하드웨어에 대해 가상 노드 수를 다르게 설정하여 성능에 맞는 부하 할당

예를 들어, 성능이 2배인 서버에 가상 노드를 2배 할당하여 더 많은 키를 처리하게 할 수 있습니다.

14. 참고 자료

"Designing Data-Intensive Applications" - Martin Kleppmann (필독서)
Raft 논문 - "In Search of an Understandable Consensus Algorithm" (Diego Ongaro, 2014)
Paxos 논문 - "The Part-Time Parliament" (Leslie Lamport, 1998)
Dynamo 논문 - "Dynamo: Amazon's Highly Available Key-value Store" (2007)
Google Spanner 논문 - "Spanner: Google's Globally-Distributed Database" (2012)
CAP 정리 증명 - "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" (2002)
PACELC - Daniel Abadi, "Consistency Tradeoffs in Modern Distributed Database System Design"
Lamport Clocks - "Time, Clocks, and the Ordering of Events in a Distributed System" (1978)
Vector Clocks - "Timestamps in Message-Passing Systems That Preserve the Partial Ordering" (1988)
Kafka 문서 - Apache Kafka Official Documentation
etcd 문서 - etcd.io Documentation
"Distributed Systems for Fun and Profit" - Mikito Takada (온라인 무료)
Jepsen - Kyle Kingsbury의 분산 시스템 검증 프로젝트

Distributed Systems Fundamentals Complete Guide 2025: Consensus Algorithms, Replication, Consistency Models, Fault Tolerance

1. Why Distributed Systems

1.1 Why We Need Distributed Systems

There are three fundamental reasons that cannot be solved by a single server.

Scalability: A single machine has physical limits on CPU, memory, and disk
Availability: Eliminate single points of failure (SPOF) to ensure service continuity
Geographic Distribution: Serve users from nearby locations to reduce latency

Core Challenges of Distributed Systems
┌─────────────────────────────────────────────┐
│                                             │
│  The network is unreliable                  │
│  ├── Packet loss                            │
│  ├── Latency variability                    │
│  └── Network partitions                     │
│                                             │
│  Nodes can fail                             │
│  ├── Crashes                                │
│  ├── Slowdowns                              │
│  └── Byzantine faults (malicious behavior)  │
│                                             │
│  No global time exists                      │
│  ├── Clock drift                            │
│  ├── NTP errors                             │
│  └── Difficulty determining event ordering  │
│                                             │
└─────────────────────────────────────────────┘

1.2 The 8 Fallacies of Distributed Computing

Eight false assumptions defined by Peter Deutsch.

No.	Fallacy	Reality
1	The network is reliable	Packet loss, delays, and partitions occur
2	Latency is zero	Latency is proportional to physical distance
3	Bandwidth is infinite	Network bandwidth has limits
4	The network is secure	Security threats always exist
5	Topology never changes	Network configuration changes constantly
6	There is one administrator	Multiple organizations manage the network
7	Transport cost is zero	Data transfer has costs
8	The network is homogeneous	Various equipment and protocols coexist

2. CAP Theorem and PACELC

2.1 CAP Theorem

The CAP theorem, proposed by Eric Brewer in 2000, is a fundamental principle of distributed system design.

CAP Theorem
┌─────────────────────────────────────────────┐
│                                             │
│            Consistency (C)                  │
│               /\                            │
│              /  \                           │
│             /    \                          │
│            / CP   \  CA                     │
│           / systems\ systems                │
│          /          \                       │
│         /____________\                      │
│  Availability (A) ─── Partition             │
│                       Tolerance (P)         │
│                                             │
│  CP: HBase, MongoDB, etcd, ZooKeeper       │
│  AP: Cassandra, DynamoDB, CouchDB          │
│  CA: Single-node RDBMS (theoretical)       │
│                                             │
│  During partition: must choose C or A       │
└─────────────────────────────────────────────┘

Three properties:

C (Consistency): All nodes see the same data
A (Availability): Every request receives a response (from non-failing nodes)
P (Partition Tolerance): System operates despite network partitions

Key insight: Network partitions are inevitable, so P is mandatory. The real choice is CP vs AP.

2.2 PACELC Extension

PACELC, proposed by Daniel Abadi, extends CAP to address trade-offs in normal operation.

PACELC Theorem
if (Partition) then
    Availability vs Consistency
else
    Latency vs Consistency

┌─────────────────────────────────────────────┐
│ System         │ On P     │ Normal (E)      │
│───────────────│─────────│────────────────│
│ DynamoDB      │ PA       │ EL (low latency)│
│ Cassandra     │ PA       │ EL              │
│ MongoDB       │ PC       │ EC (strong)     │
│ Google Spanner│ PC       │ EC              │
│ CockroachDB   │ PC       │ EL (read opt.)  │
│ PNUTS (Yahoo) │ PC       │ EL              │
└─────────────────────────────────────────────┘

3. Consistency Models

3.1 The Consistency Spectrum

Strong Consistency ◄─────────────────────────► Weak Consistency

Linearizable  Sequential  Causal    Eventual

│ Low perf     │          │         │ High perf  │
│ Complex impl │          │         │ Simple impl│
│ Strong guar. │          │         │ Weak guar. │

3.2 Linearizability

The strongest consistency guarantee. All operations must match real-time ordering.

Linearizable
Client A: ──write(x=1)──────────────────────
Client B: ────────────read(x)→1─────────────
Client C: ──────────────────read(x)→1───────

All clients always read the latest value after a write completes

NOT Linearizable
Client A: ──write(x=1)──────────────────────
Client B: ────────────read(x)→1─────────────
Client C: ──────────────────read(x)→0───────  ← stale!

3.3 Eventual Consistency

If updates continue without interruption, eventually all replicas will converge to the same state.

# Read pattern assuming Eventual Consistency
class EventuallyConsistentStore:
    def __init__(self, replicas: list):
        self.replicas = replicas

    def write(self, key: str, value: str, timestamp: float):
        """Async write to all replicas"""
        for replica in self.replicas:
            replica.async_put(key, value, timestamp)

    def read(self, key: str) -> str:
        """Read - can read from any replica"""
        # Select the value with the latest timestamp (read-repair)
        values = [r.get(key) for r in self.replicas]
        return max(values, key=lambda v: v.timestamp).value

    def read_your_writes(self, key: str, session_token: str) -> str:
        """Guarantee reading your own writes"""
        last_write_replica = self.get_replica_for_session(session_token)
        return last_write_replica.get(key).value

3.4 Causal Consistency

Only guarantees the ordering of causally related operations.

Causal Consistency Example:
A: write(x=1)    <- B reads x=1 then writes y=2
B: read(x)→1, write(y=2)
C: read(y)→2, read(x)→?

With causal consistency guaranteed:
  If C can see y=2, it must also see x=1.
  (x=1, the causal predecessor of y=2, must be visible first)

3.5 Read-Your-Writes Consistency

You can always read data that you yourself wrote.

Read-Your-Writes
Client A: write(x=1) ───→ read(x)→1  (always guaranteed)
                           |
                    reads own write

Session Consistency: guaranteed within the same session
Monotonic Reads: once you see a new value, you never revert to an old one

4. Consensus Algorithms

4.1 Paxos

The first consensus algorithm, proposed by Leslie Lamport.

Basic Paxos (single value consensus)
┌─────────────────────────────────────────────┐
│                                             │
│  Phase 1a: Prepare                          │
│  Proposer → Acceptor: prepare(n)            │
│  "Please prepare for proposal number n"     │
│                                             │
│  Phase 1b: Promise                          │
│  Acceptor → Proposer: promise(n, prev_val)  │
│  "I will reject anything below n"           │
│                                             │
│  Phase 2a: Accept                           │
│  Proposer → Acceptor: accept(n, value)      │
│  "Please accept this value"                 │
│                                             │
│  Phase 2b: Accepted                         │
│  Acceptor → Learner: accepted(n, value)     │
│  "If majority accepts, consensus reached"   │
│                                             │
└─────────────────────────────────────────────┘

4.2 Multi-Paxos

In real systems, Multi-Paxos is used for repeated consensus. It elects a stable leader to skip Phase 1.

4.3 Raft: An Understandable Consensus Algorithm

Raft, proposed by Diego Ongaro in 2014, provides safety equivalent to Paxos while being designed for understandability.

Raft State Transitions
┌────────────────────────────────────────────────┐
│                                                │
│  ┌──────────┐  timeout   ┌──────────┐         │
│  │ Follower │──────────→│ Candidate│         │
│  │          │←──────────│          │         │
│  └──────────┘ fail/discover└────┬─────┘        │
│       ↑                        │               │
│       │   majority votes won   │               │
│       │                        ↓               │
│       │              ┌──────────┐              │
│       └──────────────│  Leader  │              │
│        new leader    │          │              │
│                      └──────────┘              │
│                                                │
│  Term:                                         │
│  ┌───┬───┬───┬───┬───┐                        │
│  │ 1 │ 2 │ 3 │ 4 │ 5 │ ...                   │
│  └───┴───┴───┴───┴───┘                        │
│  At most one Leader per Term                   │
└────────────────────────────────────────────────┘

Raft Leader Election

Leader Election Process
┌────────────────────────────────────────────────┐
│ 1. Follower heartbeat timeout                  │
│ 2. Increment Term, vote for self               │
│ 3. Send RequestVote RPC to other nodes         │
│ 4. Become Leader if majority votes received    │
│                                                │
│ [Node A: Follower]  [Node B: Follower]         │
│ [Node C: Follower]  [Node D: Follower]         │
│ [Node E: Follower]                             │
│                                                │
│ Node A timeout → Candidate (Term 2)            │
│ A→B: RequestVote(Term 2) → B: Grant            │
│ A→C: RequestVote(Term 2) → C: Grant            │
│ A→D: RequestVote(Term 2) → D: Grant            │
│                                                │
│ A wins majority (3/5) → Leader!                │
│ A→all: AppendEntries(heartbeat)                │
└────────────────────────────────────────────────┘

Raft Log Replication

Log Replication
┌────────────────────────────────────────────────┐
│                                                │
│  Client → Leader: write(x=5)                   │
│                                                │
│  Leader Log:  [1:x=3] [2:y=7] [3:x=5]         │
│  Follower A:  [1:x=3] [2:y=7]                 │
│  Follower B:  [1:x=3] [2:y=7]                 │
│  Follower C:  [1:x=3]                         │
│  Follower D:  [1:x=3] [2:y=7]                 │
│                                                │
│  1. Leader appends entry to log                │
│  2. Replicates via AppendEntries RPC           │
│  3. Commits when majority confirms replication │
│  4. Responds to client                         │
│  5. Notifies commit in next heartbeat          │
│                                                │
│  Safety: Leader contains all committed entries  │
│  (Election Restriction)                        │
└────────────────────────────────────────────────┘

4.4 Raft Implementation Example (Pseudocode)

from enum import Enum
from dataclasses import dataclass, field
from typing import Optional

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

@dataclass
class LogEntry:
    term: int
    index: int
    command: str

@dataclass
class RaftNode:
    node_id: str
    state: NodeState = NodeState.FOLLOWER
    current_term: int = 0
    voted_for: Optional[str] = None
    log: list = field(default_factory=list)
    commit_index: int = 0
    last_applied: int = 0

    # Leader-only state
    next_index: dict = field(default_factory=dict)
    match_index: dict = field(default_factory=dict)

    def on_election_timeout(self):
        """Election timeout: transition to Candidate"""
        self.state = NodeState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id
        votes_received = 1  # Vote for self

        # Send RequestVote to all peers
        for peer in self.peers:
            response = self.send_request_vote(peer)
            if response and response.vote_granted:
                votes_received += 1

        if votes_received > len(self.peers) // 2:
            self.become_leader()

    def become_leader(self):
        """Transition to Leader"""
        self.state = NodeState.LEADER
        for peer in self.peers:
            self.next_index[peer] = len(self.log)
            self.match_index[peer] = 0
        self.send_heartbeats()

    def on_append_entries(self, term, leader_id, entries, leader_commit):
        """Handle AppendEntries RPC"""
        if term < self.current_term:
            return False

        self.state = NodeState.FOLLOWER
        self.current_term = term
        self.reset_election_timer()

        # Check log consistency and append entries
        self.log.extend(entries)

        if leader_commit > self.commit_index:
            self.commit_index = min(leader_commit, len(self.log) - 1)
            self.apply_committed_entries()

        return True

    def replicate_log(self, command: str):
        """Handle client command (Leader only)"""
        if self.state != NodeState.LEADER:
            return False

        entry = LogEntry(
            term=self.current_term,
            index=len(self.log),
            command=command
        )
        self.log.append(entry)

        # Replicate to all Followers
        ack_count = 1  # Self
        for peer in self.peers:
            success = self.send_append_entries(peer, [entry])
            if success:
                ack_count += 1

        # Check majority
        if ack_count > len(self.peers) // 2:
            self.commit_index = entry.index
            self.apply_committed_entries()
            return True
        return False

5. Replication Strategies

5.1 Leader-Follower

Leader-Follower Replication
┌─────────────────────────────────────────────┐
│                                             │
│  Client ──write──→ ┌────────┐               │
│                    │ Leader │               │
│  Client ──read──→  │ (R/W)  │               │
│                    └───┬────┘               │
│              replicate │  replicate          │
│              ┌─────────┼─────────┐          │
│              ↓         ↓         ↓          │
│         ┌────────┐┌────────┐┌────────┐      │
│         │Follower││Follower││Follower│      │
│         │ (Read) ││ (Read) ││ (Read) │      │
│         └────────┘└────────┘└────────┘      │
│                                             │
│  Synchronous: slow but no data loss         │
│  Asynchronous: fast but potential data loss  │
│  Semi-synchronous: sync to at least 1       │
└─────────────────────────────────────────────┘

5.2 Multi-Leader

Multi-Leader Replication
┌─────────────────────────────────────────────┐
│                                             │
│  DC-US             DC-EU            DC-Asia │
│  ┌────────┐       ┌────────┐      ┌────────┐│
│  │Leader 1│◄─────►│Leader 2│◄────►│Leader 3││
│  │ (R/W)  │       │ (R/W)  │      │ (R/W)  ││
│  └───┬────┘       └───┬────┘      └───┬────┘│
│      │                │               │     │
│  ┌───┴───┐        ┌───┴───┐      ┌───┴───┐ │
│  │Follower│        │Follower│      │Follower│ │
│  └───────┘        └───────┘      └───────┘ │
│                                             │
│  Pros: reduced write latency, offline ops   │
│  Cons: conflict resolution needed (LWW, CRDT)│
└─────────────────────────────────────────────┘

Conflict Resolution Strategies

# Last-Writer-Wins (LWW) - simplest
def lww_resolve(conflict_a, conflict_b):
    """Larger timestamp wins"""
    if conflict_a.timestamp > conflict_b.timestamp:
        return conflict_a
    return conflict_b

# Custom merge function
def custom_merge(conflict_a, conflict_b):
    """Domain-specific merge logic"""
    if conflict_a.type == "counter":
        return Value(conflict_a.value + conflict_b.value)
    elif conflict_a.type == "set":
        return Value(conflict_a.value.union(conflict_b.value))
    else:
        return lww_resolve(conflict_a, conflict_b)

5.3 Leaderless / Quorum

Leaderless Replication (Dynamo Style)
┌─────────────────────────────────────────────┐
│                                             │
│  N=3 (total replicas), W=2 (write quorum),  │
│  R=2 (read quorum)                          │
│                                             │
│  Write: Client → Node1 (ack)                │
│                → Node2 (ack)    ← W=2 met   │
│                → Node3 (timeout)             │
│                                             │
│  Read:  Client → Node1 (v=5, ts=100)        │
│                → Node2 (v=5, ts=100)  ← cur │
│                → Node3 (v=3, ts=90)   ← old │
│                                             │
│  Read repair: send latest to stale Node3    │
│                                             │
│  Quorum condition: W + R > N → at least     │
│  one intersection guarantees latest value   │
│                                             │
│  Sloppy Quorum: substitute nodes on failure │
│  (Hinted Handoff)                           │
└─────────────────────────────────────────────┘

5.4 Chain Replication

Chain Replication
┌─────────────────────────────────────────────┐
│                                             │
│  Write → [Head] → [Middle] → [Tail] → Read │
│                                             │
│  Pros:                                      │
│  - Strong consistency guarantee             │
│  - Load distribution for reads and writes   │
│  - Relatively simple implementation         │
│                                             │
│  Cons:                                      │
│  - Write latency proportional to chain len  │
│  - Head/Tail failure requires reconfiguration│
│                                             │
│  Used in: Azure Storage, HDFS              │
└─────────────────────────────────────────────┘

6. Distributed Clocks

6.1 Physical Clocks and NTP

Physical Clock Problems
┌─────────────────────────────────────────────┐
│                                             │
│  Node A clock: 10:00:00.000                 │
│  Node B clock: 10:00:00.150  (150ms off)    │
│  Node C clock: 09:59:59.800  (200ms off)    │
│                                             │
│  NTP sync accuracy: typically tens of ms    │
│  Google Spanner TrueTime: ~7ms error        │
│                                             │
│  Problem: cannot determine if event A       │
│  happened before B using physical clocks    │
└─────────────────────────────────────────────┘

6.2 Lamport Timestamps

A logical clock proposed by Leslie Lamport. Guarantees the ordering of causal relationships.

Lamport Timestamp
Rules:
1. Increment local counter on event
2. Include counter when sending message
3. On receive: max(local, received) + 1

Node A:  [1] ──msg──→ [2] ────→ [3]
                        |
Node B:  [1] ────→ [3] (max(1,2)+1) → [4]
                                        |
Node C:  [1] ──→ [2] ───────────→ [5] (max(2,4)+1)

Limitation: Lamport time alone cannot
distinguish causal relationships
(L(a) < L(b) doesn't mean a caused b)

class LamportClock:
    def __init__(self):
        self.time = 0

    def tick(self) -> int:
        """Local event"""
        self.time += 1
        return self.time

    def send(self) -> int:
        """Send message"""
        self.time += 1
        return self.time

    def receive(self, sender_time: int) -> int:
        """Receive message"""
        self.time = max(self.time, sender_time) + 1
        return self.time

6.3 Vector Clocks

Vector Clocks
Each node maintains a vector of counters for all nodes

Node A: [A:1, B:0, C:0]
        ──msg──→
Node B: [A:0, B:1, C:0] → after receive [A:1, B:2, C:0]
        ──msg──→
Node C: [A:0, B:0, C:1] → after receive [A:1, B:2, C:2]

Causality determination:
V1 < V2  iff  for all i: V1[i] <= V2[i] and
              for at least one j: V1[j] < V2[j]

V1 || V2  (concurrent) iff neither V1 < V2 nor V2 < V1

class VectorClock:
    def __init__(self, node_id: str, nodes: list):
        self.node_id = node_id
        self.clock = {n: 0 for n in nodes}

    def tick(self):
        """Local event"""
        self.clock[self.node_id] += 1

    def send(self) -> dict:
        """Send message"""
        self.clock[self.node_id] += 1
        return self.clock.copy()

    def receive(self, sender_clock: dict):
        """Receive message"""
        for node in self.clock:
            self.clock[node] = max(self.clock[node], sender_clock.get(node, 0))
        self.clock[self.node_id] += 1

    def is_before(self, other_clock: dict) -> bool:
        """Was self before other?"""
        all_leq = all(self.clock[n] <= other_clock.get(n, 0) for n in self.clock)
        any_lt = any(self.clock[n] < other_clock.get(n, 0) for n in self.clock)
        return all_leq and any_lt

    def is_concurrent(self, other_clock: dict) -> bool:
        """Are events concurrent?"""
        return not self.is_before(other_clock) and not self._is_after(other_clock)

6.4 Hybrid Logical Clocks (HLC)

HLC (Hybrid Logical Clock)
= Combines strengths of physical + logical clocks

Structure: (physical_time, logical_counter)

Advantages:
- Values close to physical time (human-readable)
- Can track causal relationships
- Compensates for NTP errors

Used in: CockroachDB, YugabyteDB

7. Fault Models

7.1 Fault Types

Fault Model Spectrum (by tolerance difficulty)
┌─────────────────────────────────────────────┐
│                                             │
│  Crash-Stop (simplest)                      │
│  ├── Node stops and never recovers          │
│  ├── Relatively easy to detect              │
│  └── Assumed by most consensus algorithms   │
│                                             │
│  Crash-Recovery                              │
│  ├── Node stops and recovers later          │
│  ├── State restored from persistent storage │
│  └── Requires WAL (Write-Ahead Log)         │
│                                             │
│  Omission                                    │
│  ├── Fails to send or receive messages      │
│  ├── Includes network partitions            │
│  └── Detected by timeouts                   │
│                                             │
│  Byzantine (most complex)                   │
│  ├── Node exhibits arbitrary/malicious behavior│
│  ├── Can send incorrect data                │
│  ├── Requires 3f+1 nodes (f: faulty nodes)  │
│  └── Primarily used in blockchain           │
│                                             │
└─────────────────────────────────────────────┘

7.2 Failure Detection

class PhiAccrualFailureDetector:
    """Phi Accrual Failure Detector (used in Akka)"""

    def __init__(self, threshold: float = 8.0):
        self.threshold = threshold
        self.heartbeat_intervals = []
        self.last_heartbeat = None

    def heartbeat(self, timestamp: float):
        """Record heartbeat"""
        if self.last_heartbeat:
            interval = timestamp - self.last_heartbeat
            self.heartbeat_intervals.append(interval)
        self.last_heartbeat = timestamp

    def phi(self, current_time: float) -> float:
        """Calculate phi value - suspicion level"""
        if not self.heartbeat_intervals:
            return 0.0

        time_since_last = current_time - self.last_heartbeat
        mean = sum(self.heartbeat_intervals) / len(self.heartbeat_intervals)
        variance = sum((x - mean) ** 2 for x in self.heartbeat_intervals) / len(self.heartbeat_intervals)
        std_dev = variance ** 0.5

        if std_dev == 0:
            return float('inf') if time_since_last > mean else 0.0

        y = (time_since_last - mean) / std_dev
        return max(0.0, y * 1.5)  # Simplified approximation

    def is_alive(self, current_time: float) -> bool:
        """Determine if node is alive"""
        return self.phi(current_time) < self.threshold

8. Distributed Transactions

8.1 2PC (Two-Phase Commit)

2PC (Two-Phase Commit)
┌────────────────────────────────────────────────┐
│                                                │
│  Phase 1: Prepare (Vote)                       │
│  Coordinator → Participant A: prepare?         │
│  Coordinator → Participant B: prepare?         │
│  Coordinator → Participant C: prepare?         │
│                                                │
│  A → Coordinator: YES (ready)                  │
│  B → Coordinator: YES                          │
│  C → Coordinator: YES                          │
│                                                │
│  Phase 2: Commit (Execute)                     │
│  Coordinator → A: commit                       │
│  Coordinator → B: commit                       │
│  Coordinator → C: commit                       │
│                                                │
│  Problems:                                     │
│  - Coordinator failure blocks participants     │
│  - Single point of failure (SPOF)              │
│  - Synchronous, performance degradation        │
│                                                │
│  Any NO → Abort                                │
│  Coordinator failure → uncertain (in-doubt)    │
└────────────────────────────────────────────────┘

8.2 3PC (Three-Phase Commit)

3PC: Attempts to solve 2PC blocking problem
Phase 1: CanCommit (Vote)
Phase 2: PreCommit (Pre-commit)
Phase 3: DoCommit (Final commit)

The additional PreCommit phase enables:
- Timeout-based recovery
- But still problematic with network partitions

In practice, Saga pattern is used more often than 3PC

8.3 Saga Pattern

Saga Pattern: Based on compensating transactions
┌────────────────────────────────────────────────┐
│                                                │
│  Normal flow:                                  │
│  T1 → T2 → T3 → T4 → Done                    │
│  (Order)(Pay)(Stock)(Ship)                     │
│                                                │
│  T3 failure compensation:                      │
│  T1 → T2 → T3(fail) → C2 → C1                 │
│  (Order)(Pay)(StockFail)(Refund)(CancelOrder)  │
│                                                │
│  Choreography (event-driven):                  │
│  Each service publishes and subscribes events  │
│                                                │
│  Orchestration (centralized):                  │
│  Saga Orchestrator coordinates each step       │
│                                                │
└────────────────────────────────────────────────┘

# Saga Orchestrator Example
from dataclasses import dataclass
from typing import Callable
from enum import Enum

class SagaStatus(Enum):
    PENDING = "pending"
    RUNNING = "running"
    COMPLETED = "completed"
    COMPENSATING = "compensating"
    FAILED = "failed"

@dataclass
class SagaStep:
    name: str
    action: Callable
    compensation: Callable

class SagaOrchestrator:
    def __init__(self, steps: list):
        self.steps = steps
        self.completed_steps = []
        self.status = SagaStatus.PENDING

    async def execute(self):
        """Execute Saga"""
        self.status = SagaStatus.RUNNING

        for step in self.steps:
            try:
                await step.action()
                self.completed_steps.append(step)
            except Exception as e:
                print(f"Step '{step.name}' failed: {e}")
                await self.compensate()
                return False

        self.status = SagaStatus.COMPLETED
        return True

    async def compensate(self):
        """Execute compensating transactions (reverse order)"""
        self.status = SagaStatus.COMPENSATING

        for step in reversed(self.completed_steps):
            try:
                await step.compensation()
                print(f"Compensated: {step.name}")
            except Exception as e:
                print(f"Compensation failed for '{step.name}': {e}")
                self.status = SagaStatus.FAILED
                return

        self.status = SagaStatus.FAILED

# Usage
saga = SagaOrchestrator([
    SagaStep("create_order", create_order, cancel_order),
    SagaStep("process_payment", charge_payment, refund_payment),
    SagaStep("reserve_inventory", reserve_stock, release_stock),
    SagaStep("arrange_shipping", book_delivery, cancel_delivery),
])

9. Partitioning and Sharding

9.1 Partitioning Strategies

Range Partitioning
┌─────────────────────────────────────────────┐
│ Key: A-F → Shard 1                          │
│ Key: G-N → Shard 2                          │
│ Key: O-Z → Shard 3                          │
│                                             │
│ Pros: efficient range queries               │
│ Cons: potential hotspots                    │
└─────────────────────────────────────────────┘

Hash Partitioning
┌─────────────────────────────────────────────┐
│ hash(key) % 3 == 0 → Shard 1               │
│ hash(key) % 3 == 1 → Shard 2               │
│ hash(key) % 3 == 2 → Shard 3               │
│                                             │
│ Pros: even distribution                     │
│ Cons: no range queries, costly resharding   │
└─────────────────────────────────────────────┘

9.2 Consistent Hashing

Consistent Hashing
┌─────────────────────────────────────────────┐
│                                             │
│           Node A (0)                        │
│          /        \                         │
│        /            \                       │
│   Node D              Node B                │
│   (270)                (90)                 │
│        \            /                       │
│          \        /                         │
│           Node C (180)                      │
│                                             │
│  Key hash → position on ring                │
│  Assigned to nearest node clockwise         │
│                                             │
│  On node add/remove:                        │
│  - Only adjacent nodes affected             │
│  - Average K/N keys move (K: total, N: nodes)│
│                                             │
│  Virtual Nodes:                             │
│  - Each physical node occupies multiple     │
│    positions on the ring                    │
│  - Even load distribution                   │
│                                             │
└─────────────────────────────────────────────┘

import hashlib
from bisect import bisect_right

class ConsistentHashRing:
    def __init__(self, virtual_nodes: int = 150):
        self.virtual_nodes = virtual_nodes
        self.ring = {}         # hash -> node
        self.sorted_keys = []  # sorted hash values
        self.nodes = set()

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node: str):
        """Add a node"""
        self.nodes.add(node)
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:vn{i}"
            h = self._hash(virtual_key)
            self.ring[h] = node
            self.sorted_keys.append(h)
        self.sorted_keys.sort()

    def remove_node(self, node: str):
        """Remove a node"""
        self.nodes.discard(node)
        for i in range(self.virtual_nodes):
            virtual_key = f"{node}:vn{i}"
            h = self._hash(virtual_key)
            del self.ring[h]
            self.sorted_keys.remove(h)

    def get_node(self, key: str) -> str:
        """Look up node for key"""
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect_right(self.sorted_keys, h)
        if idx >= len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

# Usage
ring = ConsistentHashRing(virtual_nodes=150)
ring.add_node("node-1")
ring.add_node("node-2")
ring.add_node("node-3")

print(ring.get_node("user:123"))  # → node-2
print(ring.get_node("order:456"))  # → node-1

10. Gossip Protocol and Membership

10.1 Gossip Protocol

Gossip (Epidemic) Protocol
┌─────────────────────────────────────────────┐
│                                             │
│  Periodically propagate state to random node│
│                                             │
│  Round 1: A → B (A's info)                  │
│  Round 2: A → C, B → D (A,B info)          │
│  Round 3: C → E, D → F (spreading)         │
│  ...                                        │
│  Full propagation in O(log N) rounds        │
│                                             │
│  Push: send own info                        │
│  Pull: request peer's info                  │
│  Push-Pull: bidirectional (most efficient)  │
│                                             │
│  Use cases:                                 │
│  - Membership management                    │
│  - Failure detection                        │
│  - Distributed aggregation                  │
│  - Amazon DynamoDB, Apache Cassandra        │
│                                             │
└─────────────────────────────────────────────┘

10.2 SWIM Protocol

SWIM (Scalable Weakly-consistent Infection-style Membership)
┌────────────────────────────────────────────────┐
│                                                │
│  1. Ping: direct ping to random node           │
│     A ──ping──→ B                              │
│                                                │
│  2. Ack: response                              │
│     A ←──ack── B                               │
│                                                │
│  3. No response: indirect ping (Ping-Req)      │
│     A ──ping-req──→ C ──ping──→ B              │
│     A ←──ack────── C ←──ack── B               │
│                                                │
│  4. Still no response: suspect B               │
│  5. After timeout: declare B dead              │
│                                                │
│  Used in: HashiCorp Serf, Consul               │
└────────────────────────────────────────────────┘

11. Real-World System Analysis

11.1 Google Spanner

Google Spanner Key Technologies
┌─────────────────────────────────────────────┐
│                                             │
│  TrueTime API                               │
│  ├── GPS + atomic clock based               │
│  ├── Returns uncertainty interval           │
│  ├── Error: within ~7ms                     │
│  └── Guarantees external consistency        │
│                                             │
│  Paxos-based Replication                    │
│  ├── Each Spanner server group runs Paxos   │
│  ├── Synchronous replication for strong consistency│
│  └── Globally distributable                 │
│                                             │
│  Read-Write Transactions                    │
│  ├── 2PL (Two-Phase Locking) + 2PC          │
│  ├── TrueTime assigns commit timestamps     │
│  └── Read-only: lock-free snapshot reads    │
│                                             │
└─────────────────────────────────────────────┘

11.2 Amazon DynamoDB (Dynamo Paper)

DynamoDB Key Design Decisions
┌─────────────────────────────────────────────┐
│                                             │
│  Consistent Hashing + Virtual Nodes         │
│  ├── Partition key-based data distribution  │
│  └── Automatic rebalancing                  │
│                                             │
│  Sloppy Quorum + Hinted Handoff             │
│  ├── N=3, W=2, R=2                          │
│  ├── Substitute nodes store on failure      │
│  └── Forward to original after recovery     │
│                                             │
│  Vector Clocks for conflict detection       │
│  ├── Causality tracking                     │
│  └── Client resolves concurrent writes      │
│                                             │
│  Anti-Entropy (Merkle Tree)                 │
│  ├── Detect replica inconsistencies         │
│  └── Efficient synchronization              │
│                                             │
│  Gossip-based Membership                    │
│  └── Detect node additions/removals         │
│                                             │
└─────────────────────────────────────────────┘

11.3 Apache Kafka Internals

Kafka Replication Model
┌─────────────────────────────────────────────┐
│                                             │
│  Topic: orders (Partition 3, RF=3)          │
│                                             │
│  Partition 0:                               │
│    Leader: Broker 1                         │
│    ISR: [Broker 1, Broker 2, Broker 3]     │
│                                             │
│  Partition 1:                               │
│    Leader: Broker 2                         │
│    ISR: [Broker 2, Broker 3, Broker 1]     │
│                                             │
│  Partition 2:                               │
│    Leader: Broker 3                         │
│    ISR: [Broker 3, Broker 1, Broker 2]     │
│                                             │
│  ISR (In-Sync Replicas):                    │
│  ├── Replicas caught up with leader's log   │
│  ├── Removed from ISR if falling behind     │
│  └── acks=all → all ISR must confirm        │
│                                             │
│  Controller:                                │
│  ├── Elected via ZooKeeper/KRaft            │
│  ├── Assigns partition leaders              │
│  └── Re-elects leaders on broker failure    │
│                                             │
└─────────────────────────────────────────────┘

12. Design Pattern Collection

12.1 Circuit Breaker

import time
from enum import Enum
from threading import Lock

class CircuitState(Enum):
    CLOSED = "closed"       # Normal
    OPEN = "open"           # Blocking
    HALF_OPEN = "half_open" # Testing

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5,
                 recovery_timeout: float = 30.0,
                 half_open_max: int = 3):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max = half_open_max

        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0
        self.half_open_count = 0
        self.lock = Lock()

    def call(self, func, *args, **kwargs):
        """Protected function call"""
        with self.lock:
            if self.state == CircuitState.OPEN:
                if time.time() - self.last_failure_time > self.recovery_timeout:
                    self.state = CircuitState.HALF_OPEN
                    self.half_open_count = 0
                else:
                    raise CircuitBreakerOpenError("Circuit breaker is OPEN")

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise e

    def _on_success(self):
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                self.half_open_count += 1
                if self.half_open_count >= self.half_open_max:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
            elif self.state == CircuitState.CLOSED:
                self.failure_count = 0

    def _on_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN
            elif self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN

12.2 Retry with Exponential Backoff

import random
import time
from functools import wraps

def retry_with_backoff(max_retries: int = 3,
                       base_delay: float = 1.0,
                       max_delay: float = 60.0,
                       jitter: bool = True):
    """Retry with exponential backoff and jitter"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries:
                        raise e

                    delay = min(base_delay * (2 ** attempt), max_delay)
                    if jitter:
                        delay = delay * random.uniform(0.5, 1.5)

                    print(f"Attempt {attempt + 1} failed, "
                          f"retrying in {delay:.1f}s...")
                    time.sleep(delay)
        return wrapper
    return decorator

@retry_with_backoff(max_retries=5, base_delay=0.5)
def call_remote_service():
    """Remote service call"""
    pass

13. Quiz

Q1. In the CAP theorem, what is the actual choice you must make?

Since network partitions (P) are inevitable in distributed systems, the real choice is CP vs AP.

CP choice: Prioritize consistency during partition (reject some requests). Examples: etcd, ZooKeeper, HBase
AP choice: Prioritize availability during partition (may return stale data). Examples: Cassandra, DynamoDB

The PACELC extension also considers the Latency (L) vs Consistency (C) trade-off during normal operation.

Q2. When is Leader Election needed in Raft?

Initial cluster start: When no leader exists yet
Leader failure: When a Follower does not receive the leader's heartbeat within the election timeout
Network partition: When the leader is separated from a majority of nodes

Process: A Follower transitions to Candidate, increments its Term, and requests votes from other nodes. It becomes the new Leader if it wins a majority. If two Candidates start elections simultaneously (split vote), randomized timeouts resolve it.

Q3. What advantage do Vector Clocks have over Lamport Timestamps?

Lamport timestamps cannot distinguish whether L(a) being less than L(b) means a caused b, or it is merely coincidental. Vector clocks can precisely determine causal relationships.

With vector clocks, you can accurately determine whether two events are causally related or concurrent. This enables detection of concurrent write conflicts and application of appropriate resolution strategies. The downside is that vector size grows proportionally with the number of nodes.

Q4. Why is the Saga pattern used instead of 2PC?

2PC problems:

Coordinator is a single point of failure (SPOF)
Participants can be blocked after prepare
Synchronous, causing performance degradation
Creates tight coupling in microservice environments

Saga advantages:

Each service independently executes local transactions
Asynchronous for high availability
Compensating transactions for recovery on failure
Maintains loose coupling

However, Saga only guarantees eventual consistency, and intermediate states may be visible.

Q5. Why are virtual nodes needed in consistent hashing?

Using only physical nodes leads to uneven distribution on the hash ring, causing load imbalance.

Virtual nodes provide:

Each physical node occupies multiple positions on the ring for even load distribution
Gradual rebalancing when adding/removing nodes
Different virtual node counts for heterogeneous hardware to allocate load based on capacity

For example, a server with 2x performance can be assigned 2x virtual nodes to handle more keys.

14. References

"Designing Data-Intensive Applications" - Martin Kleppmann (essential reading)
Raft Paper - "In Search of an Understandable Consensus Algorithm" (Diego Ongaro, 2014)
Paxos Paper - "The Part-Time Parliament" (Leslie Lamport, 1998)
Dynamo Paper - "Dynamo: Amazon's Highly Available Key-value Store" (2007)
Google Spanner Paper - "Spanner: Google's Globally-Distributed Database" (2012)
CAP Theorem Proof - "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" (2002)
PACELC - Daniel Abadi, "Consistency Tradeoffs in Modern Distributed Database System Design"
Lamport Clocks - "Time, Clocks, and the Ordering of Events in a Distributed System" (1978)
Vector Clocks - "Timestamps in Message-Passing Systems That Preserve the Partial Ordering" (1988)
Kafka Documentation - Apache Kafka Official Documentation
etcd Documentation - etcd.io
"Distributed Systems for Fun and Profit" - Mikito Takada (free online)
Jepsen - Kyle Kingsbury's distributed systems verification project

분산 시스템 기초 완전 가이드 2025: 합의 알고리즘, 복제, 일관성 모델, 장애 허용

목차

1. 왜 분산 시스템인가

1.1 분산 시스템이 필요한 이유

1.2 분산 컴퓨팅의 8가지 오류 (Fallacies of Distributed Computing)

2. CAP 정리와 PACELC

2.1 CAP 정리

2.2 PACELC 확장

3. 일관성 모델

3.1 일관성 스펙트럼

3.2 선형화 가능성 (Linearizability)

3.3 최종적 일관성 (Eventual Consistency)

3.4 인과적 일관성 (Causal Consistency)

3.5 Read-Your-Writes 일관성

4. 합의 알고리즘

4.1 Paxos

4.2 Multi-Paxos

4.3 Raft: 이해하기 쉬운 합의

Raft Leader Election

Raft Log Replication

4.4 Raft 구현 예시 (의사코드)

5. 복제 전략

5.1 리더-팔로워 (Leader-Follower)

5.2 멀티리더 (Multi-Leader)

충돌 해결 전략

5.3 무리더 (Leaderless) / 쿼럼

5.4 체인 복제 (Chain Replication)

6. 분산 클럭

6.1 물리적 시계와 NTP

6.2 Lamport 타임스탬프

6.3 벡터 클럭 (Vector Clock)

6.4 하이브리드 논리 클럭 (HLC)

7. 장애 모델

7.1 장애 유형

7.2 장애 감지

8. 분산 트랜잭션

8.1 2PC (Two-Phase Commit)

8.2 3PC (Three-Phase Commit)

8.3 Saga 패턴

9. 파티셔닝과 샤딩

9.1 파티셔닝 전략

9.2 일관성 해싱 (Consistent Hashing)

10. Gossip 프로토콜과 멤버십

10.1 Gossip 프로토콜

10.2 SWIM 프로토콜

11. 실전 시스템 분석

11.1 Google Spanner

11.2 Amazon DynamoDB (Dynamo 논문)

11.3 Apache Kafka 내부 구조

12. 설계 패턴 모음

12.1 Circuit Breaker

12.2 Bulkhead 패턴

12.3 Retry with Exponential Backoff

13. 퀴즈

14. 참고 자료

Distributed Systems Fundamentals Complete Guide 2025: Consensus Algorithms, Replication, Consistency Models, Fault Tolerance

Table of Contents

1. Why Distributed Systems

1.1 Why We Need Distributed Systems

1.2 The 8 Fallacies of Distributed Computing

2. CAP Theorem and PACELC

2.1 CAP Theorem

2.2 PACELC Extension

3. Consistency Models

3.1 The Consistency Spectrum

3.2 Linearizability

3.3 Eventual Consistency

3.4 Causal Consistency

3.5 Read-Your-Writes Consistency

4. Consensus Algorithms

4.1 Paxos

4.2 Multi-Paxos

4.3 Raft: An Understandable Consensus Algorithm

Raft Leader Election

Raft Log Replication

4.4 Raft Implementation Example (Pseudocode)

5. Replication Strategies

5.1 Leader-Follower

5.2 Multi-Leader

Conflict Resolution Strategies