Split View: 이중화·삼중화 완전 가이드 — 99.999% 가용성의 비밀

이중화·삼중화 완전 가이드 — 99.999% 가용성의 비밀

들어가며
가용성(Availability) 숫자의 의미
이중화 (Redundancy) 패턴
삼중화 (Triple Redundancy)
- 왜 삼중화?
- etcd (Kubernetes의 뇌) — 삼중화 필수!
계층별 이중화 전략
- 전체 아키텍처
- 데이터베이스 이중화
장애 복구 지표
비용 대비 가용성

들어가며

"서버 하나가 죽으면 서비스도 죽는다" — 이게 단일 장애점(SPOF)입니다.

이중화, 삼중화는 이 SPOF를 제거하는 기술입니다. 은행, 증권, 항공 관제 시스템은 삼중화까지 하고, 우리가 매일 쓰는 클라우드 서비스도 이중화 이상으로 운영됩니다.

가용성(Availability) 숫자의 의미

가용성 = 정상 운영 시간 / 전체 시간 × 100%

가용성	등급	연간 다운타임	월간 다운타임
99%	Two 9s	3.65일	7.3시간
99.9%	Three 9s	8.77시간	43.8분
99.99%	Four 9s	52.6분	4.38분
99.999%	Five 9s	5.26분	26.3초
99.9999%	Six 9s	31.5초	2.6초

핵심: 99.9% → 99.99%로 올리는 건 8시간 → 52분. 비용은 10배 이상 증가!

이중화 (Redundancy) 패턴

1. Active-Standby (액티브-스탠바이)

정상 상태:
  [Active 서버] ←── 모든 트래픽
  [Standby 서버]    (대기, 동기화만)

장애 발생:
  [Active 서버] ✗   장애 감지!
  [Standby 서버] ←── 트래픽 전환 (Failover)
        ↑
    새 Active로 승격

# Kubernetes에서 Active-Standby: StatefulSet + Leader Election
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-ha
spec:
  replicas: 2 # 1 Active + 1 Standby
  selector:
    matchLabels:
      app: redis
  template:
    spec:
      containers:
        - name: redis
          image: redis:7
          ports:
            - containerPort: 6379
        - name: sentinel # 장애 감지 + 자동 Failover
          image: redis:7
          command: ['redis-sentinel', '/etc/sentinel.conf']

장점: 리소스 효율적 (Standby는 최소 자원) 단점: Failover 시간 (수 초~수십 초 다운타임)

2. Active-Active (액티브-액티브)

정상 상태:
  [서버 A] ←── 트래픽 50%
  [서버 B] ←── 트래픽 50%
      ↑
  Load Balancer가 분산

장애 발생:
  [서버 A] ✗   장애!
  [서버 B] ←── 트래픽 100% (자동)

# Kubernetes Service = 자동 Active-Active
apiVersion: v1
kind: Service
metadata:
  name: my-api
spec:
  selector:
    app: my-api
  ports:
    - port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
spec:
  replicas: 2 # 둘 다 Active!
  template:
    spec:
      containers:
        - name: api
          image: my-api:latest
          readinessProbe: # 장애 감지
            httpGet:
              path: /health
              port: 8080
            periodSeconds: 5
            failureThreshold: 3 # 3번 실패 → 트래픽 제외

장점: 다운타임 거의 0 (이미 분산 중이므로) 단점: 데이터 동기화 복잡 (충돌 해결 필요)

3. N+1 / N+2 이중화

N+1: N대가 필요한데 1대 더 (최소 여유분)
  [서버1] [서버2] [서버3] [+예비1]
  → 1대 죽어도 OK

N+2: N대가 필요한데 2대 더
  [서버1] [서버2] [서버3] [+예비1] [+예비2]
  → 2대 동시에 죽어도 OK + 1대 점검 가능

2N: 전체를 2배로 (가장 비쌈)
  [서버1] [서버2] [서버3] [서버4] [서버5] [서버6]
  → 반이 죽어도 OK (미션 크리티컬)

삼중화 (Triple Redundancy)

왜 삼중화?

이중화의 치명적 약점: 스플릿 브레인 (Split Brain)

이중화에서 네트워크 단절 시:
  [서버 A] ──✗── [서버 B]
  "B가 죽었나? 내가 Active!"  "A가 죽었나? 내가 Active!"
  → 둘 다 Active → 데이터 불일치 → 재앙!

삼중화 + 쿼럼으로 해결:

삼중화 (3노드 클러스터):
  [노드 A]──[노드 B]──[노드 C]

네트워크 분할 시:
  [노드 A]    [노드 B]──[노드 C]
  1표 (소수)     2표 (과반수 = 쿼럼!)
  → A는 자진 사퇴
  → B,C가 계속 서비스
  → 스플릿 브레인 방지!

쿼럼 공식: 과반수 = (N/2) + 1
  3노드: 2표 필요 (1노드 장애 허용)
  5노드: 3표 필요 (2노드 장애 허용)
  7노드: 4표 필요 (3노드 장애 허용)

etcd (Kubernetes의 뇌) — 삼중화 필수!

# etcd 3노드 클러스터 (Raft 합의 알고리즘)
# 1노드 죽어도 쿼럼(2/3) 유지 → 서비스 계속
# 2노드 죽으면 쿼럼 상실 → 읽기만 가능

# kubeadm 기본 설정
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
etcd:
  local:
    extraArgs:
      initial-cluster: >-
        etcd-0=https://10.0.0.1:2380,
        etcd-1=https://10.0.0.2:2380,
        etcd-2=https://10.0.0.3:2380

# Raft 합의 알고리즘 (의사코드)
class RaftNode:
    def __init__(self, node_id, peers):
        self.state = "follower"  # follower → candidate → leader
        self.term = 0
        self.voted_for = None
        self.log = []

    def request_vote(self):
        """리더 선출 — 과반수 투표 필요"""
        self.state = "candidate"
        self.term += 1
        votes = 1  # 자기 자신

        for peer in self.peers:
            if peer.grant_vote(self.term, self.log):
                votes += 1

        if votes > len(self.peers) // 2:  # 쿼럼!
            self.state = "leader"
            return True
        return False

    def replicate(self, entry):
        """데이터 복제 — 과반수 확인 후 커밋"""
        self.log.append(entry)
        acks = 1

        for peer in self.peers:
            if peer.append_entry(entry):
                acks += 1

        if acks > len(self.peers) // 2:  # 쿼럼!
            self.commit(entry)  # 과반수 확인 → 안전하게 커밋

계층별 이중화 전략

전체 아키텍처

[사용자]
   │
   ▼
[DNS] ← Route 53 헬스체크 (다중 리전)
   │
   ▼
[CDN] ← CloudFront (글로벌 엣지)
   │
   ▼
[L4 LB] ← NLB x2 (Active-Active, AZ 분산)
   │
   ▼
[L7 LB] ← ALB x2 (Active-Active)
   │
   ▼
[웹 서버] ← Deployment replicas: 3+ (Active-Active)
   │
   ▼
[앱 서버] ← Deployment replicas: 3+ (Active-Active)
   │
   ├──▶ [DB Primary] ──동기복제──▶ [DB Standby]  (Active-Standby)
   │    └── 비동기복제 ──▶ [DB Read Replica x2]
   │
   ├──▶ [Redis Primary] ── [Redis Replica x2]  (Sentinel)
   │
   └──▶ [MQ] ← RabbitMQ 3노드 (쿼럼 큐)

데이터베이스 이중화

[동기 복제] (Strong Consistency)
  Primary ──COMMIT──▶ Standby
  "양쪽 다 쓸 때까지 기다림"
  장점: 데이터 손실 0
  단점: 느림 (네트워크 RTT 추가)

[비동기 복제] (Eventual Consistency)
  Primary ──COMMIT──▶ (나중에) Standby
  "일단 Primary에만 쓰고 응답"
  장점: 빠름
  단점: 장애 시 최근 데이터 유실 가능 (RPO > 0)

[반동기 복제] (Semi-Sync)
  Primary ──COMMIT──▶ Standby (1개만 확인)
  MySQL 기본 + 금융권 선호

-- PostgreSQL Streaming Replication 설정
-- Primary (postgresql.conf)
-- wal_level = replica
-- max_wal_senders = 3
-- synchronous_standby_names = 'standby1'

-- Standby 확인
SELECT client_addr, state, sync_state
FROM pg_stat_replication;
-- client_addr | state     | sync_state
-- 10.0.0.2   | streaming | sync
-- 10.0.0.3   | streaming | async

장애 복구 지표

RPO (Recovery Point Objective):
  "얼마나 많은 데이터를 잃어도 되나?"
  동기 복제: RPO = 0 (데이터 무손실)
  비동기 복제: RPO = 수 초~수 분

RTO (Recovery Time Objective):
  "얼마나 빨리 복구해야 하나?"
  Active-Active: RTO ≈ 0
  Active-Standby: RTO = 수 초~수 분
  백업 복원: RTO = 수 시간

MTBF (Mean Time Between Failures):
  장애 간 평균 시간 (길수록 좋음)

MTTR (Mean Time To Repair):
  복구까지 평균 시간 (짧을수록 좋음)

가용성 = MTBF / (MTBF + MTTR)

비용 대비 가용성

가용성     │ 구성              │ 상대 비용
───────────┼───────────────────┼──────────
99%        │ 단일 서버         │ 1x
99.9%      │ 이중화 (1 리전)   │ 2~3x
99.99%     │ 삼중화 + 자동복구  │ 5~10x
99.999%    │ 다중 리전 Active   │ 10~50x
99.9999%   │ 다중 리전 + DR    │ 50~100x

📝 퀴즈 — 이중화/삼중화 (클릭해서 확인!)

Q1. Active-Standby와 Active-Active의 가장 큰 차이는? ||Active-Standby: 평시에 1대만 트래픽 처리, Failover 시 전환 지연 있음. Active-Active: 평시에 모두 트래픽 분산, 한 대 죽어도 나머지가 즉시 흡수 (거의 무중단)||

Q2. 스플릿 브레인(Split Brain)이란 무엇이고 어떻게 방지하나? ||네트워크 단절로 양쪽 모두 자신이 Active라고 판단하는 상태. 데이터 불일치 발생. 삼중화 + 쿼럼(과반수 투표)으로 방지 — 과반수를 얻지 못한 쪽이 자진 사퇴||

Q3. etcd가 3노드인 이유는? 4노드가 아닌 이유는? ||3노드: 1노드 장애 허용 (쿼럼 2/3). 4노드: 역시 1노드만 장애 허용 (쿼럼 3/4)으로 3노드와 같은데 비용만 증가. 홀수가 효율적||

Q4. RPO와 RTO의 차이는? ||RPO: 허용 가능한 데이터 손실량 (동기복제 = 0, 비동기 = 수초). RTO: 장애 발생부터 서비스 복구까지 허용 시간. 둘 다 짧을수록 좋지만 비용 증가||

Q5. 99.99%와 99.999% 가용성의 연간 다운타임 차이는? ||99.99% = 52.6분/년, 99.999% = 5.26분/년. 약 10배 차이이며 비용은 5~10배 이상 증가||

Q6. N+1과 2N 이중화의 차이는? ||N+1: 필요 대수(N) + 여유 1대 — 비용 효율적, 1대 장애 허용. 2N: 전체를 2배 — 비용 높지만 반이 죽어도 서비스 유지, 미션 크리티컬 시스템용||

Q7. 동기 복제와 비동기 복제를 각각 어떤 상황에서 쓰나? ||동기: 금융 거래, 결제 등 데이터 무손실 필수 (RPO=0). 비동기: 읽기 부하 분산, 분석용 복제 등 약간의 지연 허용 가능한 경우 (성능 우선)||

Q8. Raft 합의에서 리더 선출 과정을 설명하라. ||Follower가 리더 heartbeat를 못 받으면 Candidate로 전환 → term 증가 → 다른 노드에 투표 요청 → 과반수 득표 시 Leader 승격 → 클라이언트 요청 처리 시작||

The Complete Guide to Redundancy — The Secret Behind 99.999% Availability

Introduction
What Availability Numbers Mean
Redundancy Patterns
Triple Redundancy
- Why Triple?
- etcd (The Brain of Kubernetes) — Triple Redundancy Required!
Redundancy Strategy by Layer
- Full Architecture
- Database Redundancy
Failure Recovery Metrics
Cost vs. Availability
Quiz

Introduction

"If a single server dies, the service dies too" — that is what a Single Point of Failure (SPOF) is.

Redundancy is the technology that eliminates SPOFs. Banks, stock exchanges, and air traffic control systems employ triple redundancy, while the cloud services we use every day run on at least dual redundancy.

What Availability Numbers Mean

Availability = Uptime / Total Time x 100%

Availability	Grade	Annual Downtime	Monthly Downtime
99%	Two 9s	3.65 days	7.3 hours
99.9%	Three 9s	8.77 hours	43.8 minutes
99.99%	Four 9s	52.6 minutes	4.38 minutes
99.999%	Five 9s	5.26 minutes	26.3 seconds
99.9999%	Six 9s	31.5 seconds	2.6 seconds

Key takeaway: Going from 99.9% to 99.99% means reducing downtime from 8 hours to 52 minutes — but costs increase more than tenfold!

Redundancy Patterns

1. Active-Standby

Normal state:
  [Active Server] <── All traffic
  [Standby Server]    (idle, sync only)

Failure occurs:
  [Active Server] x   Failure detected!
  [Standby Server] <── Traffic switchover (Failover)
        ^
    Promoted to new Active

# Active-Standby in Kubernetes: StatefulSet + Leader Election
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-ha
spec:
  replicas: 2 # 1 Active + 1 Standby
  selector:
    matchLabels:
      app: redis
  template:
    spec:
      containers:
        - name: redis
          image: redis:7
          ports:
            - containerPort: 6379
        - name: sentinel # Failure detection + automatic failover
          image: redis:7
          command: ['redis-sentinel', '/etc/sentinel.conf']

Pros: Resource efficient (Standby uses minimal resources) Cons: Failover time (seconds to tens of seconds of downtime)

2. Active-Active

Normal state:
  [Server A] <── 50% traffic
  [Server B] <── 50% traffic
      ^
  Load Balancer distributes traffic

Failure occurs:
  [Server A] x   Failure!
  [Server B] <── 100% traffic (automatic)

# Kubernetes Service = automatic Active-Active
apiVersion: v1
kind: Service
metadata:
  name: my-api
spec:
  selector:
    app: my-api
  ports:
    - port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
spec:
  replicas: 2 # Both Active!
  template:
    spec:
      containers:
        - name: api
          image: my-api:latest
          readinessProbe: # Failure detection
            httpGet:
              path: /health
              port: 8080
            periodSeconds: 5
            failureThreshold: 3 # 3 failures then removed from traffic

Pros: Near-zero downtime (already distributing traffic) Cons: Complex data synchronization (conflict resolution required)

3. N+1 / N+2 Redundancy

N+1: Need N servers plus 1 spare (minimum headroom)
  [Server1] [Server2] [Server3] [+Spare1]
  -> 1 server down is OK

N+2: Need N servers plus 2 spares
  [Server1] [Server2] [Server3] [+Spare1] [+Spare2]
  -> 2 simultaneous failures OK + 1 maintenance slot

2N: Double the entire fleet (most expensive)
  [Server1] [Server2] [Server3] [Server4] [Server5] [Server6]
  -> Half can die and still OK (mission critical)

Triple Redundancy

Why Triple?

The fatal weakness of dual redundancy: Split Brain

Network partition with dual redundancy:
  [Server A] --x-- [Server B]
  "B is dead? I'm Active!"  "A is dead? I'm Active!"
  -> Both become Active -> Data inconsistency -> Disaster!

Solved with triple redundancy + quorum:

Triple redundancy (3-node cluster):
  [Node A]--[Node B]--[Node C]

Network partition:
  [Node A]    [Node B]--[Node C]
  1 vote (minority)     2 votes (majority = quorum!)
  -> A voluntarily steps down
  -> B and C continue serving
  -> Split brain prevented!

Quorum formula: Majority = (N/2) + 1
  3 nodes: 2 votes needed (tolerates 1 node failure)
  5 nodes: 3 votes needed (tolerates 2 node failures)
  7 nodes: 4 votes needed (tolerates 3 node failures)

etcd (The Brain of Kubernetes) — Triple Redundancy Required!

# etcd 3-node cluster (Raft consensus algorithm)
# 1 node down: quorum (2/3) maintained -> service continues
# 2 nodes down: quorum lost -> read-only mode

# kubeadm default configuration
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
etcd:
  local:
    extraArgs:
      initial-cluster: >-
        etcd-0=https://10.0.0.1:2380,
        etcd-1=https://10.0.0.2:2380,
        etcd-2=https://10.0.0.3:2380

# Raft consensus algorithm (pseudocode)
class RaftNode:
    def __init__(self, node_id, peers):
        self.state = "follower"  # follower -> candidate -> leader
        self.term = 0
        self.voted_for = None
        self.log = []

    def request_vote(self):
        """Leader election — requires majority vote"""
        self.state = "candidate"
        self.term += 1
        votes = 1  # Self-vote

        for peer in self.peers:
            if peer.grant_vote(self.term, self.log):
                votes += 1

        if votes > len(self.peers) // 2:  # Quorum!
            self.state = "leader"
            return True
        return False

    def replicate(self, entry):
        """Data replication — commit after majority confirmation"""
        self.log.append(entry)
        acks = 1

        for peer in self.peers:
            if peer.append_entry(entry):
                acks += 1

        if acks > len(self.peers) // 2:  # Quorum!
            self.commit(entry)  # Majority confirmed -> safe to commit

Redundancy Strategy by Layer

Full Architecture

[Users]
   |
   v
[DNS] <- Route 53 health checks (multi-region)
   |
   v
[CDN] <- CloudFront (global edge)
   |
   v
[L4 LB] <- NLB x2 (Active-Active, AZ distributed)
   |
   v
[L7 LB] <- ALB x2 (Active-Active)
   |
   v
[Web Servers] <- Deployment replicas: 3+ (Active-Active)
   |
   v
[App Servers] <- Deployment replicas: 3+ (Active-Active)
   |
   +-->  [DB Primary] --sync replication--> [DB Standby]  (Active-Standby)
   |     +-- async replication --> [DB Read Replica x2]
   |
   +-->  [Redis Primary] -- [Redis Replica x2]  (Sentinel)
   |
   +-->  [MQ] <- RabbitMQ 3-node (quorum queues)

Database Redundancy

[Synchronous Replication] (Strong Consistency)
  Primary --COMMIT--> Standby
  "Wait until both sides have written"
  Pros: Zero data loss
  Cons: Slower (adds network RTT)

[Asynchronous Replication] (Eventual Consistency)
  Primary --COMMIT--> (later) Standby
  "Write to Primary only and respond"
  Pros: Fast
  Cons: Recent data may be lost on failure (RPO greater than 0)

[Semi-Synchronous Replication] (Semi-Sync)
  Primary --COMMIT--> Standby (confirm just 1)
  MySQL default + preferred by financial institutions

-- PostgreSQL Streaming Replication configuration
-- Primary (postgresql.conf)
-- wal_level = replica
-- max_wal_senders = 3
-- synchronous_standby_names = 'standby1'

-- Check Standby status
SELECT client_addr, state, sync_state
FROM pg_stat_replication;
-- client_addr | state     | sync_state
-- 10.0.0.2   | streaming | sync
-- 10.0.0.3   | streaming | async

Failure Recovery Metrics

RPO (Recovery Point Objective):
  "How much data can we afford to lose?"
  Synchronous replication: RPO = 0 (zero data loss)
  Asynchronous replication: RPO = seconds to minutes

RTO (Recovery Time Objective):
  "How quickly must we recover?"
  Active-Active: RTO approx. 0
  Active-Standby: RTO = seconds to minutes
  Backup restore: RTO = hours

MTBF (Mean Time Between Failures):
  Average time between failures (longer is better)

MTTR (Mean Time To Repair):
  Average time to recover (shorter is better)

Availability = MTBF / (MTBF + MTTR)

Cost vs. Availability

Availability  | Configuration            | Relative Cost
--------------+--------------------------+--------------
99%           | Single server            | 1x
99.9%         | Dual redundancy (1 region)| 2-3x
99.99%        | Triple + auto recovery   | 5-10x
99.999%       | Multi-region Active      | 10-50x
99.9999%      | Multi-region + DR        | 50-100x

Quiz — Redundancy and High Availability (click to reveal!)

Q1. What is the biggest difference between Active-Standby and Active-Active? ||Active-Standby: Only 1 server handles traffic normally; there is a switchover delay during failover. Active-Active: All servers share traffic normally; if one dies, the others absorb it immediately (near-zero downtime).||

Q2. What is split brain and how do you prevent it? ||A condition where a network partition causes both sides to believe they are Active, leading to data inconsistency. Prevented with triple redundancy + quorum (majority voting) — the side that cannot achieve a majority voluntarily steps down.||

Q3. Why does etcd use 3 nodes? Why not 4? ||3 nodes: tolerates 1 node failure (quorum 2/3). 4 nodes: also tolerates only 1 node failure (quorum 3/4), same as 3 nodes but with higher cost. Odd numbers are more efficient.||

Q4. What is the difference between RPO and RTO? ||RPO: The acceptable amount of data loss (synchronous replication = 0, asynchronous = seconds). RTO: The acceptable time from failure to service recovery. Both should be as low as possible, but cost increases accordingly.||

Q5. What is the annual downtime difference between 99.99% and 99.999% availability? ||99.99% = 52.6 minutes/year, 99.999% = 5.26 minutes/year. About a 10x difference, and costs increase 5-10x or more.||

Q6. What is the difference between N+1 and 2N redundancy? ||N+1: Required count (N) + 1 spare — cost-effective, tolerates 1 failure. 2N: Double the entire fleet — expensive but service survives even if half goes down, used for mission-critical systems.||

Q7. When should synchronous vs. asynchronous replication be used? ||Synchronous: Financial transactions, payments — where zero data loss is required (RPO=0). Asynchronous: Read load distribution, analytics replicas — where slight delay is acceptable (performance priority).||

Q8. Describe the leader election process in Raft consensus. ||When a Follower stops receiving leader heartbeats, it transitions to Candidate, increments the term, requests votes from other nodes, and if it receives a majority, it is promoted to Leader and begins processing client requests.||

Quiz

Q1: What is the main topic covered in "The Complete Guide to Redundancy — The Secret Behind 99.999% Availability"?

From dual redundancy to triple redundancy, Active-Standby, Active-Active, N+1, split brain, and quorum. We explain why the difference between 99.9% and 99.999% translates to 8 hours vs. 5 minutes of annual downtime, with real-world architecture examples.

Q2: What Availability Numbers Mean?

Key takeaway: Going from 99.9% to 99.99% means reducing downtime from 8 hours to 52 minutes — but costs increase more than tenfold!

Q3: Explain the core concept of Redundancy Patterns.

Active-Standby Pros: Resource efficient (Standby uses minimal resources) Cons: Failover time (seconds to tens of seconds of downtime) 2. Active-Active Pros: Near-zero downtime (already distributing traffic) Cons: Complex data synchronization (conflict resolution required) 3.

Q4: What are the key aspects of Triple Redundancy?

Why Triple? The fatal weakness of dual redundancy: Split Brain Solved with triple redundancy + quorum: etcd (The Brain of Kubernetes) — Triple Redundancy Required!

Q5: How does Redundancy Strategy by Layer work?

Full Architecture Database Redundancy