💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Introduction

"If a single server dies, the service dies too" — that is what a Single Point of Failure (SPOF) is.

Redundancy is the technology that eliminates SPOFs. Banks, stock exchanges, and air traffic control systems employ **triple redundancy**, while the cloud services we use every day run on at least **dual redundancy**.

What Availability Numbers Mean

Availability = Uptime / Total Time x 100%

| ------------ | ----------- | ---------------- | ---------------- |

**Key takeaway**: Going from 99.9% to 99.99% means reducing downtime from 8 hours to 52 minutes — but costs increase more than tenfold!

Redundancy Patterns

1. Active-Standby

Normal state:

[Active Server] <── All traffic

[Standby Server] (idle, sync only)

Failure occurs:

[Active Server] x Failure detected!

[Standby Server] <── Traffic switchover (Failover)

Promoted to new Active

Active-Standby in Kubernetes: StatefulSet + Leader Election

apiVersion: apps/v1

kind: StatefulSet

metadata:

spec:

replicas: 2 # 1 Active + 1 Standby

selector:

matchLabels:

app: redis

template:

spec:

containers:

- name: redis

image: redis:7

ports:

- containerPort: 6379

- name: sentinel # Failure detection + automatic failover

image: redis:7

command: ['redis-sentinel', '/etc/sentinel.conf']

**Pros**: Resource efficient (Standby uses minimal resources)

**Cons**: Failover time (seconds to tens of seconds of downtime)

2. Active-Active

Normal state:

[Server A] <── 50% traffic

[Server B] <── 50% traffic

Load Balancer distributes traffic

Failure occurs:

[Server A] x Failure!

[Server B] <── 100% traffic (automatic)

Kubernetes Service = automatic Active-Active

apiVersion: v1

kind: Service

metadata:

spec:

selector:

app: my-api

ports:

- port: 80

apiVersion: apps/v1

kind: Deployment

metadata:

spec:

replicas: 2 # Both Active!

template:

spec:

containers:

- name: api

image: my-api:latest

readinessProbe: # Failure detection

httpGet:

path: /health

port: 8080

periodSeconds: 5

failureThreshold: 3 # 3 failures then removed from traffic

**Pros**: Near-zero downtime (already distributing traffic)

**Cons**: Complex data synchronization (conflict resolution required)

3. N+1 / N+2 Redundancy

N+1: Need N servers plus 1 spare (minimum headroom)

[Server1] [Server2] [Server3] [+Spare1]

-> 1 server down is OK

N+2: Need N servers plus 2 spares

[Server1] [Server2] [Server3] [+Spare1] [+Spare2]

-> 2 simultaneous failures OK + 1 maintenance slot

2N: Double the entire fleet (most expensive)

[Server1] [Server2] [Server3] [Server4] [Server5] [Server6]

-> Half can die and still OK (mission critical)

Triple Redundancy

Why Triple?

The fatal weakness of dual redundancy: **Split Brain**

Network partition with dual redundancy:

[Server A] --x-- [Server B]

"B is dead? I'm Active!" "A is dead? I'm Active!"

-> Both become Active -> Data inconsistency -> Disaster!

Solved with triple redundancy + quorum:

Triple redundancy (3-node cluster):

[Node A]--[Node B]--[Node C]

Network partition:

[Node A] [Node B]--[Node C]

1 vote (minority) 2 votes (majority = quorum!)

-> A voluntarily steps down

-> B and C continue serving

-> Split brain prevented!

Quorum formula: Majority = (N/2) + 1

3 nodes: 2 votes needed (tolerates 1 node failure)

5 nodes: 3 votes needed (tolerates 2 node failures)

7 nodes: 4 votes needed (tolerates 3 node failures)

etcd (The Brain of Kubernetes) — Triple Redundancy Required!

etcd 3-node cluster (Raft consensus algorithm)

1 node down: quorum (2/3) maintained -> service continues

2 nodes down: quorum lost -> read-only mode

kubeadm default configuration

apiVersion: kubeadm.k8s.io/v1beta4

kind: ClusterConfiguration

etcd:

local:

extraArgs:

initial-cluster: >-

etcd-0=https://10.0.0.1:2380,

etcd-1=https://10.0.0.2:2380,

etcd-2=https://10.0.0.3:2380

Raft consensus algorithm (pseudocode)

class RaftNode:

def __init__(self, node_id, peers):

self.state = "follower" # follower -> candidate -> leader

self.term = 0

self.voted_for = None

self.log = []

def request_vote(self):

"""Leader election — requires majority vote"""

self.state = "candidate"

self.term += 1

votes = 1 # Self-vote

for peer in self.peers:

if peer.grant_vote(self.term, self.log):

votes += 1

if votes > len(self.peers) // 2: # Quorum!

self.state = "leader"

return True

return False

def replicate(self, entry):

"""Data replication — commit after majority confirmation"""

self.log.append(entry)

acks = 1

for peer in self.peers:

if peer.append_entry(entry):

acks += 1

if acks > len(self.peers) // 2: # Quorum!

self.commit(entry) # Majority confirmed -> safe to commit

Redundancy Strategy by Layer

Full Architecture

[Users]

[DNS] <- Route 53 health checks (multi-region)

[CDN] <- CloudFront (global edge)

[L4 LB] <- NLB x2 (Active-Active, AZ distributed)

[L7 LB] <- ALB x2 (Active-Active)

[Web Servers] <- Deployment replicas: 3+ (Active-Active)

[App Servers] <- Deployment replicas: 3+ (Active-Active)

+--> [DB Primary] --sync replication--> [DB Standby] (Active-Standby)

| +-- async replication --> [DB Read Replica x2]

+--> [Redis Primary] -- [Redis Replica x2] (Sentinel)

+--> [MQ] <- RabbitMQ 3-node (quorum queues)

Database Redundancy

[Synchronous Replication] (Strong Consistency)

Primary --COMMIT--> Standby

"Wait until both sides have written"

Pros: Zero data loss

Cons: Slower (adds network RTT)

[Asynchronous Replication] (Eventual Consistency)

Primary --COMMIT--> (later) Standby

"Write to Primary only and respond"

Pros: Fast

Cons: Recent data may be lost on failure (RPO greater than 0)

[Semi-Synchronous Replication] (Semi-Sync)

Primary --COMMIT--> Standby (confirm just 1)

MySQL default + preferred by financial institutions

-- PostgreSQL Streaming Replication configuration

-- Primary (postgresql.conf)

-- wal_level = replica

-- max_wal_senders = 3

-- synchronous_standby_names = 'standby1'

-- Check Standby status

SELECT client_addr, state, sync_state

FROM pg_stat_replication;

-- client_addr | state | sync_state

-- 10.0.0.2 | streaming | sync

-- 10.0.0.3 | streaming | async

Failure Recovery Metrics

RPO (Recovery Point Objective):

"How much data can we afford to lose?"

Synchronous replication: RPO = 0 (zero data loss)

Asynchronous replication: RPO = seconds to minutes

RTO (Recovery Time Objective):

"How quickly must we recover?"

Active-Active: RTO approx. 0

Active-Standby: RTO = seconds to minutes

Backup restore: RTO = hours

MTBF (Mean Time Between Failures):

Average time between failures (longer is better)

MTTR (Mean Time To Repair):

Average time to recover (shorter is better)

Availability = MTBF / (MTBF + MTTR)

Cost vs. Availability

Availability | Configuration | Relative Cost

--------------+--------------------------+--------------

99% | Single server | 1x

99.9% | Dual redundancy (1 region)| 2-3x

99.99% | Triple + auto recovery | 5-10x

99.999% | Multi-region Active | 10-50x

99.9999% | Multi-region + DR | 50-100x

**Q1.** What is the biggest difference between Active-Standby and Active-Active?

||Active-Standby: Only 1 server handles traffic normally; there is a switchover delay during failover. Active-Active: All servers share traffic normally; if one dies, the others absorb it immediately (near-zero downtime).||

**Q2.** What is split brain and how do you prevent it?

||A condition where a network partition causes both sides to believe they are Active, leading to data inconsistency. Prevented with triple redundancy + quorum (majority voting) — the side that cannot achieve a majority voluntarily steps down.||

**Q3.** Why does etcd use 3 nodes? Why not 4?

||3 nodes: tolerates 1 node failure (quorum 2/3). 4 nodes: also tolerates only 1 node failure (quorum 3/4), same as 3 nodes but with higher cost. Odd numbers are more efficient.||

**Q4.** What is the difference between RPO and RTO?

||RPO: The acceptable amount of data loss (synchronous replication = 0, asynchronous = seconds). RTO: The acceptable time from failure to service recovery. Both should be as low as possible, but cost increases accordingly.||

**Q5.** What is the annual downtime difference between 99.99% and 99.999% availability?

||99.99% = 52.6 minutes/year, 99.999% = 5.26 minutes/year. About a 10x difference, and costs increase 5-10x or more.||

**Q6.** What is the difference between N+1 and 2N redundancy?

||N+1: Required count (N) + 1 spare — cost-effective, tolerates 1 failure. 2N: Double the entire fleet — expensive but service survives even if half goes down, used for mission-critical systems.||

**Q7.** When should synchronous vs. asynchronous replication be used?

||Synchronous: Financial transactions, payments — where zero data loss is required (RPO=0). Asynchronous: Read load distribution, analytics replicas — where slight delay is acceptable (performance priority).||

**Q8.** Describe the leader election process in Raft consensus.

||When a Follower stops receiving leader heartbeats, it transitions to Candidate, increments the term, requests votes from other nodes, and if it receives a majority, it is promoted to Leader and begins processing client requests.||

Quiz

Q1: What is the main topic covered in "The Complete Guide to Redundancy — The Secret Behind

99.999% Availability"?

From dual redundancy to triple redundancy, Active-Standby, Active-Active, N+1, split brain, and

quorum. We explain why the difference between 99.9% and 99.999% translates to 8 hours vs. 5

minutes of annual downtime, with real-world architecture examples.

Key takeaway: Going from 99.9% to 99.99% means reducing downtime from 8 hours to 52 minutes — but

costs increase more than tenfold!

1. Active-Standby Pros: Resource efficient (Standby uses minimal resources) Cons: Failover time

(seconds to tens of seconds of downtime) 2. Active-Active Pros: Near-zero downtime (already

distributing traffic) Cons: Complex data synchronization (conflict resolution required) 3.

Why Triple? The fatal weakness of dual redundancy: Split Brain Solved with triple redundancy +

quorum: etcd (The Brain of Kubernetes) — Triple Redundancy Required!

Full Architecture Database Redundancy