Introduction
"If a single server dies, the service dies too" — that is what a Single Point of Failure (SPOF) is.
Redundancy is the technology that eliminates SPOFs. Banks, stock exchanges, and air traffic control systems employ **triple redundancy**, while the cloud services we use every day run on at least **dual redundancy**.
What Availability Numbers Mean
Availability = Uptime / Total Time x 100%
| Availability | Grade | Annual Downtime | Monthly Downtime |
| ------------ | ----------- | ---------------- | ---------------- |
| 99% | Two 9s | 3.65 days | 7.3 hours |
| 99.9% | Three 9s | 8.77 hours | 43.8 minutes |
| 99.99% | Four 9s | 52.6 minutes | 4.38 minutes |
| 99.999% | **Five 9s** | **5.26 minutes** | **26.3 seconds** |
| 99.9999% | Six 9s | 31.5 seconds | 2.6 seconds |
**Key takeaway**: Going from 99.9% to 99.99% means reducing downtime from 8 hours to 52 minutes — but costs increase more than tenfold!
Redundancy Patterns
1. Active-Standby
Normal state:
[Active Server] <── All traffic
[Standby Server] (idle, sync only)
Failure occurs:
[Active Server] x Failure detected!
[Standby Server] <── Traffic switchover (Failover)
^
Promoted to new Active
Active-Standby in Kubernetes: StatefulSet + Leader Election
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-ha
spec:
replicas: 2 # 1 Active + 1 Standby
selector:
matchLabels:
app: redis
template:
spec:
containers:
- name: redis
image: redis:7
ports:
- containerPort: 6379
- name: sentinel # Failure detection + automatic failover
image: redis:7
command: ['redis-sentinel', '/etc/sentinel.conf']
**Pros**: Resource efficient (Standby uses minimal resources)
**Cons**: Failover time (seconds to tens of seconds of downtime)
2. Active-Active
Normal state:
[Server A] <── 50% traffic
[Server B] <── 50% traffic
^
Load Balancer distributes traffic
Failure occurs:
[Server A] x Failure!
[Server B] <── 100% traffic (automatic)
Kubernetes Service = automatic Active-Active
apiVersion: v1
kind: Service
metadata:
name: my-api
spec:
selector:
app: my-api
ports:
- port: 80
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-api
spec:
replicas: 2 # Both Active!
template:
spec:
containers:
- name: api
image: my-api:latest
readinessProbe: # Failure detection
httpGet:
path: /health
port: 8080
periodSeconds: 5
failureThreshold: 3 # 3 failures then removed from traffic
**Pros**: Near-zero downtime (already distributing traffic)
**Cons**: Complex data synchronization (conflict resolution required)
3. N+1 / N+2 Redundancy
N+1: Need N servers plus 1 spare (minimum headroom)
[Server1] [Server2] [Server3] [+Spare1]
-> 1 server down is OK
N+2: Need N servers plus 2 spares
[Server1] [Server2] [Server3] [+Spare1] [+Spare2]
-> 2 simultaneous failures OK + 1 maintenance slot
2N: Double the entire fleet (most expensive)
[Server1] [Server2] [Server3] [Server4] [Server5] [Server6]
-> Half can die and still OK (mission critical)
Triple Redundancy
Why Triple?
The fatal weakness of dual redundancy: **Split Brain**
Network partition with dual redundancy:
[Server A] --x-- [Server B]
"B is dead? I'm Active!" "A is dead? I'm Active!"
-> Both become Active -> Data inconsistency -> Disaster!
Solved with triple redundancy + quorum:
Triple redundancy (3-node cluster):
[Node A]--[Node B]--[Node C]
Network partition:
[Node A] [Node B]--[Node C]
1 vote (minority) 2 votes (majority = quorum!)
-> A voluntarily steps down
-> B and C continue serving
-> Split brain prevented!
Quorum formula: Majority = (N/2) + 1
3 nodes: 2 votes needed (tolerates 1 node failure)
5 nodes: 3 votes needed (tolerates 2 node failures)
7 nodes: 4 votes needed (tolerates 3 node failures)
etcd (The Brain of Kubernetes) — Triple Redundancy Required!
etcd 3-node cluster (Raft consensus algorithm)
1 node down: quorum (2/3) maintained -> service continues
2 nodes down: quorum lost -> read-only mode
kubeadm default configuration
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
etcd:
local:
extraArgs:
initial-cluster: >-
etcd-0=https://10.0.0.1:2380,
etcd-1=https://10.0.0.2:2380,
etcd-2=https://10.0.0.3:2380
Raft consensus algorithm (pseudocode)
class RaftNode:
def __init__(self, node_id, peers):
self.state = "follower" # follower -> candidate -> leader
self.term = 0
self.voted_for = None
self.log = []
def request_vote(self):
"""Leader election — requires majority vote"""
self.state = "candidate"
self.term += 1
votes = 1 # Self-vote
for peer in self.peers:
if peer.grant_vote(self.term, self.log):
votes += 1
if votes > len(self.peers) // 2: # Quorum!
self.state = "leader"
return True
return False
def replicate(self, entry):
"""Data replication — commit after majority confirmation"""
self.log.append(entry)
acks = 1
for peer in self.peers:
if peer.append_entry(entry):
acks += 1
if acks > len(self.peers) // 2: # Quorum!
self.commit(entry) # Majority confirmed -> safe to commit
Redundancy Strategy by Layer
Full Architecture
[Users]
|
v
[DNS] <- Route 53 health checks (multi-region)
|
v
[CDN] <- CloudFront (global edge)
|
v
[L4 LB] <- NLB x2 (Active-Active, AZ distributed)
|
v
[L7 LB] <- ALB x2 (Active-Active)
|
v
[Web Servers] <- Deployment replicas: 3+ (Active-Active)
|
v
[App Servers] <- Deployment replicas: 3+ (Active-Active)
|
+--> [DB Primary] --sync replication--> [DB Standby] (Active-Standby)
| +-- async replication --> [DB Read Replica x2]
|
+--> [Redis Primary] -- [Redis Replica x2] (Sentinel)
|
+--> [MQ] <- RabbitMQ 3-node (quorum queues)
Database Redundancy
[Synchronous Replication] (Strong Consistency)
Primary --COMMIT--> Standby
"Wait until both sides have written"
Pros: Zero data loss
Cons: Slower (adds network RTT)
[Asynchronous Replication] (Eventual Consistency)
Primary --COMMIT--> (later) Standby
"Write to Primary only and respond"
Pros: Fast
Cons: Recent data may be lost on failure (RPO greater than 0)
[Semi-Synchronous Replication] (Semi-Sync)
Primary --COMMIT--> Standby (confirm just 1)
MySQL default + preferred by financial institutions
-- PostgreSQL Streaming Replication configuration
-- Primary (postgresql.conf)
-- wal_level = replica
-- max_wal_senders = 3
-- synchronous_standby_names = 'standby1'
-- Check Standby status
SELECT client_addr, state, sync_state
FROM pg_stat_replication;
-- client_addr | state | sync_state
-- 10.0.0.2 | streaming | sync
-- 10.0.0.3 | streaming | async
Failure Recovery Metrics
RPO (Recovery Point Objective):
"How much data can we afford to lose?"
Synchronous replication: RPO = 0 (zero data loss)
Asynchronous replication: RPO = seconds to minutes
RTO (Recovery Time Objective):
"How quickly must we recover?"
Active-Active: RTO approx. 0
Active-Standby: RTO = seconds to minutes
Backup restore: RTO = hours
MTBF (Mean Time Between Failures):
Average time between failures (longer is better)
MTTR (Mean Time To Repair):
Average time to recover (shorter is better)
Availability = MTBF / (MTBF + MTTR)
Cost vs. Availability
Availability | Configuration | Relative Cost
--------------+--------------------------+--------------
99% | Single server | 1x
99.9% | Dual redundancy (1 region)| 2-3x
99.99% | Triple + auto recovery | 5-10x
99.999% | Multi-region Active | 10-50x
99.9999% | Multi-region + DR | 50-100x
**Q1.** What is the biggest difference between Active-Standby and Active-Active?
||Active-Standby: Only 1 server handles traffic normally; there is a switchover delay during failover. Active-Active: All servers share traffic normally; if one dies, the others absorb it immediately (near-zero downtime).||
**Q2.** What is split brain and how do you prevent it?
||A condition where a network partition causes both sides to believe they are Active, leading to data inconsistency. Prevented with triple redundancy + quorum (majority voting) — the side that cannot achieve a majority voluntarily steps down.||
**Q3.** Why does etcd use 3 nodes? Why not 4?
||3 nodes: tolerates 1 node failure (quorum 2/3). 4 nodes: also tolerates only 1 node failure (quorum 3/4), same as 3 nodes but with higher cost. Odd numbers are more efficient.||
**Q4.** What is the difference between RPO and RTO?
||RPO: The acceptable amount of data loss (synchronous replication = 0, asynchronous = seconds). RTO: The acceptable time from failure to service recovery. Both should be as low as possible, but cost increases accordingly.||
**Q5.** What is the annual downtime difference between 99.99% and 99.999% availability?
||99.99% = 52.6 minutes/year, 99.999% = 5.26 minutes/year. About a 10x difference, and costs increase 5-10x or more.||
**Q6.** What is the difference between N+1 and 2N redundancy?
||N+1: Required count (N) + 1 spare — cost-effective, tolerates 1 failure. 2N: Double the entire fleet — expensive but service survives even if half goes down, used for mission-critical systems.||
**Q7.** When should synchronous vs. asynchronous replication be used?
||Synchronous: Financial transactions, payments — where zero data loss is required (RPO=0). Asynchronous: Read load distribution, analytics replicas — where slight delay is acceptable (performance priority).||
**Q8.** Describe the leader election process in Raft consensus.
||When a Follower stops receiving leader heartbeats, it transitions to Candidate, increments the term, requests votes from other nodes, and if it receives a majority, it is promoted to Leader and begins processing client requests.||
Quiz
Q1: What is the main topic covered in "The Complete Guide to Redundancy — The Secret Behind
99.999% Availability"?
From dual redundancy to triple redundancy, Active-Standby, Active-Active, N+1, split brain, and
quorum. We explain why the difference between 99.9% and 99.999% translates to 8 hours vs. 5
minutes of annual downtime, with real-world architecture examples.
Key takeaway: Going from 99.9% to 99.99% means reducing downtime from 8 hours to 52 minutes — but
costs increase more than tenfold!
1. Active-Standby Pros: Resource efficient (Standby uses minimal resources) Cons: Failover time
(seconds to tens of seconds of downtime) 2. Active-Active Pros: Near-zero downtime (already
distributing traffic) Cons: Complex data synchronization (conflict resolution required) 3.
Why Triple? The fatal weakness of dual redundancy: Split Brain Solved with triple redundancy +
quorum: etcd (The Brain of Kubernetes) — Triple Redundancy Required!
Full Architecture Database Redundancy
현재 단락 (1/239)
"If a single server dies, the service dies too" — that is what a Single Point of Failure (SPOF) is.