- Authors
- Name
- Introduction
- What Availability Numbers Mean
- Redundancy Patterns
- Triple Redundancy
- Redundancy Strategy by Layer
- Failure Recovery Metrics
- Cost vs. Availability

Introduction
"If a single server dies, the service dies too" — that is what a Single Point of Failure (SPOF) is.
Redundancy is the technology that eliminates SPOFs. Banks, stock exchanges, and air traffic control systems employ triple redundancy, while the cloud services we use every day run on at least dual redundancy.
What Availability Numbers Mean
Availability = Uptime / Total Time x 100%
| Availability | Grade | Annual Downtime | Monthly Downtime |
|---|---|---|---|
| 99% | Two 9s | 3.65 days | 7.3 hours |
| 99.9% | Three 9s | 8.77 hours | 43.8 minutes |
| 99.99% | Four 9s | 52.6 minutes | 4.38 minutes |
| 99.999% | Five 9s | 5.26 minutes | 26.3 seconds |
| 99.9999% | Six 9s | 31.5 seconds | 2.6 seconds |
Key takeaway: Going from 99.9% to 99.99% means reducing downtime from 8 hours to 52 minutes — but costs increase more than tenfold!
Redundancy Patterns
1. Active-Standby
Normal state:
[Active Server] <── All traffic
[Standby Server] (idle, sync only)
Failure occurs:
[Active Server] x Failure detected!
[Standby Server] <── Traffic switchover (Failover)
^
Promoted to new Active
# Active-Standby in Kubernetes: StatefulSet + Leader Election
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis-ha
spec:
replicas: 2 # 1 Active + 1 Standby
selector:
matchLabels:
app: redis
template:
spec:
containers:
- name: redis
image: redis:7
ports:
- containerPort: 6379
- name: sentinel # Failure detection + automatic failover
image: redis:7
command: ['redis-sentinel', '/etc/sentinel.conf']
Pros: Resource efficient (Standby uses minimal resources) Cons: Failover time (seconds to tens of seconds of downtime)
2. Active-Active
Normal state:
[Server A] <── 50% traffic
[Server B] <── 50% traffic
^
Load Balancer distributes traffic
Failure occurs:
[Server A] x Failure!
[Server B] <── 100% traffic (automatic)
# Kubernetes Service = automatic Active-Active
apiVersion: v1
kind: Service
metadata:
name: my-api
spec:
selector:
app: my-api
ports:
- port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-api
spec:
replicas: 2 # Both Active!
template:
spec:
containers:
- name: api
image: my-api:latest
readinessProbe: # Failure detection
httpGet:
path: /health
port: 8080
periodSeconds: 5
failureThreshold: 3 # 3 failures then removed from traffic
Pros: Near-zero downtime (already distributing traffic) Cons: Complex data synchronization (conflict resolution required)
3. N+1 / N+2 Redundancy
N+1: Need N servers plus 1 spare (minimum headroom)
[Server1] [Server2] [Server3] [+Spare1]
-> 1 server down is OK
N+2: Need N servers plus 2 spares
[Server1] [Server2] [Server3] [+Spare1] [+Spare2]
-> 2 simultaneous failures OK + 1 maintenance slot
2N: Double the entire fleet (most expensive)
[Server1] [Server2] [Server3] [Server4] [Server5] [Server6]
-> Half can die and still OK (mission critical)
Triple Redundancy
Why Triple?
The fatal weakness of dual redundancy: Split Brain
Network partition with dual redundancy:
[Server A] --x-- [Server B]
"B is dead? I'm Active!" "A is dead? I'm Active!"
-> Both become Active -> Data inconsistency -> Disaster!
Solved with triple redundancy + quorum:
Triple redundancy (3-node cluster):
[Node A]--[Node B]--[Node C]
Network partition:
[Node A] [Node B]--[Node C]
1 vote (minority) 2 votes (majority = quorum!)
-> A voluntarily steps down
-> B and C continue serving
-> Split brain prevented!
Quorum formula: Majority = (N/2) + 1
3 nodes: 2 votes needed (tolerates 1 node failure)
5 nodes: 3 votes needed (tolerates 2 node failures)
7 nodes: 4 votes needed (tolerates 3 node failures)
etcd (The Brain of Kubernetes) — Triple Redundancy Required!
# etcd 3-node cluster (Raft consensus algorithm)
# 1 node down: quorum (2/3) maintained -> service continues
# 2 nodes down: quorum lost -> read-only mode
# kubeadm default configuration
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
etcd:
local:
extraArgs:
initial-cluster: >-
etcd-0=https://10.0.0.1:2380,
etcd-1=https://10.0.0.2:2380,
etcd-2=https://10.0.0.3:2380
# Raft consensus algorithm (pseudocode)
class RaftNode:
def __init__(self, node_id, peers):
self.state = "follower" # follower -> candidate -> leader
self.term = 0
self.voted_for = None
self.log = []
def request_vote(self):
"""Leader election — requires majority vote"""
self.state = "candidate"
self.term += 1
votes = 1 # Self-vote
for peer in self.peers:
if peer.grant_vote(self.term, self.log):
votes += 1
if votes > len(self.peers) // 2: # Quorum!
self.state = "leader"
return True
return False
def replicate(self, entry):
"""Data replication — commit after majority confirmation"""
self.log.append(entry)
acks = 1
for peer in self.peers:
if peer.append_entry(entry):
acks += 1
if acks > len(self.peers) // 2: # Quorum!
self.commit(entry) # Majority confirmed -> safe to commit
Redundancy Strategy by Layer
Full Architecture
[Users]
|
v
[DNS] <- Route 53 health checks (multi-region)
|
v
[CDN] <- CloudFront (global edge)
|
v
[L4 LB] <- NLB x2 (Active-Active, AZ distributed)
|
v
[L7 LB] <- ALB x2 (Active-Active)
|
v
[Web Servers] <- Deployment replicas: 3+ (Active-Active)
|
v
[App Servers] <- Deployment replicas: 3+ (Active-Active)
|
+--> [DB Primary] --sync replication--> [DB Standby] (Active-Standby)
| +-- async replication --> [DB Read Replica x2]
|
+--> [Redis Primary] -- [Redis Replica x2] (Sentinel)
|
+--> [MQ] <- RabbitMQ 3-node (quorum queues)
Database Redundancy
[Synchronous Replication] (Strong Consistency)
Primary --COMMIT--> Standby
"Wait until both sides have written"
Pros: Zero data loss
Cons: Slower (adds network RTT)
[Asynchronous Replication] (Eventual Consistency)
Primary --COMMIT--> (later) Standby
"Write to Primary only and respond"
Pros: Fast
Cons: Recent data may be lost on failure (RPO greater than 0)
[Semi-Synchronous Replication] (Semi-Sync)
Primary --COMMIT--> Standby (confirm just 1)
MySQL default + preferred by financial institutions
-- PostgreSQL Streaming Replication configuration
-- Primary (postgresql.conf)
-- wal_level = replica
-- max_wal_senders = 3
-- synchronous_standby_names = 'standby1'
-- Check Standby status
SELECT client_addr, state, sync_state
FROM pg_stat_replication;
-- client_addr | state | sync_state
-- 10.0.0.2 | streaming | sync
-- 10.0.0.3 | streaming | async
Failure Recovery Metrics
RPO (Recovery Point Objective):
"How much data can we afford to lose?"
Synchronous replication: RPO = 0 (zero data loss)
Asynchronous replication: RPO = seconds to minutes
RTO (Recovery Time Objective):
"How quickly must we recover?"
Active-Active: RTO approx. 0
Active-Standby: RTO = seconds to minutes
Backup restore: RTO = hours
MTBF (Mean Time Between Failures):
Average time between failures (longer is better)
MTTR (Mean Time To Repair):
Average time to recover (shorter is better)
Availability = MTBF / (MTBF + MTTR)
Cost vs. Availability
Availability | Configuration | Relative Cost
--------------+--------------------------+--------------
99% | Single server | 1x
99.9% | Dual redundancy (1 region)| 2-3x
99.99% | Triple + auto recovery | 5-10x
99.999% | Multi-region Active | 10-50x
99.9999% | Multi-region + DR | 50-100x
Quiz — Redundancy and High Availability (click to reveal!)
Q1. What is the biggest difference between Active-Standby and Active-Active? ||Active-Standby: Only 1 server handles traffic normally; there is a switchover delay during failover. Active-Active: All servers share traffic normally; if one dies, the others absorb it immediately (near-zero downtime).||
Q2. What is split brain and how do you prevent it? ||A condition where a network partition causes both sides to believe they are Active, leading to data inconsistency. Prevented with triple redundancy + quorum (majority voting) — the side that cannot achieve a majority voluntarily steps down.||
Q3. Why does etcd use 3 nodes? Why not 4? ||3 nodes: tolerates 1 node failure (quorum 2/3). 4 nodes: also tolerates only 1 node failure (quorum 3/4), same as 3 nodes but with higher cost. Odd numbers are more efficient.||
Q4. What is the difference between RPO and RTO? ||RPO: The acceptable amount of data loss (synchronous replication = 0, asynchronous = seconds). RTO: The acceptable time from failure to service recovery. Both should be as low as possible, but cost increases accordingly.||
Q5. What is the annual downtime difference between 99.99% and 99.999% availability? ||99.99% = 52.6 minutes/year, 99.999% = 5.26 minutes/year. About a 10x difference, and costs increase 5-10x or more.||
Q6. What is the difference between N+1 and 2N redundancy? ||N+1: Required count (N) + 1 spare — cost-effective, tolerates 1 failure. 2N: Double the entire fleet — expensive but service survives even if half goes down, used for mission-critical systems.||
Q7. When should synchronous vs. asynchronous replication be used? ||Synchronous: Financial transactions, payments — where zero data loss is required (RPO=0). Asynchronous: Read load distribution, analytics replicas — where slight delay is acceptable (performance priority).||
Q8. Describe the leader election process in Raft consensus. ||When a Follower stops receiving leader heartbeats, it transitions to Candidate, increments the term, requests votes from other nodes, and if it receives a majority, it is promoted to Leader and begins processing client requests.||