Skip to content
Published on

The Complete Guide to Redundancy — The Secret Behind 99.999% Availability

Authors
  • Name
    Twitter
Redundancy & HA

Introduction

"If a single server dies, the service dies too" — that is what a Single Point of Failure (SPOF) is.

Redundancy is the technology that eliminates SPOFs. Banks, stock exchanges, and air traffic control systems employ triple redundancy, while the cloud services we use every day run on at least dual redundancy.

What Availability Numbers Mean

Availability = Uptime / Total Time x 100%
AvailabilityGradeAnnual DowntimeMonthly Downtime
99%Two 9s3.65 days7.3 hours
99.9%Three 9s8.77 hours43.8 minutes
99.99%Four 9s52.6 minutes4.38 minutes
99.999%Five 9s5.26 minutes26.3 seconds
99.9999%Six 9s31.5 seconds2.6 seconds

Key takeaway: Going from 99.9% to 99.99% means reducing downtime from 8 hours to 52 minutes — but costs increase more than tenfold!

Redundancy Patterns

1. Active-Standby

Normal state:
  [Active Server] <── All traffic
  [Standby Server]    (idle, sync only)

Failure occurs:
  [Active Server] x   Failure detected!
  [Standby Server] <── Traffic switchover (Failover)
        ^
    Promoted to new Active
# Active-Standby in Kubernetes: StatefulSet + Leader Election
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis-ha
spec:
  replicas: 2 # 1 Active + 1 Standby
  selector:
    matchLabels:
      app: redis
  template:
    spec:
      containers:
        - name: redis
          image: redis:7
          ports:
            - containerPort: 6379
        - name: sentinel # Failure detection + automatic failover
          image: redis:7
          command: ['redis-sentinel', '/etc/sentinel.conf']

Pros: Resource efficient (Standby uses minimal resources) Cons: Failover time (seconds to tens of seconds of downtime)

2. Active-Active

Normal state:
  [Server A] <── 50% traffic
  [Server B] <── 50% traffic
      ^
  Load Balancer distributes traffic

Failure occurs:
  [Server A] x   Failure!
  [Server B] <── 100% traffic (automatic)
# Kubernetes Service = automatic Active-Active
apiVersion: v1
kind: Service
metadata:
  name: my-api
spec:
  selector:
    app: my-api
  ports:
    - port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
spec:
  replicas: 2 # Both Active!
  template:
    spec:
      containers:
        - name: api
          image: my-api:latest
          readinessProbe: # Failure detection
            httpGet:
              path: /health
              port: 8080
            periodSeconds: 5
            failureThreshold: 3 # 3 failures then removed from traffic

Pros: Near-zero downtime (already distributing traffic) Cons: Complex data synchronization (conflict resolution required)

3. N+1 / N+2 Redundancy

N+1: Need N servers plus 1 spare (minimum headroom)
  [Server1] [Server2] [Server3] [+Spare1]
  -> 1 server down is OK

N+2: Need N servers plus 2 spares
  [Server1] [Server2] [Server3] [+Spare1] [+Spare2]
  -> 2 simultaneous failures OK + 1 maintenance slot

2N: Double the entire fleet (most expensive)
  [Server1] [Server2] [Server3] [Server4] [Server5] [Server6]
  -> Half can die and still OK (mission critical)

Triple Redundancy

Why Triple?

The fatal weakness of dual redundancy: Split Brain

Network partition with dual redundancy:
  [Server A] --x-- [Server B]
  "B is dead? I'm Active!"  "A is dead? I'm Active!"
  -> Both become Active -> Data inconsistency -> Disaster!

Solved with triple redundancy + quorum:

Triple redundancy (3-node cluster):
  [Node A]--[Node B]--[Node C]

Network partition:
  [Node A]    [Node B]--[Node C]
  1 vote (minority)     2 votes (majority = quorum!)
  -> A voluntarily steps down
  -> B and C continue serving
  -> Split brain prevented!

Quorum formula: Majority = (N/2) + 1
  3 nodes: 2 votes needed (tolerates 1 node failure)
  5 nodes: 3 votes needed (tolerates 2 node failures)
  7 nodes: 4 votes needed (tolerates 3 node failures)

etcd (The Brain of Kubernetes) — Triple Redundancy Required!

# etcd 3-node cluster (Raft consensus algorithm)
# 1 node down: quorum (2/3) maintained -> service continues
# 2 nodes down: quorum lost -> read-only mode

# kubeadm default configuration
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
etcd:
  local:
    extraArgs:
      initial-cluster: >-
        etcd-0=https://10.0.0.1:2380,
        etcd-1=https://10.0.0.2:2380,
        etcd-2=https://10.0.0.3:2380
# Raft consensus algorithm (pseudocode)
class RaftNode:
    def __init__(self, node_id, peers):
        self.state = "follower"  # follower -> candidate -> leader
        self.term = 0
        self.voted_for = None
        self.log = []

    def request_vote(self):
        """Leader election — requires majority vote"""
        self.state = "candidate"
        self.term += 1
        votes = 1  # Self-vote

        for peer in self.peers:
            if peer.grant_vote(self.term, self.log):
                votes += 1

        if votes > len(self.peers) // 2:  # Quorum!
            self.state = "leader"
            return True
        return False

    def replicate(self, entry):
        """Data replication — commit after majority confirmation"""
        self.log.append(entry)
        acks = 1

        for peer in self.peers:
            if peer.append_entry(entry):
                acks += 1

        if acks > len(self.peers) // 2:  # Quorum!
            self.commit(entry)  # Majority confirmed -> safe to commit

Redundancy Strategy by Layer

Full Architecture

[Users]
   |
   v
[DNS] <- Route 53 health checks (multi-region)
   |
   v
[CDN] <- CloudFront (global edge)
   |
   v
[L4 LB] <- NLB x2 (Active-Active, AZ distributed)
   |
   v
[L7 LB] <- ALB x2 (Active-Active)
   |
   v
[Web Servers] <- Deployment replicas: 3+ (Active-Active)
   |
   v
[App Servers] <- Deployment replicas: 3+ (Active-Active)
   |
   +-->  [DB Primary] --sync replication--> [DB Standby]  (Active-Standby)
   |     +-- async replication --> [DB Read Replica x2]
   |
   +-->  [Redis Primary] -- [Redis Replica x2]  (Sentinel)
   |
   +-->  [MQ] <- RabbitMQ 3-node (quorum queues)

Database Redundancy

[Synchronous Replication] (Strong Consistency)
  Primary --COMMIT--> Standby
  "Wait until both sides have written"
  Pros: Zero data loss
  Cons: Slower (adds network RTT)

[Asynchronous Replication] (Eventual Consistency)
  Primary --COMMIT--> (later) Standby
  "Write to Primary only and respond"
  Pros: Fast
  Cons: Recent data may be lost on failure (RPO greater than 0)

[Semi-Synchronous Replication] (Semi-Sync)
  Primary --COMMIT--> Standby (confirm just 1)
  MySQL default + preferred by financial institutions
-- PostgreSQL Streaming Replication configuration
-- Primary (postgresql.conf)
-- wal_level = replica
-- max_wal_senders = 3
-- synchronous_standby_names = 'standby1'

-- Check Standby status
SELECT client_addr, state, sync_state
FROM pg_stat_replication;
-- client_addr | state     | sync_state
-- 10.0.0.2   | streaming | sync
-- 10.0.0.3   | streaming | async

Failure Recovery Metrics

RPO (Recovery Point Objective):
  "How much data can we afford to lose?"
  Synchronous replication: RPO = 0 (zero data loss)
  Asynchronous replication: RPO = seconds to minutes

RTO (Recovery Time Objective):
  "How quickly must we recover?"
  Active-Active: RTO approx. 0
  Active-Standby: RTO = seconds to minutes
  Backup restore: RTO = hours

MTBF (Mean Time Between Failures):
  Average time between failures (longer is better)

MTTR (Mean Time To Repair):
  Average time to recover (shorter is better)

Availability = MTBF / (MTBF + MTTR)

Cost vs. Availability

Availability  | Configuration            | Relative Cost
--------------+--------------------------+--------------
99%           | Single server            | 1x
99.9%         | Dual redundancy (1 region)| 2-3x
99.99%        | Triple + auto recovery   | 5-10x
99.999%       | Multi-region Active      | 10-50x
99.9999%      | Multi-region + DR        | 50-100x

Quiz — Redundancy and High Availability (click to reveal!)

Q1. What is the biggest difference between Active-Standby and Active-Active? ||Active-Standby: Only 1 server handles traffic normally; there is a switchover delay during failover. Active-Active: All servers share traffic normally; if one dies, the others absorb it immediately (near-zero downtime).||

Q2. What is split brain and how do you prevent it? ||A condition where a network partition causes both sides to believe they are Active, leading to data inconsistency. Prevented with triple redundancy + quorum (majority voting) — the side that cannot achieve a majority voluntarily steps down.||

Q3. Why does etcd use 3 nodes? Why not 4? ||3 nodes: tolerates 1 node failure (quorum 2/3). 4 nodes: also tolerates only 1 node failure (quorum 3/4), same as 3 nodes but with higher cost. Odd numbers are more efficient.||

Q4. What is the difference between RPO and RTO? ||RPO: The acceptable amount of data loss (synchronous replication = 0, asynchronous = seconds). RTO: The acceptable time from failure to service recovery. Both should be as low as possible, but cost increases accordingly.||

Q5. What is the annual downtime difference between 99.99% and 99.999% availability? ||99.99% = 52.6 minutes/year, 99.999% = 5.26 minutes/year. About a 10x difference, and costs increase 5-10x or more.||

Q6. What is the difference between N+1 and 2N redundancy? ||N+1: Required count (N) + 1 spare — cost-effective, tolerates 1 failure. 2N: Double the entire fleet — expensive but service survives even if half goes down, used for mission-critical systems.||

Q7. When should synchronous vs. asynchronous replication be used? ||Synchronous: Financial transactions, payments — where zero data loss is required (RPO=0). Asynchronous: Read load distribution, analytics replicas — where slight delay is acceptable (performance priority).||

Q8. Describe the leader election process in Raft consensus. ||When a Follower stops receiving leader heartbeats, it transitions to Candidate, increments the term, requests votes from other nodes, and if it receives a majority, it is promoted to Leader and begins processing client requests.||