etcd Cluster Operations: Disaster Recovery and Performance Tuning

This post covers essential knowledge for reliable etcd cluster operations, including member management, snapshot backup/restore, disaster recovery, and performance tuning.

1. Member Management

1.1 Cluster Size Recommendations

etcd clusters should use an odd number of nodes:

Cluster Size	Majority	Tolerated Failures
1	1	0
3	2	1
5	3	2
7	4	3

3-node or 5-node configurations are most common. More nodes increase fault tolerance but also increase write latency.

1.2 Adding Members

# 1. Register new member with existing cluster
etcdctl member add new-node \
  --peer-urls=https://10.0.0.4:2380

# 2. Start etcd on the new node (initial-cluster-state=existing)
etcd --name new-node \
  --initial-cluster 'node1=https://10.0.0.1:2380,...,new-node=https://10.0.0.4:2380' \
  --initial-cluster-state existing

1.3 Learner Members

In etcd 3.4+, adding as a Learner first is recommended:

# Add as Learner
etcdctl member add new-node \
  --peer-urls=https://10.0.0.4:2380 \
  --learner

# Promote to voting member after catching up
etcdctl member promote MEMBER_ID

1.4 Removing Members

etcdctl member list
etcdctl member remove MEMBER_ID

2. Snapshots and Backup

2.1 Saving Snapshots

ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/etcd/ca.crt \
  --cert=/etc/etcd/server.crt \
  --key=/etc/etcd/server.key

2.2 Data Directory Structure

/var/lib/etcd/
  member/
    snap/       # Snapshot files
      db        # BoltDB data file
    wal/        # Write-Ahead Log files

2.3 Snapshot Restore

etcdutl snapshot restore /backup/etcd-snapshot.db \
  --name node1 \
  --data-dir /var/lib/etcd-restored \
  --initial-cluster 'node1=https://10.0.0.1:2380,...' \
  --initial-advertise-peer-urls https://10.0.0.1:2380

2.4 Backup Strategy

Regular snapshots (e.g., every 30 minutes)
Copy snapshots to remote storage (S3, GCS)
Backup verification: periodically test restores
Separate WAL and data on different disks

3. Disaster Recovery

3.1 Single Node Failure

In a 3-node cluster with 1 node failure, quorum (2/3) is maintained and the cluster operates normally. Recover or replace the failed node.

3.2 Quorum Loss Recovery

When majority nodes fail:

# Save snapshot from surviving node
etcdctl snapshot save /backup/emergency.db

# Restore as single-node cluster
etcdutl snapshot restore /backup/emergency.db \
  --name node1 \
  --data-dir /var/lib/etcd-new \
  --initial-cluster 'node1=https://10.0.0.1:2380'

# Add remaining members one by one

3.3 Data Corruption

Signs of data corruption: startup failures, consistency check failures, CORRUPT alarms.

Recovery: remove the node, delete data directory, re-add as new member (automatic data replication), or restore from snapshot.

3.4 Network Partition

During network partitions, only the partition containing the majority can perform reads/writes. After recovery, minority partition nodes automatically sync to the latest state.

4. Performance Tuning

4.1 Disk Performance

Disk is the most critical factor for etcd performance:

SSD required: HDDs are not suitable
Dedicated WAL disk: Separate WAL and data on different disks
fio benchmark: Ensure 99th percentile fsync latency is below 10ms

fio --rw=write --ioengine=sync --fdatasync=1 \
  --directory=/var/lib/etcd --size=22m \
  --bs=2300 --name=etcd-fsync-test

4.2 Network Configuration

Lower RTT between members is better
Deploy within the same data center
Adjust heartbeat-interval and election-timeout based on network latency

4.3 Snapshot Count

--snapshot-count controls snapshot creation frequency (default: 100000).

4.4 Resource Recommendations

CPU: 2-4 dedicated cores
Memory: 8GB minimum
Disk: SSD, 50GB minimum
Network: 1Gbps minimum

5. Monitoring and Alerting

5.1 Key Metrics

Metric	Description	Threshold
etcd_server_has_leader	Leader existence	0 is critical
etcd_server_leader_changes_seen_total	Leader changes	Spike needs attention
etcd_disk_wal_fsync_duration_seconds	WAL fsync latency	p99 > 10ms warning
etcd_disk_backend_commit_duration_seconds	Backend commit latency	p99 > 25ms warning

5.2 Alert Rule Examples

- alert: EtcdNoLeader
  expr: etcd_server_has_leader == 0
  for: 1m
  labels:
    severity: critical

- alert: EtcdHighFsyncDuration
  expr: histogram_quantile(0.99, etcd_disk_wal_fsync_duration_seconds_bucket) > 0.01
  for: 5m
  labels:
    severity: warning

6. Summary

The keys to etcd cluster operations are regular backups, proper monitoring, and disk performance optimization. Understanding safe member management with Learner nodes and quorum loss recovery procedures is essential. The next post analyzes etcd's Watch and Lease mechanisms.