- Authors

- Name
- Youngju Kim
- @fjvbn20031
etcd Cluster Operations: Disaster Recovery and Performance Tuning
This post covers essential knowledge for reliable etcd cluster operations, including member management, snapshot backup/restore, disaster recovery, and performance tuning.
1. Member Management
1.1 Cluster Size Recommendations
etcd clusters should use an odd number of nodes:
| Cluster Size | Majority | Tolerated Failures |
|---|---|---|
| 1 | 1 | 0 |
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
3-node or 5-node configurations are most common. More nodes increase fault tolerance but also increase write latency.
1.2 Adding Members
# 1. Register new member with existing cluster
etcdctl member add new-node \
--peer-urls=https://10.0.0.4:2380
# 2. Start etcd on the new node (initial-cluster-state=existing)
etcd --name new-node \
--initial-cluster 'node1=https://10.0.0.1:2380,...,new-node=https://10.0.0.4:2380' \
--initial-cluster-state existing
1.3 Learner Members
In etcd 3.4+, adding as a Learner first is recommended:
# Add as Learner
etcdctl member add new-node \
--peer-urls=https://10.0.0.4:2380 \
--learner
# Promote to voting member after catching up
etcdctl member promote MEMBER_ID
1.4 Removing Members
etcdctl member list
etcdctl member remove MEMBER_ID
2. Snapshots and Backup
2.1 Saving Snapshots
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/etcd/ca.crt \
--cert=/etc/etcd/server.crt \
--key=/etc/etcd/server.key
2.2 Data Directory Structure
/var/lib/etcd/
member/
snap/ # Snapshot files
db # BoltDB data file
wal/ # Write-Ahead Log files
2.3 Snapshot Restore
etcdutl snapshot restore /backup/etcd-snapshot.db \
--name node1 \
--data-dir /var/lib/etcd-restored \
--initial-cluster 'node1=https://10.0.0.1:2380,...' \
--initial-advertise-peer-urls https://10.0.0.1:2380
2.4 Backup Strategy
- Regular snapshots (e.g., every 30 minutes)
- Copy snapshots to remote storage (S3, GCS)
- Backup verification: periodically test restores
- Separate WAL and data on different disks
3. Disaster Recovery
3.1 Single Node Failure
In a 3-node cluster with 1 node failure, quorum (2/3) is maintained and the cluster operates normally. Recover or replace the failed node.
3.2 Quorum Loss Recovery
When majority nodes fail:
# Save snapshot from surviving node
etcdctl snapshot save /backup/emergency.db
# Restore as single-node cluster
etcdutl snapshot restore /backup/emergency.db \
--name node1 \
--data-dir /var/lib/etcd-new \
--initial-cluster 'node1=https://10.0.0.1:2380'
# Add remaining members one by one
3.3 Data Corruption
Signs of data corruption: startup failures, consistency check failures, CORRUPT alarms.
Recovery: remove the node, delete data directory, re-add as new member (automatic data replication), or restore from snapshot.
3.4 Network Partition
During network partitions, only the partition containing the majority can perform reads/writes. After recovery, minority partition nodes automatically sync to the latest state.
4. Performance Tuning
4.1 Disk Performance
Disk is the most critical factor for etcd performance:
- SSD required: HDDs are not suitable
- Dedicated WAL disk: Separate WAL and data on different disks
- fio benchmark: Ensure 99th percentile fsync latency is below 10ms
fio --rw=write --ioengine=sync --fdatasync=1 \
--directory=/var/lib/etcd --size=22m \
--bs=2300 --name=etcd-fsync-test
4.2 Network Configuration
- Lower RTT between members is better
- Deploy within the same data center
- Adjust heartbeat-interval and election-timeout based on network latency
4.3 Snapshot Count
--snapshot-count controls snapshot creation frequency (default: 100000).
4.4 Resource Recommendations
CPU: 2-4 dedicated cores
Memory: 8GB minimum
Disk: SSD, 50GB minimum
Network: 1Gbps minimum
5. Monitoring and Alerting
5.1 Key Metrics
| Metric | Description | Threshold |
|---|---|---|
| etcd_server_has_leader | Leader existence | 0 is critical |
| etcd_server_leader_changes_seen_total | Leader changes | Spike needs attention |
| etcd_disk_wal_fsync_duration_seconds | WAL fsync latency | p99 > 10ms warning |
| etcd_disk_backend_commit_duration_seconds | Backend commit latency | p99 > 25ms warning |
5.2 Alert Rule Examples
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 1m
labels:
severity: critical
- alert: EtcdHighFsyncDuration
expr: histogram_quantile(0.99, etcd_disk_wal_fsync_duration_seconds_bucket) > 0.01
for: 5m
labels:
severity: warning
6. Summary
The keys to etcd cluster operations are regular backups, proper monitoring, and disk performance optimization. Understanding safe member management with Learner nodes and quorum loss recovery procedures is essential. The next post analyzes etcd's Watch and Lease mechanisms.