Skip to content
Published on

etcd Backup & Restore Complete Guide — CKA Exam Essential Hands-on

Authors
  • Name
    Twitter

1. What is etcd?

etcd is a distributed Key-Value store that holds all state data for a Kubernetes cluster. Information about every resource in the cluster — Pods, Services, ConfigMaps, Secrets, RBAC policies, and more — is stored in etcd.

Put simply, everything you see with a kubectl get command ultimately comes from data stored in etcd.

┌─────────────────────────────────────────────┐
Kubernetes Control Plane│                                             │
│  kube-apiserver ──── etcd (all state stored)│       │                                     │
│  kube-scheduler    kube-controller-manager   │
└─────────────────────────────────────────────┘

Why is etcd Backup Important?

  • If etcd is lost, the entire cluster is gone (Pods, Services, Secrets — everything)
  • It is the only recovery method when a cluster is broken by a botched upgrade or misconfiguration
  • It is a near-guaranteed topic on the CKA exam

2. Preparation: Checking etcd Information

2.1 Check the etcd Pod

On clusters installed with kubeadm, etcd runs as a Static Pod:

# Check etcd Pod
kubectl get pods -n kube-system | grep etcd

# Example output
# etcd-controlplane   1/1   Running   0   45m

2.2 Locate etcd Certificates (Most Important!)

etcd communicates over TLS, so you must know the exact certificate paths:

# Find certificate paths from etcd Pod arguments
kubectl describe pod etcd-controlplane -n kube-system | grep -E "cert|key|cacert|data-dir"

Typical paths:

ItemPath
CA Certificate/etc/kubernetes/pki/etcd/ca.crt
Server Certificate/etc/kubernetes/pki/etcd/server.crt
Server Key/etc/kubernetes/pki/etcd/server.key
Data Directory/var/lib/etcd
Endpointhttps://127.0.0.1:2379

2.3 Verify etcdctl Installation

# Check etcdctl version
ETCDCTL_API=3 etcdctl version

# Install if missing (Ubuntu)
apt-get install etcd-client

# Or run inside the etcd Pod
kubectl exec -it etcd-controlplane -n kube-system -- etcdctl version

CKA Exam Tip: You must set ETCDCTL_API=3. API v2 does not support the snapshot command.


3. Checking etcd Health

Before backing up, verify that the current etcd cluster is healthy:

# Set environment variables (to avoid typing them every time)
export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt
export ETCDCTL_CERT=/etc/kubernetes/pki/etcd/server.crt
export ETCDCTL_KEY=/etc/kubernetes/pki/etcd/server.key
export ETCDCTL_ENDPOINTS=https://127.0.0.1:2379

# Cluster health check
etcdctl endpoint health
# Output: https://127.0.0.1:2379 is healthy: took = 2.345ms

# List members
etcdctl member list --write-out=table

# Check the number of stored keys
etcdctl get / --prefix --keys-only | wc -l

4. etcd Backup (Snapshot)

4.1 Create a Snapshot

ETCDCTL_API=3 etcdctl snapshot save /opt/etcd-backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

4.2 Verify the Snapshot

ETCDCTL_API=3 etcdctl snapshot status /opt/etcd-backup.db --write-out=table

Example output:

+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 3e9a843c |    12450 |       1287 |     5.8 MB |
+----------+----------+------------+------------+

Caution: There have been reports that the snapshot verify command can corrupt the DB, so use snapshot status for verification only.

4.3 Automated Backup CronJob

In production, configure periodic backups with a CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: '0 */6 * * *'
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          containers:
            - name: backup
              image: bitnami/etcd:3.5
              command:
                - /bin/sh
                - -c
                - |
                  etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
                    --endpoints=https://127.0.0.1:2379 \
                    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                    --cert=/etc/kubernetes/pki/etcd/server.crt \
                    --key=/etc/kubernetes/pki/etcd/server.key
                  find /backup -name "etcd-*.db" -mtime +7 -delete
              env:
                - name: ETCDCTL_API
                  value: '3'
              volumeMounts:
                - name: etcd-certs
                  mountPath: /etc/kubernetes/pki/etcd
                  readOnly: true
                - name: backup
                  mountPath: /backup
          restartPolicy: OnFailure
          nodeSelector:
            node-role.kubernetes.io/control-plane: ''
          tolerations:
            - effect: NoSchedule
              operator: Exists
          volumes:
            - name: etcd-certs
              hostPath:
                path: /etc/kubernetes/pki/etcd
            - name: backup
              hostPath:
                path: /opt/etcd-backups

5. etcd Restore

5.1 Restore Procedure (Step by Step)

Step 1: Restore the snapshot to a new data directory

ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-backup.db \
  --data-dir=/var/lib/etcd-from-backup \
  --initial-cluster=controlplane=https://127.0.0.1:2380 \
  --initial-cluster-token=etcd-cluster-1 \
  --initial-advertise-peer-urls=https://127.0.0.1:2380

Key Point: The --data-dir must be a new path, not the existing /var/lib/etcd!

Step 2: Update the etcd Pod's data directory

vi /etc/kubernetes/manifests/etcd.yaml

Section to change:

# Before
volumes:
- hostPath:
    path: /var/lib/etcd
    type: DirectoryOrCreate
  name: etcd-data

# After — change to the new path!
volumes:
- hostPath:
    path: /var/lib/etcd-from-backup
    type: DirectoryOrCreate
  name: etcd-data

Step 3: Wait for etcd Pod to restart

# Since it's a Static Pod, kubelet will restart it automatically
watch "kubectl get pods -n kube-system | grep etcd"

# Or check directly with crictl
crictl ps | grep etcd

Step 4: Verify the restore

kubectl get nodes
kubectl get pods --all-namespaces

ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

6. Troubleshooting

6.1 etcdctl: command not found

# Common in CKA exams!
# Solution 1: Use the full path
/usr/local/bin/etcdctl version

# Solution 2: Run inside the etcd Pod
kubectl exec -it etcd-controlplane -n kube-system -- sh

# Solution 3: SSH and sudo
sudo -i
export PATH=$PATH:/usr/local/bin

6.2 etcd Pod Won't Start After Restore

# Check logs
crictl logs $(crictl ps -a | grep etcd | awk '{print $1}')

# Common cause 1: Permission issues
chown -R etcd:etcd /var/lib/etcd-from-backup

# Common cause 2: data-dir path mismatch
# Verify that etcd.yaml's volumes.hostPath.path matches the actual restore path

# Common cause 3: initial-cluster-token mismatch

6.3 "context deadline exceeded" Error

# Cause: etcd hasn't fully started yet
# Solution: Wait 30-60 seconds and retry

7. CKA Exam Cheat Sheet

# 1. Backup
ETCDCTL_API=3 etcdctl snapshot save /opt/backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# 2. Restore
ETCDCTL_API=3 etcdctl snapshot restore /opt/backup.db \
  --data-dir=/var/lib/etcd-from-backup

# 3. Change data-dir in etcd.yaml (one-liner)
sed -i 's|/var/lib/etcd|/var/lib/etcd-from-backup|g' \
  /etc/kubernetes/manifests/etcd.yaml

# 4. Verify status
ETCDCTL_API=3 etcdctl snapshot status /opt/backup.db -w table

Exam Tip: Don't memorize certificate paths — just copy them from kubectl describe pod etcd-controlplane -n kube-system!


8. External etcd Cluster Scenario

# Find the etcd endpoint from kube-apiserver
cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep etcd-servers

# SSH to the etcd node and perform backup/restore
# Certificate paths may differ (/etc/etcd/pki/)
# After restore: systemctl restart etcd

9. Key Summary

OperationCommandNotes
Backupetcdctl snapshot saveETCDCTL_API=3 required, 3 certificates
Verifyetcdctl snapshot statusUse status, not verify
Restoreetcdctl snapshot restoreMust specify a new data-dir
ApplyEdit etcd.yamlChange hostPath path
Confirmetcdctl endpoint healthWait 30-60 seconds after restart

CKA Flow: describe -> copy certificates -> save/restore -> edit yaml. Done!


References