Split View: SSL 인증서 운영 플레이북: 무중단 갱신과 만료 사고 예방

SSL 인증서 운영 플레이북: 무중단 갱신과 만료 사고 예방

들어가며
1. 인증서 라이프사이클 관리
2. 무중단 갱신 전략
3. Let's Encrypt 자동 갱신 운영
4. 인증서 만료 모니터링
5. 인시던트 대응 플레이북
- 5.1 인증서 만료 사고 발생 시 대응 순서
- 5.2 단계별 대응 상세
6. 멀티 환경 인증서 관리
7. 실전 체크리스트
8. 마무리

SSL 인증서 운영 플레이북: 무중단 갱신과 만료 사고 예방

들어가며

SSL/TLS 인증서의 기본 개념, 발급 방법, Nginx 설정은 SSL/TLS 인증서 완벽 가이드에서 다뤘다. 이 글은 그 연장선에서 운영 관점에 집중한다. 인증서를 한 번 발급하는 것은 쉽다. 문제는 수십 개의 도메인을 운영하면서 단 한 건의 만료 사고 없이 인증서를 관리하는 것이다.

실제 인증서 만료 사고는 대형 서비스에서도 빈번하게 발생한다. 2020년 Microsoft Teams가 인증서 만료로 수 시간 장애를 겪었고, Spotify, LinkedIn 등도 같은 문제를 경험했다. 이런 사고의 공통점은 자동화의 부재가 아니라 운영 프로세스의 부재였다.

이 플레이북은 다음 질문에 답한다.

인증서 갱신 시 서비스를 중단하지 않으려면 어떻게 해야 하는가?
만료 30일 전에 자동으로 알림을 받으려면 무엇을 구축해야 하는가?
새벽 3시에 인증서 만료 장애가 발생하면 어떤 순서로 대응해야 하는가?
dev/staging/prod 환경별로 인증서를 어떻게 분리 관리하는가?

1. 인증서 라이프사이클 관리

인증서 운영은 단순히 "발급하고 갱신한다"가 아니다. 체계적인 라이프사이클 관리가 필요하다.

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│  발급     │ → │  배포     │ → │  모니터링 │ → │  갱신     │ → │  폐기     │
│ Issuance │    │ Deploy   │    │ Monitor  │    │ Renewal  │    │ Revoke   │
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘
      ↑                                              │
      └──────────────────────────────────────────────┘
                    (자동 갱신 사이클)

1.1 발급 (Issuance)

발급 단계에서 결정해야 할 사항들이다.

결정 항목	선택지	권장
CA 선택	Let's Encrypt / DigiCert / ACM	환경에 따라 (아래 참고)
키 알고리즘	RSA 2048 / RSA 4096 / ECDSA P-256	ECDSA P-256 (성능+보안)
인증서 범위	단일 도메인 / 와일드카드 / SAN	와일드카드 + apex SAN
검증 방식	HTTP-01 / DNS-01	DNS-01 (와일드카드 필수)

ECDSA를 권장하는 이유는 RSA 2048 대비 키 크기가 작고 (256bit vs 2048bit), TLS handshake 성능이 약 2~5배 빠르며, 동일 보안 강도에서 CPU 부하가 낮기 때문이다.

# ECDSA 키로 Let's Encrypt 인증서 발급
sudo certbot certonly \
  --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  --key-type ecdsa \
  --elliptic-curve secp256r1 \
  -d "*.example.com" \
  -d "example.com"

1.2 배포 (Deployment)

인증서를 발급받은 후 실제 서비스에 적용하는 과정이다. 단일 서버라면 간단하지만, 여러 서버에 걸쳐 있을 때는 배포 전략이 필요하다.

#!/bin/bash
# /usr/local/bin/deploy-cert.sh
# 인증서 배포 스크립트 (다중 서버)

CERT_DIR="/etc/letsencrypt/live/example.com"
SERVERS=("web01" "web02" "web03")
REMOTE_CERT_DIR="/etc/nginx/ssl"
DEPLOY_LOG="/var/log/cert-deploy.log"

deploy_cert() {
    local server=$1
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] Deploying to $server" >> "$DEPLOY_LOG"

    # 인증서 파일 전송
    scp -q "$CERT_DIR/fullchain.pem" "$server:$REMOTE_CERT_DIR/fullchain.pem.new"
    scp -q "$CERT_DIR/privkey.pem" "$server:$REMOTE_CERT_DIR/privkey.pem.new"

    # 원자적 교체 (mv는 같은 파일시스템에서 atomic)
    ssh "$server" "
        mv $REMOTE_CERT_DIR/fullchain.pem.new $REMOTE_CERT_DIR/fullchain.pem
        mv $REMOTE_CERT_DIR/privkey.pem.new $REMOTE_CERT_DIR/privkey.pem
        nginx -t && systemctl reload nginx
    "

    if [ $? -eq 0 ]; then
        echo "[$(date '+%Y-%m-%d %H:%M:%S')] $server: OK" >> "$DEPLOY_LOG"
    else
        echo "[$(date '+%Y-%m-%d %H:%M:%S')] $server: FAILED" >> "$DEPLOY_LOG"
        return 1
    fi
}

for server in "${SERVERS[@]}"; do
    deploy_cert "$server"
done

1.3 모니터링 (Monitoring)

2장에서 자세히 다루지만, 핵심 원칙은 다음과 같다.

만료 30일 전부터 warning 알림
만료 7일 전부터 critical 알림
만료 1일 전 에스컬레이션 (PagerDuty/전화)
갱신 성공/실패 이벤트를 반드시 로깅

1.4 갱신 (Renewal)

3장에서 무중단 갱신 전략을 상세히 다룬다.

1.5 폐기 (Revocation)

인증서 폐기가 필요한 상황은 다음과 같다.

개인키 유출이 의심되는 경우
도메인 소유권을 상실한 경우
조직 정보가 변경된 경우

# Let's Encrypt 인증서 폐기
sudo certbot revoke --cert-path /etc/letsencrypt/live/example.com/cert.pem \
  --reason keycompromise

# 폐기 후 인증서 파일 삭제
sudo certbot delete --cert-name example.com

# 즉시 새 인증서 발급
sudo certbot certonly --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  -d "*.example.com" -d "example.com"

2. 무중단 갱신 전략

인증서 갱신 시 서비스 중단이 발생하는 주요 원인은 크게 세 가지다.

갱신 과정에서 웹 서버를 재시작(restart)하는 경우
새 인증서 배포와 로드밸런서 반영 사이의 시간차
클라이언트의 TLS 세션 캐시가 이전 인증서를 참조하는 경우

2.1 Nginx reload 방식 (단일 서버)

가장 기본적인 무중단 갱신이다. Nginx는 reload 시 기존 워커 프로세스가 현재 처리 중인 요청을 마무리한 후 종료되고, 새 워커 프로세스가 새 설정(새 인증서)으로 시작된다.

# restart vs reload 차이
# restart: 프로세스를 중단 후 재시작 → 요청 유실 가능
# reload:  새 워커 생성 → 기존 워커 graceful shutdown → 무중단

# certbot deploy hook으로 reload 자동화
sudo certbot renew --deploy-hook "systemctl reload nginx"

주의: systemctl restart nginx는 절대 사용하지 말 것. 기존 연결이 즉시 끊어진다.

2.2 Rolling 갱신 (다중 서버)

로드밸런서 뒤에 여러 서버가 있을 때, 한 대씩 순차적으로 갱신한다.

#!/bin/bash
# /usr/local/bin/rolling-cert-renewal.sh

SERVERS=("web01" "web02" "web03")
LB_API="http://lb-admin.internal:8080/api"
HEALTH_CHECK_URL="https://example.com/healthz"
WAIT_SECONDS=30

for server in "${SERVERS[@]}"; do
    echo "=== Processing $server ==="

    # 1. 로드밸런서에서 서버 제거
    curl -s -X POST "$LB_API/drain" -d "server=$server"
    echo "Draining $server from load balancer..."
    sleep $WAIT_SECONDS  # 기존 연결 완료 대기

    # 2. 인증서 배포 및 적용
    scp /etc/letsencrypt/live/example.com/fullchain.pem "$server:/etc/nginx/ssl/"
    scp /etc/letsencrypt/live/example.com/privkey.pem "$server:/etc/nginx/ssl/"
    ssh "$server" "nginx -t && systemctl reload nginx"

    # 3. 서버 헬스체크
    for i in $(seq 1 10); do
        status=$(curl -s -o /dev/null -w "%{http_code}" "https://$server/healthz" --resolve "example.com:443:$(dig +short $server)")
        if [ "$status" = "200" ]; then
            echo "$server health check passed"
            break
        fi
        sleep 2
    done

    # 4. 로드밸런서에 서버 복귀
    curl -s -X POST "$LB_API/enable" -d "server=$server"
    echo "$server re-enabled in load balancer"
    sleep 5
done

echo "=== Rolling renewal complete ==="

2.3 Blue-Green 인증서 교체

두 세트의 인증서를 운영하여 전환 시점에 즉시 스위칭하는 방식이다. 주로 대규모 인프라에서 사용한다.

# /etc/nginx/conf.d/ssl-blue-green.conf
# 심볼릭 링크를 활용한 Blue-Green 인증서 전환

# 현재 활성 인증서 (심볼릭 링크)
# /etc/nginx/ssl/active/ -> /etc/nginx/ssl/blue/ 또는 /etc/nginx/ssl/green/

server {
    listen 443 ssl http2;
    server_name example.com;

    ssl_certificate     /etc/nginx/ssl/active/fullchain.pem;
    ssl_certificate_key /etc/nginx/ssl/active/privkey.pem;

    # ... 기타 설정
}

#!/bin/bash
# /usr/local/bin/blue-green-cert-switch.sh

ACTIVE_LINK="/etc/nginx/ssl/active"
BLUE_DIR="/etc/nginx/ssl/blue"
GREEN_DIR="/etc/nginx/ssl/green"

# 현재 활성 슬롯 확인
current=$(readlink "$ACTIVE_LINK")
if [ "$current" = "$BLUE_DIR" ]; then
    target="$GREEN_DIR"
    target_name="green"
else
    target="$BLUE_DIR"
    target_name="blue"
fi

echo "Current: $current"
echo "Deploying new cert to: $target ($target_name)"

# 새 인증서를 비활성 슬롯에 배포
cp /etc/letsencrypt/live/example.com/fullchain.pem "$target/fullchain.pem"
cp /etc/letsencrypt/live/example.com/privkey.pem "$target/privkey.pem"

# 인증서 유효성 검증
openssl x509 -in "$target/fullchain.pem" -noout -checkend 86400
if [ $? -ne 0 ]; then
    echo "ERROR: New certificate expires within 24 hours. Aborting."
    exit 1
fi

# 키 매칭 검증
CERT_MD5=$(openssl x509 -noout -modulus -in "$target/fullchain.pem" | openssl md5)
KEY_MD5=$(openssl rsa -noout -modulus -in "$target/privkey.pem" 2>/dev/null | openssl md5)
if [ "$CERT_MD5" != "$KEY_MD5" ]; then
    echo "ERROR: Certificate and key do not match. Aborting."
    exit 1
fi

# 심볼릭 링크 원자적 전환
ln -sfn "$target" "${ACTIVE_LINK}.new"
mv -T "${ACTIVE_LINK}.new" "$ACTIVE_LINK"

# Nginx reload
nginx -t && systemctl reload nginx
echo "Switched to $target_name slot. Reload complete."

2.4 Dual-Certificate (듀얼 인증서)

Nginx 1.11.0 이상에서는 RSA와 ECDSA 인증서를 동시에 로드할 수 있다. 이를 활용하면 하나의 인증서를 갱신하는 동안 다른 인증서가 서비스를 유지한다.

server {
    listen 443 ssl http2;
    server_name example.com;

    # RSA 인증서
    ssl_certificate     /etc/nginx/ssl/rsa/fullchain.pem;
    ssl_certificate_key /etc/nginx/ssl/rsa/privkey.pem;

    # ECDSA 인증서
    ssl_certificate     /etc/nginx/ssl/ecdsa/fullchain.pem;
    ssl_certificate_key /etc/nginx/ssl/ecdsa/privkey.pem;

    # Nginx가 클라이언트 지원에 따라 자동 선택
    # ECDSA 우선, 미지원 클라이언트는 RSA fallback
}

3. Let's Encrypt 자동 갱신 운영

3.1 systemd timer 기반 갱신 (권장)

Cron보다 systemd timer를 권장하는 이유는 다음과 같다.

RandomizedDelaySec로 CA 서버 부하 분산
Persistent=true로 부팅 시 놓친 실행분 보상
systemctl list-timers로 다음 실행 시각 확인 가능
journalctl로 로그 통합 관리

# /etc/systemd/system/certbot-renewal.service
[Unit]
Description=Certbot SSL Certificate Renewal
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/bin/certbot renew --quiet \
  --pre-hook "/usr/local/bin/cert-pre-hook.sh" \
  --deploy-hook "/usr/local/bin/cert-deploy-hook.sh"
ExecStartPost=/usr/local/bin/cert-renewal-notify.sh
TimeoutStartSec=300

# /etc/systemd/system/certbot-renewal.timer
[Unit]
Description=Run certbot renewal twice daily

[Timer]
OnCalendar=*-*-* 02,14:00:00
RandomizedDelaySec=3600
Persistent=true
AccuracySec=1s

[Install]
WantedBy=timers.target

# timer 활성화 및 상태 확인
sudo systemctl daemon-reload
sudo systemctl enable --now certbot-renewal.timer
sudo systemctl list-timers certbot-renewal.timer

# 수동 테스트 (dry-run)
sudo certbot renew --dry-run

# 수동 트리거
sudo systemctl start certbot-renewal.service

3.2 pre-hook / deploy-hook 활용

hook은 갱신 전후에 필요한 작업을 자동화한다. certbot은 인증서가 실제로 갱신될 때만 hook을 실행한다.

#!/bin/bash
# /usr/local/bin/cert-pre-hook.sh
# 갱신 전 실행되는 hook

LOG="/var/log/cert-hooks.log"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] PRE-HOOK: Starting renewal process" >> "$LOG"

# 현재 인증서 정보 백업
for domain_dir in /etc/letsencrypt/live/*/; do
    domain=$(basename "$domain_dir")
    expiry=$(openssl x509 -in "${domain_dir}fullchain.pem" -noout -enddate 2>/dev/null | cut -d= -f2)
    echo "[PRE] $domain expires: $expiry" >> "$LOG"
done

#!/bin/bash
# /usr/local/bin/cert-deploy-hook.sh
# 갱신 성공 후 실행되는 hook
# $RENEWED_DOMAINS, $RENEWED_LINEAGE 환경변수 사용 가능

LOG="/var/log/cert-hooks.log"
echo "[$(date '+%Y-%m-%d %H:%M:%S')] DEPLOY-HOOK: Certificate renewed" >> "$LOG"
echo "  Domains: $RENEWED_DOMAINS" >> "$LOG"
echo "  Lineage: $RENEWED_LINEAGE" >> "$LOG"

# 1. Nginx 설정 검증 후 reload
if nginx -t 2>/dev/null; then
    systemctl reload nginx
    echo "  Nginx reloaded successfully" >> "$LOG"
else
    echo "  ERROR: Nginx config test failed!" >> "$LOG"
    # 설정 오류 시 긴급 알림
    curl -s -X POST "$SLACK_WEBHOOK" \
      -H 'Content-type: application/json' \
      -d '{"text":"CRITICAL: Nginx config test failed after cert renewal!"}'
    exit 1
fi

# 2. HAProxy가 있으면 인증서 합본 후 reload
if systemctl is-active haproxy > /dev/null 2>&1; then
    cat "$RENEWED_LINEAGE/fullchain.pem" "$RENEWED_LINEAGE/privkey.pem" \
      > /etc/haproxy/certs/$(basename "$RENEWED_LINEAGE").pem
    systemctl reload haproxy
    echo "  HAProxy reloaded" >> "$LOG"
fi

# 3. 다른 서버에 인증서 동기화 (필요 시)
# /usr/local/bin/sync-certs-to-peers.sh "$RENEWED_LINEAGE"

#!/bin/bash
# /usr/local/bin/cert-renewal-notify.sh
# 갱신 결과 알림

SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
LOG="/var/log/cert-hooks.log"

# 마지막 갱신 결과 확인
last_renewal=$(journalctl -u certbot-renewal.service --since "5 minutes ago" --no-pager 2>/dev/null)

if echo "$last_renewal" | grep -q "Congratulations"; then
    # 갱신 성공
    renewed_domains=$(echo "$last_renewal" | grep "renewed" | head -5)
    curl -s -X POST "$SLACK_WEBHOOK" \
      -H 'Content-type: application/json' \
      -d "{\"text\":\"SSL 인증서 갱신 성공\\n${renewed_domains}\"}"
elif echo "$last_renewal" | grep -q "No renewals were attempted"; then
    # 갱신 대상 없음 (정상)
    echo "[$(date)] No certificates due for renewal" >> "$LOG"
else
    # 갱신 실패
    curl -s -X POST "$SLACK_WEBHOOK" \
      -H 'Content-type: application/json' \
      -d '{"text":"WARNING: certbot renewal 실행 결과를 확인하세요!"}'
fi

3.3 갱신 실패 시 자동 재시도

certbot 자체에는 재시도 로직이 없다. systemd의 기능을 활용하여 구현한다.

# /etc/systemd/system/certbot-renewal.service 에 추가
[Service]
# 실패 시 5분 후 재시도, 최대 3회
Restart=on-failure
RestartSec=300
StartLimitBurst=3
StartLimitIntervalSec=3600

4. 인증서 만료 모니터링

4.1 Prometheus + ssl_exporter

ssl_exporter는 TLS 인증서의 만료 시간을 Prometheus 메트릭으로 노출한다.

# ssl_exporter 설치
wget https://github.com/ribbybibby/ssl_exporter/releases/download/v2.4.3/ssl_exporter-2.4.3.linux-amd64.tar.gz
tar xzf ssl_exporter-2.4.3.linux-amd64.tar.gz
sudo mv ssl_exporter-2.4.3.linux-amd64/ssl_exporter /usr/local/bin/

# systemd service
sudo tee /etc/systemd/system/ssl-exporter.service << 'EOF'
[Unit]
Description=SSL Certificate Exporter
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ssl_exporter
Restart=on-failure
User=ssl-exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now ssl-exporter

# Prometheus scrape config
# /etc/prometheus/prometheus.yml

scrape_configs:
  - job_name: 'ssl'
    metrics_path: /probe
    static_configs:
      - targets:
          - example.com:443
          - api.example.com:443
          - admin.example.com:443
          - staging.example.com:443
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: ssl-exporter:9219 # ssl_exporter 주소

주요 메트릭은 다음과 같다.

ssl_cert_not_after: 인증서 만료 시각 (Unix timestamp)
ssl_cert_not_before: 인증서 발급 시각
ssl_tls_version_info: TLS 버전 정보
ssl_ocsp_response_status: OCSP 응답 상태

# 인증서 만료까지 남은 일수 계산
(ssl_cert_not_after - time()) / 86400

# 30일 이내 만료 인증서 조회
(ssl_cert_not_after - time()) / 86400 < 30

# 7일 이내 만료 인증서 조회
(ssl_cert_not_after - time()) / 86400 < 7

4.2 Alertmanager 알림 규칙

# /etc/prometheus/rules/ssl-alerts.yml
groups:
  - name: ssl_certificate_alerts
    rules:
      # 30일 이내 만료 - Warning
      - alert: SSLCertExpiringSoon
        expr: (ssl_cert_not_after - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: 'SSL 인증서 만료 임박 ({{ $labels.instance }})'
          description: '{{ $labels.instance }} 인증서가 {{ $value | printf "%.0f" }}일 후 만료됩니다.'
          runbook_url: 'https://wiki.internal/runbooks/ssl-renewal'

      # 7일 이내 만료 - Critical
      - alert: SSLCertExpiryCritical
        expr: (ssl_cert_not_after - time()) / 86400 < 7
        for: 10m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: 'SSL 인증서 만료 긴급 ({{ $labels.instance }})'
          description: '{{ $labels.instance }} 인증서가 {{ $value | printf "%.0f" }}일 후 만료됩니다. 즉시 조치가 필요합니다.'
          runbook_url: 'https://wiki.internal/runbooks/ssl-emergency-renewal'

      # 이미 만료된 인증서
      - alert: SSLCertExpired
        expr: (ssl_cert_not_after - time()) < 0
        for: 0m
        labels:
          severity: critical
          escalation: pagerduty
        annotations:
          summary: 'SSL 인증서 만료됨 ({{ $labels.instance }})'
          description: '{{ $labels.instance }} 인증서가 만료되었습니다! 서비스 장애가 발생할 수 있습니다.'

      # 인증서 프로빙 실패 (연결 불가)
      - alert: SSLProbeFailure
        expr: ssl_probe_success == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'SSL 프로브 실패 ({{ $labels.instance }})'
          description: '{{ $labels.instance }}에 TLS 연결을 수립할 수 없습니다.'

# Alertmanager 라우팅 설정
# /etc/alertmanager/alertmanager.yml

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-warning'
  routes:
    - match:
        severity: critical
        escalation: pagerduty
      receiver: 'pagerduty-critical'
      repeat_interval: 30m
    - match:
        severity: critical
      receiver: 'slack-critical'
      repeat_interval: 1h

receivers:
  - name: 'slack-warning'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#ssl-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'slack-critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#incident'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        severity: critical

4.3 Grafana 대시보드

{
  "title": "SSL Certificate Dashboard",
  "panels": [
    {
      "title": "인증서 만료까지 남은 일수",
      "type": "table",
      "targets": [
        {
          "expr": "sort_desc((ssl_cert_not_after - time()) / 86400)",
          "legendFormat": "{{ instance }}"
        }
      ]
    },
    {
      "title": "만료 임박 인증서 (30일 이내)",
      "type": "stat",
      "targets": [
        {
          "expr": "count((ssl_cert_not_after - time()) / 86400 < 30)"
        }
      ],
      "thresholds": [
        { "value": 0, "color": "green" },
        { "value": 1, "color": "orange" },
        { "value": 3, "color": "red" }
      ]
    }
  ]
}

4.4 자체 모니터링 스크립트 (Prometheus 없이)

Prometheus 인프라가 없는 환경에서는 셸 스크립트로 대체할 수 있다.

#!/bin/bash
# /usr/local/bin/ssl-expiry-check.sh
# 인증서 만료 모니터링 + Slack/Email 알림

set -euo pipefail

DOMAINS=(
    "example.com"
    "api.example.com"
    "admin.example.com"
    "staging.example.com"
)

WARNING_DAYS=30
CRITICAL_DAYS=7
SLACK_WEBHOOK="${SLACK_WEBHOOK:-}"
ALERT_EMAIL="ops-team@example.com"

check_cert() {
    local domain=$1
    local port=${2:-443}

    # 인증서 만료일 가져오기
    local expiry_date
    expiry_date=$(echo | timeout 10 openssl s_client \
        -servername "$domain" \
        -connect "${domain}:${port}" 2>/dev/null \
        | openssl x509 -noout -enddate 2>/dev/null \
        | cut -d= -f2)

    if [ -z "$expiry_date" ]; then
        echo "UNKNOWN|${domain}|Connection failed"
        return
    fi

    # 남은 일수 계산 (Linux/macOS 호환)
    local expiry_epoch days_left
    if date --version >/dev/null 2>&1; then
        # GNU date (Linux)
        expiry_epoch=$(date -d "$expiry_date" +%s)
    else
        # BSD date (macOS)
        expiry_epoch=$(date -j -f "%b %d %T %Y %Z" "$expiry_date" +%s)
    fi
    days_left=$(( (expiry_epoch - $(date +%s)) / 86400 ))

    if [ "$days_left" -lt 0 ]; then
        echo "EXPIRED|${domain}|${days_left}|${expiry_date}"
    elif [ "$days_left" -lt "$CRITICAL_DAYS" ]; then
        echo "CRITICAL|${domain}|${days_left}|${expiry_date}"
    elif [ "$days_left" -lt "$WARNING_DAYS" ]; then
        echo "WARNING|${domain}|${days_left}|${expiry_date}"
    else
        echo "OK|${domain}|${days_left}|${expiry_date}"
    fi
}

# 전체 도메인 점검
alerts=""
for domain in "${DOMAINS[@]}"; do
    result=$(check_cert "$domain")
    status=$(echo "$result" | cut -d'|' -f1)
    days=$(echo "$result" | cut -d'|' -f3)

    case $status in
        OK)
            printf "%-30s %-10s %s days\n" "$domain" "[OK]" "$days"
            ;;
        WARNING)
            printf "%-30s %-10s %s days\n" "$domain" "[WARNING]" "$days"
            alerts="${alerts}WARNING: ${domain} - ${days}일 남음\n"
            ;;
        CRITICAL|EXPIRED)
            printf "%-30s %-10s %s days\n" "$domain" "[$status]" "$days"
            alerts="${alerts}${status}: ${domain} - ${days}일 남음\n"
            ;;
        UNKNOWN)
            printf "%-30s %-10s\n" "$domain" "[UNKNOWN]"
            alerts="${alerts}UNKNOWN: ${domain} - 연결 실패\n"
            ;;
    esac
done

# 알림 발송
if [ -n "$alerts" ]; then
    if [ -n "$SLACK_WEBHOOK" ]; then
        curl -s -X POST "$SLACK_WEBHOOK" \
          -H 'Content-type: application/json' \
          -d "{\"text\":\"SSL 인증서 점검 결과:\\n${alerts}\"}"
    fi

    # 이메일 알림 (mailutils 필요)
    echo -e "$alerts" | mail -s "[SSL Alert] 인증서 만료 경고" "$ALERT_EMAIL" 2>/dev/null || true
fi

# cron 등록 (매일 오전 9시)
echo "0 9 * * * /usr/local/bin/ssl-expiry-check.sh >> /var/log/ssl-check.log 2>&1" | sudo crontab -

5. 인시던트 대응 플레이북

5.1 인증서 만료 사고 발생 시 대응 순서

┌─────────────────────────────────────────────────────────────┐
│  인증서 만료 인시던트 대응 플로우                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. 탐지 및 확인 (0~5분)                                    │
│     └→ 장애 범위 파악, 영향 도메인 목록화                     │
│                                                             │
│  2. 즉시 완화 (5~15분)                                      │
│     └→ 임시 인증서 적용 또는 트래픽 우회                      │
│                                                             │
│  3. 정식 인증서 갱신 (15~30분)                               │
│     └→ certbot 갱신 또는 긴급 발급                           │
│                                                             │
│  4. 서비스 검증 (30~45분)                                    │
│     └→ 모든 엔드포인트 TLS 연결 확인                         │
│                                                             │
│  5. 포스트모템 (48시간 이내)                                  │
│     └→ 원인 분석, 재발 방지 조치                              │
│                                                             │
└─────────────────────────────────────────────────────────────┘

5.2 단계별 대응 상세

단계 1: 탐지 및 확인 (0~5분)

# 1-1. 만료 여부 즉시 확인
echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null \
  | openssl x509 -noout -dates

# 1-2. 여러 도메인 한번에 확인
for domain in example.com api.example.com admin.example.com; do
    echo -n "$domain: "
    echo | openssl s_client -servername "$domain" -connect "${domain}:443" 2>/dev/null \
      | openssl x509 -noout -enddate 2>/dev/null || echo "CONNECTION FAILED"
done

# 1-3. 로컬 인증서 파일 확인
for cert in /etc/letsencrypt/live/*/fullchain.pem; do
    domain=$(basename $(dirname "$cert"))
    expiry=$(openssl x509 -in "$cert" -noout -enddate | cut -d= -f2)
    echo "$domain: $expiry"
done

# 1-4. 인시던트 채널에 상황 공유
# "SSL 인증서 만료 확인. 영향 범위: example.com, api.example.com. 대응 시작."

단계 2: 즉시 완화 (5~15분)

# 2-1. Let's Encrypt 인증서 강제 갱신
sudo certbot renew --force-renewal --cert-name example.com
sudo systemctl reload nginx

# 2-2. 갱신 실패 시 - standalone 방식으로 긴급 발급
sudo systemctl stop nginx
sudo certbot certonly --standalone -d example.com -d "*.example.com"
sudo systemctl start nginx

# 2-3. Let's Encrypt rate limit에 걸린 경우 - 자체 서명 인증서 임시 적용
openssl req -x509 -nodes -days 1 -newkey rsa:2048 \
  -keyout /tmp/emergency.key \
  -out /tmp/emergency.crt \
  -subj "/CN=example.com"

# 주의: 자체 서명 인증서는 브라우저 경고가 표시되지만
# API 서버 등 내부 통신에서는 임시 대안이 될 수 있다

# 2-4. AWS 환경에서 ACM 인증서 사용 중이라면
# ACM은 자동 갱신이므로 대부분 ALB/CloudFront 측 문제
aws elbv2 describe-listeners --load-balancer-arn $ALB_ARN \
  --query 'Listeners[].Certificates[].CertificateArn'

# ACM 인증서 상태 확인
aws acm describe-certificate --certificate-arn $CERT_ARN \
  --query 'Certificate.{Status:Status,NotAfter:NotAfter}'

단계 3: 정식 인증서 갱신 (15~30분)

# 3-1. 갱신된 인증서 체인 검증
openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt \
  /etc/letsencrypt/live/example.com/fullchain.pem

# 3-2. 인증서와 키 매칭 확인
diff <(openssl x509 -noout -modulus -in /etc/letsencrypt/live/example.com/fullchain.pem | openssl md5) \
     <(openssl rsa -noout -modulus -in /etc/letsencrypt/live/example.com/privkey.pem | openssl md5)

# 3-3. 다중 서버에 배포
/usr/local/bin/deploy-cert.sh

단계 4: 서비스 검증 (30~45분)

# 4-1. TLS 연결 테스트
curl -vI https://example.com 2>&1 | grep -E "SSL|expire|subject"

# 4-2. 전체 엔드포인트 점검
for url in https://example.com https://api.example.com/healthz https://admin.example.com; do
    status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
    echo "$url: HTTP $status"
done

# 4-3. 외부에서 확인 (SSL Labs)
echo "Check: https://www.ssllabs.com/ssltest/analyze.html?d=example.com"

단계 5: 포스트모템 (48시간 이내)

포스트모템에 포함해야 할 항목은 다음과 같다.

## 인시던트 포스트모템: SSL 인증서 만료

### 타임라인

- HH:MM - 최초 알림 수신 (출처: Prometheus/사용자 보고)
- HH:MM - 인시던트 확인, 대응 시작
- HH:MM - 인증서 갱신 완료
- HH:MM - 서비스 정상 확인

### 영향 범위

- 영향받은 도메인: example.com, api.example.com
- 영향 시간: XX분
- 영향받은 사용자 수: approximately N명

### 근본 원인

- (예) certbot timer가 비활성화되어 자동 갱신이 동작하지 않았음
- (예) DNS 검증 실패로 갱신이 반복 실패했으나 알림이 미설정

### 재발 방지 조치

- [ ] 자동 갱신 timer 상태 모니터링 추가
- [ ] 갱신 실패 시 즉시 알림 설정
- [ ] 인증서 만료 30일 전 warning 알림 설정 확인

6. 멀티 환경 인증서 관리

6.1 환경별 전략

환경	인증서 유형	CA	갱신 주기	비고
dev	자체 서명 또는 mkcert	자체	N/A	브라우저 경고 허용
staging	Let's Encrypt (staging)	Let's Encrypt Staging	90일	rate limit 없음
prod	Let's Encrypt 또는 ACM	공인 CA	90일 / 자동	무중단 필수

6.2 개발 환경: mkcert 활용

로컬 개발 환경에서는 mkcert로 로컬 신뢰 인증서를 생성한다.

# mkcert 설치 (macOS)
brew install mkcert
mkcert -install  # 로컬 CA를 시스템 인증서 저장소에 추가

# 개발용 인증서 생성
mkcert "*.dev.example.com" localhost 127.0.0.1 ::1

# 결과 파일
# _wildcard.dev.example.com+3.pem (인증서)
# _wildcard.dev.example.com+3-key.pem (키)

6.3 스테이징 환경: Let's Encrypt Staging 사용

스테이징에서는 Let's Encrypt의 staging 서버를 사용하여 rate limit 문제를 피한다.

# staging 서버로 인증서 발급 (rate limit 없음, 브라우저 신뢰 안 됨)
sudo certbot certonly --staging \
  --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  -d "*.staging.example.com"

# API 테스트 시 -k (insecure) 플래그 사용
curl -k https://staging.example.com/api/healthz

6.4 와일드카드 전략

example.com                  → 단일 도메인 + 와일드카드
├── www.example.com          → *.example.com 으로 커버
├── api.example.com          → *.example.com 으로 커버
├── admin.example.com        → *.example.com 으로 커버
├── staging.example.com      → 별도 인증서 (staging 환경)
│   ├── api.staging.example.com → *.staging.example.com 으로 커버
│   └── admin.staging.example.com → *.staging.example.com 으로 커버
└── internal.example.com     → 내부 전용 (mTLS 고려)

# 프로덕션 와일드카드 인증서 (apex + wildcard)
sudo certbot certonly \
  --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  -d "example.com" \
  -d "*.example.com" \
  --cert-name prod-wildcard

# 스테이징 와일드카드 인증서 (별도)
sudo certbot certonly \
  --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  -d "staging.example.com" \
  -d "*.staging.example.com" \
  --cert-name staging-wildcard

6.5 Kubernetes 환경: cert-manager

Kubernetes 환경에서는 cert-manager로 인증서를 선언적으로 관리한다.

# cert-manager ClusterIssuer (Let's Encrypt)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
      - dns01:
          cloudflare:
            email: admin@example.com
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token

# Certificate 리소스
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example-com-tls
  namespace: istio-system
spec:
  secretName: example-com-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - example.com
    - '*.example.com'
  # 자동 갱신: 만료 30일 전
  renewBefore: 720h # 30일

# Ingress에서 자동 인증서 발급
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: example-ingress
  annotations:
    cert-manager.io/cluster-issuer: 'letsencrypt-prod'
spec:
  tls:
    - hosts:
        - example.com
        - api.example.com
      secretName: example-com-tls
  rules:
    - host: example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web
                port:
                  number: 80

cert-manager 상태 모니터링 명령도 함께 알아두면 좋다.

# 인증서 상태 확인
kubectl get certificates -A
kubectl describe certificate example-com-tls -n istio-system

# 인증서 이벤트 확인
kubectl get events --field-selector reason=IssueError -A

# cert-manager 로그
kubectl logs -n cert-manager deploy/cert-manager -f

7. 실전 체크리스트

인증서 발급 시 체크리스트

키 알고리즘 선택 (ECDSA P-256 권장)
인증서 범위 결정 (단일 / 와일드카드 / SAN)
DNS-01 검증을 위한 DNS API 자격 증명 준비
인증서 파일 권한 설정 (chmod 600 privkey.pem)
fullchain.pem 사용 여부 확인 (cert.pem만 쓰면 체인 불완전)

자동 갱신 체크리스트

systemd timer 활성 상태 확인 (systemctl is-active certbot-renewal.timer)
certbot renew --dry-run 성공 확인
deploy-hook에서 웹 서버 reload 설정
갱신 실패 시 알림 설정
갱신 실패 시 재시도 로직 구현

모니터링 체크리스트

ssl_exporter 또는 자체 스크립트로 만료일 모니터링
30일 전 warning, 7일 전 critical 알림 설정
Slack/PagerDuty 알림 채널 연결
Grafana 대시보드에 인증서 현황 패널 추가
주간 인증서 점검 리포트 자동화

인시던트 대비 체크리스트

인증서 만료 시 대응 런북 작성 완료
긴급 연락 채널 (온콜) 지정
이전 인증서 백업 보관
certbot 외 대체 발급 수단 확보 (acme.sh 등)
rate limit 초과 시 대안 (staging CA, 다른 CA)

멀티 환경 체크리스트

dev: mkcert 로컬 인증서 사용
staging: Let's Encrypt staging CA 사용
prod: 공인 CA + 자동 갱신 + 모니터링
Kubernetes: cert-manager 설치 및 ClusterIssuer 설정
인증서별 갱신 일정 문서화 (인증서 인벤토리)

8. 마무리

인증서 운영의 핵심 원칙을 정리한다.

1. 수동 갱신은 반드시 실패한다. Let's Encrypt가 90일 유효기간을 채택한 이유는 자동화를 강제하기 위해서다. certbot + systemd timer, cert-manager 등으로 갱신을 완전 자동화해야 한다.

2. 자동화만으로는 부족하다. 자동 갱신이 실패할 수 있다. DNS API 토큰 만료, 디스크 부족, CA 서버 장애 등 실패 원인은 다양하다. 반드시 모니터링을 함께 구축해야 한다.

3. 인시던트는 반드시 온다. 만료 사고가 발생했을 때 당황하지 않으려면, 미리 런북을 작성하고 정기적으로 훈련해야 한다. 복구 시간은 준비 수준에 비례한다.

4. 인증서는 인벤토리로 관리한다. 도메인이 늘어날수록 "어디에 어떤 인증서가 있고, 언제 만료되는지" 파악이 어려워진다. 인증서 목록을 문서화하고, 모니터링 대상에 빠짐없이 등록해야 한다.

이 플레이북의 체크리스트와 스크립트를 기반으로 자신의 환경에 맞는 인증서 운영 체계를 구축하기를 권장한다. 한 번 잘 만들어 놓으면 인증서 만료 사고에서 해방될 수 있다.

SSL Certificate Operations Playbook: Zero-Downtime Renewal and Expiry Prevention

Introduction
1. Certificate Lifecycle Management
2. Zero-Downtime Renewal Strategies
3. Let's Encrypt Auto-Renewal Operations
4. Certificate Expiry Monitoring
5. Incident Response Playbook
- 5.1 Response Flow for Certificate Expiry Incidents
- 5.2 Detailed Step-by-Step Response
6. Multi-Environment Certificate Management
7. Operational Checklists
8. Conclusion
Quiz

SSL Certificate Operations Playbook: Zero-Downtime Renewal and Expiry Prevention

Introduction

The basics of SSL/TLS certificates -- concepts, issuance methods, and Nginx configuration -- are covered in the SSL/TLS Certificate Complete Guide. This post extends that foundation by focusing exclusively on operations. Issuing a certificate once is easy. The real challenge is managing dozens of domains without a single expiry incident.

Certificate expiry incidents happen even to major services. In 2020, Microsoft Teams went down for hours due to an expired certificate. Spotify and LinkedIn have experienced the same. The common thread was not the absence of automation, but the absence of operational processes.

This playbook answers the following questions:

How do you renew certificates without any service downtime?
What do you need to build to receive automatic alerts 30 days before expiry?
If a certificate expires at 3 AM, what is the step-by-step response procedure?
How do you separately manage certificates across dev/staging/prod environments?

1. Certificate Lifecycle Management

Certificate operations is not simply "issue and renew." It requires systematic lifecycle management.

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│ Issuance │ → │  Deploy  │ → │ Monitor  │ → │ Renewal  │ → │  Revoke  │
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘
      ↑                                              │
      └──────────────────────────────────────────────┘
                    (auto-renewal cycle)

1.1 Issuance

Key decisions at the issuance stage:

Decision	Options	Recommended
CA selection	Let's Encrypt / DigiCert / ACM	Depends on environment (see below)
Key algorithm	RSA 2048 / RSA 4096 / ECDSA P-256	ECDSA P-256 (performance + security)
Certificate scope	Single domain / Wildcard / SAN	Wildcard + apex SAN
Validation method	HTTP-01 / DNS-01	DNS-01 (required for wildcards)

ECDSA is recommended because it has a smaller key size compared to RSA 2048 (256-bit vs. 2048-bit), TLS handshake performance is approximately 2-5x faster, and CPU load is lower at equivalent security strength.

# Issue Let's Encrypt certificate with ECDSA key
sudo certbot certonly \
  --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  --key-type ecdsa \
  --elliptic-curve secp256r1 \
  -d "*.example.com" \
  -d "example.com"

1.2 Deployment

After issuance, the certificate must be applied to the actual service. This is trivial for a single server but requires a deployment strategy when spanning multiple servers.

#!/bin/bash
# /usr/local/bin/deploy-cert.sh
# Certificate deployment script (multi-server)

CERT_DIR="/etc/letsencrypt/live/example.com"
SERVERS=("web01" "web02" "web03")
REMOTE_CERT_DIR="/etc/nginx/ssl"
DEPLOY_LOG="/var/log/cert-deploy.log"

deploy_cert() {
    local server=$1
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] Deploying to $server" >> "$DEPLOY_LOG"

    # Transfer certificate files
    scp -q "$CERT_DIR/fullchain.pem" "$server:$REMOTE_CERT_DIR/fullchain.pem.new"
    scp -q "$CERT_DIR/privkey.pem" "$server:$REMOTE_CERT_DIR/privkey.pem.new"

    # Atomic replacement (mv is atomic on the same filesystem)
    ssh "$server" "
        mv $REMOTE_CERT_DIR/fullchain.pem.new $REMOTE_CERT_DIR/fullchain.pem
        mv $REMOTE_CERT_DIR/privkey.pem.new $REMOTE_CERT_DIR/privkey.pem
        nginx -t && systemctl reload nginx
    "

    if [ $? -eq 0 ]; then
        echo "[$(date '+%Y-%m-%d %H:%M:%S')] $server: OK" >> "$DEPLOY_LOG"
    else
        echo "[$(date '+%Y-%m-%d %H:%M:%S')] $server: FAILED" >> "$DEPLOY_LOG"
        return 1
    fi
}

for server in "${SERVERS[@]}"; do
    deploy_cert "$server"
done

1.3 Monitoring

Covered in detail in Section 4. The core principles are:

Warning alerts starting 30 days before expiry
Critical alerts starting 7 days before expiry
Escalation (PagerDuty/phone call) 1 day before expiry
Always log renewal success/failure events

1.4 Renewal

Zero-downtime renewal strategies are covered in detail in Section 2.

1.5 Revocation

Situations requiring certificate revocation:

Suspected private key compromise
Loss of domain ownership
Changes in organization information

# Revoke a Let's Encrypt certificate
sudo certbot revoke --cert-path /etc/letsencrypt/live/example.com/cert.pem \
  --reason keycompromise

# Delete certificate files after revocation
sudo certbot delete --cert-name example.com

# Immediately issue a new certificate
sudo certbot certonly --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  -d "*.example.com" -d "example.com"

2. Zero-Downtime Renewal Strategies

The main causes of service disruption during certificate renewal are:

Restarting (not reloading) the web server during renewal
Time gap between deploying the new certificate and load balancer propagation
Clients holding TLS session caches referencing the old certificate

2.1 Nginx Reload (Single Server)

The most basic zero-downtime approach. When Nginx performs a reload, existing worker processes finish handling their current requests before shutting down, while new worker processes start with the new configuration (and new certificate).

# restart vs reload difference
# restart: stops and restarts the process → potential request loss
# reload:  spawns new workers → graceful shutdown of old workers → zero downtime

# Automate reload with certbot deploy hook
sudo certbot renew --deploy-hook "systemctl reload nginx"

Warning: Never use systemctl restart nginx. It immediately terminates existing connections.

2.2 Rolling Renewal (Multi-Server)

When multiple servers sit behind a load balancer, renew one server at a time sequentially.

#!/bin/bash
# /usr/local/bin/rolling-cert-renewal.sh

SERVERS=("web01" "web02" "web03")
LB_API="http://lb-admin.internal:8080/api"
HEALTH_CHECK_URL="https://example.com/healthz"
WAIT_SECONDS=30

for server in "${SERVERS[@]}"; do
    echo "=== Processing $server ==="

    # 1. Remove server from load balancer
    curl -s -X POST "$LB_API/drain" -d "server=$server"
    echo "Draining $server from load balancer..."
    sleep $WAIT_SECONDS  # Wait for existing connections to complete

    # 2. Deploy certificate and apply
    scp /etc/letsencrypt/live/example.com/fullchain.pem "$server:/etc/nginx/ssl/"
    scp /etc/letsencrypt/live/example.com/privkey.pem "$server:/etc/nginx/ssl/"
    ssh "$server" "nginx -t && systemctl reload nginx"

    # 3. Health check
    for i in $(seq 1 10); do
        status=$(curl -s -o /dev/null -w "%{http_code}" "https://$server/healthz" --resolve "example.com:443:$(dig +short $server)")
        if [ "$status" = "200" ]; then
            echo "$server health check passed"
            break
        fi
        sleep 2
    done

    # 4. Re-add server to load balancer
    curl -s -X POST "$LB_API/enable" -d "server=$server"
    echo "$server re-enabled in load balancer"
    sleep 5
done

echo "=== Rolling renewal complete ==="

2.3 Blue-Green Certificate Swap

Operate two sets of certificates and instantly switch at the transition point. Primarily used in large-scale infrastructure.

# /etc/nginx/conf.d/ssl-blue-green.conf
# Blue-Green certificate switching via symlinks

# Active certificate (symlink)
# /etc/nginx/ssl/active/ -> /etc/nginx/ssl/blue/ or /etc/nginx/ssl/green/

server {
    listen 443 ssl http2;
    server_name example.com;

    ssl_certificate     /etc/nginx/ssl/active/fullchain.pem;
    ssl_certificate_key /etc/nginx/ssl/active/privkey.pem;

    # ... other settings
}

#!/bin/bash
# /usr/local/bin/blue-green-cert-switch.sh

ACTIVE_LINK="/etc/nginx/ssl/active"
BLUE_DIR="/etc/nginx/ssl/blue"
GREEN_DIR="/etc/nginx/ssl/green"

# Determine current active slot
current=$(readlink "$ACTIVE_LINK")
if [ "$current" = "$BLUE_DIR" ]; then
    target="$GREEN_DIR"
    target_name="green"
else
    target="$BLUE_DIR"
    target_name="blue"
fi

echo "Current: $current"
echo "Deploying new cert to: $target ($target_name)"

# Deploy new certificate to inactive slot
cp /etc/letsencrypt/live/example.com/fullchain.pem "$target/fullchain.pem"
cp /etc/letsencrypt/live/example.com/privkey.pem "$target/privkey.pem"

# Validate certificate
openssl x509 -in "$target/fullchain.pem" -noout -checkend 86400
if [ $? -ne 0 ]; then
    echo "ERROR: New certificate expires within 24 hours. Aborting."
    exit 1
fi

# Verify key match
CERT_MD5=$(openssl x509 -noout -modulus -in "$target/fullchain.pem" | openssl md5)
KEY_MD5=$(openssl rsa -noout -modulus -in "$target/privkey.pem" 2>/dev/null | openssl md5)
if [ "$CERT_MD5" != "$KEY_MD5" ]; then
    echo "ERROR: Certificate and key do not match. Aborting."
    exit 1
fi

# Atomic symlink switch
ln -sfn "$target" "${ACTIVE_LINK}.new"
mv -T "${ACTIVE_LINK}.new" "$ACTIVE_LINK"

# Nginx reload
nginx -t && systemctl reload nginx
echo "Switched to $target_name slot. Reload complete."

2.4 Dual-Certificate

Nginx 1.11.0+ supports loading both RSA and ECDSA certificates simultaneously. This allows one certificate to maintain service while the other is being renewed.

server {
    listen 443 ssl http2;
    server_name example.com;

    # RSA certificate
    ssl_certificate     /etc/nginx/ssl/rsa/fullchain.pem;
    ssl_certificate_key /etc/nginx/ssl/rsa/privkey.pem;

    # ECDSA certificate
    ssl_certificate     /etc/nginx/ssl/ecdsa/fullchain.pem;
    ssl_certificate_key /etc/nginx/ssl/ecdsa/privkey.pem;

    # Nginx automatically selects based on client support
    # ECDSA preferred, RSA fallback for unsupported clients
}

3. Let's Encrypt Auto-Renewal Operations

3.1 systemd Timer-Based Renewal (Recommended)

Reasons to prefer systemd timer over cron:

RandomizedDelaySec distributes load on the CA server
Persistent=true compensates for missed runs after boot
systemctl list-timers shows next scheduled execution
Logs are integrated via journalctl

# /etc/systemd/system/certbot-renewal.service
[Unit]
Description=Certbot SSL Certificate Renewal
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
ExecStart=/usr/bin/certbot renew --quiet \
  --pre-hook "/usr/local/bin/cert-pre-hook.sh" \
  --deploy-hook "/usr/local/bin/cert-deploy-hook.sh"
ExecStartPost=/usr/local/bin/cert-renewal-notify.sh
TimeoutStartSec=300

# /etc/systemd/system/certbot-renewal.timer
[Unit]
Description=Run certbot renewal twice daily

[Timer]
OnCalendar=*-*-* 02,14:00:00
RandomizedDelaySec=3600
Persistent=true
AccuracySec=1s

[Install]
WantedBy=timers.target

# Enable timer and check status
sudo systemctl daemon-reload
sudo systemctl enable --now certbot-renewal.timer
sudo systemctl list-timers certbot-renewal.timer

# Manual test (dry-run)
sudo certbot renew --dry-run

# Manual trigger
sudo systemctl start certbot-renewal.service

3.2 Using pre-hook / deploy-hook

Hooks automate tasks before and after renewal. Certbot only executes hooks when a certificate is actually renewed.

#!/bin/bash
# /usr/local/bin/cert-pre-hook.sh
# Executed before renewal

LOG="/var/log/cert-hooks.log"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] PRE-HOOK: Starting renewal process" >> "$LOG"

# Back up current certificate information
for domain_dir in /etc/letsencrypt/live/*/; do
    domain=$(basename "$domain_dir")
    expiry=$(openssl x509 -in "${domain_dir}fullchain.pem" -noout -enddate 2>/dev/null | cut -d= -f2)
    echo "[PRE] $domain expires: $expiry" >> "$LOG"
done

#!/bin/bash
# /usr/local/bin/cert-deploy-hook.sh
# Executed after successful renewal
# Environment variables $RENEWED_DOMAINS and $RENEWED_LINEAGE are available

LOG="/var/log/cert-hooks.log"
echo "[$(date '+%Y-%m-%d %H:%M:%S')] DEPLOY-HOOK: Certificate renewed" >> "$LOG"
echo "  Domains: $RENEWED_DOMAINS" >> "$LOG"
echo "  Lineage: $RENEWED_LINEAGE" >> "$LOG"

# 1. Validate Nginx config then reload
if nginx -t 2>/dev/null; then
    systemctl reload nginx
    echo "  Nginx reloaded successfully" >> "$LOG"
else
    echo "  ERROR: Nginx config test failed!" >> "$LOG"
    curl -s -X POST "$SLACK_WEBHOOK" \
      -H 'Content-type: application/json' \
      -d '{"text":"CRITICAL: Nginx config test failed after cert renewal!"}'
    exit 1
fi

# 2. If HAProxy is running, combine cert and reload
if systemctl is-active haproxy > /dev/null 2>&1; then
    cat "$RENEWED_LINEAGE/fullchain.pem" "$RENEWED_LINEAGE/privkey.pem" \
      > /etc/haproxy/certs/$(basename "$RENEWED_LINEAGE").pem
    systemctl reload haproxy
    echo "  HAProxy reloaded" >> "$LOG"
fi

# 3. Sync certificates to other servers (if needed)
# /usr/local/bin/sync-certs-to-peers.sh "$RENEWED_LINEAGE"

#!/bin/bash
# /usr/local/bin/cert-renewal-notify.sh
# Renewal result notification

SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
LOG="/var/log/cert-hooks.log"

# Check last renewal result
last_renewal=$(journalctl -u certbot-renewal.service --since "5 minutes ago" --no-pager 2>/dev/null)

if echo "$last_renewal" | grep -q "Congratulations"; then
    # Renewal succeeded
    renewed_domains=$(echo "$last_renewal" | grep "renewed" | head -5)
    curl -s -X POST "$SLACK_WEBHOOK" \
      -H 'Content-type: application/json' \
      -d "{\"text\":\"SSL certificate renewal succeeded\\n${renewed_domains}\"}"
elif echo "$last_renewal" | grep -q "No renewals were attempted"; then
    # No certificates due for renewal (normal)
    echo "[$(date)] No certificates due for renewal" >> "$LOG"
else
    # Renewal failed
    curl -s -X POST "$SLACK_WEBHOOK" \
      -H 'Content-type: application/json' \
      -d '{"text":"WARNING: Please check certbot renewal execution results!"}'
fi

3.3 Automatic Retry on Renewal Failure

Certbot does not have built-in retry logic. Use systemd features to implement it.

# Add to /etc/systemd/system/certbot-renewal.service
[Service]
# Retry after 5 minutes on failure, up to 3 times
Restart=on-failure
RestartSec=300
StartLimitBurst=3
StartLimitIntervalSec=3600

4. Certificate Expiry Monitoring

4.1 Prometheus + ssl_exporter

ssl_exporter exposes TLS certificate expiry times as Prometheus metrics.

# Install ssl_exporter
wget https://github.com/ribbybibby/ssl_exporter/releases/download/v2.4.3/ssl_exporter-2.4.3.linux-amd64.tar.gz
tar xzf ssl_exporter-2.4.3.linux-amd64.tar.gz
sudo mv ssl_exporter-2.4.3.linux-amd64/ssl_exporter /usr/local/bin/

# systemd service
sudo tee /etc/systemd/system/ssl-exporter.service << 'EOF'
[Unit]
Description=SSL Certificate Exporter
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ssl_exporter
Restart=on-failure
User=ssl-exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable --now ssl-exporter

# Prometheus scrape config
# /etc/prometheus/prometheus.yml

scrape_configs:
  - job_name: 'ssl'
    metrics_path: /probe
    static_configs:
      - targets:
          - example.com:443
          - api.example.com:443
          - admin.example.com:443
          - staging.example.com:443
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: ssl-exporter:9219 # ssl_exporter address

Key metrics:

ssl_cert_not_after: Certificate expiry time (Unix timestamp)
ssl_cert_not_before: Certificate issuance time
ssl_tls_version_info: TLS version information
ssl_ocsp_response_status: OCSP response status

# Calculate days until certificate expiry
(ssl_cert_not_after - time()) / 86400

# Find certificates expiring within 30 days
(ssl_cert_not_after - time()) / 86400 < 30

# Find certificates expiring within 7 days
(ssl_cert_not_after - time()) / 86400 < 7

4.2 Alertmanager Alert Rules

# /etc/prometheus/rules/ssl-alerts.yml
groups:
  - name: ssl_certificate_alerts
    rules:
      # Expiring within 30 days - Warning
      - alert: SSLCertExpiringSoon
        expr: (ssl_cert_not_after - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: 'SSL certificate expiring soon ({{ $labels.instance }})'
          description: '{{ $labels.instance }} certificate expires in {{ $value | printf "%.0f" }} days.'
          runbook_url: 'https://wiki.internal/runbooks/ssl-renewal'

      # Expiring within 7 days - Critical
      - alert: SSLCertExpiryCritical
        expr: (ssl_cert_not_after - time()) / 86400 < 7
        for: 10m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: 'SSL certificate expiry urgent ({{ $labels.instance }})'
          description: '{{ $labels.instance }} certificate expires in {{ $value | printf "%.0f" }} days. Immediate action required.'
          runbook_url: 'https://wiki.internal/runbooks/ssl-emergency-renewal'

      # Already expired certificate
      - alert: SSLCertExpired
        expr: (ssl_cert_not_after - time()) < 0
        for: 0m
        labels:
          severity: critical
          escalation: pagerduty
        annotations:
          summary: 'SSL certificate expired ({{ $labels.instance }})'
          description: '{{ $labels.instance }} certificate has expired! Service outage may occur.'

      # Probe failure (connection failed)
      - alert: SSLProbeFailure
        expr: ssl_probe_success == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: 'SSL probe failed ({{ $labels.instance }})'
          description: 'Cannot establish TLS connection to {{ $labels.instance }}.'

# Alertmanager routing config
# /etc/alertmanager/alertmanager.yml

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-warning'
  routes:
    - match:
        severity: critical
        escalation: pagerduty
      receiver: 'pagerduty-critical'
      repeat_interval: 30m
    - match:
        severity: critical
      receiver: 'slack-critical'
      repeat_interval: 1h

receivers:
  - name: 'slack-warning'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#ssl-alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'slack-critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#incident'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        severity: critical

4.3 Grafana Dashboard

{
  "title": "SSL Certificate Dashboard",
  "panels": [
    {
      "title": "Days Until Certificate Expiry",
      "type": "table",
      "targets": [
        {
          "expr": "sort_desc((ssl_cert_not_after - time()) / 86400)",
          "legendFormat": "{{ instance }}"
        }
      ]
    },
    {
      "title": "Certificates Expiring Soon (within 30 days)",
      "type": "stat",
      "targets": [
        {
          "expr": "count((ssl_cert_not_after - time()) / 86400 < 30)"
        }
      ],
      "thresholds": [
        { "value": 0, "color": "green" },
        { "value": 1, "color": "orange" },
        { "value": 3, "color": "red" }
      ]
    }
  ]
}

4.4 Standalone Monitoring Script (Without Prometheus)

For environments without Prometheus infrastructure, a shell script can serve as an alternative.

#!/bin/bash
# /usr/local/bin/ssl-expiry-check.sh
# Certificate expiry monitoring + Slack/Email alerts

set -euo pipefail

DOMAINS=(
    "example.com"
    "api.example.com"
    "admin.example.com"
    "staging.example.com"
)

WARNING_DAYS=30
CRITICAL_DAYS=7
SLACK_WEBHOOK="${SLACK_WEBHOOK:-}"
ALERT_EMAIL="ops-team@example.com"

check_cert() {
    local domain=$1
    local port=${2:-443}

    # Get certificate expiry date
    local expiry_date
    expiry_date=$(echo | timeout 10 openssl s_client \
        -servername "$domain" \
        -connect "${domain}:${port}" 2>/dev/null \
        | openssl x509 -noout -enddate 2>/dev/null \
        | cut -d= -f2)

    if [ -z "$expiry_date" ]; then
        echo "UNKNOWN|${domain}|Connection failed"
        return
    fi

    # Calculate remaining days (Linux/macOS compatible)
    local expiry_epoch days_left
    if date --version >/dev/null 2>&1; then
        # GNU date (Linux)
        expiry_epoch=$(date -d "$expiry_date" +%s)
    else
        # BSD date (macOS)
        expiry_epoch=$(date -j -f "%b %d %T %Y %Z" "$expiry_date" +%s)
    fi
    days_left=$(( (expiry_epoch - $(date +%s)) / 86400 ))

    if [ "$days_left" -lt 0 ]; then
        echo "EXPIRED|${domain}|${days_left}|${expiry_date}"
    elif [ "$days_left" -lt "$CRITICAL_DAYS" ]; then
        echo "CRITICAL|${domain}|${days_left}|${expiry_date}"
    elif [ "$days_left" -lt "$WARNING_DAYS" ]; then
        echo "WARNING|${domain}|${days_left}|${expiry_date}"
    else
        echo "OK|${domain}|${days_left}|${expiry_date}"
    fi
}

# Check all domains
alerts=""
for domain in "${DOMAINS[@]}"; do
    result=$(check_cert "$domain")
    status=$(echo "$result" | cut -d'|' -f1)
    days=$(echo "$result" | cut -d'|' -f3)

    case $status in
        OK)
            printf "%-30s %-10s %s days\n" "$domain" "[OK]" "$days"
            ;;
        WARNING)
            printf "%-30s %-10s %s days\n" "$domain" "[WARNING]" "$days"
            alerts="${alerts}WARNING: ${domain} - ${days} days remaining\n"
            ;;
        CRITICAL|EXPIRED)
            printf "%-30s %-10s %s days\n" "$domain" "[$status]" "$days"
            alerts="${alerts}${status}: ${domain} - ${days} days remaining\n"
            ;;
        UNKNOWN)
            printf "%-30s %-10s\n" "$domain" "[UNKNOWN]"
            alerts="${alerts}UNKNOWN: ${domain} - Connection failed\n"
            ;;
    esac
done

# Send alerts
if [ -n "$alerts" ]; then
    if [ -n "$SLACK_WEBHOOK" ]; then
        curl -s -X POST "$SLACK_WEBHOOK" \
          -H 'Content-type: application/json' \
          -d "{\"text\":\"SSL Certificate Check Results:\\n${alerts}\"}"
    fi

    # Email alert (requires mailutils)
    echo -e "$alerts" | mail -s "[SSL Alert] Certificate Expiry Warning" "$ALERT_EMAIL" 2>/dev/null || true
fi

# Register in cron (daily at 9 AM)
echo "0 9 * * * /usr/local/bin/ssl-expiry-check.sh >> /var/log/ssl-check.log 2>&1" | sudo crontab -

5. Incident Response Playbook

5.1 Response Flow for Certificate Expiry Incidents

┌─────────────────────────────────────────────────────────────┐
│  Certificate Expiry Incident Response Flow                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Detection & Confirmation (0-5 min)                      │
│     └→ Assess blast radius, list affected domains           │
│                                                             │
│  2. Immediate Mitigation (5-15 min)                         │
│     └→ Apply temporary certificate or reroute traffic       │
│                                                             │
│  3. Formal Certificate Renewal (15-30 min)                  │
│     └→ certbot renewal or emergency issuance                │
│                                                             │
│  4. Service Verification (30-45 min)                        │
│     └→ Confirm TLS connectivity on all endpoints            │
│                                                             │
│  5. Postmortem (within 48 hours)                            │
│     └→ Root cause analysis, preventive measures             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

5.2 Detailed Step-by-Step Response

Step 1: Detection and Confirmation (0-5 min)

# 1-1. Immediately check expiry status
echo | openssl s_client -servername example.com -connect example.com:443 2>/dev/null \
  | openssl x509 -noout -dates

# 1-2. Check multiple domains at once
for domain in example.com api.example.com admin.example.com; do
    echo -n "$domain: "
    echo | openssl s_client -servername "$domain" -connect "${domain}:443" 2>/dev/null \
      | openssl x509 -noout -enddate 2>/dev/null || echo "CONNECTION FAILED"
done

# 1-3. Check local certificate files
for cert in /etc/letsencrypt/live/*/fullchain.pem; do
    domain=$(basename $(dirname "$cert"))
    expiry=$(openssl x509 -in "$cert" -noout -enddate | cut -d= -f2)
    echo "$domain: $expiry"
done

# 1-4. Share situation in incident channel
# "SSL certificate expiry confirmed. Blast radius: example.com, api.example.com. Response initiated."

Step 2: Immediate Mitigation (5-15 min)

# 2-1. Force renew Let's Encrypt certificate
sudo certbot renew --force-renewal --cert-name example.com
sudo systemctl reload nginx

# 2-2. If renewal fails - emergency issuance via standalone
sudo systemctl stop nginx
sudo certbot certonly --standalone -d example.com -d "*.example.com"
sudo systemctl start nginx

# 2-3. If hitting Let's Encrypt rate limit - temporary self-signed cert
openssl req -x509 -nodes -days 1 -newkey rsa:2048 \
  -keyout /tmp/emergency.key \
  -out /tmp/emergency.crt \
  -subj "/CN=example.com"

# Note: Self-signed certs show browser warnings but can serve
# as a temporary measure for API servers or internal communication

# 2-4. For AWS environments using ACM certificates
# ACM auto-renews, so the issue is usually on the ALB/CloudFront side
aws elbv2 describe-listeners --load-balancer-arn $ALB_ARN \
  --query 'Listeners[].Certificates[].CertificateArn'

# Check ACM certificate status
aws acm describe-certificate --certificate-arn $CERT_ARN \
  --query 'Certificate.{Status:Status,NotAfter:NotAfter}'

Step 3: Formal Certificate Renewal (15-30 min)

# 3-1. Validate renewed certificate chain
openssl verify -CAfile /etc/ssl/certs/ca-certificates.crt \
  /etc/letsencrypt/live/example.com/fullchain.pem

# 3-2. Verify certificate and key match
diff <(openssl x509 -noout -modulus -in /etc/letsencrypt/live/example.com/fullchain.pem | openssl md5) \
     <(openssl rsa -noout -modulus -in /etc/letsencrypt/live/example.com/privkey.pem | openssl md5)

# 3-3. Deploy to all servers
/usr/local/bin/deploy-cert.sh

Step 4: Service Verification (30-45 min)

# 4-1. TLS connection test
curl -vI https://example.com 2>&1 | grep -E "SSL|expire|subject"

# 4-2. Full endpoint check
for url in https://example.com https://api.example.com/healthz https://admin.example.com; do
    status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
    echo "$url: HTTP $status"
done

# 4-3. External verification (SSL Labs)
echo "Check: https://www.ssllabs.com/ssltest/analyze.html?d=example.com"

Step 5: Postmortem (within 48 hours)

Items to include in the postmortem:

## Incident Postmortem: SSL Certificate Expiry

### Timeline

- HH:MM - Initial alert received (source: Prometheus/user report)
- HH:MM - Incident confirmed, response started
- HH:MM - Certificate renewal completed
- HH:MM - Service confirmed normal

### Blast Radius

- Affected domains: example.com, api.example.com
- Duration: XX minutes
- Affected users: approximately N

### Root Cause

- (e.g.) certbot timer was disabled, so auto-renewal was not running
- (e.g.) DNS validation kept failing but no failure alerts were configured

### Preventive Measures

- [ ] Add monitoring for auto-renewal timer status
- [ ] Set up immediate alerts on renewal failure
- [ ] Confirm 30-day warning alerts are configured

6. Multi-Environment Certificate Management

6.1 Strategy by Environment

Environment	Certificate Type	CA	Renewal Cycle	Notes
dev	Self-signed or mkcert	Self	N/A	Browser warnings acceptable
staging	Let's Encrypt (staging)	LE Staging	90 days	No rate limits
prod	Let's Encrypt or ACM	Public CA	90 days / auto	Zero-downtime required

6.2 Development: Using mkcert

For local development environments, use mkcert to create locally-trusted certificates.

# Install mkcert (macOS)
brew install mkcert
mkcert -install  # Add local CA to system certificate store

# Create development certificates
mkcert "*.dev.example.com" localhost 127.0.0.1 ::1

# Output files
# _wildcard.dev.example.com+3.pem (certificate)
# _wildcard.dev.example.com+3-key.pem (key)

6.3 Staging: Using Let's Encrypt Staging

Use Let's Encrypt's staging server in staging to avoid rate limit issues.

# Issue certificate from staging server (no rate limits, not browser-trusted)
sudo certbot certonly --staging \
  --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  -d "*.staging.example.com"

# Use -k (insecure) flag for API testing
curl -k https://staging.example.com/api/healthz

6.4 Wildcard Strategy

example.com                  → Single domain + wildcard
├── www.example.com          → Covered by *.example.com
├── api.example.com          → Covered by *.example.com
├── admin.example.com        → Covered by *.example.com
├── staging.example.com      → Separate certificate (staging env)
│   ├── api.staging.example.com → Covered by *.staging.example.com
│   └── admin.staging.example.com → Covered by *.staging.example.com
└── internal.example.com     → Internal only (consider mTLS)

# Production wildcard certificate (apex + wildcard)
sudo certbot certonly \
  --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  -d "example.com" \
  -d "*.example.com" \
  --cert-name prod-wildcard

# Staging wildcard certificate (separate)
sudo certbot certonly \
  --dns-cloudflare \
  --dns-cloudflare-credentials /etc/letsencrypt/cloudflare.ini \
  -d "staging.example.com" \
  -d "*.staging.example.com" \
  --cert-name staging-wildcard

6.5 Kubernetes: cert-manager

In Kubernetes environments, manage certificates declaratively with cert-manager.

# cert-manager ClusterIssuer (Let's Encrypt)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
      - dns01:
          cloudflare:
            email: admin@example.com
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token

# Certificate resource
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example-com-tls
  namespace: istio-system
spec:
  secretName: example-com-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - example.com
    - '*.example.com'
  # Auto-renewal: 30 days before expiry
  renewBefore: 720h # 30 days

# Ingress with automatic certificate issuance
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: example-ingress
  annotations:
    cert-manager.io/cluster-issuer: 'letsencrypt-prod'
spec:
  tls:
    - hosts:
        - example.com
        - api.example.com
      secretName: example-com-tls
  rules:
    - host: example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web
                port:
                  number: 80

Useful cert-manager monitoring commands:

# Check certificate status
kubectl get certificates -A
kubectl describe certificate example-com-tls -n istio-system

# Check certificate events
kubectl get events --field-selector reason=IssueError -A

# cert-manager logs
kubectl logs -n cert-manager deploy/cert-manager -f

7. Operational Checklists

Certificate Issuance Checklist

Key algorithm selected (ECDSA P-256 recommended)
Certificate scope decided (single / wildcard / SAN)
DNS API credentials prepared for DNS-01 validation
Certificate file permissions set (chmod 600 privkey.pem)
Confirmed using fullchain.pem (using cert.pem alone causes incomplete chain)

Auto-Renewal Checklist

systemd timer active (systemctl is-active certbot-renewal.timer)
certbot renew --dry-run succeeds
deploy-hook configured for web server reload
Alerts configured for renewal failures
Retry logic implemented for renewal failures

Monitoring Checklist

ssl_exporter or custom script monitoring expiry dates
Warning at 30 days, critical at 7 days before expiry
Slack/PagerDuty alert channels connected
Certificate status panel added to Grafana dashboard
Weekly certificate report automated

Incident Preparedness Checklist

Runbook written for certificate expiry response
Emergency contact channel (on-call) designated
Previous certificate backups retained
Alternative issuance method available (acme.sh, etc.)
Contingency plan for rate limit exhaustion (staging CA, alternative CA)

Multi-Environment Checklist

dev: mkcert local certificates in use
staging: Let's Encrypt staging CA in use
prod: Public CA + auto-renewal + monitoring
Kubernetes: cert-manager installed with ClusterIssuer configured
Renewal schedule documented per certificate (certificate inventory)

8. Conclusion

Here are the core principles of certificate operations:

1. Manual renewal will inevitably fail. The reason Let's Encrypt adopted a 90-day validity period is to force automation. Fully automate renewal with certbot + systemd timer, cert-manager, or equivalent.

2. Automation alone is not enough. Auto-renewal can fail. DNS API token expiry, disk space exhaustion, CA server outages -- failure causes are diverse. You must build monitoring alongside automation.

3. Incidents will happen. To avoid panic when an expiry incident occurs, write runbooks in advance and practice regularly. Recovery time is proportional to preparedness.

4. Manage certificates as an inventory. As domains grow, it becomes harder to track "what certificate is where and when does it expire." Document your certificate list and register every one in your monitoring targets without exception.

Build your own certificate operations system based on the checklists and scripts in this playbook. Once properly established, you can free yourself from certificate expiry incidents for good.

Quiz

Q1: What is the main topic covered in "SSL Certificate Operations Playbook: Zero-Downtime Renewal and Expiry Prevention"?

A comprehensive operations playbook covering the entire certificate lifecycle (issuance, deployment, monitoring, renewal, revocation).

Q2: What is Certificate Lifecycle Management?

Certificate operations is not simply "issue and renew." It requires systematic lifecycle management. 1.1 Issuance Key decisions at the issuance stage: ECDSA is recommended because it has a smaller key size compared to RSA 2048 (256-bit vs.

Q3: Explain the core concept of Zero-Downtime Renewal Strategies.

The main causes of service disruption during certificate renewal are: Restarting (not reloading) the web server during renewal Time gap between deploying the new certificate and load balancer propagation Clients holding TLS session caches referencing the old certificate 2.1 Nginx...

Q4: What are the key aspects of Let's Encrypt Auto-Renewal Operations?

3.1 systemd Timer-Based Renewal (Recommended) Reasons to prefer systemd timer over cron: RandomizedDelaySec distributes load on the CA server Persistent=true compensates for missed runs after boot systemctl list-timers shows next scheduled execution Logs are integrated via journa...

Q5: How does Certificate Expiry Monitoring work?

4.1 Prometheus + ssl_exporter ssl_exporter exposes TLS certificate expiry times as Prometheus metrics. Key metrics: ssl_cert_not_after: Certificate expiry time (Unix timestamp) ssl_cert_not_before: Certificate issuance time ssl_tls_version_info: TLS version information ssl_ocsp_r...