Split View: Kubernetes 오토스케일링 완벽 가이드: HPA·VPA·KEDA 기반 프로덕션 워크로드 자동 확장 전략

Kubernetes 오토스케일링 완벽 가이드: HPA·VPA·KEDA 기반 프로덕션 워크로드 자동 확장 전략

들어가며
HPA v2 심층 분석
VPA 아키텍처와 운영 전략
KEDA 이벤트 기반 스케일링
HPA vs VPA vs KEDA 비교 분석
- 사용 시나리오별 권장
복합 스케일링 패턴
- HPA + VPA 조합
- HPA + KEDA 조합
운영 시 주의사항
장애 사례와 복구 절차
프로덕션 체크리스트
참고자료
마치며

Kubernetes HPA VPA KEDA Autoscaling Production

들어가며

프로덕션 Kubernetes 클러스터에서 오토스케일링은 선택이 아니라 필수이다. 트래픽 급증 시 Pod가 부족하면 서비스 장애로 이어지고, 과잉 프로비저닝은 매월 수백만 원의 불필요한 클라우드 비용을 발생시킨다. Kubernetes 생태계는 이러한 문제를 해결하기 위해 세 가지 핵심 오토스케일러를 제공한다.

HPA (Horizontal Pod Autoscaler): CPU, 메모리, 커스텀 메트릭을 기반으로 Pod 레플리카 수를 수평 확장
VPA (Vertical Pod Autoscaler): 과거 사용량 분석을 토대로 Pod의 리소스 요청/제한값을 자동 조정
KEDA (Kubernetes Event-Driven Autoscaler): 메시지 큐, HTTP 요청 수, cron 스케줄 등 외부 이벤트 소스 기반으로 워크로드를 0에서 N까지 스케일링

이 글에서는 각 오토스케일러의 아키텍처와 프로덕션 환경에서의 심층 구성 전략을 다루고, 실제 장애 사례와 복구 절차, 그리고 프로덕션 배포 전 반드시 확인해야 할 체크리스트까지 종합적으로 정리한다.

HPA v2 심층 분석

아키텍처와 메트릭 수집 흐름

HPA v2(autoscaling/v2)는 Kubernetes의 기본 수평 확장 메커니즘이다. HPA 컨트롤러는 기본적으로 15초 주기(--horizontal-pod-autoscaler-sync-period)로 메트릭을 수집하고, 목표 사용률과 현재 사용률의 비율을 계산하여 레플리카 수를 결정한다.

메트릭 수집 흐름은 다음과 같다.

Metrics Server: kubelet의 cAdvisor에서 CPU/메모리 메트릭을 수집하여 metrics.k8s.io API로 노출
Custom Metrics Adapter: Prometheus, Datadog 등에서 커스텀 메트릭을 custom.metrics.k8s.io API로 노출
External Metrics Adapter: 클러스터 외부 시스템의 메트릭을 external.metrics.k8s.io API로 노출

스케일링 알고리즘

HPA의 핵심 공식은 다음과 같다.

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

예를 들어, 현재 3개 Pod가 CPU 80%를 사용 중이고 목표가 50%라면 ceil(3 * (80/50)) = ceil(4.8) = 5개로 확장된다. 복수 메트릭이 지정된 경우 HPA는 각 메트릭에 대해 독립적으로 계산한 뒤 가장 큰 값을 채택한다.

프로덕션 HPA v2 매니페스트

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: '1000'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 5
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120
      selectPolicy: Min

핵심 포인트는 behavior 필드이다.

scaleUp: 안정화 창을 60초로 설정하고, 1분 동안 최대 50% 또는 5개 Pod 중 큰 값만큼 확장한다.
scaleDown: 안정화 창을 300초(5분)로 설정하고, 2분 동안 최대 10%씩만 축소한다. 이렇게 보수적으로 축소해야 트래픽이 다시 증가할 때 대응할 수 있다.

Metrics Server 설치 확인

# Metrics Server 동작 확인
kubectl top nodes
kubectl top pods -n production

# HPA 상태 확인
kubectl get hpa api-server-hpa -n production -o yaml

# HPA 이벤트 확인
kubectl describe hpa api-server-hpa -n production | grep -A 20 "Events"

kubectl top nodes가 실패하면 Metrics Server가 설치되지 않았거나 정상 동작하지 않는 것이다. 이 경우 HPA는 메트릭을 수집하지 못해 스케일링이 전혀 작동하지 않는다.

VPA 아키텍처와 운영 전략

VPA 구성 요소

VPA는 세 가지 주요 컴포넌트로 구성된다.

Recommender: 과거 리소스 사용량과 OOM 이벤트를 분석하여 최적의 CPU/메모리 요청값을 계산한다. 히스토그램 기반 알고리즘으로 P95 사용량을 기준으로 추천값을 도출한다.
Updater: 현재 Pod의 리소스 설정이 추천값과 크게 벗어나면 Pod를 퇴거(eviction)시켜 새로운 추천값이 적용되도록 한다.
Admission Controller: 새로 생성되거나 재시작된 Pod에 추천 리소스 값을 자동으로 적용하는 웹훅이다.

운영 모드

VPA는 네 가지 updateMode를 지원한다.

Off: 추천만 제공하고 실제 변경은 하지 않는다. 프로덕션 도입 초기에 권장한다.
Initial: Pod 생성 시에만 추천값을 적용한다. 이미 실행 중인 Pod는 변경하지 않는다.
Recreate: 추천값과 차이가 크면 Pod를 재생성한다. PodDisruptionBudget과 함께 사용해야 한다.
Auto: Kubernetes 1.27+에서 In-Place Resource Resize를 지원하는 경우 Pod 재시작 없이 리소스를 조정한다.

프로덕션 VPA 매니페스트

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: 'Off'
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        minAllowed:
          cpu: '100m'
          memory: '128Mi'
        maxAllowed:
          cpu: '4'
          memory: '8Gi'
        controlledResources:
          - cpu
          - memory
        controlledValues: RequestsAndLimits

프로덕션 환경에서는 반드시 Off 모드로 시작하여 최소 1-2주간 추천값의 안정성을 관찰한 뒤에 Auto나 Recreate 모드로 전환해야 한다. minAllowed와 maxAllowed를 반드시 설정하여 비정상적인 추천값이 적용되는 것을 방지한다.

VPA 추천값 확인

# VPA 추천값 확인
kubectl describe vpa api-server-vpa -n production

# 추천값과 현재 리소스 요청값 비교
kubectl get vpa api-server-vpa -n production -o jsonpath='{.status.recommendation.containerRecommendations[0]}'

KEDA 이벤트 기반 스케일링

KEDA 아키텍처

KEDA는 Kubernetes의 HPA를 확장하여 외부 이벤트 소스 기반으로 스케일링을 가능하게 하는 CNCF Graduated 프로젝트이다. KEDA의 핵심 컴포넌트는 다음과 같다.

KEDA Operator: ScaledObject/ScaledJob CRD를 감시하고, HPA를 자동 생성/관리한다. 이벤트가 없을 때 Pod를 0으로 축소하고, 이벤트 발생 시 1로 활성화한 뒤 HPA에 제어를 넘긴다.
Metrics Server (KEDA): 외부 이벤트 소스의 메트릭을 Kubernetes External Metrics API로 노출한다.
Scalers: 65개 이상의 이벤트 소스(Kafka, RabbitMQ, AWS SQS, Prometheus, PostgreSQL, Cron 등)에 연결하는 어댑터이다.

스케일링 흐름

KEDA의 스케일링은 두 단계로 작동한다.

Activation 단계: KEDA Operator가 이벤트 소스를 모니터링하다가 트리거 조건이 충족되면 Deployment의 레플리카를 0에서 1로 활성화한다.
Scaling 단계: 활성화된 이후에는 KEDA가 생성한 HPA가 메트릭 기반으로 1에서 N까지 확장/축소를 담당한다.

KEDA ScaledObject 매니페스트

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaledobject
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  pollingInterval: 15
  cooldownPeriod: 300
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 100
  fallback:
    failureThreshold: 3
    replicas: 5
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker:9092
        consumerGroup: order-processor-group
        topic: orders
        lagThreshold: '50'
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_total
        query: sum(rate(http_requests_total[2m]))
        threshold: '100'

핵심 구성 요소를 살펴보면 다음과 같다.

pollingInterval: 이벤트 소스를 확인하는 주기(초). 기본값 30초이며, 반응성이 중요한 경우 15초로 줄인다.
cooldownPeriod: 마지막 트리거 활성화 후 0으로 축소하기까지 대기하는 시간(초).
idleReplicaCount: 이벤트가 없을 때 유지할 레플리카 수. 0으로 설정하면 Scale-to-Zero가 활성화된다.
fallback: 메트릭 수집 실패 시 안전장치. failureThreshold만큼 연속 실패하면 replicas에 지정된 수로 유지한다.

KEDA 설치 및 확인

# Helm으로 KEDA 설치
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

# KEDA 상태 확인
kubectl get pods -n keda
kubectl get scaledobjects -n production
kubectl get hpa -n production

# ScaledObject 상세 확인
kubectl describe scaledobject order-processor-scaledobject -n production

HPA vs VPA vs KEDA 비교 분석

항목	HPA	VPA	KEDA
스케일링 방향	수평 (Pod 수 증감)	수직 (리소스 요청/제한 조정)	수평 (이벤트 기반 Pod 수 증감)
기본 메트릭	CPU, 메모리	CPU, 메모리 사용 이력	외부 이벤트 소스 (65+ 스케일러)
커스텀 메트릭	지원 (Adapter 필요)	미지원	내장 지원
Scale-to-Zero	미지원 (minReplicas 1 이상)	해당 없음	지원
Pod 재시작	불필요	필요 (Recreate 모드)	불필요
적합한 워크로드	무상태 웹 서비스, API	모든 워크로드 (리소스 최적화)	이벤트 기반, 배치 작업, 큐 처리
Kubernetes 내장	예	별도 설치 필요	별도 설치 필요
HPA와의 관계	-	CPU/메모리 메트릭 충돌 가능	HPA를 내부적으로 생성/관리
러닝 커브	낮음	중간	중간~높음
비용 최적화	중간 (과잉 확장 가능)	높음 (right-sizing)	높음 (Scale-to-Zero)

사용 시나리오별 권장

일반 웹 서비스: HPA (CPU 기반) + VPA (Off 모드로 모니터링)
이벤트 기반 마이크로서비스: KEDA (Kafka/SQS 트리거)
배치 작업: KEDA (ScaledJob으로 작업 완료 시 자동 종료)
API 게이트웨이: HPA (RPS 기반 커스텀 메트릭)
레거시 애플리케이션: VPA (수직 확장이 유일한 선택인 경우)

복합 스케일링 패턴

HPA + VPA 조합

HPA와 VPA를 동시에 사용할 때 가장 중요한 원칙은 같은 메트릭으로 충돌하지 않도록 하는 것이다.

# HPA: 커스텀 메트릭(RPS) 기반으로만 수평 확장
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: '500'
---
# VPA: CPU/메모리만 수직 조정 (HPA가 커스텀 메트릭만 사용하므로 충돌 없음)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: 'Auto'
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        controlledResources:
          - cpu
          - memory
        minAllowed:
          cpu: '200m'
          memory: '256Mi'
        maxAllowed:
          cpu: '2'
          memory: '4Gi'

이 패턴에서 HPA는 RPS(초당 요청 수)만을 기준으로 Pod 수를 조절하고, VPA는 CPU/메모리 리소스를 최적화한다. 양쪽이 같은 메트릭(CPU/메모리)을 사용하면 HPA가 Pod를 늘려 사용률을 낮추고, VPA가 리소스를 줄여 사용률을 다시 높이는 **무한 루프(thrashing)**가 발생한다.

HPA + KEDA 조합

KEDA는 내부적으로 HPA를 생성하므로 별도의 HPA를 만들 필요가 없다. 단, 여러 이벤트 소스를 하나의 ScaledObject에 결합할 수 있다.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: multi-trigger-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: api-server
  minReplicaCount: 2
  maxReplicaCount: 50
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: nginx_connections_active
        query: sum(nginx_ingress_controller_nginx_process_connections)
        threshold: '100'
    - type: cron
      metadata:
        timezone: Asia/Seoul
        start: 30 8 * * 1-5
        end: 30 18 * * 1-5
        desiredReplicas: '10'

이 예시는 Prometheus 메트릭 기반 동적 스케일링과 Cron 기반 예약 스케일링을 결합한다. 평일 오전 8시 30분부터 오후 6시 30분까지는 최소 10개 Pod를 유지하면서, 동시에 활성 연결 수에 따라 추가 확장이 가능하다.

운영 시 주의사항

1. 플래핑(Flapping) 방지

HPA가 빈번하게 스케일 업/다운을 반복하는 현상을 플래핑이라 한다. 이를 방지하려면 다음을 확인한다.

behavior.scaleDown.stabilizationWindowSeconds를 300초 이상으로 설정
메트릭의 변동성이 크면 averageUtilization 목표를 10-15% 여유 있게 설정
tolerance 값(기본 0.1, 즉 10%)을 이해하고 활용. 현재 메트릭이 목표의 90-110% 범위 안이면 스케일링이 발생하지 않음

2. 리소스 제한(Limits) 설정 누락

Pod에 CPU/메모리 limits가 설정되지 않으면 HPA의 퍼센트 기반 스케일링이 작동하지 않는다. 반드시 resources.requests를 설정해야 한다.

resources:
  requests:
    cpu: '500m'
    memory: '512Mi'
  limits:
    cpu: '1000m'
    memory: '1Gi'

3. 메트릭 지연(Lag)

Metrics Server는 약 15초 지연이 있고, Prometheus 기반 커스텀 메트릭은 scrape interval + adapter 처리 시간까지 합하면 30-60초의 지연이 발생할 수 있다. 트래픽 급증 시 이 지연 동안 기존 Pod에 과부하가 걸릴 수 있으므로 다음을 고려한다.

minReplicas를 평균 트래픽 기준보다 약간 높게 설정
Readiness Probe를 적절히 구성하여 새 Pod가 트래픽을 받을 준비가 되기까지의 시간을 최소화
급증이 예측 가능하면 KEDA의 Cron 트리거로 사전 확장

4. VPA와 HPA 동시 사용 시 충돌

VPA가 Auto 또는 Recreate 모드이면서 HPA가 CPU/메모리 메트릭을 사용하면 양쪽이 서로 상충하는 결정을 내린다. 반드시 다음 중 하나를 선택한다.

HPA는 커스텀 메트릭만 사용, VPA는 CPU/메모리 조정
VPA를 Off 모드로 설정하고 추천값만 참고

5. Cluster Autoscaler와의 연동

HPA가 Pod 수를 늘려도 클러스터에 가용 노드가 없으면 Pod는 Pending 상태에 머문다. Cluster Autoscaler(또는 Karpenter)가 연동되어 있어야 노드 수준의 확장이 이루어진다.

장애 사례와 복구 절차

사례 1: Metrics Server 장애로 HPA 무력화

증상: 모든 HPA의 TARGETS 열에 unknown이 표시되고, Pod 수가 고정된다.

# 진단
kubectl get hpa -A
kubectl top nodes  # 실패하면 Metrics Server 문제

# Metrics Server 상태 확인
kubectl get pods -n kube-system | grep metrics-server
kubectl logs -n kube-system deployment/metrics-server --tail=50

복구 절차:

Metrics Server Pod 재시작: kubectl rollout restart deployment/metrics-server -n kube-system
APIService 등록 확인: kubectl get apiservice v1beta1.metrics.k8s.io -o yaml
지속적으로 문제가 발생하면 Metrics Server 재설치
복구될 때까지 수동으로 레플리카 수 조정: kubectl scale deployment/api-server --replicas=10 -n production

사례 2: OOM Kill 연쇄 발생

증상: VPA의 추천값이 실제 피크 사용량보다 낮게 설정되어 Pod가 반복적으로 OOMKilled 된다.

# OOM 이벤트 확인
kubectl get events -n production --field-selector reason=OOMKilling --sort-by='.lastTimestamp'

# Pod 재시작 횟수 확인
kubectl get pods -n production -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount

복구 절차:

VPA updateMode를 즉시 Off로 변경하여 추가 Pod 변경 방지
영향받는 Deployment의 메모리 요청/제한을 수동으로 상향 조정
VPA의 maxAllowed 값이 실제 피크 사용량을 충분히 커버하는지 확인
최소 2주간 Off 모드로 추천값 안정성 재관찰

사례 3: 스케일링 스톰(Scaling Storm)

증상: HPA가 빠르게 확장한 뒤 곧바로 축소를 반복하면서 Pod가 계속 생성/삭제된다.

# HPA 이벤트에서 빈번한 SuccessfulRescale 확인
kubectl describe hpa api-server-hpa -n production | grep SuccessfulRescale

복구 절차:

behavior.scaleDown.stabilizationWindowSeconds를 600초 이상으로 증가
scaleDown.policies의 축소 비율을 5% 이하로 제한
메트릭 소스의 변동성을 분석하여 적절한 평활화(smoothing) 적용
필요 시 HPA를 일시적으로 비활성화하고 수동 관리로 전환

사례 4: KEDA 메트릭 수집 실패

증상: ScaledObject 상태가 Unknown이고, 연결된 HPA의 메트릭이 수집되지 않는다.

# ScaledObject 상태 확인
kubectl get scaledobject -n production
kubectl describe scaledobject order-processor-scaledobject -n production

# KEDA Operator 로그 확인
kubectl logs -n keda deployment/keda-operator --tail=100 | grep -i error

복구 절차:

이벤트 소스(Kafka, Prometheus 등)의 연결 상태 확인
ScaledObject의 fallback 설정이 있는지 확인. 없으면 추가하여 메트릭 실패 시 안전 레플리카 수 유지
KEDA Operator 재시작: kubectl rollout restart deployment/keda-operator -n keda
인증 정보(TriggerAuthentication)가 만료되지 않았는지 확인

프로덕션 체크리스트

배포 전 다음 항목을 반드시 확인한다.

HPA 관련:

Metrics Server가 정상 동작하고 kubectl top pods가 성공하는가
minReplicas가 평상시 트래픽을 처리할 수 있는 최소값인가
maxReplicas가 클러스터 용량과 비용 한도 내에 있는가
behavior.scaleDown.stabilizationWindowSeconds가 300초 이상인가
대상 Deployment에 resources.requests가 설정되어 있는가

VPA 관련:

프로덕션 첫 적용 시 updateMode: "Off"로 시작하는가
minAllowed와 maxAllowed가 적절하게 설정되어 있는가
PodDisruptionBudget이 구성되어 있는가
HPA와 동일 메트릭(CPU/메모리)으로 충돌하지 않는가

KEDA 관련:

fallback 설정이 메트릭 장애 시 안전 레플리카 수를 보장하는가
cooldownPeriod가 워크로드 특성에 맞게 설정되어 있는가
TriggerAuthentication의 인증 정보가 유효한가
Scale-to-Zero 사용 시 Cold Start 시간이 SLA 내에 있는가

공통:

Cluster Autoscaler 또는 Karpenter가 노드 수준 확장을 처리하는가
오토스케일링 이벤트에 대한 알림(Alerting)이 구성되어 있는가
Grafana 대시보드에서 레플리카 수, 메트릭 값, 스케일링 이벤트를 모니터링할 수 있는가
Runbook이 작성되어 있으며 팀원들이 숙지하고 있는가

참고자료

마치며

Kubernetes 오토스케일링은 단순히 HPA 하나를 배포하는 것으로 끝나지 않는다. 프로덕션 환경에서는 메트릭 수집의 안정성, 스케일링 알고리즘의 이해, Cooldown 전략, 장애 시 복구 절차까지 종합적으로 설계해야 한다.

핵심 원칙을 정리하면 다음과 같다.

단일 오토스케일러에 의존하지 마라: HPA, VPA, KEDA는 각각의 강점이 있다. 워크로드 특성에 맞게 조합하라.
보수적으로 축소하라: 스케일 업은 빠르게, 스케일 다운은 천천히. 트래픽 재증가에 대비해야 한다.
메트릭 장애에 대비하라: Metrics Server, Prometheus, 외부 이벤트 소스 모두 장애가 발생할 수 있다. Fallback 전략을 반드시 구성하라.
관찰 먼저, 자동화는 나중에: VPA는 Off 모드로 시작하고, HPA의 behavior를 보수적으로 설정한 뒤 점진적으로 자동화 수준을 높여라.
모니터링과 알림은 기본이다: 오토스케일링 결정 과정을 Grafana 대시보드로 시각화하고, 비정상적인 스케일링 패턴에 대한 알림을 설정하라.

오토스케일링을 제대로 구축하면 트래픽 급증에도 안정적으로 서비스를 제공하면서 클라우드 비용을 30-50% 절감할 수 있다. 이 글에서 다룬 패턴과 체크리스트가 프로덕션 환경의 오토스케일링 전략 수립에 실질적인 도움이 되길 바란다.

Kubernetes Autoscaling Complete Guide: Production Workload Auto-Scaling Strategies with HPA, VPA, and KEDA

Introduction
HPA v2 In-Depth Analysis
VPA Architecture and Operational Strategy
KEDA Event-Driven Scaling
HPA vs VPA vs KEDA Comparison
- Recommended Scenarios
Composite Scaling Patterns
- HPA + VPA Combination
- HPA + KEDA Combination
Operational Considerations
Failure Cases and Recovery Procedures
Production Checklist
References
Conclusion

Introduction

In production Kubernetes clusters, autoscaling is not optional but essential. When Pods are insufficient during traffic spikes, service outages occur, and over-provisioning generates unnecessary cloud costs of hundreds of thousands of dollars each month. The Kubernetes ecosystem provides three core autoscalers to address these challenges.

HPA (Horizontal Pod Autoscaler): Horizontally scales Pod replica count based on CPU, memory, and custom metrics
VPA (Vertical Pod Autoscaler): Automatically adjusts Pod resource requests/limits based on historical usage analysis
KEDA (Kubernetes Event-Driven Autoscaler): Scales workloads from 0 to N based on external event sources such as message queues, HTTP request counts, and cron schedules

This article covers the architecture of each autoscaler, in-depth configuration strategies for production environments, real-world failure cases with recovery procedures, and a comprehensive checklist that must be verified before production deployment.

HPA v2 In-Depth Analysis

Architecture and Metrics Collection Flow

HPA v2 (autoscaling/v2) is the default horizontal scaling mechanism in Kubernetes. The HPA controller collects metrics at a default interval of 15 seconds (--horizontal-pod-autoscaler-sync-period) and determines the replica count by calculating the ratio between target utilization and current utilization.

The metrics collection flow is as follows.

Metrics Server: Collects CPU/memory metrics from kubelet's cAdvisor and exposes them via the metrics.k8s.io API
Custom Metrics Adapter: Exposes custom metrics from Prometheus, Datadog, etc. via the custom.metrics.k8s.io API
External Metrics Adapter: Exposes metrics from systems outside the cluster via the external.metrics.k8s.io API

Scaling Algorithm

The core formula of HPA is as follows.

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

For example, if 3 Pods are currently using 80% CPU and the target is 50%, the calculation yields ceil(3 * (80/50)) = ceil(4.8) = 5 Pods. When multiple metrics are specified, HPA independently calculates for each metric and adopts the largest value.

Production HPA v2 Manifest

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: '1000'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 5
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120
      selectPolicy: Min

The key point is the behavior field.

scaleUp: Sets a stabilization window of 60 seconds, scaling up by the larger of 50% or 5 Pods per minute.
scaleDown: Sets a stabilization window of 300 seconds (5 minutes), scaling down by at most 10% every 2 minutes. This conservative scale-down is necessary to handle traffic re-surges.

Metrics Server Installation Verification

# Verify Metrics Server is working
kubectl top nodes
kubectl top pods -n production

# Check HPA status
kubectl get hpa api-server-hpa -n production -o yaml

# Check HPA events
kubectl describe hpa api-server-hpa -n production | grep -A 20 "Events"

If kubectl top nodes fails, either Metrics Server is not installed or is not functioning properly. In this case, HPA cannot collect metrics and autoscaling will not work at all.

VPA Architecture and Operational Strategy

VPA Components

VPA consists of three main components.

Recommender: Analyzes historical resource usage and OOM events to calculate optimal CPU/memory request values. Uses a histogram-based algorithm to derive recommendations based on P95 usage.
Updater: Evicts Pods when current resource settings deviate significantly from recommended values, allowing new recommendations to be applied.
Admission Controller: A webhook that automatically applies recommended resource values to newly created or restarted Pods.

Operation Modes

VPA supports four updateMode options.

Off: Only provides recommendations without making actual changes. Recommended during initial production adoption.
Initial: Applies recommended values only at Pod creation time. Does not modify already running Pods.
Recreate: Recreates Pods when the difference from recommended values is significant. Must be used with PodDisruptionBudget.
Auto: On Kubernetes 1.27+, adjusts resources without Pod restart when In-Place Resource Resize is supported.

Production VPA Manifest

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: 'Off'
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        minAllowed:
          cpu: '100m'
          memory: '128Mi'
        maxAllowed:
          cpu: '4'
          memory: '8Gi'
        controlledResources:
          - cpu
          - memory
        controlledValues: RequestsAndLimits

In production environments, you must start with Off mode and observe the stability of recommendations for at least 1-2 weeks before switching to Auto or Recreate mode. Always configure minAllowed and maxAllowed to prevent abnormal recommendations from being applied.

Checking VPA Recommendations

# Check VPA recommendations
kubectl describe vpa api-server-vpa -n production

# Compare recommendations with current resource requests
kubectl get vpa api-server-vpa -n production -o jsonpath='{.status.recommendation.containerRecommendations[0]}'

KEDA Event-Driven Scaling

KEDA Architecture

KEDA is a CNCF Graduated project that extends Kubernetes HPA to enable scaling based on external event sources. The core components of KEDA are as follows.

KEDA Operator: Watches ScaledObject/ScaledJob CRDs, automatically creates and manages HPAs. Scales Pods to 0 when there are no events, activates to 1 when events occur, then hands control to HPA.
Metrics Server (KEDA): Exposes external event source metrics via the Kubernetes External Metrics API.
Scalers: Adapters that connect to 65+ event sources (Kafka, RabbitMQ, AWS SQS, Prometheus, PostgreSQL, Cron, etc.).

Scaling Flow

KEDA scaling operates in two phases.

Activation Phase: The KEDA Operator monitors event sources and activates the Deployment from 0 to 1 replica when trigger conditions are met.
Scaling Phase: After activation, the HPA created by KEDA handles scaling from 1 to N based on metrics.

KEDA ScaledObject Manifest

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaledobject
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  pollingInterval: 15
  cooldownPeriod: 300
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 100
  fallback:
    failureThreshold: 3
    replicas: 5
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker:9092
        consumerGroup: order-processor-group
        topic: orders
        lagThreshold: '50'
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_total
        query: sum(rate(http_requests_total[2m]))
        threshold: '100'

Here is a breakdown of the key configuration elements.

pollingInterval: Interval in seconds for checking event sources. Default is 30 seconds; reduce to 15 seconds when responsiveness is critical.
cooldownPeriod: Wait time in seconds after the last trigger activation before scaling down to 0.
idleReplicaCount: Number of replicas to maintain when there are no events. Setting to 0 enables Scale-to-Zero.
fallback: Safety mechanism for metric collection failures. Maintains the number of replicas specified in replicas after failureThreshold consecutive failures.

KEDA Installation and Verification

# Install KEDA with Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

# Check KEDA status
kubectl get pods -n keda
kubectl get scaledobjects -n production
kubectl get hpa -n production

# Check ScaledObject details
kubectl describe scaledobject order-processor-scaledobject -n production

HPA vs VPA vs KEDA Comparison

Category	HPA	VPA	KEDA
Scaling Direction	Horizontal (Pod count)	Vertical (resource requests/limits)	Horizontal (event-driven Pod count)
Default Metrics	CPU, Memory	CPU, Memory usage history	External event sources (65+ scalers)
Custom Metrics	Supported (Adapter required)	Not supported	Built-in support
Scale-to-Zero	Not supported (minReplicas >= 1)	Not applicable	Supported
Pod Restart	Not required	Required (Recreate mode)	Not required
Suitable Workloads	Stateless web services, APIs	All workloads (resource optimization)	Event-driven, batch jobs, queue processing
Built into Kubernetes	Yes	Separate installation required	Separate installation required
Relationship with HPA	-	May conflict on CPU/memory metrics	Internally creates/manages HPA
Learning Curve	Low	Medium	Medium to High
Cost Optimization	Medium (possible over-scaling)	High (right-sizing)	High (Scale-to-Zero)

Recommended Scenarios

General web services: HPA (CPU-based) + VPA (Off mode for monitoring)
Event-driven microservices: KEDA (Kafka/SQS triggers)
Batch jobs: KEDA (ScaledJob for automatic termination upon job completion)
API gateways: HPA (RPS-based custom metrics)
Legacy applications: VPA (when vertical scaling is the only option)

Composite Scaling Patterns

HPA + VPA Combination

The most important principle when using HPA and VPA simultaneously is to ensure they do not conflict on the same metrics.

# HPA: Horizontal scaling based on custom metrics (RPS) only
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: '500'
---
# VPA: Vertical adjustment of CPU/memory only (no conflict since HPA uses only custom metrics)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: 'Auto'
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        controlledResources:
          - cpu
          - memory
        minAllowed:
          cpu: '200m'
          memory: '256Mi'
        maxAllowed:
          cpu: '2'
          memory: '4Gi'

In this pattern, HPA adjusts the Pod count based solely on RPS (requests per second), while VPA optimizes CPU/memory resources. If both use the same metrics (CPU/memory), HPA increases Pods to lower utilization, and VPA reduces resources to raise utilization again, creating an infinite loop (thrashing).

HPA + KEDA Combination

Since KEDA internally creates an HPA, there is no need to create a separate HPA. However, multiple event sources can be combined into a single ScaledObject.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: multi-trigger-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: api-server
  minReplicaCount: 2
  maxReplicaCount: 50
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: nginx_connections_active
        query: sum(nginx_ingress_controller_nginx_process_connections)
        threshold: '100'
    - type: cron
      metadata:
        timezone: Asia/Seoul
        start: 30 8 * * 1-5
        end: 30 18 * * 1-5
        desiredReplicas: '10'

This example combines Prometheus metric-based dynamic scaling with cron-based scheduled scaling. From 8:30 AM to 6:30 PM on weekdays, a minimum of 10 Pods is maintained while allowing additional scaling based on active connection count.

Operational Considerations

1. Preventing Flapping

Flapping refers to the phenomenon where HPA repeatedly scales up and down frequently. To prevent this, verify the following.

Set behavior.scaleDown.stabilizationWindowSeconds to 300 seconds or more
If metrics have high variability, set averageUtilization target with 10-15% headroom
Understand and utilize the tolerance value (default 0.1, i.e., 10%). Scaling does not occur when current metrics are within 90-110% of the target

2. Missing Resource Limits

If CPU/memory limits are not set on Pods, HPA percentage-based scaling will not work. You must configure resources.requests.

resources:
  requests:
    cpu: '500m'
    memory: '512Mi'
  limits:
    cpu: '1000m'
    memory: '1Gi'

3. Metrics Lag

Metrics Server has approximately 15 seconds of lag, and Prometheus-based custom metrics can have 30-60 seconds of delay when combining scrape interval and adapter processing time. During traffic spikes, existing Pods may become overloaded during this delay, so consider the following.

Set minReplicas slightly above the average traffic baseline
Configure Readiness Probes appropriately to minimize the time before new Pods are ready to receive traffic
Use KEDA's Cron trigger for pre-scaling when surges are predictable

4. Conflicts When Using VPA and HPA Simultaneously

When VPA is in Auto or Recreate mode while HPA uses CPU/memory metrics, both will make conflicting decisions. You must choose one of the following approaches.

HPA uses only custom metrics, VPA adjusts CPU/memory
Set VPA to Off mode and only reference recommendations

5. Integration with Cluster Autoscaler

Even if HPA increases the Pod count, Pods will remain in Pending state if no available nodes exist in the cluster. Cluster Autoscaler (or Karpenter) must be integrated for node-level scaling.

Failure Cases and Recovery Procedures

Case 1: Metrics Server Failure Disabling HPA

Symptom: All HPAs show unknown in the TARGETS column, and Pod count remains fixed.

# Diagnosis
kubectl get hpa -A
kubectl top nodes  # If this fails, it is a Metrics Server issue

# Check Metrics Server status
kubectl get pods -n kube-system | grep metrics-server
kubectl logs -n kube-system deployment/metrics-server --tail=50

Recovery Procedure:

Restart Metrics Server Pod: kubectl rollout restart deployment/metrics-server -n kube-system
Verify APIService registration: kubectl get apiservice v1beta1.metrics.k8s.io -o yaml
Reinstall Metrics Server if the issue persists
Manually adjust replica count until recovery: kubectl scale deployment/api-server --replicas=10 -n production

Case 2: Cascading OOM Kills

Symptom: VPA recommendations are set lower than actual peak usage, causing Pods to be repeatedly OOMKilled.

# Check OOM events
kubectl get events -n production --field-selector reason=OOMKilling --sort-by='.lastTimestamp'

# Check Pod restart counts
kubectl get pods -n production -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount

Recovery Procedure:

Immediately change VPA updateMode to Off to prevent further Pod modifications
Manually increase memory requests/limits for the affected Deployment
Verify that VPA maxAllowed values sufficiently cover actual peak usage
Re-observe recommendation stability in Off mode for at least 2 weeks

Case 3: Scaling Storm

Symptom: HPA rapidly scales up then immediately scales down repeatedly, continuously creating and deleting Pods.

# Check for frequent SuccessfulRescale events in HPA
kubectl describe hpa api-server-hpa -n production | grep SuccessfulRescale

Recovery Procedure:

Increase behavior.scaleDown.stabilizationWindowSeconds to 600 seconds or more
Limit scale-down ratio in scaleDown.policies to 5% or less
Analyze metric source variability and apply appropriate smoothing
Temporarily disable HPA and switch to manual management if necessary

Case 4: KEDA Metrics Collection Failure

Symptom: ScaledObject status shows Unknown, and metrics for the associated HPA are not being collected.

# Check ScaledObject status
kubectl get scaledobject -n production
kubectl describe scaledobject order-processor-scaledobject -n production

# Check KEDA Operator logs
kubectl logs -n keda deployment/keda-operator --tail=100 | grep -i error

Recovery Procedure:

Verify connection status of event sources (Kafka, Prometheus, etc.)
Check if the ScaledObject has a fallback configuration. If not, add one to maintain safe replica count during metric failures
Restart KEDA Operator: kubectl rollout restart deployment/keda-operator -n keda
Verify that authentication credentials (TriggerAuthentication) have not expired

Production Checklist

Verify the following items before deployment.

HPA Related:

Is Metrics Server functioning properly and does kubectl top pods succeed
Is minReplicas set to the minimum value capable of handling normal traffic
Is maxReplicas within cluster capacity and cost limits
Is behavior.scaleDown.stabilizationWindowSeconds set to 300 seconds or more
Are resources.requests configured on the target Deployment

VPA Related:

Does initial production deployment start with updateMode: "Off"
Are minAllowed and maxAllowed appropriately configured
Is PodDisruptionBudget configured
Is there no conflict with HPA on the same metrics (CPU/memory)

KEDA Related:

Does the fallback configuration guarantee safe replica count during metric failures
Is cooldownPeriod configured appropriately for the workload characteristics
Are TriggerAuthentication credentials valid
Is Cold Start time within SLA when using Scale-to-Zero

Common:

Is Cluster Autoscaler or Karpenter handling node-level scaling
Are alerts configured for autoscaling events
Can replica count, metric values, and scaling events be monitored via Grafana dashboards
Are runbooks documented and understood by team members

References

Conclusion

Kubernetes autoscaling does not end with simply deploying a single HPA. In production environments, you must comprehensively design for metrics collection stability, understanding of scaling algorithms, cooldown strategies, and recovery procedures during failures.

The key principles are summarized as follows.

Do not rely on a single autoscaler: HPA, VPA, and KEDA each have their strengths. Combine them according to your workload characteristics.
Scale down conservatively: Scale up quickly, scale down slowly. You must be prepared for traffic re-surges.
Prepare for metric failures: Metrics Server, Prometheus, and external event sources can all experience failures. Always configure fallback strategies.
Observe first, automate later: Start VPA in Off mode, configure HPA behavior conservatively, then gradually increase the level of automation.
Monitoring and alerting are fundamental: Visualize autoscaling decision processes in Grafana dashboards and set up alerts for abnormal scaling patterns.

When autoscaling is properly implemented, you can provide stable service even during traffic spikes while reducing cloud costs by 30-50%. We hope that the patterns and checklists covered in this article will provide practical help in establishing autoscaling strategies for production environments.