Skip to content
Published on

Kubernetes Autoscaling Complete Guide: Production Workload Auto-Scaling Strategies with HPA, VPA, and KEDA

Authors
  • Name
    Twitter
Kubernetes HPA VPA KEDA Autoscaling Production

Introduction

In production Kubernetes clusters, autoscaling is not optional but essential. When Pods are insufficient during traffic spikes, service outages occur, and over-provisioning generates unnecessary cloud costs of hundreds of thousands of dollars each month. The Kubernetes ecosystem provides three core autoscalers to address these challenges.

  • HPA (Horizontal Pod Autoscaler): Horizontally scales Pod replica count based on CPU, memory, and custom metrics
  • VPA (Vertical Pod Autoscaler): Automatically adjusts Pod resource requests/limits based on historical usage analysis
  • KEDA (Kubernetes Event-Driven Autoscaler): Scales workloads from 0 to N based on external event sources such as message queues, HTTP request counts, and cron schedules

This article covers the architecture of each autoscaler, in-depth configuration strategies for production environments, real-world failure cases with recovery procedures, and a comprehensive checklist that must be verified before production deployment.

HPA v2 In-Depth Analysis

Architecture and Metrics Collection Flow

HPA v2 (autoscaling/v2) is the default horizontal scaling mechanism in Kubernetes. The HPA controller collects metrics at a default interval of 15 seconds (--horizontal-pod-autoscaler-sync-period) and determines the replica count by calculating the ratio between target utilization and current utilization.

The metrics collection flow is as follows.

  1. Metrics Server: Collects CPU/memory metrics from kubelet's cAdvisor and exposes them via the metrics.k8s.io API
  2. Custom Metrics Adapter: Exposes custom metrics from Prometheus, Datadog, etc. via the custom.metrics.k8s.io API
  3. External Metrics Adapter: Exposes metrics from systems outside the cluster via the external.metrics.k8s.io API

Scaling Algorithm

The core formula of HPA is as follows.

desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))

For example, if 3 Pods are currently using 80% CPU and the target is 50%, the calculation yields ceil(3 * (80/50)) = ceil(4.8) = 5 Pods. When multiple metrics are specified, HPA independently calculates for each metric and adopts the largest value.

Production HPA v2 Manifest

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: '1000'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 5
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 120
      selectPolicy: Min

The key point is the behavior field.

  • scaleUp: Sets a stabilization window of 60 seconds, scaling up by the larger of 50% or 5 Pods per minute.
  • scaleDown: Sets a stabilization window of 300 seconds (5 minutes), scaling down by at most 10% every 2 minutes. This conservative scale-down is necessary to handle traffic re-surges.

Metrics Server Installation Verification

# Verify Metrics Server is working
kubectl top nodes
kubectl top pods -n production

# Check HPA status
kubectl get hpa api-server-hpa -n production -o yaml

# Check HPA events
kubectl describe hpa api-server-hpa -n production | grep -A 20 "Events"

If kubectl top nodes fails, either Metrics Server is not installed or is not functioning properly. In this case, HPA cannot collect metrics and autoscaling will not work at all.

VPA Architecture and Operational Strategy

VPA Components

VPA consists of three main components.

  1. Recommender: Analyzes historical resource usage and OOM events to calculate optimal CPU/memory request values. Uses a histogram-based algorithm to derive recommendations based on P95 usage.
  2. Updater: Evicts Pods when current resource settings deviate significantly from recommended values, allowing new recommendations to be applied.
  3. Admission Controller: A webhook that automatically applies recommended resource values to newly created or restarted Pods.

Operation Modes

VPA supports four updateMode options.

  • Off: Only provides recommendations without making actual changes. Recommended during initial production adoption.
  • Initial: Applies recommended values only at Pod creation time. Does not modify already running Pods.
  • Recreate: Recreates Pods when the difference from recommended values is significant. Must be used with PodDisruptionBudget.
  • Auto: On Kubernetes 1.27+, adjusts resources without Pod restart when In-Place Resource Resize is supported.

Production VPA Manifest

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: 'Off'
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        minAllowed:
          cpu: '100m'
          memory: '128Mi'
        maxAllowed:
          cpu: '4'
          memory: '8Gi'
        controlledResources:
          - cpu
          - memory
        controlledValues: RequestsAndLimits

In production environments, you must start with Off mode and observe the stability of recommendations for at least 1-2 weeks before switching to Auto or Recreate mode. Always configure minAllowed and maxAllowed to prevent abnormal recommendations from being applied.

Checking VPA Recommendations

# Check VPA recommendations
kubectl describe vpa api-server-vpa -n production

# Compare recommendations with current resource requests
kubectl get vpa api-server-vpa -n production -o jsonpath='{.status.recommendation.containerRecommendations[0]}'

KEDA Event-Driven Scaling

KEDA Architecture

KEDA is a CNCF Graduated project that extends Kubernetes HPA to enable scaling based on external event sources. The core components of KEDA are as follows.

  1. KEDA Operator: Watches ScaledObject/ScaledJob CRDs, automatically creates and manages HPAs. Scales Pods to 0 when there are no events, activates to 1 when events occur, then hands control to HPA.
  2. Metrics Server (KEDA): Exposes external event source metrics via the Kubernetes External Metrics API.
  3. Scalers: Adapters that connect to 65+ event sources (Kafka, RabbitMQ, AWS SQS, Prometheus, PostgreSQL, Cron, etc.).

Scaling Flow

KEDA scaling operates in two phases.

  1. Activation Phase: The KEDA Operator monitors event sources and activates the Deployment from 0 to 1 replica when trigger conditions are met.
  2. Scaling Phase: After activation, the HPA created by KEDA handles scaling from 1 to N based on metrics.

KEDA ScaledObject Manifest

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaledobject
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  pollingInterval: 15
  cooldownPeriod: 300
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 100
  fallback:
    failureThreshold: 3
    replicas: 5
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka-broker:9092
        consumerGroup: order-processor-group
        topic: orders
        lagThreshold: '50'
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: http_requests_total
        query: sum(rate(http_requests_total[2m]))
        threshold: '100'

Here is a breakdown of the key configuration elements.

  • pollingInterval: Interval in seconds for checking event sources. Default is 30 seconds; reduce to 15 seconds when responsiveness is critical.
  • cooldownPeriod: Wait time in seconds after the last trigger activation before scaling down to 0.
  • idleReplicaCount: Number of replicas to maintain when there are no events. Setting to 0 enables Scale-to-Zero.
  • fallback: Safety mechanism for metric collection failures. Maintains the number of replicas specified in replicas after failureThreshold consecutive failures.

KEDA Installation and Verification

# Install KEDA with Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

# Check KEDA status
kubectl get pods -n keda
kubectl get scaledobjects -n production
kubectl get hpa -n production

# Check ScaledObject details
kubectl describe scaledobject order-processor-scaledobject -n production

HPA vs VPA vs KEDA Comparison

CategoryHPAVPAKEDA
Scaling DirectionHorizontal (Pod count)Vertical (resource requests/limits)Horizontal (event-driven Pod count)
Default MetricsCPU, MemoryCPU, Memory usage historyExternal event sources (65+ scalers)
Custom MetricsSupported (Adapter required)Not supportedBuilt-in support
Scale-to-ZeroNot supported (minReplicas >= 1)Not applicableSupported
Pod RestartNot requiredRequired (Recreate mode)Not required
Suitable WorkloadsStateless web services, APIsAll workloads (resource optimization)Event-driven, batch jobs, queue processing
Built into KubernetesYesSeparate installation requiredSeparate installation required
Relationship with HPA-May conflict on CPU/memory metricsInternally creates/manages HPA
Learning CurveLowMediumMedium to High
Cost OptimizationMedium (possible over-scaling)High (right-sizing)High (Scale-to-Zero)
  • General web services: HPA (CPU-based) + VPA (Off mode for monitoring)
  • Event-driven microservices: KEDA (Kafka/SQS triggers)
  • Batch jobs: KEDA (ScaledJob for automatic termination upon job completion)
  • API gateways: HPA (RPS-based custom metrics)
  • Legacy applications: VPA (when vertical scaling is the only option)

Composite Scaling Patterns

HPA + VPA Combination

The most important principle when using HPA and VPA simultaneously is to ensure they do not conflict on the same metrics.

# HPA: Horizontal scaling based on custom metrics (RPS) only
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 30
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: '500'
---
# VPA: Vertical adjustment of CPU/memory only (no conflict since HPA uses only custom metrics)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: 'Auto'
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        controlledResources:
          - cpu
          - memory
        minAllowed:
          cpu: '200m'
          memory: '256Mi'
        maxAllowed:
          cpu: '2'
          memory: '4Gi'

In this pattern, HPA adjusts the Pod count based solely on RPS (requests per second), while VPA optimizes CPU/memory resources. If both use the same metrics (CPU/memory), HPA increases Pods to lower utilization, and VPA reduces resources to raise utilization again, creating an infinite loop (thrashing).

HPA + KEDA Combination

Since KEDA internally creates an HPA, there is no need to create a separate HPA. However, multiple event sources can be combined into a single ScaledObject.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: multi-trigger-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: api-server
  minReplicaCount: 2
  maxReplicaCount: 50
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: nginx_connections_active
        query: sum(nginx_ingress_controller_nginx_process_connections)
        threshold: '100'
    - type: cron
      metadata:
        timezone: Asia/Seoul
        start: 30 8 * * 1-5
        end: 30 18 * * 1-5
        desiredReplicas: '10'

This example combines Prometheus metric-based dynamic scaling with cron-based scheduled scaling. From 8:30 AM to 6:30 PM on weekdays, a minimum of 10 Pods is maintained while allowing additional scaling based on active connection count.

Operational Considerations

1. Preventing Flapping

Flapping refers to the phenomenon where HPA repeatedly scales up and down frequently. To prevent this, verify the following.

  • Set behavior.scaleDown.stabilizationWindowSeconds to 300 seconds or more
  • If metrics have high variability, set averageUtilization target with 10-15% headroom
  • Understand and utilize the tolerance value (default 0.1, i.e., 10%). Scaling does not occur when current metrics are within 90-110% of the target

2. Missing Resource Limits

If CPU/memory limits are not set on Pods, HPA percentage-based scaling will not work. You must configure resources.requests.

resources:
  requests:
    cpu: '500m'
    memory: '512Mi'
  limits:
    cpu: '1000m'
    memory: '1Gi'

3. Metrics Lag

Metrics Server has approximately 15 seconds of lag, and Prometheus-based custom metrics can have 30-60 seconds of delay when combining scrape interval and adapter processing time. During traffic spikes, existing Pods may become overloaded during this delay, so consider the following.

  • Set minReplicas slightly above the average traffic baseline
  • Configure Readiness Probes appropriately to minimize the time before new Pods are ready to receive traffic
  • Use KEDA's Cron trigger for pre-scaling when surges are predictable

4. Conflicts When Using VPA and HPA Simultaneously

When VPA is in Auto or Recreate mode while HPA uses CPU/memory metrics, both will make conflicting decisions. You must choose one of the following approaches.

  • HPA uses only custom metrics, VPA adjusts CPU/memory
  • Set VPA to Off mode and only reference recommendations

5. Integration with Cluster Autoscaler

Even if HPA increases the Pod count, Pods will remain in Pending state if no available nodes exist in the cluster. Cluster Autoscaler (or Karpenter) must be integrated for node-level scaling.

Failure Cases and Recovery Procedures

Case 1: Metrics Server Failure Disabling HPA

Symptom: All HPAs show unknown in the TARGETS column, and Pod count remains fixed.

# Diagnosis
kubectl get hpa -A
kubectl top nodes  # If this fails, it is a Metrics Server issue

# Check Metrics Server status
kubectl get pods -n kube-system | grep metrics-server
kubectl logs -n kube-system deployment/metrics-server --tail=50

Recovery Procedure:

  1. Restart Metrics Server Pod: kubectl rollout restart deployment/metrics-server -n kube-system
  2. Verify APIService registration: kubectl get apiservice v1beta1.metrics.k8s.io -o yaml
  3. Reinstall Metrics Server if the issue persists
  4. Manually adjust replica count until recovery: kubectl scale deployment/api-server --replicas=10 -n production

Case 2: Cascading OOM Kills

Symptom: VPA recommendations are set lower than actual peak usage, causing Pods to be repeatedly OOMKilled.

# Check OOM events
kubectl get events -n production --field-selector reason=OOMKilling --sort-by='.lastTimestamp'

# Check Pod restart counts
kubectl get pods -n production -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount

Recovery Procedure:

  1. Immediately change VPA updateMode to Off to prevent further Pod modifications
  2. Manually increase memory requests/limits for the affected Deployment
  3. Verify that VPA maxAllowed values sufficiently cover actual peak usage
  4. Re-observe recommendation stability in Off mode for at least 2 weeks

Case 3: Scaling Storm

Symptom: HPA rapidly scales up then immediately scales down repeatedly, continuously creating and deleting Pods.

# Check for frequent SuccessfulRescale events in HPA
kubectl describe hpa api-server-hpa -n production | grep SuccessfulRescale

Recovery Procedure:

  1. Increase behavior.scaleDown.stabilizationWindowSeconds to 600 seconds or more
  2. Limit scale-down ratio in scaleDown.policies to 5% or less
  3. Analyze metric source variability and apply appropriate smoothing
  4. Temporarily disable HPA and switch to manual management if necessary

Case 4: KEDA Metrics Collection Failure

Symptom: ScaledObject status shows Unknown, and metrics for the associated HPA are not being collected.

# Check ScaledObject status
kubectl get scaledobject -n production
kubectl describe scaledobject order-processor-scaledobject -n production

# Check KEDA Operator logs
kubectl logs -n keda deployment/keda-operator --tail=100 | grep -i error

Recovery Procedure:

  1. Verify connection status of event sources (Kafka, Prometheus, etc.)
  2. Check if the ScaledObject has a fallback configuration. If not, add one to maintain safe replica count during metric failures
  3. Restart KEDA Operator: kubectl rollout restart deployment/keda-operator -n keda
  4. Verify that authentication credentials (TriggerAuthentication) have not expired

Production Checklist

Verify the following items before deployment.

HPA Related:

  • Is Metrics Server functioning properly and does kubectl top pods succeed
  • Is minReplicas set to the minimum value capable of handling normal traffic
  • Is maxReplicas within cluster capacity and cost limits
  • Is behavior.scaleDown.stabilizationWindowSeconds set to 300 seconds or more
  • Are resources.requests configured on the target Deployment

VPA Related:

  • Does initial production deployment start with updateMode: "Off"
  • Are minAllowed and maxAllowed appropriately configured
  • Is PodDisruptionBudget configured
  • Is there no conflict with HPA on the same metrics (CPU/memory)

KEDA Related:

  • Does the fallback configuration guarantee safe replica count during metric failures
  • Is cooldownPeriod configured appropriately for the workload characteristics
  • Are TriggerAuthentication credentials valid
  • Is Cold Start time within SLA when using Scale-to-Zero

Common:

  • Is Cluster Autoscaler or Karpenter handling node-level scaling
  • Are alerts configured for autoscaling events
  • Can replica count, metric values, and scaling events be monitored via Grafana dashboards
  • Are runbooks documented and understood by team members

References

Conclusion

Kubernetes autoscaling does not end with simply deploying a single HPA. In production environments, you must comprehensively design for metrics collection stability, understanding of scaling algorithms, cooldown strategies, and recovery procedures during failures.

The key principles are summarized as follows.

  1. Do not rely on a single autoscaler: HPA, VPA, and KEDA each have their strengths. Combine them according to your workload characteristics.
  2. Scale down conservatively: Scale up quickly, scale down slowly. You must be prepared for traffic re-surges.
  3. Prepare for metric failures: Metrics Server, Prometheus, and external event sources can all experience failures. Always configure fallback strategies.
  4. Observe first, automate later: Start VPA in Off mode, configure HPA behavior conservatively, then gradually increase the level of automation.
  5. Monitoring and alerting are fundamental: Visualize autoscaling decision processes in Grafana dashboards and set up alerts for abnormal scaling patterns.

When autoscaling is properly implemented, you can provide stable service even during traffic spikes while reducing cloud costs by 30-50%. We hope that the patterns and checklists covered in this article will provide practical help in establishing autoscaling strategies for production environments.