- Published on
Kubernetes Autoscaling Complete Guide: Production Workload Auto-Scaling Strategies with HPA, VPA, and KEDA
- Authors
- Name
- Introduction
- HPA v2 In-Depth Analysis
- VPA Architecture and Operational Strategy
- KEDA Event-Driven Scaling
- HPA vs VPA vs KEDA Comparison
- Composite Scaling Patterns
- Operational Considerations
- Failure Cases and Recovery Procedures
- Production Checklist
- References
- Conclusion

Introduction
In production Kubernetes clusters, autoscaling is not optional but essential. When Pods are insufficient during traffic spikes, service outages occur, and over-provisioning generates unnecessary cloud costs of hundreds of thousands of dollars each month. The Kubernetes ecosystem provides three core autoscalers to address these challenges.
- HPA (Horizontal Pod Autoscaler): Horizontally scales Pod replica count based on CPU, memory, and custom metrics
- VPA (Vertical Pod Autoscaler): Automatically adjusts Pod resource requests/limits based on historical usage analysis
- KEDA (Kubernetes Event-Driven Autoscaler): Scales workloads from 0 to N based on external event sources such as message queues, HTTP request counts, and cron schedules
This article covers the architecture of each autoscaler, in-depth configuration strategies for production environments, real-world failure cases with recovery procedures, and a comprehensive checklist that must be verified before production deployment.
HPA v2 In-Depth Analysis
Architecture and Metrics Collection Flow
HPA v2 (autoscaling/v2) is the default horizontal scaling mechanism in Kubernetes. The HPA controller collects metrics at a default interval of 15 seconds (--horizontal-pod-autoscaler-sync-period) and determines the replica count by calculating the ratio between target utilization and current utilization.
The metrics collection flow is as follows.
- Metrics Server: Collects CPU/memory metrics from kubelet's cAdvisor and exposes them via the
metrics.k8s.ioAPI - Custom Metrics Adapter: Exposes custom metrics from Prometheus, Datadog, etc. via the
custom.metrics.k8s.ioAPI - External Metrics Adapter: Exposes metrics from systems outside the cluster via the
external.metrics.k8s.ioAPI
Scaling Algorithm
The core formula of HPA is as follows.
desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue))
For example, if 3 Pods are currently using 80% CPU and the target is 50%, the calculation yields ceil(3 * (80/50)) = ceil(4.8) = 5 Pods. When multiple metrics are specified, HPA independently calculates for each metric and adopts the largest value.
Production HPA v2 Manifest
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: '1000'
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 5
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120
selectPolicy: Min
The key point is the behavior field.
- scaleUp: Sets a stabilization window of 60 seconds, scaling up by the larger of 50% or 5 Pods per minute.
- scaleDown: Sets a stabilization window of 300 seconds (5 minutes), scaling down by at most 10% every 2 minutes. This conservative scale-down is necessary to handle traffic re-surges.
Metrics Server Installation Verification
# Verify Metrics Server is working
kubectl top nodes
kubectl top pods -n production
# Check HPA status
kubectl get hpa api-server-hpa -n production -o yaml
# Check HPA events
kubectl describe hpa api-server-hpa -n production | grep -A 20 "Events"
If kubectl top nodes fails, either Metrics Server is not installed or is not functioning properly. In this case, HPA cannot collect metrics and autoscaling will not work at all.
VPA Architecture and Operational Strategy
VPA Components
VPA consists of three main components.
- Recommender: Analyzes historical resource usage and OOM events to calculate optimal CPU/memory request values. Uses a histogram-based algorithm to derive recommendations based on P95 usage.
- Updater: Evicts Pods when current resource settings deviate significantly from recommended values, allowing new recommendations to be applied.
- Admission Controller: A webhook that automatically applies recommended resource values to newly created or restarted Pods.
Operation Modes
VPA supports four updateMode options.
- Off: Only provides recommendations without making actual changes. Recommended during initial production adoption.
- Initial: Applies recommended values only at Pod creation time. Does not modify already running Pods.
- Recreate: Recreates Pods when the difference from recommended values is significant. Must be used with PodDisruptionBudget.
- Auto: On Kubernetes 1.27+, adjusts resources without Pod restart when In-Place Resource Resize is supported.
Production VPA Manifest
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: 'Off'
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: '100m'
memory: '128Mi'
maxAllowed:
cpu: '4'
memory: '8Gi'
controlledResources:
- cpu
- memory
controlledValues: RequestsAndLimits
In production environments, you must start with Off mode and observe the stability of recommendations for at least 1-2 weeks before switching to Auto or Recreate mode. Always configure minAllowed and maxAllowed to prevent abnormal recommendations from being applied.
Checking VPA Recommendations
# Check VPA recommendations
kubectl describe vpa api-server-vpa -n production
# Compare recommendations with current resource requests
kubectl get vpa api-server-vpa -n production -o jsonpath='{.status.recommendation.containerRecommendations[0]}'
KEDA Event-Driven Scaling
KEDA Architecture
KEDA is a CNCF Graduated project that extends Kubernetes HPA to enable scaling based on external event sources. The core components of KEDA are as follows.
- KEDA Operator: Watches ScaledObject/ScaledJob CRDs, automatically creates and manages HPAs. Scales Pods to 0 when there are no events, activates to 1 when events occur, then hands control to HPA.
- Metrics Server (KEDA): Exposes external event source metrics via the Kubernetes External Metrics API.
- Scalers: Adapters that connect to 65+ event sources (Kafka, RabbitMQ, AWS SQS, Prometheus, PostgreSQL, Cron, etc.).
Scaling Flow
KEDA scaling operates in two phases.
- Activation Phase: The KEDA Operator monitors event sources and activates the Deployment from 0 to 1 replica when trigger conditions are met.
- Scaling Phase: After activation, the HPA created by KEDA handles scaling from 1 to N based on metrics.
KEDA ScaledObject Manifest
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor-scaledobject
namespace: production
spec:
scaleTargetRef:
name: order-processor
pollingInterval: 15
cooldownPeriod: 300
idleReplicaCount: 0
minReplicaCount: 1
maxReplicaCount: 100
fallback:
failureThreshold: 3
replicas: 5
triggers:
- type: kafka
metadata:
bootstrapServers: kafka-broker:9092
consumerGroup: order-processor-group
topic: orders
lagThreshold: '50'
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: http_requests_total
query: sum(rate(http_requests_total[2m]))
threshold: '100'
Here is a breakdown of the key configuration elements.
- pollingInterval: Interval in seconds for checking event sources. Default is 30 seconds; reduce to 15 seconds when responsiveness is critical.
- cooldownPeriod: Wait time in seconds after the last trigger activation before scaling down to 0.
- idleReplicaCount: Number of replicas to maintain when there are no events. Setting to 0 enables Scale-to-Zero.
- fallback: Safety mechanism for metric collection failures. Maintains the number of replicas specified in
replicasafterfailureThresholdconsecutive failures.
KEDA Installation and Verification
# Install KEDA with Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace
# Check KEDA status
kubectl get pods -n keda
kubectl get scaledobjects -n production
kubectl get hpa -n production
# Check ScaledObject details
kubectl describe scaledobject order-processor-scaledobject -n production
HPA vs VPA vs KEDA Comparison
| Category | HPA | VPA | KEDA |
|---|---|---|---|
| Scaling Direction | Horizontal (Pod count) | Vertical (resource requests/limits) | Horizontal (event-driven Pod count) |
| Default Metrics | CPU, Memory | CPU, Memory usage history | External event sources (65+ scalers) |
| Custom Metrics | Supported (Adapter required) | Not supported | Built-in support |
| Scale-to-Zero | Not supported (minReplicas >= 1) | Not applicable | Supported |
| Pod Restart | Not required | Required (Recreate mode) | Not required |
| Suitable Workloads | Stateless web services, APIs | All workloads (resource optimization) | Event-driven, batch jobs, queue processing |
| Built into Kubernetes | Yes | Separate installation required | Separate installation required |
| Relationship with HPA | - | May conflict on CPU/memory metrics | Internally creates/manages HPA |
| Learning Curve | Low | Medium | Medium to High |
| Cost Optimization | Medium (possible over-scaling) | High (right-sizing) | High (Scale-to-Zero) |
Recommended Scenarios
- General web services: HPA (CPU-based) + VPA (Off mode for monitoring)
- Event-driven microservices: KEDA (Kafka/SQS triggers)
- Batch jobs: KEDA (ScaledJob for automatic termination upon job completion)
- API gateways: HPA (RPS-based custom metrics)
- Legacy applications: VPA (when vertical scaling is the only option)
Composite Scaling Patterns
HPA + VPA Combination
The most important principle when using HPA and VPA simultaneously is to ensure they do not conflict on the same metrics.
# HPA: Horizontal scaling based on custom metrics (RPS) only
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 30
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: '500'
---
# VPA: Vertical adjustment of CPU/memory only (no conflict since HPA uses only custom metrics)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: 'Auto'
resourcePolicy:
containerPolicies:
- containerName: api-server
controlledResources:
- cpu
- memory
minAllowed:
cpu: '200m'
memory: '256Mi'
maxAllowed:
cpu: '2'
memory: '4Gi'
In this pattern, HPA adjusts the Pod count based solely on RPS (requests per second), while VPA optimizes CPU/memory resources. If both use the same metrics (CPU/memory), HPA increases Pods to lower utilization, and VPA reduces resources to raise utilization again, creating an infinite loop (thrashing).
HPA + KEDA Combination
Since KEDA internally creates an HPA, there is no need to create a separate HPA. However, multiple event sources can be combined into a single ScaledObject.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: multi-trigger-scaler
namespace: production
spec:
scaleTargetRef:
name: api-server
minReplicaCount: 2
maxReplicaCount: 50
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: nginx_connections_active
query: sum(nginx_ingress_controller_nginx_process_connections)
threshold: '100'
- type: cron
metadata:
timezone: Asia/Seoul
start: 30 8 * * 1-5
end: 30 18 * * 1-5
desiredReplicas: '10'
This example combines Prometheus metric-based dynamic scaling with cron-based scheduled scaling. From 8:30 AM to 6:30 PM on weekdays, a minimum of 10 Pods is maintained while allowing additional scaling based on active connection count.
Operational Considerations
1. Preventing Flapping
Flapping refers to the phenomenon where HPA repeatedly scales up and down frequently. To prevent this, verify the following.
- Set
behavior.scaleDown.stabilizationWindowSecondsto 300 seconds or more - If metrics have high variability, set
averageUtilizationtarget with 10-15% headroom - Understand and utilize the
tolerancevalue (default 0.1, i.e., 10%). Scaling does not occur when current metrics are within 90-110% of the target
2. Missing Resource Limits
If CPU/memory limits are not set on Pods, HPA percentage-based scaling will not work. You must configure resources.requests.
resources:
requests:
cpu: '500m'
memory: '512Mi'
limits:
cpu: '1000m'
memory: '1Gi'
3. Metrics Lag
Metrics Server has approximately 15 seconds of lag, and Prometheus-based custom metrics can have 30-60 seconds of delay when combining scrape interval and adapter processing time. During traffic spikes, existing Pods may become overloaded during this delay, so consider the following.
- Set
minReplicasslightly above the average traffic baseline - Configure Readiness Probes appropriately to minimize the time before new Pods are ready to receive traffic
- Use KEDA's Cron trigger for pre-scaling when surges are predictable
4. Conflicts When Using VPA and HPA Simultaneously
When VPA is in Auto or Recreate mode while HPA uses CPU/memory metrics, both will make conflicting decisions. You must choose one of the following approaches.
- HPA uses only custom metrics, VPA adjusts CPU/memory
- Set VPA to
Offmode and only reference recommendations
5. Integration with Cluster Autoscaler
Even if HPA increases the Pod count, Pods will remain in Pending state if no available nodes exist in the cluster. Cluster Autoscaler (or Karpenter) must be integrated for node-level scaling.
Failure Cases and Recovery Procedures
Case 1: Metrics Server Failure Disabling HPA
Symptom: All HPAs show unknown in the TARGETS column, and Pod count remains fixed.
# Diagnosis
kubectl get hpa -A
kubectl top nodes # If this fails, it is a Metrics Server issue
# Check Metrics Server status
kubectl get pods -n kube-system | grep metrics-server
kubectl logs -n kube-system deployment/metrics-server --tail=50
Recovery Procedure:
- Restart Metrics Server Pod:
kubectl rollout restart deployment/metrics-server -n kube-system - Verify APIService registration:
kubectl get apiservice v1beta1.metrics.k8s.io -o yaml - Reinstall Metrics Server if the issue persists
- Manually adjust replica count until recovery:
kubectl scale deployment/api-server --replicas=10 -n production
Case 2: Cascading OOM Kills
Symptom: VPA recommendations are set lower than actual peak usage, causing Pods to be repeatedly OOMKilled.
# Check OOM events
kubectl get events -n production --field-selector reason=OOMKilling --sort-by='.lastTimestamp'
# Check Pod restart counts
kubectl get pods -n production -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount
Recovery Procedure:
- Immediately change VPA
updateModetoOffto prevent further Pod modifications - Manually increase memory requests/limits for the affected Deployment
- Verify that VPA
maxAllowedvalues sufficiently cover actual peak usage - Re-observe recommendation stability in
Offmode for at least 2 weeks
Case 3: Scaling Storm
Symptom: HPA rapidly scales up then immediately scales down repeatedly, continuously creating and deleting Pods.
# Check for frequent SuccessfulRescale events in HPA
kubectl describe hpa api-server-hpa -n production | grep SuccessfulRescale
Recovery Procedure:
- Increase
behavior.scaleDown.stabilizationWindowSecondsto 600 seconds or more - Limit scale-down ratio in
scaleDown.policiesto 5% or less - Analyze metric source variability and apply appropriate smoothing
- Temporarily disable HPA and switch to manual management if necessary
Case 4: KEDA Metrics Collection Failure
Symptom: ScaledObject status shows Unknown, and metrics for the associated HPA are not being collected.
# Check ScaledObject status
kubectl get scaledobject -n production
kubectl describe scaledobject order-processor-scaledobject -n production
# Check KEDA Operator logs
kubectl logs -n keda deployment/keda-operator --tail=100 | grep -i error
Recovery Procedure:
- Verify connection status of event sources (Kafka, Prometheus, etc.)
- Check if the ScaledObject has a
fallbackconfiguration. If not, add one to maintain safe replica count during metric failures - Restart KEDA Operator:
kubectl rollout restart deployment/keda-operator -n keda - Verify that authentication credentials (TriggerAuthentication) have not expired
Production Checklist
Verify the following items before deployment.
HPA Related:
- Is Metrics Server functioning properly and does
kubectl top podssucceed - Is
minReplicasset to the minimum value capable of handling normal traffic - Is
maxReplicaswithin cluster capacity and cost limits - Is
behavior.scaleDown.stabilizationWindowSecondsset to 300 seconds or more - Are
resources.requestsconfigured on the target Deployment
VPA Related:
- Does initial production deployment start with
updateMode: "Off" - Are
minAllowedandmaxAllowedappropriately configured - Is PodDisruptionBudget configured
- Is there no conflict with HPA on the same metrics (CPU/memory)
KEDA Related:
- Does the
fallbackconfiguration guarantee safe replica count during metric failures - Is
cooldownPeriodconfigured appropriately for the workload characteristics - Are TriggerAuthentication credentials valid
- Is Cold Start time within SLA when using Scale-to-Zero
Common:
- Is Cluster Autoscaler or Karpenter handling node-level scaling
- Are alerts configured for autoscaling events
- Can replica count, metric values, and scaling events be monitored via Grafana dashboards
- Are runbooks documented and understood by team members
References
- Kubernetes Horizontal Pod Autoscaling Official Documentation
- Kubernetes Vertical Pod Autoscaling Official Documentation
- KEDA Official Website
- Kubernetes Autoscaling Patterns: HPA, VPA and KEDA - Spectro Cloud
- Scaling Kubernetes the Right Way: HPA, VPA, CA, Karpenter, and KEDA - CloudPilot AI
- HPA vs VPA vs KEDA Performance and Cost Trade-offs - Kubeify
- Kubernetes HPA Troubleshooting - OneUptime
- VPA GitHub Repository - kubernetes/autoscaler
Conclusion
Kubernetes autoscaling does not end with simply deploying a single HPA. In production environments, you must comprehensively design for metrics collection stability, understanding of scaling algorithms, cooldown strategies, and recovery procedures during failures.
The key principles are summarized as follows.
- Do not rely on a single autoscaler: HPA, VPA, and KEDA each have their strengths. Combine them according to your workload characteristics.
- Scale down conservatively: Scale up quickly, scale down slowly. You must be prepared for traffic re-surges.
- Prepare for metric failures: Metrics Server, Prometheus, and external event sources can all experience failures. Always configure fallback strategies.
- Observe first, automate later: Start VPA in Off mode, configure HPA behavior conservatively, then gradually increase the level of automation.
- Monitoring and alerting are fundamental: Visualize autoscaling decision processes in Grafana dashboards and set up alerts for abnormal scaling patterns.
When autoscaling is properly implemented, you can provide stable service even during traffic spikes while reducing cloud costs by 30-50%. We hope that the patterns and checklists covered in this article will provide practical help in establishing autoscaling strategies for production environments.