- Published on
Kubernetes Advanced Operations Guide 2025: Autoscaling, Scheduling, Resource Management, Multi-Cluster
- Authors

- Name
- Youngju Kim
- @fjvbn20031
Table of Contents
1. Introduction: Why Advanced Kubernetes Operations Matter
Running Kubernetes in production reveals challenges that basic deployments cannot address. Pods may not scale fast enough during traffic spikes, workloads may cluster on specific nodes causing cascading failures, or costs may explode without proper resource management.
This guide covers four core areas of advanced Kubernetes operations:
- Autoscaling - Scale workloads and infrastructure automatically with HPA, VPA, KEDA, and Karpenter
- Scheduling - Optimize Pod placement with Affinity, Taints, Priority, and Topology Spread
- Resource Management - Ensure stability with QoS, LimitRange, ResourceQuota, and PDB
- Multi-Cluster - Manage multiple clusters with Cluster API and Fleet
2. Autoscaling Strategies
2.1 HPA (Horizontal Pod Autoscaler) Deep Dive
HPA is the most fundamental autoscaler that adjusts the number of Pods. The v2 API supports custom and external metrics.
Basic HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
# CPU-based scaling
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Memory-based scaling
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 10
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
Custom Metrics HPA
Using Prometheus Adapter, you can scale based on application-specific custom metrics.
# Prometheus Adapter configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'http_requests_per_second{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "^(.*)$"
as: "requests_per_second"
metricsQuery: 'sum(rate(http_requests_total{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
# Custom metrics HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-custom-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 100
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "1000"
External Metrics HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: queue-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-worker
minReplicas: 1
maxReplicas: 30
metrics:
- type: External
external:
metric:
name: sqs_queue_depth
selector:
matchLabels:
queue: "order-processing"
target:
type: AverageValue
averageValue: "5"
2.2 VPA (Vertical Pod Autoscaler)
VPA automatically adjusts CPU/memory requests for Pods. It is especially useful in early stages when optimal resource requests are unknown.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Auto" # Off, Initial, Recreate, Auto
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
VPA Operating Mode Comparison:
| Mode | Behavior | When to Use |
|---|---|---|
| Off | Provides recommendations only, no application | Initial analysis phase |
| Initial | Applied only at creation | Stable workloads |
| Recreate | Applied by recreating Pods | General operations |
| Auto | In-place if possible, otherwise recreate | Latest K8s environments |
Caution: Using HPA and VPA on the same metrics (CPU/memory) simultaneously causes conflicts. The recommended pattern is to use VPA in Off mode for recommendations only while HPA handles scaling.
2.3 KEDA (Kubernetes Event-Driven Autoscaling)
KEDA scales workloads based on external event sources, supporting over 60 scalers.
# Install KEDA
# helm repo add kedacore https://kedacore.github.io/charts
# helm install keda kedacore/keda --namespace keda-system --create-namespace
# ScaledObject example: Kafka-based scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
namespace: production
spec:
scaleTargetRef:
name: kafka-consumer
pollingInterval: 15
cooldownPeriod: 300
idleReplicaCount: 0
minReplicaCount: 1
maxReplicaCount: 50
fallback:
failureThreshold: 3
replicas: 5
triggers:
- type: kafka
metadata:
bootstrapServers: kafka.production.svc:9092
consumerGroup: order-processor
topic: orders
lagThreshold: "100"
offsetResetPolicy: latest
---
# ScaledObject example: AWS SQS-based scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sqs-worker-scaler
spec:
scaleTargetRef:
name: sqs-worker
pollingInterval: 10
cooldownPeriod: 60
minReplicaCount: 0
maxReplicaCount: 100
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.ap-northeast-2.amazonaws.com/123456789012/order-queue
queueLength: "5"
awsRegion: ap-northeast-2
authenticationRef:
name: aws-credentials
---
# ScaledJob example: Batch job scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: image-processor
spec:
jobTargetRef:
template:
spec:
containers:
- name: processor
image: myapp/image-processor:latest
restartPolicy: Never
pollingInterval: 10
maxReplicaCount: 20
successfulJobsHistoryLimit: 10
failedJobsHistoryLimit: 5
triggers:
- type: redis-lists
metadata:
address: redis.production.svc:6379
listName: image-processing-queue
listLength: "3"
2.4 Karpenter - Next-Generation Node Autoscaler
Karpenter is a node provisioning engine that overcomes the limitations of Cluster Autoscaler.
# NodePool definition
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
metadata:
labels:
team: platform
tier: general
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h
limits:
cpu: "1000"
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
---
# EC2NodeClass definition
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
role: KarpenterNodeRole-my-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
Cluster Autoscaler vs Karpenter Comparison:
| Aspect | Cluster Autoscaler | Karpenter |
|---|---|---|
| Node Selection | Node Group based | Workload requirements based |
| Provisioning Speed | Minutes | Seconds |
| Instance Variety | Fixed per group | Automatic optimal selection |
| Spot Handling | Manual configuration | Automatic price/availability optimization |
| Consolidation | Not supported | Automatic node consolidation |
| Cloud Support | All clouds | AWS (Azure preview) |
3. Advanced Scheduling
3.1 nodeSelector
The simplest node selection method.
apiVersion: v1
kind: Pod
metadata:
name: gpu-worker
spec:
nodeSelector:
accelerator: nvidia-tesla-v100
topology.kubernetes.io/zone: ap-northeast-2a
containers:
- name: gpu-worker
image: myapp/gpu-worker:latest
resources:
limits:
nvidia.com/gpu: 1
3.2 Node Affinity and Pod Affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
spec:
replicas: 6
selector:
matchLabels:
app: web-frontend
template:
metadata:
labels:
app: web-frontend
spec:
affinity:
# Node Affinity: place on specific nodes
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- compute-optimized
- general-purpose
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- ap-northeast-2a
- weight: 20
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- ap-northeast-2c
# Pod Affinity: place on same node/zone as specific Pods
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache
topologyKey: topology.kubernetes.io/zone
# Pod Anti-Affinity: spread Pods of same app
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-frontend
topologyKey: kubernetes.io/hostname
containers:
- name: web-frontend
image: myapp/web-frontend:latest
3.3 Taints and Tolerations
# Add Taint to nodes
# kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# kubectl taint nodes spot-node-1 spot=true:PreferNoSchedule
# GPU workload: Tolerate gpu Taint
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-training
spec:
replicas: 2
selector:
matchLabels:
app: ml-training
template:
metadata:
labels:
app: ml-training
spec:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
accelerator: nvidia-tesla-v100
containers:
- name: trainer
image: myapp/ml-trainer:latest
resources:
limits:
nvidia.com/gpu: 4
---
# Spot instance workloads
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 10
selector:
matchLabels:
app: batch-processor
template:
metadata:
labels:
app: batch-processor
spec:
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "PreferNoSchedule"
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 60
containers:
- name: processor
image: myapp/batch-processor:latest
3.4 Priority and Preemption
# PriorityClass definitions
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-production
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For critical production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: standard-production
value: 100000
globalDefault: true
preemptionPolicy: PreemptLowerPriority
description: "For standard production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-low
value: 1000
globalDefault: false
preemptionPolicy: Never
description: "For batch jobs. No preemption"
---
# Using Priority
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 3
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
spec:
priorityClassName: critical-production
containers:
- name: payment
image: myapp/payment:latest
3.5 Topology Spread Constraints
A powerful feature that distributes Pods evenly across topology domains.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
spec:
replicas: 12
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
topologySpreadConstraints:
# Spread across AZs
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-gateway
# Spread across nodes
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-gateway
nodeAffinityPolicy: Honor
nodeTaintsPolicy: Honor
containers:
- name: api-gateway
image: myapp/api-gateway:latest
4. Resource Management
4.1 Understanding Requests vs Limits
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: myapp/demo:latest
resources:
# Used for scheduling. This amount of resources is guaranteed
requests:
cpu: 500m
memory: 512Mi
ephemeral-storage: 1Gi
# Upper bound. CPU is throttled when exceeded, memory triggers OOMKill
limits:
cpu: "2"
memory: 1Gi
ephemeral-storage: 2Gi
4.2 QoS Classes
| QoS Class | Condition | OOM Kill Priority |
|---|---|---|
| Guaranteed | All containers have requests = limits | Lowest (killed last) |
| Burstable | Only requests or limits set | Medium |
| BestEffort | Neither requests nor limits set | Highest (killed first) |
# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: app
image: myapp/critical:latest
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi
4.3 LimitRange and ResourceQuota
# LimitRange: Per-Pod/Container resource limits within a namespace
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-backend
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: "4"
memory: 8Gi
min:
cpu: 50m
memory: 64Mi
- type: Pod
max:
cpu: "8"
memory: 16Gi
- type: PersistentVolumeClaim
max:
storage: 100Gi
min:
storage: 1Gi
---
# ResourceQuota: Total resource cap for entire namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-backend-quota
namespace: team-backend
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
services: "20"
persistentvolumeclaims: "30"
requests.storage: 500Gi
count/deployments.apps: "30"
count/configmaps: "50"
count/secrets: "50"
scopeSelector:
matchExpressions:
- scopeName: PriorityClass
operator: In
values:
- standard-production
- critical-production
4.4 PodDisruptionBudget (PDB)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
namespace: production
spec:
minAvailable: "60%"
selector:
matchLabels:
app: api-server
unhealthyPodEvictionPolicy: IfHealthy
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: redis-pdb
namespace: production
spec:
maxUnavailable: 1
selector:
matchLabels:
app: redis-cluster
5. Multi-Cluster Operations
5.1 Cluster API
Cluster API is a project for declaratively creating and managing Kubernetes clusters.
# Cluster definition
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production-cluster
namespace: clusters
spec:
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
services:
cidrBlocks:
- 10.96.0.0/12
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: production-control-plane
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSCluster
name: production-cluster
---
# Control Plane definition
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: production-control-plane
namespace: clusters
spec:
replicas: 3
version: v1.30.2
machineTemplate:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachineTemplate
name: production-control-plane
kubeadmConfigSpec:
clusterConfiguration:
apiServer:
extraArgs:
audit-log-maxage: "30"
audit-log-maxbackup: "10"
enable-admission-plugins: "NodeRestriction,PodSecurityAdmission"
initConfiguration:
nodeRegistration:
kubeletExtraArgs:
cloud-provider: external
5.2 Fleet/Rancher Multi-Cluster Management
# Fleet GitRepo: Deploy across multiple clusters
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
name: platform-apps
namespace: fleet-default
spec:
repo: https://github.com/myorg/platform-apps
branch: main
paths:
- monitoring/
- logging/
- ingress/
targets:
- name: production
clusterSelector:
matchLabels:
env: production
- name: staging
clusterSelector:
matchLabels:
env: staging
---
# Fleet Bundle customization
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
name: app-deployments
namespace: fleet-default
spec:
repo: https://github.com/myorg/app-deployments
branch: main
targets:
- name: us-east
clusterSelector:
matchLabels:
region: us-east
helm:
values:
replicaCount: 5
ingress:
host: api-us.mycompany.com
- name: ap-northeast
clusterSelector:
matchLabels:
region: ap-northeast
helm:
values:
replicaCount: 3
ingress:
host: api-ap.mycompany.com
5.3 Multi-Cluster Architecture Patterns
| Pattern | Description | When to Use |
|---|---|---|
| Hub-Spoke | Central management cluster controls worker clusters | Basic multi-cluster |
| Federation | KubeFed syncs resources across clusters | Same app multi-region |
| Service Mesh | Istio Multi-cluster for inter-cluster communication | Distributed microservices |
| Virtual Kubelet | Admiralty, Liqo connect virtual nodes | Burst workloads |
6. Cluster Upgrade Strategies
6.1 In-place Upgrade
#!/bin/bash
# Control Plane upgrade
echo "=== Starting Control Plane Upgrade ==="
# 1. Check current version
kubectl get nodes
kubectl version --short
# 2. Upgrade kubeadm
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.2-1.1
sudo kubeadm upgrade plan
sudo kubeadm upgrade apply v1.30.2
# 3. Upgrade kubelet and kubectl
sudo apt-get install -y kubelet=1.30.2-1.1 kubectl=1.30.2-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
echo "=== Sequential Worker Node Upgrade ==="
NODES=$(kubectl get nodes -l node-role.kubernetes.io/worker -o jsonpath='{.items[*].metadata.name}')
for NODE in $NODES; do
echo "--- Starting upgrade for $NODE ---"
# Cordon: prevent new Pod scheduling
kubectl cordon "$NODE"
# Drain: evict existing Pods
kubectl drain "$NODE" \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=120 \
--timeout=300s
# Run upgrade on the node (via SSH or automation tool)
echo "Running kubeadm and kubelet upgrade on node $NODE"
# Uncordon: resume scheduling
kubectl uncordon "$NODE"
# Verify node Ready status
kubectl wait --for=condition=Ready "node/$NODE" --timeout=300s
echo "--- Upgrade complete for $NODE ---"
done
6.2 Blue-Green Cluster Upgrade
# Create new cluster with Cluster API
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production-v130
namespace: clusters
labels:
upgrade-group: production
version: v1.30
spec:
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: production-v130-cp
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSCluster
name: production-v130
7. Troubleshooting
7.1 Using kubectl debug
# Add debug container to running Pod
kubectl debug -it pod/api-server-abc123 \
--image=nicolaka/netshoot \
--target=api-server \
-- /bin/bash
# Node debugging
kubectl debug node/worker-1 \
-it --image=ubuntu:22.04 \
-- /bin/bash
# Debug with Pod copy (image change)
kubectl debug pod/api-server-abc123 \
-it --copy-to=debug-pod \
--container=api-server \
--image=myapp/api-server:debug \
-- /bin/sh
7.2 Common Issues and Solutions
Pending Pod Issues:
# Check why Pod is Pending
kubectl describe pod pending-pod-name
# Common causes:
# 1. Insufficient resources -> Add nodes or adjust resources
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU_ALLOC:.status.allocatable.cpu,\
MEM_ALLOC:.status.allocatable.memory,\
CPU_REQ:.status.capacity.cpu
# 2. nodeSelector/affinity mismatch -> Check labels
kubectl get nodes --show-labels
# 3. Taints blocking -> Add tolerations
kubectl describe nodes | grep -A5 Taints
CrashLoopBackOff Resolution:
# Check logs
kubectl logs pod/crashing-pod --previous
kubectl logs pod/crashing-pod -c init-container-name
# Check events
kubectl get events --sort-by=.lastTimestamp \
--field-selector involvedObject.name=crashing-pod
# Check for OOM Kill
kubectl describe pod crashing-pod | grep -A5 "Last State"
# If OOMKilled appears, increase memory limits
# Run in debug mode
kubectl debug pod/crashing-pod \
-it --copy-to=debug-pod \
--container=app \
--image=busybox \
-- /bin/sh
Network Issue Diagnosis:
# DNS check
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never \
-- nslookup kubernetes.default.svc.cluster.local
# Service connectivity test
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never \
-- curl -v http://api-server.production.svc:8080/health
# Check network policies
kubectl get networkpolicy -A
kubectl describe networkpolicy -n production
8. Cost Optimization
8.1 Leveraging Spot Nodes
# Karpenter Spot NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-workloads
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: "500"
memory: 1000Gi
8.2 Right-sizing Automation
#!/bin/bash
# Right-sizing report based on VPA recommendations
echo "=== Resource Utilization by Namespace ==="
for NS in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
CPU_REQ=$(kubectl top pods -n "$NS" --no-headers 2>/dev/null | \
awk '{sum += $2} END {print sum}')
MEM_REQ=$(kubectl top pods -n "$NS" --no-headers 2>/dev/null | \
awk '{sum += $3} END {print sum}')
if [ -n "$CPU_REQ" ] && [ "$CPU_REQ" != "0" ]; then
echo "Namespace: $NS | CPU: ${CPU_REQ}m | Memory: ${MEM_REQ}Mi"
fi
done
echo ""
echo "=== VPA Recommendations ==="
for VPA in $(kubectl get vpa -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'); do
NS=$(echo "$VPA" | cut -d'/' -f1)
NAME=$(echo "$VPA" | cut -d'/' -f2)
echo "--- $VPA ---"
kubectl get vpa "$NAME" -n "$NS" -o jsonpath='{.status.recommendation.containerRecommendations[*]}'
echo ""
done
8.3 Namespace Cost Allocation
# Using Kubecost or OpenCost
# helm install kubecost kubecost/cost-analyzer \
# --namespace kubecost --create-namespace \
# --set prometheus.enabled=false \
# --set prometheus.fqdn=http://prometheus-server.monitoring:80
# Cost allocation via namespace labels
apiVersion: v1
kind: Namespace
metadata:
name: team-backend
labels:
cost-center: "backend-team"
department: "engineering"
project: "api-platform"
environment: "production"
9. Practice Quiz
Q1: What component is needed for HPA to scale on custom metrics?
Answer: A custom metrics API server like Prometheus Adapter (or Datadog Cluster Agent, etc.) is required.
- Prometheus Adapter exposes Prometheus metrics through the Kubernetes Custom Metrics API (custom.metrics.k8s.io)
- HPA queries this API to retrieve custom metric values and make scaling decisions
- Flow: Prometheus collects -> Adapter transforms -> HPA queries -> Scaling executes
Q2: What conditions must be met for Guaranteed QoS Class?
Answer: All containers in the Pod must have CPU and memory requests and limits set, and each pair must be equal.
- requests.cpu = limits.cpu
- requests.memory = limits.memory
- Applies to all containers (including init containers)
- Guaranteed Pods are the last to be OOM Killed under node memory pressure
Q3: What is the difference between requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution in podAntiAffinity?
Answer: Required means the condition must be satisfied for the Pod to be scheduled. If no node satisfies the condition, the Pod stays Pending. Preferred means the scheduler tries to place the Pod where conditions are met, but will place it elsewhere if necessary.
- required: Hard constraint (mandatory)
- preferred: Soft constraint, priority adjustable via weight
- IgnoredDuringExecution: Already running Pods are not evicted even if conditions change
Q4: How does Karpenter consolidation save costs?
Answer: Karpenter consolidation moves Pods from idle or underutilized nodes to other nodes, then removes empty nodes or replaces them with smaller (cheaper) instances.
- WhenEmpty: Only removes nodes with no Pods
- WhenEmptyOrUnderutilized: Also relocates Pods from underutilized nodes before removing
- Can replace with cheaper instance types (e.g., two c5.2xlarge to one c5.4xlarge)
- consolidateAfter sets the stabilization wait time
Q5: How does PodDisruptionBudget affect cluster upgrades?
Answer: PDB limits the number of Pods that can be simultaneously disrupted during voluntary disruptions.
- During kubectl drain, PDB is respected to evict Pods sequentially
- minAvailable: Guarantees minimum available Pod count/percentage
- maxUnavailable: Limits maximum disrupted Pod count/percentage
- Overly strict PDBs can cause drain timeouts
- unhealthyPodEvictionPolicy: IfHealthy allows ignoring PDB for unhealthy Pods
10. References
- Kubernetes Official Docs - HPA
- Kubernetes Official Docs - VPA
- KEDA Official Documentation
- Karpenter Official Documentation
- Kubernetes Scheduling
- Topology Spread Constraints
- Resource Management
- Cluster API
- Fleet Manager
- Kubecost
- Kubernetes Best Practices - Google
- EKS Best Practices Guide
- Pod Priority and Preemption