Split View: Kubernetes 심화 운영 가이드 2025: 오토스케일링, 스케줄링, 리소스 관리, 멀티 클러스터
Kubernetes 심화 운영 가이드 2025: 오토스케일링, 스케줄링, 리소스 관리, 멀티 클러스터
목차
1. 들어가며: Kubernetes 심화 운영이 중요한 이유
Kubernetes를 프로덕션에서 운영하다 보면 기본 배포만으로는 해결할 수 없는 문제들이 나타난다. 트래픽이 급증할 때 Pod이 충분히 빠르게 늘어나지 않거나, 특정 노드에 Pod이 몰려 장애가 전파되거나, 리소스를 제대로 관리하지 않아 비용이 폭발적으로 증가하는 상황들이다.
이 가이드에서는 Kubernetes 심화 운영의 네 가지 핵심 영역을 다룬다:
- 오토스케일링 - HPA, VPA, KEDA, Karpenter로 워크로드와 인프라를 자동 확장
- 스케줄링 - Affinity, Taint, Priority, Topology Spread로 Pod 배치 최적화
- 리소스 관리 - QoS, LimitRange, ResourceQuota, PDB로 안정성 보장
- 멀티 클러스터 - Cluster API, Fleet으로 여러 클러스터를 통합 관리
2. 오토스케일링 전략
2.1 HPA (Horizontal Pod Autoscaler) 심화
HPA는 Pod 수를 자동으로 조절하는 가장 기본적인 오토스케일러다. v2 API에서는 커스텀 메트릭과 외부 메트릭을 지원한다.
기본 HPA 설정
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
# CPU 기반 스케일링
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# 메모리 기반 스케일링
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 10
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
커스텀 메트릭 HPA
Prometheus Adapter를 사용하면 애플리케이션별 커스텀 메트릭으로 스케일링할 수 있다.
# Prometheus Adapter 설정
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'http_requests_per_second{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "^(.*)$"
as: "requests_per_second"
metricsQuery: 'sum(rate(http_requests_total{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
# 커스텀 메트릭 기반 HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-custom-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 100
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "1000"
외부 메트릭 HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: queue-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-worker
minReplicas: 1
maxReplicas: 30
metrics:
- type: External
external:
metric:
name: sqs_queue_depth
selector:
matchLabels:
queue: "order-processing"
target:
type: AverageValue
averageValue: "5"
2.2 VPA (Vertical Pod Autoscaler)
VPA는 Pod의 CPU/메모리 요청량을 자동으로 조정한다. 적절한 리소스 요청량을 모르는 초기 단계에서 특히 유용하다.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Auto" # Off, Initial, Recreate, Auto
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
VPA 운영 모드 비교:
| 모드 | 동작 | 사용 시점 |
|---|---|---|
| Off | 추천만 제공, 적용 안 함 | 초기 분석 단계 |
| Initial | 생성 시에만 적용 | 안정적 워크로드 |
| Recreate | Pod 재생성으로 적용 | 일반적 운영 |
| Auto | 가능하면 in-place, 아니면 재생성 | 최신 K8s 환경 |
주의: HPA와 VPA를 같은 메트릭(CPU/메모리)으로 동시에 사용하면 충돌이 발생한다. VPA는 Off 모드로 추천만 받고, HPA가 스케일링을 담당하는 패턴을 권장한다.
2.3 KEDA (Kubernetes Event-Driven Autoscaling)
KEDA는 외부 이벤트 소스를 기반으로 워크로드를 스케일링한다. 60개 이상의 스케일러를 지원한다.
# KEDA 설치
# helm repo add kedacore https://kedacore.github.io/charts
# helm install keda kedacore/keda --namespace keda-system --create-namespace
# ScaledObject 예제: Kafka 기반 스케일링
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
namespace: production
spec:
scaleTargetRef:
name: kafka-consumer
pollingInterval: 15
cooldownPeriod: 300
idleReplicaCount: 0
minReplicaCount: 1
maxReplicaCount: 50
fallback:
failureThreshold: 3
replicas: 5
triggers:
- type: kafka
metadata:
bootstrapServers: kafka.production.svc:9092
consumerGroup: order-processor
topic: orders
lagThreshold: "100"
offsetResetPolicy: latest
---
# ScaledObject 예제: AWS SQS 기반 스케일링
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sqs-worker-scaler
spec:
scaleTargetRef:
name: sqs-worker
pollingInterval: 10
cooldownPeriod: 60
minReplicaCount: 0
maxReplicaCount: 100
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.ap-northeast-2.amazonaws.com/123456789012/order-queue
queueLength: "5"
awsRegion: ap-northeast-2
authenticationRef:
name: aws-credentials
---
# ScaledJob 예제: 배치 작업 스케일링
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: image-processor
spec:
jobTargetRef:
template:
spec:
containers:
- name: processor
image: myapp/image-processor:latest
restartPolicy: Never
pollingInterval: 10
maxReplicaCount: 20
successfulJobsHistoryLimit: 10
failedJobsHistoryLimit: 5
triggers:
- type: redis-lists
metadata:
address: redis.production.svc:6379
listName: image-processing-queue
listLength: "3"
2.4 Karpenter - 차세대 노드 오토스케일러
Karpenter는 Cluster Autoscaler의 한계를 극복한 노드 프로비저닝 엔진이다.
# NodePool 정의
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
metadata:
labels:
team: platform
tier: general
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h
limits:
cpu: "1000"
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
---
# EC2NodeClass 정의
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
role: KarpenterNodeRole-my-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
Cluster Autoscaler vs Karpenter 비교:
| 항목 | Cluster Autoscaler | Karpenter |
|---|---|---|
| 노드 선택 | Node Group 기반 | 워크로드 요구사항 기반 |
| 프로비저닝 속도 | 분 단위 | 초 단위 |
| 인스턴스 다양성 | 그룹당 고정 | 자동 최적 선택 |
| Spot 처리 | 수동 설정 | 자동 가격/가용성 최적화 |
| 통합(Consolidation) | 미지원 | 자동 노드 통합 |
| 클라우드 지원 | 모든 클라우드 | AWS (Azure 미리보기) |
3. 스케줄링 심화
3.1 nodeSelector
가장 간단한 노드 선택 방법이다.
apiVersion: v1
kind: Pod
metadata:
name: gpu-worker
spec:
nodeSelector:
accelerator: nvidia-tesla-v100
topology.kubernetes.io/zone: ap-northeast-2a
containers:
- name: gpu-worker
image: myapp/gpu-worker:latest
resources:
limits:
nvidia.com/gpu: 1
3.2 Node Affinity와 Pod Affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
spec:
replicas: 6
selector:
matchLabels:
app: web-frontend
template:
metadata:
labels:
app: web-frontend
spec:
affinity:
# Node Affinity: 특정 노드에 배치
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- compute-optimized
- general-purpose
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- ap-northeast-2a
- weight: 20
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- ap-northeast-2c
# Pod Affinity: 특정 Pod과 같은 노드/존에 배치
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache
topologyKey: topology.kubernetes.io/zone
# Pod Anti-Affinity: 같은 앱의 Pod을 분산
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-frontend
topologyKey: kubernetes.io/hostname
containers:
- name: web-frontend
image: myapp/web-frontend:latest
3.3 Taints와 Tolerations
# 노드에 Taint 추가
# kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# kubectl taint nodes spot-node-1 spot=true:PreferNoSchedule
# GPU 워크로드: gpu Taint를 Tolerate
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-training
spec:
replicas: 2
selector:
matchLabels:
app: ml-training
template:
metadata:
labels:
app: ml-training
spec:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
accelerator: nvidia-tesla-v100
containers:
- name: trainer
image: myapp/ml-trainer:latest
resources:
limits:
nvidia.com/gpu: 4
---
# Spot 인스턴스 워크로드
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 10
selector:
matchLabels:
app: batch-processor
template:
metadata:
labels:
app: batch-processor
spec:
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "PreferNoSchedule"
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 60
containers:
- name: processor
image: myapp/batch-processor:latest
3.4 Priority와 Preemption
# PriorityClass 정의
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-production
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "프로덕션 핵심 서비스용"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: standard-production
value: 100000
globalDefault: true
preemptionPolicy: PreemptLowerPriority
description: "일반 프로덕션 워크로드용"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-low
value: 1000
globalDefault: false
preemptionPolicy: Never
description: "배치 작업용. Preemption 하지 않음"
---
# Priority 사용
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 3
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
spec:
priorityClassName: critical-production
containers:
- name: payment
image: myapp/payment:latest
3.5 Topology Spread Constraints
Pod을 여러 토폴로지 도메인에 균등하게 분산하는 강력한 기능이다.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
spec:
replicas: 12
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
topologySpreadConstraints:
# AZ 기준 분산
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-gateway
# 노드 기준 분산
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-gateway
nodeAffinityPolicy: Honor
nodeTaintsPolicy: Honor
containers:
- name: api-gateway
image: myapp/api-gateway:latest
4. 리소스 관리
4.1 Requests vs Limits 이해
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: myapp/demo:latest
resources:
# 스케줄링에 사용. 이만큼의 리소스를 보장받음
requests:
cpu: 500m
memory: 512Mi
ephemeral-storage: 1Gi
# 사용 상한선. 초과 시 CPU는 쓰로틀링, 메모리는 OOMKill
limits:
cpu: "2"
memory: 1Gi
ephemeral-storage: 2Gi
4.2 QoS 클래스
| QoS 클래스 | 조건 | OOM Kill 우선순위 |
|---|---|---|
| Guaranteed | 모든 컨테이너에 requests = limits 설정 | 가장 낮음 (마지막에 Kill) |
| Burstable | requests 또는 limits 중 하나만 설정 | 중간 |
| BestEffort | requests, limits 모두 미설정 | 가장 높음 (먼저 Kill) |
# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: app
image: myapp/critical:latest
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi
4.3 LimitRange와 ResourceQuota
# LimitRange: 네임스페이스 내 개별 Pod/컨테이너 리소스 제한
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-backend
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: "4"
memory: 8Gi
min:
cpu: 50m
memory: 64Mi
- type: Pod
max:
cpu: "8"
memory: 16Gi
- type: PersistentVolumeClaim
max:
storage: 100Gi
min:
storage: 1Gi
---
# ResourceQuota: 네임스페이스 전체 리소스 총량 제한
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-backend-quota
namespace: team-backend
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
services: "20"
persistentvolumeclaims: "30"
requests.storage: 500Gi
count/deployments.apps: "30"
count/configmaps: "50"
count/secrets: "50"
scopeSelector:
matchExpressions:
- scopeName: PriorityClass
operator: In
values:
- standard-production
- critical-production
4.4 PodDisruptionBudget (PDB)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
namespace: production
spec:
minAvailable: "60%"
selector:
matchLabels:
app: api-server
unhealthyPodEvictionPolicy: IfHealthy
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: redis-pdb
namespace: production
spec:
maxUnavailable: 1
selector:
matchLabels:
app: redis-cluster
5. 멀티 클러스터 운영
5.1 Cluster API
Cluster API는 Kubernetes 클러스터를 선언적으로 생성/관리하는 프로젝트다.
# Cluster 정의
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production-cluster
namespace: clusters
spec:
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
services:
cidrBlocks:
- 10.96.0.0/12
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: production-control-plane
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSCluster
name: production-cluster
---
# Control Plane 정의
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: production-control-plane
namespace: clusters
spec:
replicas: 3
version: v1.30.2
machineTemplate:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachineTemplate
name: production-control-plane
kubeadmConfigSpec:
clusterConfiguration:
apiServer:
extraArgs:
audit-log-maxage: "30"
audit-log-maxbackup: "10"
enable-admission-plugins: "NodeRestriction,PodSecurityAdmission"
initConfiguration:
nodeRegistration:
kubeletExtraArgs:
cloud-provider: external
5.2 Fleet/Rancher 멀티 클러스터 관리
# Fleet GitRepo: 여러 클러스터에 배포
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
name: platform-apps
namespace: fleet-default
spec:
repo: https://github.com/myorg/platform-apps
branch: main
paths:
- monitoring/
- logging/
- ingress/
targets:
- name: production
clusterSelector:
matchLabels:
env: production
- name: staging
clusterSelector:
matchLabels:
env: staging
---
# Fleet Bundle 커스터마이징
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
name: app-deployments
namespace: fleet-default
spec:
repo: https://github.com/myorg/app-deployments
branch: main
targets:
- name: us-east
clusterSelector:
matchLabels:
region: us-east
helm:
values:
replicaCount: 5
ingress:
host: api-us.mycompany.com
- name: ap-northeast
clusterSelector:
matchLabels:
region: ap-northeast
helm:
values:
replicaCount: 3
ingress:
host: api-ap.mycompany.com
5.3 멀티 클러스터 아키텍처 패턴
| 패턴 | 설명 | 사용 시점 |
|---|---|---|
| Hub-Spoke | 중앙 관리 클러스터에서 워커 클러스터 제어 | 기본 멀티 클러스터 |
| Federation | KubeFed로 리소스를 여러 클러스터에 동기화 | 동일 앱 멀티 리전 |
| Service Mesh | Istio Multi-cluster로 클러스터 간 통신 | 마이크로서비스 분산 |
| Virtual Kubelet | Admiralty, Liqo로 가상 노드 연결 | 버스트 워크로드 |
6. 클러스터 업그레이드 전략
6.1 In-place 업그레이드
#!/bin/bash
# Control Plane 업그레이드
echo "=== Control Plane 업그레이드 시작 ==="
# 1. 현재 버전 확인
kubectl get nodes
kubectl version --short
# 2. kubeadm 업그레이드
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.2-1.1
sudo kubeadm upgrade plan
sudo kubeadm upgrade apply v1.30.2
# 3. kubelet, kubectl 업그레이드
sudo apt-get install -y kubelet=1.30.2-1.1 kubectl=1.30.2-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
echo "=== Worker Node 순차 업그레이드 ==="
NODES=$(kubectl get nodes -l node-role.kubernetes.io/worker -o jsonpath='{.items[*].metadata.name}')
for NODE in $NODES; do
echo "--- $NODE 업그레이드 시작 ---"
# Cordon: 새 Pod 스케줄링 차단
kubectl cordon "$NODE"
# Drain: 기존 Pod 퇴거
kubectl drain "$NODE" \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=120 \
--timeout=300s
# 노드에서 업그레이드 실행 (SSH 또는 자동화 도구 사용)
echo "노드 $NODE에서 kubeadm, kubelet 업그레이드 실행"
# Uncordon: 스케줄링 재개
kubectl uncordon "$NODE"
# 노드 Ready 상태 확인
kubectl wait --for=condition=Ready "node/$NODE" --timeout=300s
echo "--- $NODE 업그레이드 완료 ---"
done
6.2 Blue-Green 클러스터 업그레이드
# 새 클러스터를 Cluster API로 생성
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production-v130
namespace: clusters
labels:
upgrade-group: production
version: v1.30
spec:
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: production-v130-cp
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSCluster
name: production-v130
7. 트러블슈팅
7.1 kubectl debug 활용
# 실행 중인 Pod에 디버그 컨테이너 추가
kubectl debug -it pod/api-server-abc123 \
--image=nicolaka/netshoot \
--target=api-server \
-- /bin/bash
# 노드 디버깅
kubectl debug node/worker-1 \
-it --image=ubuntu:22.04 \
-- /bin/bash
# 문제 Pod의 복사본으로 디버깅 (이미지 변경)
kubectl debug pod/api-server-abc123 \
-it --copy-to=debug-pod \
--container=api-server \
--image=myapp/api-server:debug \
-- /bin/sh
7.2 일반적인 문제와 해결
Pending Pod 문제:
# Pod이 Pending 상태인 이유 확인
kubectl describe pod pending-pod-name
# 일반적인 원인:
# 1. 리소스 부족 -> 노드 추가 또는 리소스 조정
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU_ALLOC:.status.allocatable.cpu,\
MEM_ALLOC:.status.allocatable.memory,\
CPU_REQ:.status.capacity.cpu
# 2. nodeSelector/affinity 불일치 -> 레이블 확인
kubectl get nodes --show-labels
# 3. Taint 때문 -> Toleration 추가
kubectl describe nodes | grep -A5 Taints
CrashLoopBackOff 해결:
# 로그 확인
kubectl logs pod/crashing-pod --previous
kubectl logs pod/crashing-pod -c init-container-name
# 이벤트 확인
kubectl get events --sort-by=.lastTimestamp \
--field-selector involvedObject.name=crashing-pod
# OOM Kill 확인
kubectl describe pod crashing-pod | grep -A5 "Last State"
# OOMKilled가 보이면 메모리 limits 증가 필요
# 디버그 모드로 실행
kubectl debug pod/crashing-pod \
-it --copy-to=debug-pod \
--container=app \
--image=busybox \
-- /bin/sh
네트워크 문제 진단:
# DNS 확인
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never \
-- nslookup kubernetes.default.svc.cluster.local
# 서비스 연결 테스트
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never \
-- curl -v http://api-server.production.svc:8080/health
# 네트워크 정책 확인
kubectl get networkpolicy -A
kubectl describe networkpolicy -n production
8. 비용 최적화
8.1 Spot 노드 활용
# Karpenter Spot NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-workloads
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: "500"
memory: 1000Gi
8.2 Right-sizing 자동화
#!/bin/bash
# VPA 추천값 기반 Right-sizing 리포트
echo "=== 네임스페이스별 리소스 사용률 ==="
for NS in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
CPU_REQ=$(kubectl top pods -n "$NS" --no-headers 2>/dev/null | \
awk '{sum += $2} END {print sum}')
MEM_REQ=$(kubectl top pods -n "$NS" --no-headers 2>/dev/null | \
awk '{sum += $3} END {print sum}')
if [ -n "$CPU_REQ" ] && [ "$CPU_REQ" != "0" ]; then
echo "Namespace: $NS | CPU: ${CPU_REQ}m | Memory: ${MEM_REQ}Mi"
fi
done
echo ""
echo "=== VPA 추천값 확인 ==="
for VPA in $(kubectl get vpa -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'); do
NS=$(echo "$VPA" | cut -d'/' -f1)
NAME=$(echo "$VPA" | cut -d'/' -f2)
echo "--- $VPA ---"
kubectl get vpa "$NAME" -n "$NS" -o jsonpath='{.status.recommendation.containerRecommendations[*]}'
echo ""
done
8.3 네임스페이스별 비용 할당
# Kubecost 또는 OpenCost 활용
# helm install kubecost kubecost/cost-analyzer \
# --namespace kubecost --create-namespace \
# --set prometheus.enabled=false \
# --set prometheus.fqdn=http://prometheus-server.monitoring:80
# 네임스페이스 레이블로 비용 할당
apiVersion: v1
kind: Namespace
metadata:
name: team-backend
labels:
cost-center: "backend-team"
department: "engineering"
project: "api-platform"
environment: "production"
9. 실전 퀴즈
Q1: HPA가 커스텀 메트릭으로 스케일링하려면 어떤 컴포넌트가 필요한가?
정답: Prometheus Adapter(또는 Datadog Cluster Agent 등) 같은 커스텀 메트릭 API 서버가 필요하다.
- Prometheus Adapter는 Prometheus 메트릭을 Kubernetes Custom Metrics API (custom.metrics.k8s.io)로 노출한다
- HPA는 이 API를 통해 커스텀 메트릭 값을 조회하여 스케일링 결정을 내린다
- 설정 흐름: Prometheus 수집 -> Adapter 변환 -> HPA 조회 -> 스케일링 실행
Q2: Guaranteed QoS Class가 되려면 어떤 조건을 만족해야 하는가?
정답: Pod 내 모든 컨테이너에 대해 CPU와 메모리의 requests와 limits가 설정되어야 하고, 각각 같은 값이어야 한다.
- requests.cpu = limits.cpu
- requests.memory = limits.memory
- 모든 컨테이너(init 컨테이너 포함)에 적용
- Guaranteed Pod은 노드 메모리 압박 시 가장 마지막에 OOM Kill 된다
Q3: podAntiAffinity의 requiredDuringSchedulingIgnoredDuringExecution과 preferredDuringSchedulingIgnoredDuringExecution의 차이는?
정답: required는 반드시 조건을 만족해야 Pod이 스케줄링된다. 만족하는 노드가 없으면 Pending 상태. preferred는 가능하면 조건을 만족하는 곳에 배치하되, 불가능하면 다른 노드에도 배치한다.
- required: 강제 조건 (hard constraint)
- preferred: 선호 조건 (soft constraint), weight로 우선순위 조정
- IgnoredDuringExecution: 이미 실행 중인 Pod은 조건이 변해도 퇴거되지 않음
Q4: Karpenter의 consolidation 기능은 어떻게 비용을 절약하는가?
정답: Karpenter consolidation은 유휴 또는 저활용 노드의 Pod을 다른 노드로 이동시키고, 빈 노드를 제거하거나 더 작은(저렴한) 인스턴스로 교체한다.
- WhenEmpty: Pod이 없는 노드만 제거
- WhenEmptyOrUnderutilized: 저활용 노드의 Pod도 재배치 후 제거
- 더 저렴한 인스턴스 타입으로 교체 가능 (예: c5.2xlarge 2대를 c5.4xlarge 1대로)
- consolidateAfter로 안정화 대기 시간 설정
Q5: PodDisruptionBudget이 클러스터 업그레이드에 어떤 영향을 미치는가?
정답: PDB는 자발적 중단(voluntary disruption) 시 동시에 중단될 수 있는 Pod 수를 제한한다.
- kubectl drain 시 PDB를 존중하여 Pod을 순차적으로 퇴거
- minAvailable: 최소 가용 Pod 수/비율 보장
- maxUnavailable: 최대 중단 Pod 수/비율 제한
- PDB가 너무 엄격하면 drain이 타임아웃될 수 있음
- unhealthyPodEvictionPolicy: IfHealthy로 비정상 Pod은 PDB 무시 가능
10. 참고 자료
Kubernetes Advanced Operations Guide 2025: Autoscaling, Scheduling, Resource Management, Multi-Cluster
Table of Contents
1. Introduction: Why Advanced Kubernetes Operations Matter
Running Kubernetes in production reveals challenges that basic deployments cannot address. Pods may not scale fast enough during traffic spikes, workloads may cluster on specific nodes causing cascading failures, or costs may explode without proper resource management.
This guide covers four core areas of advanced Kubernetes operations:
- Autoscaling - Scale workloads and infrastructure automatically with HPA, VPA, KEDA, and Karpenter
- Scheduling - Optimize Pod placement with Affinity, Taints, Priority, and Topology Spread
- Resource Management - Ensure stability with QoS, LimitRange, ResourceQuota, and PDB
- Multi-Cluster - Manage multiple clusters with Cluster API and Fleet
2. Autoscaling Strategies
2.1 HPA (Horizontal Pod Autoscaler) Deep Dive
HPA is the most fundamental autoscaler that adjusts the number of Pods. The v2 API supports custom and external metrics.
Basic HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
# CPU-based scaling
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Memory-based scaling
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
- type: Pods
value: 10
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
Custom Metrics HPA
Using Prometheus Adapter, you can scale based on application-specific custom metrics.
# Prometheus Adapter configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'http_requests_per_second{namespace!="",pod!=""}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "^(.*)$"
as: "requests_per_second"
metricsQuery: 'sum(rate(http_requests_total{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
# Custom metrics HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-custom-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 100
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "1000"
External Metrics HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: queue-worker-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-worker
minReplicas: 1
maxReplicas: 30
metrics:
- type: External
external:
metric:
name: sqs_queue_depth
selector:
matchLabels:
queue: "order-processing"
target:
type: AverageValue
averageValue: "5"
2.2 VPA (Vertical Pod Autoscaler)
VPA automatically adjusts CPU/memory requests for Pods. It is especially useful in early stages when optimal resource requests are unknown.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Auto" # Off, Initial, Recreate, Auto
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
controlledValues: RequestsAndLimits
VPA Operating Mode Comparison:
| Mode | Behavior | When to Use |
|---|---|---|
| Off | Provides recommendations only, no application | Initial analysis phase |
| Initial | Applied only at creation | Stable workloads |
| Recreate | Applied by recreating Pods | General operations |
| Auto | In-place if possible, otherwise recreate | Latest K8s environments |
Caution: Using HPA and VPA on the same metrics (CPU/memory) simultaneously causes conflicts. The recommended pattern is to use VPA in Off mode for recommendations only while HPA handles scaling.
2.3 KEDA (Kubernetes Event-Driven Autoscaling)
KEDA scales workloads based on external event sources, supporting over 60 scalers.
# Install KEDA
# helm repo add kedacore https://kedacore.github.io/charts
# helm install keda kedacore/keda --namespace keda-system --create-namespace
# ScaledObject example: Kafka-based scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
namespace: production
spec:
scaleTargetRef:
name: kafka-consumer
pollingInterval: 15
cooldownPeriod: 300
idleReplicaCount: 0
minReplicaCount: 1
maxReplicaCount: 50
fallback:
failureThreshold: 3
replicas: 5
triggers:
- type: kafka
metadata:
bootstrapServers: kafka.production.svc:9092
consumerGroup: order-processor
topic: orders
lagThreshold: "100"
offsetResetPolicy: latest
---
# ScaledObject example: AWS SQS-based scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: sqs-worker-scaler
spec:
scaleTargetRef:
name: sqs-worker
pollingInterval: 10
cooldownPeriod: 60
minReplicaCount: 0
maxReplicaCount: 100
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.ap-northeast-2.amazonaws.com/123456789012/order-queue
queueLength: "5"
awsRegion: ap-northeast-2
authenticationRef:
name: aws-credentials
---
# ScaledJob example: Batch job scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: image-processor
spec:
jobTargetRef:
template:
spec:
containers:
- name: processor
image: myapp/image-processor:latest
restartPolicy: Never
pollingInterval: 10
maxReplicaCount: 20
successfulJobsHistoryLimit: 10
failedJobsHistoryLimit: 5
triggers:
- type: redis-lists
metadata:
address: redis.production.svc:6379
listName: image-processing-queue
listLength: "3"
2.4 Karpenter - Next-Generation Node Autoscaler
Karpenter is a node provisioning engine that overcomes the limitations of Cluster Autoscaler.
# NodePool definition
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general-purpose
spec:
template:
metadata:
labels:
team: platform
tier: general
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 720h
limits:
cpu: "1000"
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
---
# EC2NodeClass definition
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
role: KarpenterNodeRole-my-cluster
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
Cluster Autoscaler vs Karpenter Comparison:
| Aspect | Cluster Autoscaler | Karpenter |
|---|---|---|
| Node Selection | Node Group based | Workload requirements based |
| Provisioning Speed | Minutes | Seconds |
| Instance Variety | Fixed per group | Automatic optimal selection |
| Spot Handling | Manual configuration | Automatic price/availability optimization |
| Consolidation | Not supported | Automatic node consolidation |
| Cloud Support | All clouds | AWS (Azure preview) |
3. Advanced Scheduling
3.1 nodeSelector
The simplest node selection method.
apiVersion: v1
kind: Pod
metadata:
name: gpu-worker
spec:
nodeSelector:
accelerator: nvidia-tesla-v100
topology.kubernetes.io/zone: ap-northeast-2a
containers:
- name: gpu-worker
image: myapp/gpu-worker:latest
resources:
limits:
nvidia.com/gpu: 1
3.2 Node Affinity and Pod Affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
spec:
replicas: 6
selector:
matchLabels:
app: web-frontend
template:
metadata:
labels:
app: web-frontend
spec:
affinity:
# Node Affinity: place on specific nodes
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- compute-optimized
- general-purpose
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- ap-northeast-2a
- weight: 20
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- ap-northeast-2c
# Pod Affinity: place on same node/zone as specific Pods
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache
topologyKey: topology.kubernetes.io/zone
# Pod Anti-Affinity: spread Pods of same app
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-frontend
topologyKey: kubernetes.io/hostname
containers:
- name: web-frontend
image: myapp/web-frontend:latest
3.3 Taints and Tolerations
# Add Taint to nodes
# kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# kubectl taint nodes spot-node-1 spot=true:PreferNoSchedule
# GPU workload: Tolerate gpu Taint
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-training
spec:
replicas: 2
selector:
matchLabels:
app: ml-training
template:
metadata:
labels:
app: ml-training
spec:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
nodeSelector:
accelerator: nvidia-tesla-v100
containers:
- name: trainer
image: myapp/ml-trainer:latest
resources:
limits:
nvidia.com/gpu: 4
---
# Spot instance workloads
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 10
selector:
matchLabels:
app: batch-processor
template:
metadata:
labels:
app: batch-processor
spec:
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "PreferNoSchedule"
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 60
containers:
- name: processor
image: myapp/batch-processor:latest
3.4 Priority and Preemption
# PriorityClass definitions
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-production
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For critical production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: standard-production
value: 100000
globalDefault: true
preemptionPolicy: PreemptLowerPriority
description: "For standard production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-low
value: 1000
globalDefault: false
preemptionPolicy: Never
description: "For batch jobs. No preemption"
---
# Using Priority
apiVersion: apps/v1
kind: Deployment
metadata:
name: payment-service
spec:
replicas: 3
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
spec:
priorityClassName: critical-production
containers:
- name: payment
image: myapp/payment:latest
3.5 Topology Spread Constraints
A powerful feature that distributes Pods evenly across topology domains.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-gateway
spec:
replicas: 12
selector:
matchLabels:
app: api-gateway
template:
metadata:
labels:
app: api-gateway
spec:
topologySpreadConstraints:
# Spread across AZs
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-gateway
# Spread across nodes
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: api-gateway
nodeAffinityPolicy: Honor
nodeTaintsPolicy: Honor
containers:
- name: api-gateway
image: myapp/api-gateway:latest
4. Resource Management
4.1 Understanding Requests vs Limits
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: myapp/demo:latest
resources:
# Used for scheduling. This amount of resources is guaranteed
requests:
cpu: 500m
memory: 512Mi
ephemeral-storage: 1Gi
# Upper bound. CPU is throttled when exceeded, memory triggers OOMKill
limits:
cpu: "2"
memory: 1Gi
ephemeral-storage: 2Gi
4.2 QoS Classes
| QoS Class | Condition | OOM Kill Priority |
|---|---|---|
| Guaranteed | All containers have requests = limits | Lowest (killed last) |
| Burstable | Only requests or limits set | Medium |
| BestEffort | Neither requests nor limits set | Highest (killed first) |
# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
name: guaranteed-pod
spec:
containers:
- name: app
image: myapp/critical:latest
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi
4.3 LimitRange and ResourceQuota
# LimitRange: Per-Pod/Container resource limits within a namespace
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-backend
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: "4"
memory: 8Gi
min:
cpu: 50m
memory: 64Mi
- type: Pod
max:
cpu: "8"
memory: 16Gi
- type: PersistentVolumeClaim
max:
storage: 100Gi
min:
storage: 1Gi
---
# ResourceQuota: Total resource cap for entire namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-backend-quota
namespace: team-backend
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
services: "20"
persistentvolumeclaims: "30"
requests.storage: 500Gi
count/deployments.apps: "30"
count/configmaps: "50"
count/secrets: "50"
scopeSelector:
matchExpressions:
- scopeName: PriorityClass
operator: In
values:
- standard-production
- critical-production
4.4 PodDisruptionBudget (PDB)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
namespace: production
spec:
minAvailable: "60%"
selector:
matchLabels:
app: api-server
unhealthyPodEvictionPolicy: IfHealthy
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: redis-pdb
namespace: production
spec:
maxUnavailable: 1
selector:
matchLabels:
app: redis-cluster
5. Multi-Cluster Operations
5.1 Cluster API
Cluster API is a project for declaratively creating and managing Kubernetes clusters.
# Cluster definition
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production-cluster
namespace: clusters
spec:
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
services:
cidrBlocks:
- 10.96.0.0/12
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: production-control-plane
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSCluster
name: production-cluster
---
# Control Plane definition
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: production-control-plane
namespace: clusters
spec:
replicas: 3
version: v1.30.2
machineTemplate:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSMachineTemplate
name: production-control-plane
kubeadmConfigSpec:
clusterConfiguration:
apiServer:
extraArgs:
audit-log-maxage: "30"
audit-log-maxbackup: "10"
enable-admission-plugins: "NodeRestriction,PodSecurityAdmission"
initConfiguration:
nodeRegistration:
kubeletExtraArgs:
cloud-provider: external
5.2 Fleet/Rancher Multi-Cluster Management
# Fleet GitRepo: Deploy across multiple clusters
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
name: platform-apps
namespace: fleet-default
spec:
repo: https://github.com/myorg/platform-apps
branch: main
paths:
- monitoring/
- logging/
- ingress/
targets:
- name: production
clusterSelector:
matchLabels:
env: production
- name: staging
clusterSelector:
matchLabels:
env: staging
---
# Fleet Bundle customization
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
name: app-deployments
namespace: fleet-default
spec:
repo: https://github.com/myorg/app-deployments
branch: main
targets:
- name: us-east
clusterSelector:
matchLabels:
region: us-east
helm:
values:
replicaCount: 5
ingress:
host: api-us.mycompany.com
- name: ap-northeast
clusterSelector:
matchLabels:
region: ap-northeast
helm:
values:
replicaCount: 3
ingress:
host: api-ap.mycompany.com
5.3 Multi-Cluster Architecture Patterns
| Pattern | Description | When to Use |
|---|---|---|
| Hub-Spoke | Central management cluster controls worker clusters | Basic multi-cluster |
| Federation | KubeFed syncs resources across clusters | Same app multi-region |
| Service Mesh | Istio Multi-cluster for inter-cluster communication | Distributed microservices |
| Virtual Kubelet | Admiralty, Liqo connect virtual nodes | Burst workloads |
6. Cluster Upgrade Strategies
6.1 In-place Upgrade
#!/bin/bash
# Control Plane upgrade
echo "=== Starting Control Plane Upgrade ==="
# 1. Check current version
kubectl get nodes
kubectl version --short
# 2. Upgrade kubeadm
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.2-1.1
sudo kubeadm upgrade plan
sudo kubeadm upgrade apply v1.30.2
# 3. Upgrade kubelet and kubectl
sudo apt-get install -y kubelet=1.30.2-1.1 kubectl=1.30.2-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
echo "=== Sequential Worker Node Upgrade ==="
NODES=$(kubectl get nodes -l node-role.kubernetes.io/worker -o jsonpath='{.items[*].metadata.name}')
for NODE in $NODES; do
echo "--- Starting upgrade for $NODE ---"
# Cordon: prevent new Pod scheduling
kubectl cordon "$NODE"
# Drain: evict existing Pods
kubectl drain "$NODE" \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=120 \
--timeout=300s
# Run upgrade on the node (via SSH or automation tool)
echo "Running kubeadm and kubelet upgrade on node $NODE"
# Uncordon: resume scheduling
kubectl uncordon "$NODE"
# Verify node Ready status
kubectl wait --for=condition=Ready "node/$NODE" --timeout=300s
echo "--- Upgrade complete for $NODE ---"
done
6.2 Blue-Green Cluster Upgrade
# Create new cluster with Cluster API
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: production-v130
namespace: clusters
labels:
upgrade-group: production
version: v1.30
spec:
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
name: production-v130-cp
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSCluster
name: production-v130
7. Troubleshooting
7.1 Using kubectl debug
# Add debug container to running Pod
kubectl debug -it pod/api-server-abc123 \
--image=nicolaka/netshoot \
--target=api-server \
-- /bin/bash
# Node debugging
kubectl debug node/worker-1 \
-it --image=ubuntu:22.04 \
-- /bin/bash
# Debug with Pod copy (image change)
kubectl debug pod/api-server-abc123 \
-it --copy-to=debug-pod \
--container=api-server \
--image=myapp/api-server:debug \
-- /bin/sh
7.2 Common Issues and Solutions
Pending Pod Issues:
# Check why Pod is Pending
kubectl describe pod pending-pod-name
# Common causes:
# 1. Insufficient resources -> Add nodes or adjust resources
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU_ALLOC:.status.allocatable.cpu,\
MEM_ALLOC:.status.allocatable.memory,\
CPU_REQ:.status.capacity.cpu
# 2. nodeSelector/affinity mismatch -> Check labels
kubectl get nodes --show-labels
# 3. Taints blocking -> Add tolerations
kubectl describe nodes | grep -A5 Taints
CrashLoopBackOff Resolution:
# Check logs
kubectl logs pod/crashing-pod --previous
kubectl logs pod/crashing-pod -c init-container-name
# Check events
kubectl get events --sort-by=.lastTimestamp \
--field-selector involvedObject.name=crashing-pod
# Check for OOM Kill
kubectl describe pod crashing-pod | grep -A5 "Last State"
# If OOMKilled appears, increase memory limits
# Run in debug mode
kubectl debug pod/crashing-pod \
-it --copy-to=debug-pod \
--container=app \
--image=busybox \
-- /bin/sh
Network Issue Diagnosis:
# DNS check
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never \
-- nslookup kubernetes.default.svc.cluster.local
# Service connectivity test
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never \
-- curl -v http://api-server.production.svc:8080/health
# Check network policies
kubectl get networkpolicy -A
kubectl describe networkpolicy -n production
8. Cost Optimization
8.1 Leveraging Spot Nodes
# Karpenter Spot NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-workloads
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: "500"
memory: 1000Gi
8.2 Right-sizing Automation
#!/bin/bash
# Right-sizing report based on VPA recommendations
echo "=== Resource Utilization by Namespace ==="
for NS in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
CPU_REQ=$(kubectl top pods -n "$NS" --no-headers 2>/dev/null | \
awk '{sum += $2} END {print sum}')
MEM_REQ=$(kubectl top pods -n "$NS" --no-headers 2>/dev/null | \
awk '{sum += $3} END {print sum}')
if [ -n "$CPU_REQ" ] && [ "$CPU_REQ" != "0" ]; then
echo "Namespace: $NS | CPU: ${CPU_REQ}m | Memory: ${MEM_REQ}Mi"
fi
done
echo ""
echo "=== VPA Recommendations ==="
for VPA in $(kubectl get vpa -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'); do
NS=$(echo "$VPA" | cut -d'/' -f1)
NAME=$(echo "$VPA" | cut -d'/' -f2)
echo "--- $VPA ---"
kubectl get vpa "$NAME" -n "$NS" -o jsonpath='{.status.recommendation.containerRecommendations[*]}'
echo ""
done
8.3 Namespace Cost Allocation
# Using Kubecost or OpenCost
# helm install kubecost kubecost/cost-analyzer \
# --namespace kubecost --create-namespace \
# --set prometheus.enabled=false \
# --set prometheus.fqdn=http://prometheus-server.monitoring:80
# Cost allocation via namespace labels
apiVersion: v1
kind: Namespace
metadata:
name: team-backend
labels:
cost-center: "backend-team"
department: "engineering"
project: "api-platform"
environment: "production"
9. Practice Quiz
Q1: What component is needed for HPA to scale on custom metrics?
Answer: A custom metrics API server like Prometheus Adapter (or Datadog Cluster Agent, etc.) is required.
- Prometheus Adapter exposes Prometheus metrics through the Kubernetes Custom Metrics API (custom.metrics.k8s.io)
- HPA queries this API to retrieve custom metric values and make scaling decisions
- Flow: Prometheus collects -> Adapter transforms -> HPA queries -> Scaling executes
Q2: What conditions must be met for Guaranteed QoS Class?
Answer: All containers in the Pod must have CPU and memory requests and limits set, and each pair must be equal.
- requests.cpu = limits.cpu
- requests.memory = limits.memory
- Applies to all containers (including init containers)
- Guaranteed Pods are the last to be OOM Killed under node memory pressure
Q3: What is the difference between requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution in podAntiAffinity?
Answer: Required means the condition must be satisfied for the Pod to be scheduled. If no node satisfies the condition, the Pod stays Pending. Preferred means the scheduler tries to place the Pod where conditions are met, but will place it elsewhere if necessary.
- required: Hard constraint (mandatory)
- preferred: Soft constraint, priority adjustable via weight
- IgnoredDuringExecution: Already running Pods are not evicted even if conditions change
Q4: How does Karpenter consolidation save costs?
Answer: Karpenter consolidation moves Pods from idle or underutilized nodes to other nodes, then removes empty nodes or replaces them with smaller (cheaper) instances.
- WhenEmpty: Only removes nodes with no Pods
- WhenEmptyOrUnderutilized: Also relocates Pods from underutilized nodes before removing
- Can replace with cheaper instance types (e.g., two c5.2xlarge to one c5.4xlarge)
- consolidateAfter sets the stabilization wait time
Q5: How does PodDisruptionBudget affect cluster upgrades?
Answer: PDB limits the number of Pods that can be simultaneously disrupted during voluntary disruptions.
- During kubectl drain, PDB is respected to evict Pods sequentially
- minAvailable: Guarantees minimum available Pod count/percentage
- maxUnavailable: Limits maximum disrupted Pod count/percentage
- Overly strict PDBs can cause drain timeouts
- unhealthyPodEvictionPolicy: IfHealthy allows ignoring PDB for unhealthy Pods
10. References
- Kubernetes Official Docs - HPA
- Kubernetes Official Docs - VPA
- KEDA Official Documentation
- Karpenter Official Documentation
- Kubernetes Scheduling
- Topology Spread Constraints
- Resource Management
- Cluster API
- Fleet Manager
- Kubecost
- Kubernetes Best Practices - Google
- EKS Best Practices Guide
- Pod Priority and Preemption