Skip to content

Split View: Kubernetes 심화 운영 가이드 2025: 오토스케일링, 스케줄링, 리소스 관리, 멀티 클러스터

✨ Learn with Quiz
|

Kubernetes 심화 운영 가이드 2025: 오토스케일링, 스케줄링, 리소스 관리, 멀티 클러스터

목차

1. 들어가며: Kubernetes 심화 운영이 중요한 이유

Kubernetes를 프로덕션에서 운영하다 보면 기본 배포만으로는 해결할 수 없는 문제들이 나타난다. 트래픽이 급증할 때 Pod이 충분히 빠르게 늘어나지 않거나, 특정 노드에 Pod이 몰려 장애가 전파되거나, 리소스를 제대로 관리하지 않아 비용이 폭발적으로 증가하는 상황들이다.

이 가이드에서는 Kubernetes 심화 운영의 네 가지 핵심 영역을 다룬다:

  1. 오토스케일링 - HPA, VPA, KEDA, Karpenter로 워크로드와 인프라를 자동 확장
  2. 스케줄링 - Affinity, Taint, Priority, Topology Spread로 Pod 배치 최적화
  3. 리소스 관리 - QoS, LimitRange, ResourceQuota, PDB로 안정성 보장
  4. 멀티 클러스터 - Cluster API, Fleet으로 여러 클러스터를 통합 관리

2. 오토스케일링 전략

2.1 HPA (Horizontal Pod Autoscaler) 심화

HPA는 Pod 수를 자동으로 조절하는 가장 기본적인 오토스케일러다. v2 API에서는 커스텀 메트릭과 외부 메트릭을 지원한다.

기본 HPA 설정

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    # CPU 기반 스케일링
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    # 메모리 기반 스케일링
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
        - type: Pods
          value: 10
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

커스텀 메트릭 HPA

Prometheus Adapter를 사용하면 애플리케이션별 커스텀 메트릭으로 스케일링할 수 있다.

# Prometheus Adapter 설정
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
      - seriesQuery: 'http_requests_per_second{namespace!="",pod!=""}'
        resources:
          overrides:
            namespace:
              resource: namespace
            pod:
              resource: pod
        name:
          matches: "^(.*)$"
          as: "requests_per_second"
        metricsQuery: 'sum(rate(http_requests_total{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
# 커스텀 메트릭 기반 HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-custom-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 100
  metrics:
    - type: Pods
      pods:
        metric:
          name: requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

외부 메트릭 HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  minReplicas: 1
  maxReplicas: 30
  metrics:
    - type: External
      external:
        metric:
          name: sqs_queue_depth
          selector:
            matchLabels:
              queue: "order-processing"
        target:
          type: AverageValue
          averageValue: "5"

2.2 VPA (Vertical Pod Autoscaler)

VPA는 Pod의 CPU/메모리 요청량을 자동으로 조정한다. 적절한 리소스 요청량을 모르는 초기 단계에서 특히 유용하다.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"  # Off, Initial, Recreate, Auto
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
        controlledResources: ["cpu", "memory"]
        controlledValues: RequestsAndLimits

VPA 운영 모드 비교:

모드동작사용 시점
Off추천만 제공, 적용 안 함초기 분석 단계
Initial생성 시에만 적용안정적 워크로드
RecreatePod 재생성으로 적용일반적 운영
Auto가능하면 in-place, 아니면 재생성최신 K8s 환경

주의: HPA와 VPA를 같은 메트릭(CPU/메모리)으로 동시에 사용하면 충돌이 발생한다. VPA는 Off 모드로 추천만 받고, HPA가 스케일링을 담당하는 패턴을 권장한다.

2.3 KEDA (Kubernetes Event-Driven Autoscaling)

KEDA는 외부 이벤트 소스를 기반으로 워크로드를 스케일링한다. 60개 이상의 스케일러를 지원한다.

# KEDA 설치
# helm repo add kedacore https://kedacore.github.io/charts
# helm install keda kedacore/keda --namespace keda-system --create-namespace

# ScaledObject 예제: Kafka 기반 스케일링
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: kafka-consumer
  pollingInterval: 15
  cooldownPeriod: 300
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 50
  fallback:
    failureThreshold: 3
    replicas: 5
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka.production.svc:9092
        consumerGroup: order-processor
        topic: orders
        lagThreshold: "100"
        offsetResetPolicy: latest
---
# ScaledObject 예제: AWS SQS 기반 스케일링
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sqs-worker-scaler
spec:
  scaleTargetRef:
    name: sqs-worker
  pollingInterval: 10
  cooldownPeriod: 60
  minReplicaCount: 0
  maxReplicaCount: 100
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.ap-northeast-2.amazonaws.com/123456789012/order-queue
        queueLength: "5"
        awsRegion: ap-northeast-2
      authenticationRef:
        name: aws-credentials
---
# ScaledJob 예제: 배치 작업 스케일링
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: image-processor
spec:
  jobTargetRef:
    template:
      spec:
        containers:
          - name: processor
            image: myapp/image-processor:latest
        restartPolicy: Never
  pollingInterval: 10
  maxReplicaCount: 20
  successfulJobsHistoryLimit: 10
  failedJobsHistoryLimit: 5
  triggers:
    - type: redis-lists
      metadata:
        address: redis.production.svc:6379
        listName: image-processing-queue
        listLength: "3"

2.4 Karpenter - 차세대 노드 오토스케일러

Karpenter는 Cluster Autoscaler의 한계를 극복한 노드 프로비저닝 엔진이다.

# NodePool 정의
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    metadata:
      labels:
        team: platform
        tier: general
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      expireAfter: 720h
  limits:
    cpu: "1000"
    memory: 2000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
---
# EC2NodeClass 정의
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  role: KarpenterNodeRole-my-cluster
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 125

Cluster Autoscaler vs Karpenter 비교:

항목Cluster AutoscalerKarpenter
노드 선택Node Group 기반워크로드 요구사항 기반
프로비저닝 속도분 단위초 단위
인스턴스 다양성그룹당 고정자동 최적 선택
Spot 처리수동 설정자동 가격/가용성 최적화
통합(Consolidation)미지원자동 노드 통합
클라우드 지원모든 클라우드AWS (Azure 미리보기)

3. 스케줄링 심화

3.1 nodeSelector

가장 간단한 노드 선택 방법이다.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-worker
spec:
  nodeSelector:
    accelerator: nvidia-tesla-v100
    topology.kubernetes.io/zone: ap-northeast-2a
  containers:
    - name: gpu-worker
      image: myapp/gpu-worker:latest
      resources:
        limits:
          nvidia.com/gpu: 1

3.2 Node Affinity와 Pod Affinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web-frontend
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      affinity:
        # Node Affinity: 특정 노드에 배치
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-type
                    operator: In
                    values:
                      - compute-optimized
                      - general-purpose
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - ap-northeast-2a
            - weight: 20
              preference:
                matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - ap-northeast-2c
        # Pod Affinity: 특정 Pod과 같은 노드/존에 배치
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - redis-cache
              topologyKey: topology.kubernetes.io/zone
        # Pod Anti-Affinity: 같은 앱의 Pod을 분산
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - web-frontend
              topologyKey: kubernetes.io/hostname
      containers:
        - name: web-frontend
          image: myapp/web-frontend:latest

3.3 Taints와 Tolerations

# 노드에 Taint 추가
# kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# kubectl taint nodes spot-node-1 spot=true:PreferNoSchedule

# GPU 워크로드: gpu Taint를 Tolerate
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      tolerations:
        - key: "gpu"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      nodeSelector:
        accelerator: nvidia-tesla-v100
      containers:
        - name: trainer
          image: myapp/ml-trainer:latest
          resources:
            limits:
              nvidia.com/gpu: 4
---
# Spot 인스턴스 워크로드
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  replicas: 10
  selector:
    matchLabels:
      app: batch-processor
  template:
    metadata:
      labels:
        app: batch-processor
    spec:
      tolerations:
        - key: "spot"
          operator: "Equal"
          value: "true"
          effect: "PreferNoSchedule"
        - key: "node.kubernetes.io/not-ready"
          operator: "Exists"
          effect: "NoExecute"
          tolerationSeconds: 60
      containers:
        - name: processor
          image: myapp/batch-processor:latest

3.4 Priority와 Preemption

# PriorityClass 정의
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-production
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "프로덕션 핵심 서비스용"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: standard-production
value: 100000
globalDefault: true
preemptionPolicy: PreemptLowerPriority
description: "일반 프로덕션 워크로드용"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-low
value: 1000
globalDefault: false
preemptionPolicy: Never
description: "배치 작업용. Preemption 하지 않음"
---
# Priority 사용
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      priorityClassName: critical-production
      containers:
        - name: payment
          image: myapp/payment:latest

3.5 Topology Spread Constraints

Pod을 여러 토폴로지 도메인에 균등하게 분산하는 강력한 기능이다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  replicas: 12
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      topologySpreadConstraints:
        # AZ 기준 분산
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api-gateway
        # 노드 기준 분산
        - maxSkew: 2
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: api-gateway
          nodeAffinityPolicy: Honor
          nodeTaintsPolicy: Honor
      containers:
        - name: api-gateway
          image: myapp/api-gateway:latest

4. 리소스 관리

4.1 Requests vs Limits 이해

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
    - name: app
      image: myapp/demo:latest
      resources:
        # 스케줄링에 사용. 이만큼의 리소스를 보장받음
        requests:
          cpu: 500m
          memory: 512Mi
          ephemeral-storage: 1Gi
        # 사용 상한선. 초과 시 CPU는 쓰로틀링, 메모리는 OOMKill
        limits:
          cpu: "2"
          memory: 1Gi
          ephemeral-storage: 2Gi

4.2 QoS 클래스

QoS 클래스조건OOM Kill 우선순위
Guaranteed모든 컨테이너에 requests = limits 설정가장 낮음 (마지막에 Kill)
Burstablerequests 또는 limits 중 하나만 설정중간
BestEffortrequests, limits 모두 미설정가장 높음 (먼저 Kill)
# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
    - name: app
      image: myapp/critical:latest
      resources:
        requests:
          cpu: "1"
          memory: 1Gi
        limits:
          cpu: "1"
          memory: 1Gi

4.3 LimitRange와 ResourceQuota

# LimitRange: 네임스페이스 내 개별 Pod/컨테이너 리소스 제한
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-backend
spec:
  limits:
    - type: Container
      default:
        cpu: 500m
        memory: 512Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      max:
        cpu: "4"
        memory: 8Gi
      min:
        cpu: 50m
        memory: 64Mi
    - type: Pod
      max:
        cpu: "8"
        memory: 16Gi
    - type: PersistentVolumeClaim
      max:
        storage: 100Gi
      min:
        storage: 1Gi
---
# ResourceQuota: 네임스페이스 전체 리소스 총량 제한
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-backend-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "100"
    services: "20"
    persistentvolumeclaims: "30"
    requests.storage: 500Gi
    count/deployments.apps: "30"
    count/configmaps: "50"
    count/secrets: "50"
  scopeSelector:
    matchExpressions:
      - scopeName: PriorityClass
        operator: In
        values:
          - standard-production
          - critical-production

4.4 PodDisruptionBudget (PDB)

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
  namespace: production
spec:
  minAvailable: "60%"
  selector:
    matchLabels:
      app: api-server
  unhealthyPodEvictionPolicy: IfHealthy
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: redis-pdb
  namespace: production
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: redis-cluster

5. 멀티 클러스터 운영

5.1 Cluster API

Cluster API는 Kubernetes 클러스터를 선언적으로 생성/관리하는 프로젝트다.

# Cluster 정의
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-cluster
  namespace: clusters
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 192.168.0.0/16
    services:
      cidrBlocks:
        - 10.96.0.0/12
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: production-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSCluster
    name: production-cluster
---
# Control Plane 정의
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: production-control-plane
  namespace: clusters
spec:
  replicas: 3
  version: v1.30.2
  machineTemplate:
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
      kind: AWSMachineTemplate
      name: production-control-plane
  kubeadmConfigSpec:
    clusterConfiguration:
      apiServer:
        extraArgs:
          audit-log-maxage: "30"
          audit-log-maxbackup: "10"
          enable-admission-plugins: "NodeRestriction,PodSecurityAdmission"
    initConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          cloud-provider: external

5.2 Fleet/Rancher 멀티 클러스터 관리

# Fleet GitRepo: 여러 클러스터에 배포
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: platform-apps
  namespace: fleet-default
spec:
  repo: https://github.com/myorg/platform-apps
  branch: main
  paths:
    - monitoring/
    - logging/
    - ingress/
  targets:
    - name: production
      clusterSelector:
        matchLabels:
          env: production
    - name: staging
      clusterSelector:
        matchLabels:
          env: staging
---
# Fleet Bundle 커스터마이징
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: app-deployments
  namespace: fleet-default
spec:
  repo: https://github.com/myorg/app-deployments
  branch: main
  targets:
    - name: us-east
      clusterSelector:
        matchLabels:
          region: us-east
      helm:
        values:
          replicaCount: 5
          ingress:
            host: api-us.mycompany.com
    - name: ap-northeast
      clusterSelector:
        matchLabels:
          region: ap-northeast
      helm:
        values:
          replicaCount: 3
          ingress:
            host: api-ap.mycompany.com

5.3 멀티 클러스터 아키텍처 패턴

패턴설명사용 시점
Hub-Spoke중앙 관리 클러스터에서 워커 클러스터 제어기본 멀티 클러스터
FederationKubeFed로 리소스를 여러 클러스터에 동기화동일 앱 멀티 리전
Service MeshIstio Multi-cluster로 클러스터 간 통신마이크로서비스 분산
Virtual KubeletAdmiralty, Liqo로 가상 노드 연결버스트 워크로드

6. 클러스터 업그레이드 전략

6.1 In-place 업그레이드

#!/bin/bash
# Control Plane 업그레이드
echo "=== Control Plane 업그레이드 시작 ==="

# 1. 현재 버전 확인
kubectl get nodes
kubectl version --short

# 2. kubeadm 업그레이드
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.2-1.1
sudo kubeadm upgrade plan
sudo kubeadm upgrade apply v1.30.2

# 3. kubelet, kubectl 업그레이드
sudo apt-get install -y kubelet=1.30.2-1.1 kubectl=1.30.2-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

echo "=== Worker Node 순차 업그레이드 ==="
NODES=$(kubectl get nodes -l node-role.kubernetes.io/worker -o jsonpath='{.items[*].metadata.name}')

for NODE in $NODES; do
  echo "--- $NODE 업그레이드 시작 ---"

  # Cordon: 새 Pod 스케줄링 차단
  kubectl cordon "$NODE"

  # Drain: 기존 Pod 퇴거
  kubectl drain "$NODE" \
    --ignore-daemonsets \
    --delete-emptydir-data \
    --grace-period=120 \
    --timeout=300s

  # 노드에서 업그레이드 실행 (SSH 또는 자동화 도구 사용)
  echo "노드 $NODE에서 kubeadm, kubelet 업그레이드 실행"

  # Uncordon: 스케줄링 재개
  kubectl uncordon "$NODE"

  # 노드 Ready 상태 확인
  kubectl wait --for=condition=Ready "node/$NODE" --timeout=300s

  echo "--- $NODE 업그레이드 완료 ---"
done

6.2 Blue-Green 클러스터 업그레이드

# 새 클러스터를 Cluster API로 생성
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-v130
  namespace: clusters
  labels:
    upgrade-group: production
    version: v1.30
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: production-v130-cp
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSCluster
    name: production-v130

7. 트러블슈팅

7.1 kubectl debug 활용

# 실행 중인 Pod에 디버그 컨테이너 추가
kubectl debug -it pod/api-server-abc123 \
  --image=nicolaka/netshoot \
  --target=api-server \
  -- /bin/bash

# 노드 디버깅
kubectl debug node/worker-1 \
  -it --image=ubuntu:22.04 \
  -- /bin/bash

# 문제 Pod의 복사본으로 디버깅 (이미지 변경)
kubectl debug pod/api-server-abc123 \
  -it --copy-to=debug-pod \
  --container=api-server \
  --image=myapp/api-server:debug \
  -- /bin/sh

7.2 일반적인 문제와 해결

Pending Pod 문제:

# Pod이 Pending 상태인 이유 확인
kubectl describe pod pending-pod-name

# 일반적인 원인:
# 1. 리소스 부족 -> 노드 추가 또는 리소스 조정
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU_ALLOC:.status.allocatable.cpu,\
MEM_ALLOC:.status.allocatable.memory,\
CPU_REQ:.status.capacity.cpu

# 2. nodeSelector/affinity 불일치 -> 레이블 확인
kubectl get nodes --show-labels

# 3. Taint 때문 -> Toleration 추가
kubectl describe nodes | grep -A5 Taints

CrashLoopBackOff 해결:

# 로그 확인
kubectl logs pod/crashing-pod --previous
kubectl logs pod/crashing-pod -c init-container-name

# 이벤트 확인
kubectl get events --sort-by=.lastTimestamp \
  --field-selector involvedObject.name=crashing-pod

# OOM Kill 확인
kubectl describe pod crashing-pod | grep -A5 "Last State"
# OOMKilled가 보이면 메모리 limits 증가 필요

# 디버그 모드로 실행
kubectl debug pod/crashing-pod \
  -it --copy-to=debug-pod \
  --container=app \
  --image=busybox \
  -- /bin/sh

네트워크 문제 진단:

# DNS 확인
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never \
  -- nslookup kubernetes.default.svc.cluster.local

# 서비스 연결 테스트
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never \
  -- curl -v http://api-server.production.svc:8080/health

# 네트워크 정책 확인
kubectl get networkpolicy -A
kubectl describe networkpolicy -n production

8. 비용 최적화

8.1 Spot 노드 활용

# Karpenter Spot NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-workloads
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: "500"
    memory: 1000Gi

8.2 Right-sizing 자동화

#!/bin/bash
# VPA 추천값 기반 Right-sizing 리포트

echo "=== 네임스페이스별 리소스 사용률 ==="
for NS in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
  CPU_REQ=$(kubectl top pods -n "$NS" --no-headers 2>/dev/null | \
    awk '{sum += $2} END {print sum}')
  MEM_REQ=$(kubectl top pods -n "$NS" --no-headers 2>/dev/null | \
    awk '{sum += $3} END {print sum}')
  if [ -n "$CPU_REQ" ] && [ "$CPU_REQ" != "0" ]; then
    echo "Namespace: $NS | CPU: ${CPU_REQ}m | Memory: ${MEM_REQ}Mi"
  fi
done

echo ""
echo "=== VPA 추천값 확인 ==="
for VPA in $(kubectl get vpa -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'); do
  NS=$(echo "$VPA" | cut -d'/' -f1)
  NAME=$(echo "$VPA" | cut -d'/' -f2)
  echo "--- $VPA ---"
  kubectl get vpa "$NAME" -n "$NS" -o jsonpath='{.status.recommendation.containerRecommendations[*]}'
  echo ""
done

8.3 네임스페이스별 비용 할당

# Kubecost 또는 OpenCost 활용
# helm install kubecost kubecost/cost-analyzer \
#   --namespace kubecost --create-namespace \
#   --set prometheus.enabled=false \
#   --set prometheus.fqdn=http://prometheus-server.monitoring:80

# 네임스페이스 레이블로 비용 할당
apiVersion: v1
kind: Namespace
metadata:
  name: team-backend
  labels:
    cost-center: "backend-team"
    department: "engineering"
    project: "api-platform"
    environment: "production"

9. 실전 퀴즈

Q1: HPA가 커스텀 메트릭으로 스케일링하려면 어떤 컴포넌트가 필요한가?

정답: Prometheus Adapter(또는 Datadog Cluster Agent 등) 같은 커스텀 메트릭 API 서버가 필요하다.

  • Prometheus Adapter는 Prometheus 메트릭을 Kubernetes Custom Metrics API (custom.metrics.k8s.io)로 노출한다
  • HPA는 이 API를 통해 커스텀 메트릭 값을 조회하여 스케일링 결정을 내린다
  • 설정 흐름: Prometheus 수집 -> Adapter 변환 -> HPA 조회 -> 스케일링 실행
Q2: Guaranteed QoS Class가 되려면 어떤 조건을 만족해야 하는가?

정답: Pod 내 모든 컨테이너에 대해 CPU와 메모리의 requests와 limits가 설정되어야 하고, 각각 같은 값이어야 한다.

  • requests.cpu = limits.cpu
  • requests.memory = limits.memory
  • 모든 컨테이너(init 컨테이너 포함)에 적용
  • Guaranteed Pod은 노드 메모리 압박 시 가장 마지막에 OOM Kill 된다
Q3: podAntiAffinity의 requiredDuringSchedulingIgnoredDuringExecution과 preferredDuringSchedulingIgnoredDuringExecution의 차이는?

정답: required는 반드시 조건을 만족해야 Pod이 스케줄링된다. 만족하는 노드가 없으면 Pending 상태. preferred는 가능하면 조건을 만족하는 곳에 배치하되, 불가능하면 다른 노드에도 배치한다.

  • required: 강제 조건 (hard constraint)
  • preferred: 선호 조건 (soft constraint), weight로 우선순위 조정
  • IgnoredDuringExecution: 이미 실행 중인 Pod은 조건이 변해도 퇴거되지 않음
Q4: Karpenter의 consolidation 기능은 어떻게 비용을 절약하는가?

정답: Karpenter consolidation은 유휴 또는 저활용 노드의 Pod을 다른 노드로 이동시키고, 빈 노드를 제거하거나 더 작은(저렴한) 인스턴스로 교체한다.

  • WhenEmpty: Pod이 없는 노드만 제거
  • WhenEmptyOrUnderutilized: 저활용 노드의 Pod도 재배치 후 제거
  • 더 저렴한 인스턴스 타입으로 교체 가능 (예: c5.2xlarge 2대를 c5.4xlarge 1대로)
  • consolidateAfter로 안정화 대기 시간 설정
Q5: PodDisruptionBudget이 클러스터 업그레이드에 어떤 영향을 미치는가?

정답: PDB는 자발적 중단(voluntary disruption) 시 동시에 중단될 수 있는 Pod 수를 제한한다.

  • kubectl drain 시 PDB를 존중하여 Pod을 순차적으로 퇴거
  • minAvailable: 최소 가용 Pod 수/비율 보장
  • maxUnavailable: 최대 중단 Pod 수/비율 제한
  • PDB가 너무 엄격하면 drain이 타임아웃될 수 있음
  • unhealthyPodEvictionPolicy: IfHealthy로 비정상 Pod은 PDB 무시 가능

10. 참고 자료

  1. Kubernetes 공식 문서 - HPA
  2. Kubernetes 공식 문서 - VPA
  3. KEDA 공식 문서
  4. Karpenter 공식 문서
  5. Kubernetes Scheduling
  6. Topology Spread Constraints
  7. Resource Management
  8. Cluster API
  9. Fleet Manager
  10. Kubecost
  11. Kubernetes Best Practices - Google
  12. EKS Best Practices Guide
  13. Pod Priority and Preemption

Kubernetes Advanced Operations Guide 2025: Autoscaling, Scheduling, Resource Management, Multi-Cluster

Table of Contents

1. Introduction: Why Advanced Kubernetes Operations Matter

Running Kubernetes in production reveals challenges that basic deployments cannot address. Pods may not scale fast enough during traffic spikes, workloads may cluster on specific nodes causing cascading failures, or costs may explode without proper resource management.

This guide covers four core areas of advanced Kubernetes operations:

  1. Autoscaling - Scale workloads and infrastructure automatically with HPA, VPA, KEDA, and Karpenter
  2. Scheduling - Optimize Pod placement with Affinity, Taints, Priority, and Topology Spread
  3. Resource Management - Ensure stability with QoS, LimitRange, ResourceQuota, and PDB
  4. Multi-Cluster - Manage multiple clusters with Cluster API and Fleet

2. Autoscaling Strategies

2.1 HPA (Horizontal Pod Autoscaler) Deep Dive

HPA is the most fundamental autoscaler that adjusts the number of Pods. The v2 API supports custom and external metrics.

Basic HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    # CPU-based scaling
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    # Memory-based scaling
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
        - type: Pods
          value: 10
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

Custom Metrics HPA

Using Prometheus Adapter, you can scale based on application-specific custom metrics.

# Prometheus Adapter configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
      - seriesQuery: 'http_requests_per_second{namespace!="",pod!=""}'
        resources:
          overrides:
            namespace:
              resource: namespace
            pod:
              resource: pod
        name:
          matches: "^(.*)$"
          as: "requests_per_second"
        metricsQuery: 'sum(rate(http_requests_total{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
# Custom metrics HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-custom-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 100
  metrics:
    - type: Pods
      pods:
        metric:
          name: requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

External Metrics HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  minReplicas: 1
  maxReplicas: 30
  metrics:
    - type: External
      external:
        metric:
          name: sqs_queue_depth
          selector:
            matchLabels:
              queue: "order-processing"
        target:
          type: AverageValue
          averageValue: "5"

2.2 VPA (Vertical Pod Autoscaler)

VPA automatically adjusts CPU/memory requests for Pods. It is especially useful in early stages when optimal resource requests are unknown.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"  # Off, Initial, Recreate, Auto
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        minAllowed:
          cpu: 100m
          memory: 128Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
        controlledResources: ["cpu", "memory"]
        controlledValues: RequestsAndLimits

VPA Operating Mode Comparison:

ModeBehaviorWhen to Use
OffProvides recommendations only, no applicationInitial analysis phase
InitialApplied only at creationStable workloads
RecreateApplied by recreating PodsGeneral operations
AutoIn-place if possible, otherwise recreateLatest K8s environments

Caution: Using HPA and VPA on the same metrics (CPU/memory) simultaneously causes conflicts. The recommended pattern is to use VPA in Off mode for recommendations only while HPA handles scaling.

2.3 KEDA (Kubernetes Event-Driven Autoscaling)

KEDA scales workloads based on external event sources, supporting over 60 scalers.

# Install KEDA
# helm repo add kedacore https://kedacore.github.io/charts
# helm install keda kedacore/keda --namespace keda-system --create-namespace

# ScaledObject example: Kafka-based scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: kafka-consumer
  pollingInterval: 15
  cooldownPeriod: 300
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 50
  fallback:
    failureThreshold: 3
    replicas: 5
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka.production.svc:9092
        consumerGroup: order-processor
        topic: orders
        lagThreshold: "100"
        offsetResetPolicy: latest
---
# ScaledObject example: AWS SQS-based scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: sqs-worker-scaler
spec:
  scaleTargetRef:
    name: sqs-worker
  pollingInterval: 10
  cooldownPeriod: 60
  minReplicaCount: 0
  maxReplicaCount: 100
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.ap-northeast-2.amazonaws.com/123456789012/order-queue
        queueLength: "5"
        awsRegion: ap-northeast-2
      authenticationRef:
        name: aws-credentials
---
# ScaledJob example: Batch job scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: image-processor
spec:
  jobTargetRef:
    template:
      spec:
        containers:
          - name: processor
            image: myapp/image-processor:latest
        restartPolicy: Never
  pollingInterval: 10
  maxReplicaCount: 20
  successfulJobsHistoryLimit: 10
  failedJobsHistoryLimit: 5
  triggers:
    - type: redis-lists
      metadata:
        address: redis.production.svc:6379
        listName: image-processing-queue
        listLength: "3"

2.4 Karpenter - Next-Generation Node Autoscaler

Karpenter is a node provisioning engine that overcomes the limitations of Cluster Autoscaler.

# NodePool definition
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general-purpose
spec:
  template:
    metadata:
      labels:
        team: platform
        tier: general
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand", "spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      expireAfter: 720h
  limits:
    cpu: "1000"
    memory: 2000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
---
# EC2NodeClass definition
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  role: KarpenterNodeRole-my-cluster
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 125

Cluster Autoscaler vs Karpenter Comparison:

AspectCluster AutoscalerKarpenter
Node SelectionNode Group basedWorkload requirements based
Provisioning SpeedMinutesSeconds
Instance VarietyFixed per groupAutomatic optimal selection
Spot HandlingManual configurationAutomatic price/availability optimization
ConsolidationNot supportedAutomatic node consolidation
Cloud SupportAll cloudsAWS (Azure preview)

3. Advanced Scheduling

3.1 nodeSelector

The simplest node selection method.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-worker
spec:
  nodeSelector:
    accelerator: nvidia-tesla-v100
    topology.kubernetes.io/zone: ap-northeast-2a
  containers:
    - name: gpu-worker
      image: myapp/gpu-worker:latest
      resources:
        limits:
          nvidia.com/gpu: 1

3.2 Node Affinity and Pod Affinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web-frontend
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      affinity:
        # Node Affinity: place on specific nodes
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-type
                    operator: In
                    values:
                      - compute-optimized
                      - general-purpose
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - ap-northeast-2a
            - weight: 20
              preference:
                matchExpressions:
                  - key: topology.kubernetes.io/zone
                    operator: In
                    values:
                      - ap-northeast-2c
        # Pod Affinity: place on same node/zone as specific Pods
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - redis-cache
              topologyKey: topology.kubernetes.io/zone
        # Pod Anti-Affinity: spread Pods of same app
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - web-frontend
              topologyKey: kubernetes.io/hostname
      containers:
        - name: web-frontend
          image: myapp/web-frontend:latest

3.3 Taints and Tolerations

# Add Taint to nodes
# kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# kubectl taint nodes spot-node-1 spot=true:PreferNoSchedule

# GPU workload: Tolerate gpu Taint
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-training
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ml-training
  template:
    metadata:
      labels:
        app: ml-training
    spec:
      tolerations:
        - key: "gpu"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
        - key: "nvidia.com/gpu"
          operator: "Exists"
          effect: "NoSchedule"
      nodeSelector:
        accelerator: nvidia-tesla-v100
      containers:
        - name: trainer
          image: myapp/ml-trainer:latest
          resources:
            limits:
              nvidia.com/gpu: 4
---
# Spot instance workloads
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  replicas: 10
  selector:
    matchLabels:
      app: batch-processor
  template:
    metadata:
      labels:
        app: batch-processor
    spec:
      tolerations:
        - key: "spot"
          operator: "Equal"
          value: "true"
          effect: "PreferNoSchedule"
        - key: "node.kubernetes.io/not-ready"
          operator: "Exists"
          effect: "NoExecute"
          tolerationSeconds: 60
      containers:
        - name: processor
          image: myapp/batch-processor:latest

3.4 Priority and Preemption

# PriorityClass definitions
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical-production
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "For critical production services"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: standard-production
value: 100000
globalDefault: true
preemptionPolicy: PreemptLowerPriority
description: "For standard production workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-low
value: 1000
globalDefault: false
preemptionPolicy: Never
description: "For batch jobs. No preemption"
---
# Using Priority
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      priorityClassName: critical-production
      containers:
        - name: payment
          image: myapp/payment:latest

3.5 Topology Spread Constraints

A powerful feature that distributes Pods evenly across topology domains.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
spec:
  replicas: 12
  selector:
    matchLabels:
      app: api-gateway
  template:
    metadata:
      labels:
        app: api-gateway
    spec:
      topologySpreadConstraints:
        # Spread across AZs
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api-gateway
        # Spread across nodes
        - maxSkew: 2
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: api-gateway
          nodeAffinityPolicy: Honor
          nodeTaintsPolicy: Honor
      containers:
        - name: api-gateway
          image: myapp/api-gateway:latest

4. Resource Management

4.1 Understanding Requests vs Limits

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
    - name: app
      image: myapp/demo:latest
      resources:
        # Used for scheduling. This amount of resources is guaranteed
        requests:
          cpu: 500m
          memory: 512Mi
          ephemeral-storage: 1Gi
        # Upper bound. CPU is throttled when exceeded, memory triggers OOMKill
        limits:
          cpu: "2"
          memory: 1Gi
          ephemeral-storage: 2Gi

4.2 QoS Classes

QoS ClassConditionOOM Kill Priority
GuaranteedAll containers have requests = limitsLowest (killed last)
BurstableOnly requests or limits setMedium
BestEffortNeither requests nor limits setHighest (killed first)
# Guaranteed QoS
apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
    - name: app
      image: myapp/critical:latest
      resources:
        requests:
          cpu: "1"
          memory: 1Gi
        limits:
          cpu: "1"
          memory: 1Gi

4.3 LimitRange and ResourceQuota

# LimitRange: Per-Pod/Container resource limits within a namespace
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-backend
spec:
  limits:
    - type: Container
      default:
        cpu: 500m
        memory: 512Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      max:
        cpu: "4"
        memory: 8Gi
      min:
        cpu: 50m
        memory: 64Mi
    - type: Pod
      max:
        cpu: "8"
        memory: 16Gi
    - type: PersistentVolumeClaim
      max:
        storage: 100Gi
      min:
        storage: 1Gi
---
# ResourceQuota: Total resource cap for entire namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-backend-quota
  namespace: team-backend
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "100"
    services: "20"
    persistentvolumeclaims: "30"
    requests.storage: 500Gi
    count/deployments.apps: "30"
    count/configmaps: "50"
    count/secrets: "50"
  scopeSelector:
    matchExpressions:
      - scopeName: PriorityClass
        operator: In
        values:
          - standard-production
          - critical-production

4.4 PodDisruptionBudget (PDB)

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
  namespace: production
spec:
  minAvailable: "60%"
  selector:
    matchLabels:
      app: api-server
  unhealthyPodEvictionPolicy: IfHealthy
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: redis-pdb
  namespace: production
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: redis-cluster

5. Multi-Cluster Operations

5.1 Cluster API

Cluster API is a project for declaratively creating and managing Kubernetes clusters.

# Cluster definition
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-cluster
  namespace: clusters
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 192.168.0.0/16
    services:
      cidrBlocks:
        - 10.96.0.0/12
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: production-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSCluster
    name: production-cluster
---
# Control Plane definition
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: production-control-plane
  namespace: clusters
spec:
  replicas: 3
  version: v1.30.2
  machineTemplate:
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
      kind: AWSMachineTemplate
      name: production-control-plane
  kubeadmConfigSpec:
    clusterConfiguration:
      apiServer:
        extraArgs:
          audit-log-maxage: "30"
          audit-log-maxbackup: "10"
          enable-admission-plugins: "NodeRestriction,PodSecurityAdmission"
    initConfiguration:
      nodeRegistration:
        kubeletExtraArgs:
          cloud-provider: external

5.2 Fleet/Rancher Multi-Cluster Management

# Fleet GitRepo: Deploy across multiple clusters
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: platform-apps
  namespace: fleet-default
spec:
  repo: https://github.com/myorg/platform-apps
  branch: main
  paths:
    - monitoring/
    - logging/
    - ingress/
  targets:
    - name: production
      clusterSelector:
        matchLabels:
          env: production
    - name: staging
      clusterSelector:
        matchLabels:
          env: staging
---
# Fleet Bundle customization
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: app-deployments
  namespace: fleet-default
spec:
  repo: https://github.com/myorg/app-deployments
  branch: main
  targets:
    - name: us-east
      clusterSelector:
        matchLabels:
          region: us-east
      helm:
        values:
          replicaCount: 5
          ingress:
            host: api-us.mycompany.com
    - name: ap-northeast
      clusterSelector:
        matchLabels:
          region: ap-northeast
      helm:
        values:
          replicaCount: 3
          ingress:
            host: api-ap.mycompany.com

5.3 Multi-Cluster Architecture Patterns

PatternDescriptionWhen to Use
Hub-SpokeCentral management cluster controls worker clustersBasic multi-cluster
FederationKubeFed syncs resources across clustersSame app multi-region
Service MeshIstio Multi-cluster for inter-cluster communicationDistributed microservices
Virtual KubeletAdmiralty, Liqo connect virtual nodesBurst workloads

6. Cluster Upgrade Strategies

6.1 In-place Upgrade

#!/bin/bash
# Control Plane upgrade
echo "=== Starting Control Plane Upgrade ==="

# 1. Check current version
kubectl get nodes
kubectl version --short

# 2. Upgrade kubeadm
sudo apt-get update
sudo apt-get install -y kubeadm=1.30.2-1.1
sudo kubeadm upgrade plan
sudo kubeadm upgrade apply v1.30.2

# 3. Upgrade kubelet and kubectl
sudo apt-get install -y kubelet=1.30.2-1.1 kubectl=1.30.2-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet

echo "=== Sequential Worker Node Upgrade ==="
NODES=$(kubectl get nodes -l node-role.kubernetes.io/worker -o jsonpath='{.items[*].metadata.name}')

for NODE in $NODES; do
  echo "--- Starting upgrade for $NODE ---"

  # Cordon: prevent new Pod scheduling
  kubectl cordon "$NODE"

  # Drain: evict existing Pods
  kubectl drain "$NODE" \
    --ignore-daemonsets \
    --delete-emptydir-data \
    --grace-period=120 \
    --timeout=300s

  # Run upgrade on the node (via SSH or automation tool)
  echo "Running kubeadm and kubelet upgrade on node $NODE"

  # Uncordon: resume scheduling
  kubectl uncordon "$NODE"

  # Verify node Ready status
  kubectl wait --for=condition=Ready "node/$NODE" --timeout=300s

  echo "--- Upgrade complete for $NODE ---"
done

6.2 Blue-Green Cluster Upgrade

# Create new cluster with Cluster API
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-v130
  namespace: clusters
  labels:
    upgrade-group: production
    version: v1.30
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 192.168.0.0/16
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: production-v130-cp
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
    kind: AWSCluster
    name: production-v130

7. Troubleshooting

7.1 Using kubectl debug

# Add debug container to running Pod
kubectl debug -it pod/api-server-abc123 \
  --image=nicolaka/netshoot \
  --target=api-server \
  -- /bin/bash

# Node debugging
kubectl debug node/worker-1 \
  -it --image=ubuntu:22.04 \
  -- /bin/bash

# Debug with Pod copy (image change)
kubectl debug pod/api-server-abc123 \
  -it --copy-to=debug-pod \
  --container=api-server \
  --image=myapp/api-server:debug \
  -- /bin/sh

7.2 Common Issues and Solutions

Pending Pod Issues:

# Check why Pod is Pending
kubectl describe pod pending-pod-name

# Common causes:
# 1. Insufficient resources -> Add nodes or adjust resources
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
CPU_ALLOC:.status.allocatable.cpu,\
MEM_ALLOC:.status.allocatable.memory,\
CPU_REQ:.status.capacity.cpu

# 2. nodeSelector/affinity mismatch -> Check labels
kubectl get nodes --show-labels

# 3. Taints blocking -> Add tolerations
kubectl describe nodes | grep -A5 Taints

CrashLoopBackOff Resolution:

# Check logs
kubectl logs pod/crashing-pod --previous
kubectl logs pod/crashing-pod -c init-container-name

# Check events
kubectl get events --sort-by=.lastTimestamp \
  --field-selector involvedObject.name=crashing-pod

# Check for OOM Kill
kubectl describe pod crashing-pod | grep -A5 "Last State"
# If OOMKilled appears, increase memory limits

# Run in debug mode
kubectl debug pod/crashing-pod \
  -it --copy-to=debug-pod \
  --container=app \
  --image=busybox \
  -- /bin/sh

Network Issue Diagnosis:

# DNS check
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never \
  -- nslookup kubernetes.default.svc.cluster.local

# Service connectivity test
kubectl run curl-test --image=curlimages/curl --rm -it --restart=Never \
  -- curl -v http://api-server.production.svc:8080/health

# Check network policies
kubectl get networkpolicy -A
kubectl describe networkpolicy -n production

8. Cost Optimization

8.1 Leveraging Spot Nodes

# Karpenter Spot NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-workloads
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: "500"
    memory: 1000Gi

8.2 Right-sizing Automation

#!/bin/bash
# Right-sizing report based on VPA recommendations

echo "=== Resource Utilization by Namespace ==="
for NS in $(kubectl get ns -o jsonpath='{.items[*].metadata.name}'); do
  CPU_REQ=$(kubectl top pods -n "$NS" --no-headers 2>/dev/null | \
    awk '{sum += $2} END {print sum}')
  MEM_REQ=$(kubectl top pods -n "$NS" --no-headers 2>/dev/null | \
    awk '{sum += $3} END {print sum}')
  if [ -n "$CPU_REQ" ] && [ "$CPU_REQ" != "0" ]; then
    echo "Namespace: $NS | CPU: ${CPU_REQ}m | Memory: ${MEM_REQ}Mi"
  fi
done

echo ""
echo "=== VPA Recommendations ==="
for VPA in $(kubectl get vpa -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}'); do
  NS=$(echo "$VPA" | cut -d'/' -f1)
  NAME=$(echo "$VPA" | cut -d'/' -f2)
  echo "--- $VPA ---"
  kubectl get vpa "$NAME" -n "$NS" -o jsonpath='{.status.recommendation.containerRecommendations[*]}'
  echo ""
done

8.3 Namespace Cost Allocation

# Using Kubecost or OpenCost
# helm install kubecost kubecost/cost-analyzer \
#   --namespace kubecost --create-namespace \
#   --set prometheus.enabled=false \
#   --set prometheus.fqdn=http://prometheus-server.monitoring:80

# Cost allocation via namespace labels
apiVersion: v1
kind: Namespace
metadata:
  name: team-backend
  labels:
    cost-center: "backend-team"
    department: "engineering"
    project: "api-platform"
    environment: "production"

9. Practice Quiz

Q1: What component is needed for HPA to scale on custom metrics?

Answer: A custom metrics API server like Prometheus Adapter (or Datadog Cluster Agent, etc.) is required.

  • Prometheus Adapter exposes Prometheus metrics through the Kubernetes Custom Metrics API (custom.metrics.k8s.io)
  • HPA queries this API to retrieve custom metric values and make scaling decisions
  • Flow: Prometheus collects -> Adapter transforms -> HPA queries -> Scaling executes
Q2: What conditions must be met for Guaranteed QoS Class?

Answer: All containers in the Pod must have CPU and memory requests and limits set, and each pair must be equal.

  • requests.cpu = limits.cpu
  • requests.memory = limits.memory
  • Applies to all containers (including init containers)
  • Guaranteed Pods are the last to be OOM Killed under node memory pressure
Q3: What is the difference between requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution in podAntiAffinity?

Answer: Required means the condition must be satisfied for the Pod to be scheduled. If no node satisfies the condition, the Pod stays Pending. Preferred means the scheduler tries to place the Pod where conditions are met, but will place it elsewhere if necessary.

  • required: Hard constraint (mandatory)
  • preferred: Soft constraint, priority adjustable via weight
  • IgnoredDuringExecution: Already running Pods are not evicted even if conditions change
Q4: How does Karpenter consolidation save costs?

Answer: Karpenter consolidation moves Pods from idle or underutilized nodes to other nodes, then removes empty nodes or replaces them with smaller (cheaper) instances.

  • WhenEmpty: Only removes nodes with no Pods
  • WhenEmptyOrUnderutilized: Also relocates Pods from underutilized nodes before removing
  • Can replace with cheaper instance types (e.g., two c5.2xlarge to one c5.4xlarge)
  • consolidateAfter sets the stabilization wait time
Q5: How does PodDisruptionBudget affect cluster upgrades?

Answer: PDB limits the number of Pods that can be simultaneously disrupted during voluntary disruptions.

  • During kubectl drain, PDB is respected to evict Pods sequentially
  • minAvailable: Guarantees minimum available Pod count/percentage
  • maxUnavailable: Limits maximum disrupted Pod count/percentage
  • Overly strict PDBs can cause drain timeouts
  • unhealthyPodEvictionPolicy: IfHealthy allows ignoring PDB for unhealthy Pods

10. References

  1. Kubernetes Official Docs - HPA
  2. Kubernetes Official Docs - VPA
  3. KEDA Official Documentation
  4. Karpenter Official Documentation
  5. Kubernetes Scheduling
  6. Topology Spread Constraints
  7. Resource Management
  8. Cluster API
  9. Fleet Manager
  10. Kubecost
  11. Kubernetes Best Practices - Google
  12. EKS Best Practices Guide
  13. Pod Priority and Preemption