Split View: [AWS] Karpenter로 GPU 노드 관리하기: AI/ML 워크로드 최적화

[AWS] Karpenter로 GPU 노드 관리하기: AI/ML 워크로드 최적화

1. Karpenter를 이용한 GPU 노드 프로비저닝

GPU 워크로드의 특수성

AI/ML 워크로드는 일반 컴퓨팅과 다른 고유한 요구사항을 가집니다:

+---------------------------------------------------------------+
|                GPU 워크로드 특성                               |
+---------------------------------------------------------------+
| - 고가의 GPU 인스턴스 (시간당 수~수십 달러)                   |
| - 학습 작업의 장시간 실행 (수시간~수일)                       |
| - 추론 작업의 낮은 지연 시간 요구                             |
| - GPU 메모리(VRAM) 기반 리소스 제약                           |
| - 인스턴스 타입별 GPU 성능 차이가 큼                          |
| - Spot 중단 시 학습 진행 상황 손실 위험                       |
+---------------------------------------------------------------+

Karpenter가 GPU 관리에 적합한 이유

+------------------------------------------+
|         기존 방식 (Cluster Autoscaler)    |
|                                           |
|  GPU Node Group A: p3.2xlarge             |
|  GPU Node Group B: g5.xlarge              |
|  GPU Node Group C: g5.2xlarge             |
|  GPU Node Group D: p4d.24xlarge           |
|  ...                                      |
|  각 Node Group을 개별 관리 (비효율적)     |
+------------------------------------------+

+------------------------------------------+
|         Karpenter 방식                    |
|                                           |
|  단일 GPU NodePool:                       |
|  - Pod 요구사항 분석                      |
|  - 최적 GPU 인스턴스 자동 선택            |
|  - Spot/On-Demand 자동 전환               |
|  - 비용 기반 인스턴스 최적화              |
+------------------------------------------+

2. GPU NodePool 설정

범용 GPU NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-general
spec:
  template:
    metadata:
      labels:
        node-type: gpu
        workload: ai-ml
    spec:
      requirements:
        # GPU 인스턴스만 선택
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

        # GPU 인스턴스 패밀리
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g', 'p']

        # 용량 타입
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

        # 가용 영역
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

        # x86 아키텍처만
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']

      # GPU 전용 taint
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

      # GPU 노드는 더 긴 만료 시간
      expireAfter: 336h # 14일

  limits:
    cpu: '500'
    memory: 2000Gi
    nvidia.com/gpu: '100'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
    budgets:
      - nodes: '1'

  weight: 80

AWS GPU 인스턴스 타입 가이드

+------------------+----------+----------+------------------+-------------------+
| 인스턴스 타입    | GPU      | GPU 수   | GPU 메모리       | 주요 용도         |
+------------------+----------+----------+------------------+-------------------+
| g4dn.xlarge      | T4       | 1        | 16 GB            | 추론, 경량 학습   |
| g4dn.12xlarge    | T4       | 4        | 64 GB            | 다중 추론         |
| g5.xlarge        | A10G     | 1        | 24 GB            | 추론, 미세 조정   |
| g5.12xlarge      | A10G     | 4        | 96 GB            | 중형 학습         |
| g5.48xlarge      | A10G     | 8        | 192 GB           | 대형 학습         |
| g6.xlarge        | L4       | 1        | 24 GB            | 추론 최적화       |
| g6.12xlarge      | L4       | 4        | 96 GB            | 멀티모달 추론     |
| p3.2xlarge       | V100     | 1        | 16 GB            | 범용 학습         |
| p3.8xlarge       | V100     | 4        | 64 GB            | 대규모 학습       |
| p4d.24xlarge     | A100     | 8        | 320 GB (40GB x8) | 초대규모 학습     |
| p5.48xlarge      | H100     | 8        | 640 GB (80GB x8) | 최대 성능 학습    |
+------------------+----------+----------+------------------+-------------------+

추론 전용 NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-inference
        workload: inference
    spec:
      requirements:
        # 추론에 적합한 인스턴스
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Spot 인스턴스 우선 (추론은 stateless)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']

        # 인스턴스 크기 제한
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '50'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 2m

  weight: 60

학습 전용 NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training
spec:
  template:
    metadata:
      labels:
        node-type: gpu-training
        workload: training
    spec:
      requirements:
        # 학습에 적합한 고성능 GPU
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a100', 'h100', 'a10g']

        # On-Demand 전용 (학습은 중단 비용이 큼)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

        # 대형 인스턴스
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

      # 학습용 노드는 만료 없음
      expireAfter: 720h # 30일

  limits:
    nvidia.com/gpu: '32'

  disruption:
    # 학습 중 통합 비활성화
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      - nodes: '0'

  weight: 90

3. GPU 전용 EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-optimized
spec:
  # GPU 드라이버가 포함된 AMI
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
        network-type: private

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # GPU 워크로드용 대용량 디스크
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi
        volumeType: gp3
        iops: 6000
        throughput: 250
        encrypted: true
        deleteOnTermination: true

  metadataOptions:
    httpEndpoint: enabled
    httpPutResponseHopLimit: 2
    httpTokens: required

  tags:
    Environment: production
    NodeType: gpu
    ManagedBy: karpenter

  # GPU 드라이버 설치를 위한 사용자 데이터
  userData: |
    #!/bin/bash
    echo "GPU node bootstrap"
    # NVIDIA 드라이버는 GPU Operator가 처리

학습 전용 EC2NodeClass (대용량 스토리지)

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-training
spec:
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # 학습 데이터용 대용량 + 고성능 스토리지
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 500Gi
        volumeType: gp3
        iops: 16000
        throughput: 1000
        encrypted: true
        deleteOnTermination: true

  tags:
    Environment: production
    NodeType: gpu-training
    ManagedBy: karpenter

4. Spot GPU 인스턴스 전략

Spot GPU의 비용 절감 효과

+------------------+-------------------+-------------------+---------+
| 인스턴스 타입    | On-Demand (시간)  | Spot 예상 (시간)  | 절감율  |
+------------------+-------------------+-------------------+---------+
| g4dn.xlarge      | ~0.526            | ~0.158            | ~70%    |
| g5.xlarge        | ~1.006            | ~0.302            | ~70%    |
| g5.2xlarge       | ~1.212            | ~0.364            | ~70%    |
| g5.12xlarge      | ~5.672            | ~1.702            | ~70%    |
| p3.2xlarge       | ~3.060            | ~0.918            | ~70%    |
+------------------+-------------------+-------------------+---------+
 (가격은 리전과 시기에 따라 변동됩니다)

Spot GPU NodePool - 추론용

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-spot
        workload: inference
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']

        # 추론에 적합한 다양한 GPU 타입
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # 다양한 크기로 Spot 가용성 확보
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge']

        # 여러 AZ 활용
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '40'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

  weight: 70

Spot 중단 대비 전략

# Pod에 do-not-disrupt 어노테이션 적용 (장시간 학습 작업)
apiVersion: v1
kind: Pod
metadata:
  name: training-job
  annotations:
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: training
      image: my-training-image:latest
      resources:
        requests:
          nvidia.com/gpu: '1'
          cpu: '4'
          memory: 16Gi
        limits:
          nvidia.com/gpu: '1'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  terminationGracePeriodSeconds: 120

5. NVIDIA GPU Operator 연동

GPU Operator 개요

+----------------------------------------------------------------+
|                    NVIDIA GPU Operator                          |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | NVIDIA Driver    |  | Container Toolkit |  | Device Plugin| |
|  | (자동 설치)      |  | (자동 설정)       |  | (자동 배포)  | |
|  +------------------+  +-------------------+  +--------------+ |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | GPU Feature      |  | DCGM Exporter    |  | MIG Manager  | |
|  | Discovery        |  | (메트릭 수집)     |  | (MIG 관리)   | |
|  +------------------+  +-------------------+  +--------------+ |
+----------------------------------------------------------------+

GPU Operator 설치

# NVIDIA GPU Operator Helm 리포지토리 추가
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# GPU Operator 설치
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false \
  --set gfd.enabled=true

GPU Operator와 Karpenter 연동 확인

# GPU 노드의 레이블 확인
kubectl get nodes -l node-type=gpu -o json | \
  jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia")))'

# GPU 리소스 확인
kubectl describe node gpu-node-name | grep -A 5 "nvidia.com/gpu"

# DCGM Exporter Pod 확인
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

GPU 워크로드 배포 예제

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-inference-server
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference-server
  template:
    metadata:
      labels:
        app: inference-server
    spec:
      containers:
        - name: inference
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: model-store
              mountPath: /models
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc

6. 멀티 아키텍처 지원 (x86 + ARM/Graviton)

멀티 아키텍처 NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: multi-arch
spec:
  template:
    spec:
      requirements:
        # x86과 ARM 모두 허용
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64', 'arm64']

        # Graviton 인스턴스 포함
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['c', 'm', 'r']

        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ['5']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default

  limits:
    cpu: '1000'
    memory: 2000Gi

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Graviton GPU 대안: Inferentia/Trainium

# AWS Inferentia 추론 전용 NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: inferentia
spec:
  template:
    metadata:
      labels:
        accelerator: inferentia
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

      taints:
        - key: aws.amazon.com/neuron
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: inferentia-nodes

  limits:
    aws.amazon.com/neuron: '32'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

7. 비용 최적화 전략

Spot과 On-Demand 혼합 전략

+-------------------------------------------------------------+
|              비용 최적화 의사결정 트리                        |
+-------------------------------------------------------------+
|                                                             |
|  워크로드 유형 확인                                         |
|      |                                                      |
|      +-- 추론 (Stateless) --> Spot 우선 + On-Demand 대체    |
|      |                                                      |
|      +-- 미세 조정 (단기) --> Spot + 체크포인트 전략        |
|      |                                                      |
|      +-- 대규모 학습 (장기) --> On-Demand + 예약 인스턴스   |
|      |                                                      |
|      +-- 배치 처리 --> Spot 전용                            |
|                                                             |
+-------------------------------------------------------------+

가중 우선순위를 사용한 인스턴스 패밀리 전략

# 1순위: G5 Spot (가장 비용 효율적인 추론)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier1-g5-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 100
  limits:
    nvidia.com/gpu: '20'
---
# 2순위: G4dn Spot (대체)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier2-g4dn-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 50
  limits:
    nvidia.com/gpu: '20'
---
# 3순위: G5 On-Demand (최후의 대체)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier3-g5-ondemand
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 10
  limits:
    nvidia.com/gpu: '10'

Consolidation 정책 최적화

# GPU 노드의 Consolidation 설정
disruption:
  # GPU 노드는 WhenEmpty만 사용 (실행 중인 GPU 작업 보호)
  consolidationPolicy: WhenEmpty
  # 빈 노드 감지 후 5분 대기 (일시적 비활성 고려)
  consolidateAfter: 5m
  budgets:
    # 동시에 최대 1개 노드만 중단
    - nodes: '1'
    # 업무 시간에는 중단 차단
    - nodes: '0'
      schedule: '0 9 * * MON-FRI'
      duration: 10h

8. 노드 중단 예산 (Node Disruption Budgets)

GPU 워크로드를 위한 Disruption Budget

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training-protected
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      # 학습 시간에는 중단 완전 차단
      - nodes: '0'
        schedule: '0 0 * * *'
        duration: 23h

      # 유지보수 창 (매일 1시간)
      - nodes: '1'
        schedule: '0 23 * * *'
        duration: 1h

      # 드리프트로 인한 중단은 별도 관리
      - nodes: '1'
        reasons:
          - 'Drifted'

Pod 수준 보호

# 장시간 학습 Pod: Karpenter 중단 방지
apiVersion: v1
kind: Pod
metadata:
  name: long-training-job
  annotations:
    # 이 어노테이션으로 Karpenter의 자발적 중단을 방지
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: trainer
      image: my-training-image:v1
      resources:
        requests:
          nvidia.com/gpu: '4'
          cpu: '16'
          memory: 64Gi
        limits:
          nvidia.com/gpu: '4'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  # 충분한 종료 유예 시간 (체크포인트 저장)
  terminationGracePeriodSeconds: 300

PDB (Pod Disruption Budget) 설정

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-server-pdb
  namespace: ml-serving
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: inference-server

9. Prometheus/Grafana를 이용한 모니터링

Karpenter 메트릭 수집 설정

# Karpenter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: karpenter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
    - port: http-metrics
      interval: 15s
      path: /metrics

주요 Karpenter 메트릭

+-----------------------------------------------+----------------------------------------+
| 메트릭                                        | 설명                                   |
+-----------------------------------------------+----------------------------------------+
| karpenter_nodeclaims_launched_total           | 총 시작된 NodeClaim 수                 |
| karpenter_nodeclaims_registered_total         | 총 등록된 NodeClaim 수                 |
| karpenter_nodeclaims_terminated_total         | 총 종료된 NodeClaim 수                 |
| karpenter_pods_state                          | Pod 상태 (노드, 네임스페이스 등)       |
| karpenter_nodepool_usage                      | NodePool별 리소스 사용량               |
| karpenter_nodepool_limit                      | NodePool별 리소스 한도                 |
| karpenter_voluntary_disruption_eligible_nodes | 자발적 중단 대상 노드 수              |
| karpenter_disruption_actions_performed_total  | 수행된 중단 작업 수                    |
| karpenter_nodes_allocatable                   | 노드별 할당 가능 리소스               |
| karpenter_nodes_total_daemon_requests         | 데몬셋 리소스 요청 총합               |
+-----------------------------------------------+----------------------------------------+

GPU 전용 Grafana 대시보드 쿼리

# GPU 노드 수 추적
count(karpenter_nodes_allocatable{resource_type="nvidia.com/gpu"} > 0)

# GPU 활용률 (DCGM Exporter 필요)
DCGM_FI_DEV_GPU_UTIL

# NodePool별 GPU 사용량 vs 한도
karpenter_nodepool_usage{resource_type="nvidia.com/gpu"}
  /
karpenter_nodepool_limit{resource_type="nvidia.com/gpu"}

# 프로비저닝 지연 시간
histogram_quantile(0.99,
  rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])
)

DCGM Exporter 메트릭

# DCGM Exporter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

알림 규칙 예제

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: karpenter-gpu-alerts
  namespace: monitoring
spec:
  groups:
    - name: karpenter-gpu
      rules:
        # GPU NodePool이 한도의 90%에 도달
        - alert: GPUNodePoolNearLimit
          expr: |
            karpenter_nodepool_usage{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            /
            karpenter_nodepool_limit{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU NodePool approaching resource limit'

        # GPU 활용률이 낮은 노드 감지
        - alert: LowGPUUtilization
          expr: |
            avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10
          for: 1h
          labels:
            severity: info
          annotations:
            summary: 'GPU utilization below 10 percent for 1 hour'

        # Karpenter 프로비저닝 실패
        - alert: KarpenterProvisioningFailed
          expr: |
            increase(karpenter_nodeclaims_terminated_total{reason="ProvisioningFailed"}[15m]) > 0
          labels:
            severity: critical
          annotations:
            summary: 'Karpenter failed to provision GPU node'

10. 실전 예제: 학습 클러스터

분산 학습 클러스터 구성

# PyTorch 분산 학습 Job
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
  namespace: ml-training
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      labels:
        app: distributed-training
      annotations:
        karpenter.sh/do-not-disrupt: 'true'
    spec:
      containers:
        - name: pytorch-trainer
          image: my-pytorch-training:v1
          command: ['torchrun']
          args:
            - '--nproc_per_node=1'
            - '--nnodes=4'
            - '--node_rank=$(JOB_COMPLETION_INDEX)'
            - '--master_addr=training-master'
            - '--master_port=29500'
            - 'train.py'
          env:
            - name: JOB_COMPLETION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
          resources:
            requests:
              cpu: '8'
              memory: 32Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: shared-data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-training
      restartPolicy: OnFailure
      volumes:
        - name: shared-data
          persistentVolumeClaim:
            claimName: training-data-pvc
        - name: checkpoints
          persistentVolumeClaim:
            claimName: checkpoint-pvc

11. 실전 예제: 추론 클러스터

오토스케일링 추론 서비스

# 추론 Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
        - name: vllm-server
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3-8B'
            - '--tensor-parallel-size'
            - '1'
            - '--gpu-memory-utilization'
            - '0.9'
          ports:
            - containerPort: 8000
              name: http
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
---
# HPA 설정
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '70'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

12. 트러블슈팅 가이드

일반적인 GPU 노드 문제

# 1. GPU 리소스가 표시되지 않는 경우
kubectl describe node gpu-node | grep -A 10 "Allocatable"
# nvidia.com/gpu가 없으면 GPU Operator 확인

# 2. GPU Operator Pod 상태 확인
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# 3. Karpenter 프로비저닝 로그 확인
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter \
  | grep -i "gpu\|nvidia\|instance-type"

# 4. NodeClaim 상태 확인
kubectl get nodeclaims -o wide

# 5. Pending Pod 원인 분석
kubectl describe pod gpu-pod-name | grep -A 20 "Events"

자주 발생하는 문제와 해결

+---------------------------------------------+------------------------------------------+
| 문제                                        | 해결 방법                                |
+---------------------------------------------+------------------------------------------+
| GPU 리소스가 노드에 표시되지 않음           | GPU Operator 재설치 또는 드라이버 확인   |
| Spot GPU 인스턴스를 찾을 수 없음            | 더 많은 GPU 인스턴스 타입과 AZ 추가      |
| GPU 노드 프로비저닝 시간 초과               | EC2NodeClass 서브넷/보안그룹 태그 확인   |
| 학습 중 노드가 중단됨                       | do-not-disrupt 어노테이션 추가           |
| GPU 메모리 부족 (OOM)                       | 더 큰 GPU 인스턴스 타입 허용             |
| 불필요한 GPU 노드가 유지됨                  | Consolidation 정책 및 consolidateAfter   |
|                                             | 값 확인                                  |
| 특정 GPU 타입만 프로비저닝됨                | NodePool requirements 범위 확장          |
+---------------------------------------------+------------------------------------------+

GPU 메모리 확인 명령

# 노드에서 직접 GPU 상태 확인 (디버그 Pod 사용)
kubectl run gpu-debug --rm -it \
  --image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
  --overrides='{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],"nodeSelector":{"node-type":"gpu"}}}' \
  --restart=Never \
  -- nvidia-smi

13. 베스트 프랙티스 요약

GPU 노드 관리 체크리스트

+---+------------------------------------------------------------+
| # | 베스트 프랙티스                                            |
+---+------------------------------------------------------------+
| 1 | 추론과 학습 워크로드의 NodePool을 분리                     |
| 2 | GPU taint를 설정하여 비 GPU 워크로드 스케줄링 방지        |
| 3 | 추론은 Spot, 학습은 On-Demand 사용                         |
| 4 | 장시간 학습에 do-not-disrupt 어노테이션 적용               |
| 5 | 체크포인트 전략으로 학습 진행 상황 보호                     |
| 6 | GPU Operator로 드라이버 관리 자동화                        |
| 7 | DCGM Exporter로 GPU 메트릭 수집                           |
| 8 | NodePool limits로 GPU 비용 상한 설정                       |
| 9 | 여러 GPU 인스턴스 타입을 허용하여 가용성 확보              |
| 10| PDB로 추론 서비스의 최소 가용성 보장                       |
| 11| Disruption Budget으로 학습 시간 중 중단 차단               |
| 12| HPA와 Karpenter를 연동하여 자동 스케일링 구현             |
+---+------------------------------------------------------------+

비용 최적화 전략 요약

전략 1: 계층형 NodePool
  - Spot GPU (높은 가중치) -> On-Demand GPU (낮은 가중치)
  - 추론 워크로드에 최적

전략 2: 인스턴스 다각화
  - 여러 GPU 패밀리 (g4dn, g5, g6) 허용
  - 여러 인스턴스 크기 허용
  - Spot 가용성 극대화

전략 3: 자동 축소
  - WhenEmpty consolidation으로 빈 GPU 노드 즉시 제거
  - consolidateAfter를 짧게 설정 (추론)
  - 학습 노드는 더 긴 대기 시간 설정

전략 4: 적절한 리소스 한도
  - NodePool limits로 최대 GPU 수 제한
  - 예상치 못한 비용 폭주 방지
  - 팀/프로젝트별 할당량 관리

Karpenter + GPU 아키텍처 최종 다이어그램

+---------------------------------------------------------------------+
|                        EKS Cluster                                  |
|                                                                     |
|  +-------------------+  +-------------------+  +-----------------+  |
|  | NodePool:         |  | NodePool:         |  | NodePool:       |  |
|  | gpu-inference     |  | gpu-training      |  | multi-arch      |  |
|  | (Spot, weight:60) |  | (OD, weight:90)   |  | (Mixed, w:50)   |  |
|  +--------+----------+  +--------+----------+  +--------+--------+  |
|           |                      |                       |           |
|  +--------v----------+  +--------v----------+  +--------v--------+  |
|  | EC2NodeClass:     |  | EC2NodeClass:     |  | EC2NodeClass:   |  |
|  | gpu-optimized     |  | gpu-training      |  | default         |  |
|  | (200GB, gp3)      |  | (500GB, gp3)      |  | (100GB, gp3)    |  |
|  +-------------------+  +-------------------+  +-----------------+  |
|                                                                     |
|  +-------------------+  +-------------------+                       |
|  | GPU Operator      |  | Prometheus +      |                       |
|  | (NVIDIA Driver,   |  | Grafana           |                       |
|  |  Device Plugin,   |  | (Karpenter +      |                       |
|  |  DCGM Exporter)   |  |  DCGM Metrics)    |                       |
|  +-------------------+  +-------------------+                       |
+---------------------------------------------------------------------+

[AWS] Managing GPU Nodes with Karpenter: AI/ML Workload Optimization

1. GPU Node Provisioning with Karpenter

The Unique Nature of GPU Workloads

AI/ML workloads have distinct requirements compared to general computing:

+---------------------------------------------------------------+
|               GPU Workload Characteristics                    |
+---------------------------------------------------------------+
| - Expensive GPU instances (dollars to tens of dollars/hour)   |
| - Long-running training jobs (hours to days)                  |
| - Low-latency requirements for inference                      |
| - GPU memory (VRAM) as the primary resource constraint        |
| - Significant performance differences across instance types   |
| - Risk of losing training progress on Spot interruption       |
+---------------------------------------------------------------+

Why Karpenter Excels for GPU Management

+------------------------------------------+
|    Traditional (Cluster Autoscaler)      |
|                                           |
|  GPU Node Group A: p3.2xlarge             |
|  GPU Node Group B: g5.xlarge              |
|  GPU Node Group C: g5.2xlarge             |
|  GPU Node Group D: p4d.24xlarge           |
|  ...                                      |
|  Manage each Node Group separately        |
|  (inefficient)                            |
+------------------------------------------+

+------------------------------------------+
|         Karpenter Approach               |
|                                           |
|  Single GPU NodePool:                     |
|  - Analyze pod requirements               |
|  - Auto-select optimal GPU instance       |
|  - Automatic Spot/On-Demand switching     |
|  - Cost-based instance optimization       |
+------------------------------------------+

2. GPU NodePool Configuration

General-Purpose GPU NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-general
spec:
  template:
    metadata:
      labels:
        node-type: gpu
        workload: ai-ml
    spec:
      requirements:
        # Select only GPU instances
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

        # GPU instance families
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g', 'p']

        # Capacity type
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

        # Availability zones
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

        # x86 architecture only
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']

      # GPU-dedicated taint
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

      # Longer expiration for GPU nodes
      expireAfter: 336h # 14 days

  limits:
    cpu: '500'
    memory: 2000Gi
    nvidia.com/gpu: '100'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
    budgets:
      - nodes: '1'

  weight: 80

AWS GPU Instance Type Guide

+------------------+----------+----------+------------------+---------------------+
| Instance Type    | GPU      | Count    | GPU Memory       | Primary Use Case    |
+------------------+----------+----------+------------------+---------------------+
| g4dn.xlarge      | T4       | 1        | 16 GB            | Inference, light ML |
| g4dn.12xlarge    | T4       | 4        | 64 GB            | Multi-inference     |
| g5.xlarge        | A10G     | 1        | 24 GB            | Inference, fine-tune|
| g5.12xlarge      | A10G     | 4        | 96 GB            | Medium training     |
| g5.48xlarge      | A10G     | 8        | 192 GB           | Large training      |
| g6.xlarge        | L4       | 1        | 24 GB            | Inference optimized |
| g6.12xlarge      | L4       | 4        | 96 GB            | Multimodal inference|
| p3.2xlarge       | V100     | 1        | 16 GB            | General training    |
| p3.8xlarge       | V100     | 4        | 64 GB            | Large training      |
| p4d.24xlarge     | A100     | 8        | 320 GB (40GB x8) | Ultra-large training|
| p5.48xlarge      | H100     | 8        | 640 GB (80GB x8) | Maximum performance |
+------------------+----------+----------+------------------+---------------------+

Inference-Only NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-inference
        workload: inference
    spec:
      requirements:
        # Inference-suitable instances
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Spot instances preferred (inference is stateless)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']

        # Instance size constraint
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '50'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 2m

  weight: 60

Training-Only NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training
spec:
  template:
    metadata:
      labels:
        node-type: gpu-training
        workload: training
    spec:
      requirements:
        # High-performance GPUs for training
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a100', 'h100', 'a10g']

        # On-Demand only (training interruption is costly)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

        # Large instances
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

      # No expiration for training nodes
      expireAfter: 720h # 30 days

  limits:
    nvidia.com/gpu: '32'

  disruption:
    # Disable consolidation during training
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      - nodes: '0'

  weight: 90

3. GPU-Optimized EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-optimized
spec:
  # AMI with GPU driver support
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
        network-type: private

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # Large disk for GPU workloads
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi
        volumeType: gp3
        iops: 6000
        throughput: 250
        encrypted: true
        deleteOnTermination: true

  metadataOptions:
    httpEndpoint: enabled
    httpPutResponseHopLimit: 2
    httpTokens: required

  tags:
    Environment: production
    NodeType: gpu
    ManagedBy: karpenter

  # Bootstrap script for GPU nodes
  userData: |
    #!/bin/bash
    echo "GPU node bootstrap"
    # NVIDIA drivers are handled by GPU Operator

Training-Specific EC2NodeClass (Large Storage)

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-training
spec:
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # Large, high-performance storage for training data
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 500Gi
        volumeType: gp3
        iops: 16000
        throughput: 1000
        encrypted: true
        deleteOnTermination: true

  tags:
    Environment: production
    NodeType: gpu-training
    ManagedBy: karpenter

4. Spot GPU Instance Strategy

Cost Savings with Spot GPU

+------------------+-------------------+-------------------+---------+
| Instance Type    | On-Demand (hr)    | Spot Est. (hr)    | Savings |
+------------------+-------------------+-------------------+---------+
| g4dn.xlarge      | ~0.526            | ~0.158            | ~70%    |
| g5.xlarge        | ~1.006            | ~0.302            | ~70%    |
| g5.2xlarge       | ~1.212            | ~0.364            | ~70%    |
| g5.12xlarge      | ~5.672            | ~1.702            | ~70%    |
| p3.2xlarge       | ~3.060            | ~0.918            | ~70%    |
+------------------+-------------------+-------------------+---------+
 (Prices vary by region and time)

Spot GPU NodePool for Inference

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-spot
        workload: inference
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']

        # Diverse GPU types for inference
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Various sizes for Spot availability
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge']

        # Multiple AZs
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '40'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

  weight: 70

Spot Interruption Mitigation

# Apply do-not-disrupt annotation for long-running training jobs
apiVersion: v1
kind: Pod
metadata:
  name: training-job
  annotations:
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: training
      image: my-training-image:latest
      resources:
        requests:
          nvidia.com/gpu: '1'
          cpu: '4'
          memory: 16Gi
        limits:
          nvidia.com/gpu: '1'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  terminationGracePeriodSeconds: 120

5. NVIDIA GPU Operator Integration

GPU Operator Overview

+----------------------------------------------------------------+
|                    NVIDIA GPU Operator                          |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | NVIDIA Driver    |  | Container Toolkit |  | Device Plugin| |
|  | (Auto Install)   |  | (Auto Config)     |  | (Auto Deploy)| |
|  +------------------+  +-------------------+  +--------------+ |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | GPU Feature      |  | DCGM Exporter    |  | MIG Manager  | |
|  | Discovery        |  | (Metrics)         |  | (MIG Mgmt)   | |
|  +------------------+  +-------------------+  +--------------+ |
+----------------------------------------------------------------+

Installing GPU Operator

# Add NVIDIA GPU Operator Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false \
  --set gfd.enabled=true

Verifying GPU Operator with Karpenter

# Check NVIDIA labels on GPU nodes
kubectl get nodes -l node-type=gpu -o json | \
  jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia")))'

# Verify GPU resources
kubectl describe node gpu-node-name | grep -A 5 "nvidia.com/gpu"

# Check DCGM Exporter pods
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

GPU Workload Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-inference-server
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference-server
  template:
    metadata:
      labels:
        app: inference-server
    spec:
      containers:
        - name: inference
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: model-store
              mountPath: /models
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc

6. Multi-Architecture Support (x86 + ARM/Graviton)

Multi-Architecture NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: multi-arch
spec:
  template:
    spec:
      requirements:
        # Allow both x86 and ARM
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64', 'arm64']

        # Include Graviton instances
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['c', 'm', 'r']

        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ['5']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default

  limits:
    cpu: '1000'
    memory: 2000Gi

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Graviton GPU Alternative: Inferentia/Trainium

# AWS Inferentia inference-only NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: inferentia
spec:
  template:
    metadata:
      labels:
        accelerator: inferentia
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

      taints:
        - key: aws.amazon.com/neuron
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: inferentia-nodes

  limits:
    aws.amazon.com/neuron: '32'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

7. Cost Optimization Strategies

Mixed Spot and On-Demand Strategy

+-------------------------------------------------------------+
|           Cost Optimization Decision Tree                    |
+-------------------------------------------------------------+
|                                                             |
|  Identify workload type                                     |
|      |                                                      |
|      +-- Inference (Stateless) --> Spot first + OD fallback |
|      |                                                      |
|      +-- Fine-tuning (Short) --> Spot + checkpoint strategy |
|      |                                                      |
|      +-- Large Training (Long) --> On-Demand + Reserved     |
|      |                                                      |
|      +-- Batch Processing --> Spot only                     |
|                                                             |
+-------------------------------------------------------------+

Weighted Priority Instance Family Strategy

# Tier 1: G5 Spot (most cost-effective for inference)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier1-g5-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 100
  limits:
    nvidia.com/gpu: '20'
---
# Tier 2: G4dn Spot (fallback)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier2-g4dn-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 50
  limits:
    nvidia.com/gpu: '20'
---
# Tier 3: G5 On-Demand (last resort)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier3-g5-ondemand
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 10
  limits:
    nvidia.com/gpu: '10'

Consolidation Policy Optimization

# GPU node consolidation settings
disruption:
  # Use WhenEmpty only for GPU (protect running GPU jobs)
  consolidationPolicy: WhenEmpty
  # Wait 5 minutes after detecting empty node (handle temporary inactivity)
  consolidateAfter: 5m
  budgets:
    # Disrupt at most 1 node at a time
    - nodes: '1'
    # Block disruption during business hours
    - nodes: '0'
      schedule: '0 9 * * MON-FRI'
      duration: 10h

8. Node Disruption Budgets for GPU Workloads

Disruption Budget for GPU Workloads

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training-protected
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      # Block disruption completely during training hours
      - nodes: '0'
        schedule: '0 0 * * *'
        duration: 23h

      # Maintenance window (1 hour daily)
      - nodes: '1'
        schedule: '0 23 * * *'
        duration: 1h

      # Manage drift-related disruption separately
      - nodes: '1'
        reasons:
          - 'Drifted'

Pod-Level Protection

# Long-running training pod: prevent Karpenter disruption
apiVersion: v1
kind: Pod
metadata:
  name: long-training-job
  annotations:
    # This annotation prevents Karpenter voluntary disruption
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: trainer
      image: my-training-image:v1
      resources:
        requests:
          nvidia.com/gpu: '4'
          cpu: '16'
          memory: 64Gi
        limits:
          nvidia.com/gpu: '4'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  # Sufficient grace period for checkpoint saving
  terminationGracePeriodSeconds: 300

PDB (Pod Disruption Budget) Configuration

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-server-pdb
  namespace: ml-serving
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: inference-server

9. Monitoring with Prometheus and Grafana

Karpenter Metrics Collection

# Karpenter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: karpenter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
    - port: http-metrics
      interval: 15s
      path: /metrics

Key Karpenter Metrics

+-----------------------------------------------+------------------------------------------+
| Metric                                        | Description                              |
+-----------------------------------------------+------------------------------------------+
| karpenter_nodeclaims_launched_total           | Total NodeClaims launched                |
| karpenter_nodeclaims_registered_total         | Total NodeClaims registered              |
| karpenter_nodeclaims_terminated_total         | Total NodeClaims terminated              |
| karpenter_pods_state                          | Pod state (node, namespace, etc.)        |
| karpenter_nodepool_usage                      | Resource usage per NodePool              |
| karpenter_nodepool_limit                      | Resource limits per NodePool             |
| karpenter_voluntary_disruption_eligible_nodes | Nodes eligible for voluntary disruption  |
| karpenter_disruption_actions_performed_total  | Total disruption actions performed       |
| karpenter_nodes_allocatable                   | Allocatable resources per node           |
| karpenter_nodes_total_daemon_requests         | Total daemon set resource requests       |
+-----------------------------------------------+------------------------------------------+

GPU-Specific Grafana Dashboard Queries

# Track GPU node count
count(karpenter_nodes_allocatable{resource_type="nvidia.com/gpu"} > 0)

# GPU utilization (requires DCGM Exporter)
DCGM_FI_DEV_GPU_UTIL

# GPU usage vs limits per NodePool
karpenter_nodepool_usage{resource_type="nvidia.com/gpu"}
  /
karpenter_nodepool_limit{resource_type="nvidia.com/gpu"}

# Provisioning latency
histogram_quantile(0.99,
  rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])
)

DCGM Exporter Metrics

# DCGM Exporter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

Alert Rules Example

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: karpenter-gpu-alerts
  namespace: monitoring
spec:
  groups:
    - name: karpenter-gpu
      rules:
        # GPU NodePool reaching 90% of limit
        - alert: GPUNodePoolNearLimit
          expr: |
            karpenter_nodepool_usage{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            /
            karpenter_nodepool_limit{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU NodePool approaching resource limit'

        # Low GPU utilization detected
        - alert: LowGPUUtilization
          expr: |
            avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10
          for: 1h
          labels:
            severity: info
          annotations:
            summary: 'GPU utilization below 10 percent for 1 hour'

        # Karpenter provisioning failure
        - alert: KarpenterProvisioningFailed
          expr: |
            increase(karpenter_nodeclaims_terminated_total{reason="ProvisioningFailed"}[15m]) > 0
          labels:
            severity: critical
          annotations:
            summary: 'Karpenter failed to provision GPU node'

10. Real-World Example: Training Cluster

Distributed Training Cluster Configuration

# PyTorch distributed training Job
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
  namespace: ml-training
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      labels:
        app: distributed-training
      annotations:
        karpenter.sh/do-not-disrupt: 'true'
    spec:
      containers:
        - name: pytorch-trainer
          image: my-pytorch-training:v1
          command: ['torchrun']
          args:
            - '--nproc_per_node=1'
            - '--nnodes=4'
            - '--node_rank=$(JOB_COMPLETION_INDEX)'
            - '--master_addr=training-master'
            - '--master_port=29500'
            - 'train.py'
          env:
            - name: JOB_COMPLETION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
          resources:
            requests:
              cpu: '8'
              memory: 32Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: shared-data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-training
      restartPolicy: OnFailure
      volumes:
        - name: shared-data
          persistentVolumeClaim:
            claimName: training-data-pvc
        - name: checkpoints
          persistentVolumeClaim:
            claimName: checkpoint-pvc

11. Real-World Example: Inference Cluster

Auto-Scaling Inference Service

# Inference Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
        - name: vllm-server
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3-8B'
            - '--tensor-parallel-size'
            - '1'
            - '--gpu-memory-utilization'
            - '0.9'
          ports:
            - containerPort: 8000
              name: http
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
---
# HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '70'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

12. Troubleshooting Guide

Common GPU Node Issues

# 1. GPU resources not showing on node
kubectl describe node gpu-node | grep -A 10 "Allocatable"
# If nvidia.com/gpu is missing, check GPU Operator

# 2. Check GPU Operator pod status
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# 3. Check Karpenter provisioning logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter \
  | grep -i "gpu\|nvidia\|instance-type"

# 4. Check NodeClaim status
kubectl get nodeclaims -o wide

# 5. Analyze pending pod causes
kubectl describe pod gpu-pod-name | grep -A 20 "Events"

Common Issues and Solutions

+---------------------------------------------+------------------------------------------+
| Issue                                       | Solution                                 |
+---------------------------------------------+------------------------------------------+
| GPU resources not showing on node           | Reinstall GPU Operator or verify drivers |
| Cannot find Spot GPU instances              | Add more GPU instance types and AZs      |
| GPU node provisioning timeout               | Check EC2NodeClass subnet/SG tags        |
| Node disrupted during training              | Add do-not-disrupt annotation            |
| GPU out of memory (OOM)                     | Allow larger GPU instance types           |
| Idle GPU nodes persisting                   | Review consolidation policy and           |
|                                             | consolidateAfter value                   |
| Only specific GPU type being provisioned    | Expand NodePool requirements range       |
+---------------------------------------------+------------------------------------------+

GPU Memory Debugging

# Check GPU status directly on node (using debug pod)
kubectl run gpu-debug --rm -it \
  --image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
  --overrides='{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],"nodeSelector":{"node-type":"gpu"}}}' \
  --restart=Never \
  -- nvidia-smi

13. Best Practices Summary

GPU Node Management Checklist

+---+------------------------------------------------------------+
| # | Best Practice                                              |
+---+------------------------------------------------------------+
| 1 | Separate NodePools for inference and training workloads    |
| 2 | Set GPU taints to prevent non-GPU workload scheduling      |
| 3 | Use Spot for inference, On-Demand for training             |
| 4 | Apply do-not-disrupt annotation for long training jobs     |
| 5 | Implement checkpoint strategy to protect training progress |
| 6 | Automate driver management with GPU Operator               |
| 7 | Collect GPU metrics with DCGM Exporter                     |
| 8 | Set GPU cost caps with NodePool limits                     |
| 9 | Allow multiple GPU instance types for availability         |
| 10| Guarantee minimum availability with PDB for inference      |
| 11| Block disruption during training with Disruption Budgets   |
| 12| Combine HPA with Karpenter for auto-scaling               |
+---+------------------------------------------------------------+

Cost Optimization Strategy Summary

Strategy 1: Tiered NodePools
  - Spot GPU (high weight) -> On-Demand GPU (low weight)
  - Optimal for inference workloads

Strategy 2: Instance Diversification
  - Allow multiple GPU families (g4dn, g5, g6)
  - Allow multiple instance sizes
  - Maximize Spot availability

Strategy 3: Auto Scale-Down
  - WhenEmpty consolidation to remove idle GPU nodes immediately
  - Short consolidateAfter for inference
  - Longer wait time for training nodes

Strategy 4: Appropriate Resource Limits
  - Limit max GPUs with NodePool limits
  - Prevent unexpected cost spikes
  - Manage quotas per team/project

Final Architecture Diagram: Karpenter + GPU

+---------------------------------------------------------------------+
|                        EKS Cluster                                  |
|                                                                     |
|  +-------------------+  +-------------------+  +-----------------+  |
|  | NodePool:         |  | NodePool:         |  | NodePool:       |  |
|  | gpu-inference     |  | gpu-training      |  | multi-arch      |  |
|  | (Spot, weight:60) |  | (OD, weight:90)   |  | (Mixed, w:50)   |  |
|  +--------+----------+  +--------+----------+  +--------+--------+  |
|           |                      |                       |           |
|  +--------v----------+  +--------v----------+  +--------v--------+  |
|  | EC2NodeClass:     |  | EC2NodeClass:     |  | EC2NodeClass:   |  |
|  | gpu-optimized     |  | gpu-training      |  | default         |  |
|  | (200GB, gp3)      |  | (500GB, gp3)      |  | (100GB, gp3)    |  |
|  +-------------------+  +-------------------+  +-----------------+  |
|                                                                     |
|  +-------------------+  +-------------------+                       |
|  | GPU Operator      |  | Prometheus +      |                       |
|  | (NVIDIA Driver,   |  | Grafana           |                       |
|  |  Device Plugin,   |  | (Karpenter +      |                       |
|  |  DCGM Exporter)   |  |  DCGM Metrics)    |                       |
|  +-------------------+  +-------------------+                       |
+---------------------------------------------------------------------+

[AWS] Karpenter로 GPU 노드 관리하기: AI/ML 워크로드 최적화

목차

1. Karpenter를 이용한 GPU 노드 프로비저닝

GPU 워크로드의 특수성

Karpenter가 GPU 관리에 적합한 이유

2. GPU NodePool 설정

범용 GPU NodePool

AWS GPU 인스턴스 타입 가이드

추론 전용 NodePool

학습 전용 NodePool

3. GPU 전용 EC2NodeClass

학습 전용 EC2NodeClass (대용량 스토리지)

4. Spot GPU 인스턴스 전략

Spot GPU의 비용 절감 효과

Spot GPU NodePool - 추론용

Spot 중단 대비 전략

5. NVIDIA GPU Operator 연동

GPU Operator 개요

GPU Operator 설치

GPU Operator와 Karpenter 연동 확인

GPU 워크로드 배포 예제

6. 멀티 아키텍처 지원 (x86 + ARM/Graviton)

멀티 아키텍처 NodePool

Graviton GPU 대안: Inferentia/Trainium

7. 비용 최적화 전략

Spot과 On-Demand 혼합 전략

가중 우선순위를 사용한 인스턴스 패밀리 전략

Consolidation 정책 최적화

8. 노드 중단 예산 (Node Disruption Budgets)

GPU 워크로드를 위한 Disruption Budget

Pod 수준 보호

PDB (Pod Disruption Budget) 설정

9. Prometheus/Grafana를 이용한 모니터링

Karpenter 메트릭 수집 설정

주요 Karpenter 메트릭

GPU 전용 Grafana 대시보드 쿼리

DCGM Exporter 메트릭

알림 규칙 예제

10. 실전 예제: 학습 클러스터

분산 학습 클러스터 구성

11. 실전 예제: 추론 클러스터

오토스케일링 추론 서비스

12. 트러블슈팅 가이드

일반적인 GPU 노드 문제

자주 발생하는 문제와 해결

GPU 메모리 확인 명령

13. 베스트 프랙티스 요약

GPU 노드 관리 체크리스트

비용 최적화 전략 요약

Karpenter + GPU 아키텍처 최종 다이어그램

[AWS] Managing GPU Nodes with Karpenter: AI/ML Workload Optimization

Table of Contents

1. GPU Node Provisioning with Karpenter

The Unique Nature of GPU Workloads

Why Karpenter Excels for GPU Management

2. GPU NodePool Configuration

General-Purpose GPU NodePool

AWS GPU Instance Type Guide

Inference-Only NodePool

Training-Only NodePool

3. GPU-Optimized EC2NodeClass

Training-Specific EC2NodeClass (Large Storage)

4. Spot GPU Instance Strategy

Cost Savings with Spot GPU

Spot GPU NodePool for Inference

Spot Interruption Mitigation

5. NVIDIA GPU Operator Integration

GPU Operator Overview

Installing GPU Operator

Verifying GPU Operator with Karpenter

GPU Workload Deployment Example

6. Multi-Architecture Support (x86 + ARM/Graviton)

Multi-Architecture NodePool

Graviton GPU Alternative: Inferentia/Trainium

7. Cost Optimization Strategies

Mixed Spot and On-Demand Strategy

Weighted Priority Instance Family Strategy

Consolidation Policy Optimization

8. Node Disruption Budgets for GPU Workloads

Disruption Budget for GPU Workloads