Skip to content

Split View: [AWS] Karpenter로 GPU 노드 관리하기: AI/ML 워크로드 최적화

|

[AWS] Karpenter로 GPU 노드 관리하기: AI/ML 워크로드 최적화

목차

1. Karpenter를 이용한 GPU 노드 프로비저닝

GPU 워크로드의 특수성

AI/ML 워크로드는 일반 컴퓨팅과 다른 고유한 요구사항을 가집니다:

+---------------------------------------------------------------+
|                GPU 워크로드 특성                               |
+---------------------------------------------------------------+
| - 고가의 GPU 인스턴스 (시간당 수~수십 달러)                   |
| - 학습 작업의 장시간 실행 (수시간~수일)                       |
| - 추론 작업의 낮은 지연 시간 요구                             |
| - GPU 메모리(VRAM) 기반 리소스 제약                           |
| - 인스턴스 타입별 GPU 성능 차이가 큼                          |
| - Spot 중단 시 학습 진행 상황 손실 위험                       |
+---------------------------------------------------------------+

Karpenter가 GPU 관리에 적합한 이유

+------------------------------------------+
|         기존 방식 (Cluster Autoscaler)    |
|                                           |
|  GPU Node Group A: p3.2xlarge             |
|  GPU Node Group B: g5.xlarge              |
|  GPU Node Group C: g5.2xlarge             |
|  GPU Node Group D: p4d.24xlarge           |
|  ...                                      |
|  각 Node Group을 개별 관리 (비효율적)     |
+------------------------------------------+

+------------------------------------------+
|         Karpenter 방식                    |
|                                           |
|  단일 GPU NodePool:                       |
|  - Pod 요구사항 분석                      |
|  - 최적 GPU 인스턴스 자동 선택            |
|  - Spot/On-Demand 자동 전환               |
|  - 비용 기반 인스턴스 최적화              |
+------------------------------------------+

2. GPU NodePool 설정

범용 GPU NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-general
spec:
  template:
    metadata:
      labels:
        node-type: gpu
        workload: ai-ml
    spec:
      requirements:
        # GPU 인스턴스만 선택
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

        # GPU 인스턴스 패밀리
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g', 'p']

        # 용량 타입
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

        # 가용 영역
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

        # x86 아키텍처만
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']

      # GPU 전용 taint
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

      # GPU 노드는 더 긴 만료 시간
      expireAfter: 336h # 14일

  limits:
    cpu: '500'
    memory: 2000Gi
    nvidia.com/gpu: '100'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
    budgets:
      - nodes: '1'

  weight: 80

AWS GPU 인스턴스 타입 가이드

+------------------+----------+----------+------------------+-------------------+
| 인스턴스 타입    | GPU      | GPU 수   | GPU 메모리       | 주요 용도         |
+------------------+----------+----------+------------------+-------------------+
| g4dn.xlarge      | T4       | 1        | 16 GB            | 추론, 경량 학습   |
| g4dn.12xlarge    | T4       | 4        | 64 GB            | 다중 추론         |
| g5.xlarge        | A10G     | 1        | 24 GB            | 추론, 미세 조정   |
| g5.12xlarge      | A10G     | 4        | 96 GB            | 중형 학습         |
| g5.48xlarge      | A10G     | 8        | 192 GB           | 대형 학습         |
| g6.xlarge        | L4       | 1        | 24 GB            | 추론 최적화       |
| g6.12xlarge      | L4       | 4        | 96 GB            | 멀티모달 추론     |
| p3.2xlarge       | V100     | 1        | 16 GB            | 범용 학습         |
| p3.8xlarge       | V100     | 4        | 64 GB            | 대규모 학습       |
| p4d.24xlarge     | A100     | 8        | 320 GB (40GB x8) | 초대규모 학습     |
| p5.48xlarge      | H100     | 8        | 640 GB (80GB x8) | 최대 성능 학습    |
+------------------+----------+----------+------------------+-------------------+

추론 전용 NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-inference
        workload: inference
    spec:
      requirements:
        # 추론에 적합한 인스턴스
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Spot 인스턴스 우선 (추론은 stateless)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']

        # 인스턴스 크기 제한
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '50'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 2m

  weight: 60

학습 전용 NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training
spec:
  template:
    metadata:
      labels:
        node-type: gpu-training
        workload: training
    spec:
      requirements:
        # 학습에 적합한 고성능 GPU
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a100', 'h100', 'a10g']

        # On-Demand 전용 (학습은 중단 비용이 큼)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

        # 대형 인스턴스
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

      # 학습용 노드는 만료 없음
      expireAfter: 720h # 30일

  limits:
    nvidia.com/gpu: '32'

  disruption:
    # 학습 중 통합 비활성화
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      - nodes: '0'

  weight: 90

3. GPU 전용 EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-optimized
spec:
  # GPU 드라이버가 포함된 AMI
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
        network-type: private

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # GPU 워크로드용 대용량 디스크
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi
        volumeType: gp3
        iops: 6000
        throughput: 250
        encrypted: true
        deleteOnTermination: true

  metadataOptions:
    httpEndpoint: enabled
    httpPutResponseHopLimit: 2
    httpTokens: required

  tags:
    Environment: production
    NodeType: gpu
    ManagedBy: karpenter

  # GPU 드라이버 설치를 위한 사용자 데이터
  userData: |
    #!/bin/bash
    echo "GPU node bootstrap"
    # NVIDIA 드라이버는 GPU Operator가 처리

학습 전용 EC2NodeClass (대용량 스토리지)

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-training
spec:
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # 학습 데이터용 대용량 + 고성능 스토리지
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 500Gi
        volumeType: gp3
        iops: 16000
        throughput: 1000
        encrypted: true
        deleteOnTermination: true

  tags:
    Environment: production
    NodeType: gpu-training
    ManagedBy: karpenter

4. Spot GPU 인스턴스 전략

Spot GPU의 비용 절감 효과

+------------------+-------------------+-------------------+---------+
| 인스턴스 타입    | On-Demand (시간)  | Spot 예상 (시간)  | 절감율  |
+------------------+-------------------+-------------------+---------+
| g4dn.xlarge      | ~0.526            | ~0.158            | ~70%    |
| g5.xlarge        | ~1.006            | ~0.302            | ~70%    |
| g5.2xlarge       | ~1.212            | ~0.364            | ~70%    |
| g5.12xlarge      | ~5.672            | ~1.702            | ~70%    |
| p3.2xlarge       | ~3.060            | ~0.918            | ~70%    |
+------------------+-------------------+-------------------+---------+
 (가격은 리전과 시기에 따라 변동됩니다)

Spot GPU NodePool - 추론용

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-spot
        workload: inference
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']

        # 추론에 적합한 다양한 GPU 타입
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # 다양한 크기로 Spot 가용성 확보
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge']

        # 여러 AZ 활용
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '40'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

  weight: 70

Spot 중단 대비 전략

# Pod에 do-not-disrupt 어노테이션 적용 (장시간 학습 작업)
apiVersion: v1
kind: Pod
metadata:
  name: training-job
  annotations:
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: training
      image: my-training-image:latest
      resources:
        requests:
          nvidia.com/gpu: '1'
          cpu: '4'
          memory: 16Gi
        limits:
          nvidia.com/gpu: '1'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  terminationGracePeriodSeconds: 120

5. NVIDIA GPU Operator 연동

GPU Operator 개요

+----------------------------------------------------------------+
|                    NVIDIA GPU Operator                          |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | NVIDIA Driver    |  | Container Toolkit |  | Device Plugin| |
|  | (자동 설치)      |  | (자동 설정)       |  | (자동 배포)  | |
|  +------------------+  +-------------------+  +--------------+ |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | GPU Feature      |  | DCGM Exporter    |  | MIG Manager  | |
|  | Discovery        |  | (메트릭 수집)     |  | (MIG 관리)   | |
|  +------------------+  +-------------------+  +--------------+ |
+----------------------------------------------------------------+

GPU Operator 설치

# NVIDIA GPU Operator Helm 리포지토리 추가
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# GPU Operator 설치
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false \
  --set gfd.enabled=true

GPU Operator와 Karpenter 연동 확인

# GPU 노드의 레이블 확인
kubectl get nodes -l node-type=gpu -o json | \
  jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia")))'

# GPU 리소스 확인
kubectl describe node gpu-node-name | grep -A 5 "nvidia.com/gpu"

# DCGM Exporter Pod 확인
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

GPU 워크로드 배포 예제

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-inference-server
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference-server
  template:
    metadata:
      labels:
        app: inference-server
    spec:
      containers:
        - name: inference
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: model-store
              mountPath: /models
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc

6. 멀티 아키텍처 지원 (x86 + ARM/Graviton)

멀티 아키텍처 NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: multi-arch
spec:
  template:
    spec:
      requirements:
        # x86과 ARM 모두 허용
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64', 'arm64']

        # Graviton 인스턴스 포함
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['c', 'm', 'r']

        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ['5']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default

  limits:
    cpu: '1000'
    memory: 2000Gi

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Graviton GPU 대안: Inferentia/Trainium

# AWS Inferentia 추론 전용 NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: inferentia
spec:
  template:
    metadata:
      labels:
        accelerator: inferentia
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

      taints:
        - key: aws.amazon.com/neuron
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: inferentia-nodes

  limits:
    aws.amazon.com/neuron: '32'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

7. 비용 최적화 전략

Spot과 On-Demand 혼합 전략

+-------------------------------------------------------------+
|              비용 최적화 의사결정 트리                        |
+-------------------------------------------------------------+
|                                                             |
|  워크로드 유형 확인                                         |
|      |                                                      |
|      +-- 추론 (Stateless) --> Spot 우선 + On-Demand 대체    |
|      |                                                      |
|      +-- 미세 조정 (단기) --> Spot + 체크포인트 전략        |
|      |                                                      |
|      +-- 대규모 학습 (장기) --> On-Demand + 예약 인스턴스   |
|      |                                                      |
|      +-- 배치 처리 --> Spot 전용                            |
|                                                             |
+-------------------------------------------------------------+

가중 우선순위를 사용한 인스턴스 패밀리 전략

# 1순위: G5 Spot (가장 비용 효율적인 추론)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier1-g5-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 100
  limits:
    nvidia.com/gpu: '20'
---
# 2순위: G4dn Spot (대체)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier2-g4dn-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 50
  limits:
    nvidia.com/gpu: '20'
---
# 3순위: G5 On-Demand (최후의 대체)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier3-g5-ondemand
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 10
  limits:
    nvidia.com/gpu: '10'

Consolidation 정책 최적화

# GPU 노드의 Consolidation 설정
disruption:
  # GPU 노드는 WhenEmpty만 사용 (실행 중인 GPU 작업 보호)
  consolidationPolicy: WhenEmpty
  # 빈 노드 감지 후 5분 대기 (일시적 비활성 고려)
  consolidateAfter: 5m
  budgets:
    # 동시에 최대 1개 노드만 중단
    - nodes: '1'
    # 업무 시간에는 중단 차단
    - nodes: '0'
      schedule: '0 9 * * MON-FRI'
      duration: 10h

8. 노드 중단 예산 (Node Disruption Budgets)

GPU 워크로드를 위한 Disruption Budget

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training-protected
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      # 학습 시간에는 중단 완전 차단
      - nodes: '0'
        schedule: '0 0 * * *'
        duration: 23h

      # 유지보수 창 (매일 1시간)
      - nodes: '1'
        schedule: '0 23 * * *'
        duration: 1h

      # 드리프트로 인한 중단은 별도 관리
      - nodes: '1'
        reasons:
          - 'Drifted'

Pod 수준 보호

# 장시간 학습 Pod: Karpenter 중단 방지
apiVersion: v1
kind: Pod
metadata:
  name: long-training-job
  annotations:
    # 이 어노테이션으로 Karpenter의 자발적 중단을 방지
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: trainer
      image: my-training-image:v1
      resources:
        requests:
          nvidia.com/gpu: '4'
          cpu: '16'
          memory: 64Gi
        limits:
          nvidia.com/gpu: '4'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  # 충분한 종료 유예 시간 (체크포인트 저장)
  terminationGracePeriodSeconds: 300

PDB (Pod Disruption Budget) 설정

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-server-pdb
  namespace: ml-serving
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: inference-server

9. Prometheus/Grafana를 이용한 모니터링

Karpenter 메트릭 수집 설정

# Karpenter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: karpenter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
    - port: http-metrics
      interval: 15s
      path: /metrics

주요 Karpenter 메트릭

+-----------------------------------------------+----------------------------------------+
| 메트릭                                        | 설명                                   |
+-----------------------------------------------+----------------------------------------+
| karpenter_nodeclaims_launched_total           | 총 시작된 NodeClaim 수                 |
| karpenter_nodeclaims_registered_total         | 총 등록된 NodeClaim 수                 |
| karpenter_nodeclaims_terminated_total         | 총 종료된 NodeClaim 수                 |
| karpenter_pods_state                          | Pod 상태 (노드, 네임스페이스 등)       |
| karpenter_nodepool_usage                      | NodePool별 리소스 사용량               |
| karpenter_nodepool_limit                      | NodePool별 리소스 한도                 |
| karpenter_voluntary_disruption_eligible_nodes | 자발적 중단 대상 노드 수              |
| karpenter_disruption_actions_performed_total  | 수행된 중단 작업 수                    |
| karpenter_nodes_allocatable                   | 노드별 할당 가능 리소스               |
| karpenter_nodes_total_daemon_requests         | 데몬셋 리소스 요청 총합               |
+-----------------------------------------------+----------------------------------------+

GPU 전용 Grafana 대시보드 쿼리

# GPU 노드 수 추적
count(karpenter_nodes_allocatable{resource_type="nvidia.com/gpu"} > 0)

# GPU 활용률 (DCGM Exporter 필요)
DCGM_FI_DEV_GPU_UTIL

# NodePool별 GPU 사용량 vs 한도
karpenter_nodepool_usage{resource_type="nvidia.com/gpu"}
  /
karpenter_nodepool_limit{resource_type="nvidia.com/gpu"}

# 프로비저닝 지연 시간
histogram_quantile(0.99,
  rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])
)

DCGM Exporter 메트릭

# DCGM Exporter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

알림 규칙 예제

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: karpenter-gpu-alerts
  namespace: monitoring
spec:
  groups:
    - name: karpenter-gpu
      rules:
        # GPU NodePool이 한도의 90%에 도달
        - alert: GPUNodePoolNearLimit
          expr: |
            karpenter_nodepool_usage{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            /
            karpenter_nodepool_limit{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU NodePool approaching resource limit'

        # GPU 활용률이 낮은 노드 감지
        - alert: LowGPUUtilization
          expr: |
            avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10
          for: 1h
          labels:
            severity: info
          annotations:
            summary: 'GPU utilization below 10 percent for 1 hour'

        # Karpenter 프로비저닝 실패
        - alert: KarpenterProvisioningFailed
          expr: |
            increase(karpenter_nodeclaims_terminated_total{reason="ProvisioningFailed"}[15m]) > 0
          labels:
            severity: critical
          annotations:
            summary: 'Karpenter failed to provision GPU node'

10. 실전 예제: 학습 클러스터

분산 학습 클러스터 구성

# PyTorch 분산 학습 Job
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
  namespace: ml-training
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      labels:
        app: distributed-training
      annotations:
        karpenter.sh/do-not-disrupt: 'true'
    spec:
      containers:
        - name: pytorch-trainer
          image: my-pytorch-training:v1
          command: ['torchrun']
          args:
            - '--nproc_per_node=1'
            - '--nnodes=4'
            - '--node_rank=$(JOB_COMPLETION_INDEX)'
            - '--master_addr=training-master'
            - '--master_port=29500'
            - 'train.py'
          env:
            - name: JOB_COMPLETION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
          resources:
            requests:
              cpu: '8'
              memory: 32Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: shared-data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-training
      restartPolicy: OnFailure
      volumes:
        - name: shared-data
          persistentVolumeClaim:
            claimName: training-data-pvc
        - name: checkpoints
          persistentVolumeClaim:
            claimName: checkpoint-pvc

11. 실전 예제: 추론 클러스터

오토스케일링 추론 서비스

# 추론 Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
        - name: vllm-server
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3-8B'
            - '--tensor-parallel-size'
            - '1'
            - '--gpu-memory-utilization'
            - '0.9'
          ports:
            - containerPort: 8000
              name: http
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
---
# HPA 설정
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '70'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

12. 트러블슈팅 가이드

일반적인 GPU 노드 문제

# 1. GPU 리소스가 표시되지 않는 경우
kubectl describe node gpu-node | grep -A 10 "Allocatable"
# nvidia.com/gpu가 없으면 GPU Operator 확인

# 2. GPU Operator Pod 상태 확인
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# 3. Karpenter 프로비저닝 로그 확인
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter \
  | grep -i "gpu\|nvidia\|instance-type"

# 4. NodeClaim 상태 확인
kubectl get nodeclaims -o wide

# 5. Pending Pod 원인 분석
kubectl describe pod gpu-pod-name | grep -A 20 "Events"

자주 발생하는 문제와 해결

+---------------------------------------------+------------------------------------------+
| 문제                                        | 해결 방법                                |
+---------------------------------------------+------------------------------------------+
| GPU 리소스가 노드에 표시되지 않음           | GPU Operator 재설치 또는 드라이버 확인   |
| Spot GPU 인스턴스를 찾을 수 없음            | 더 많은 GPU 인스턴스 타입과 AZ 추가      |
| GPU 노드 프로비저닝 시간 초과               | EC2NodeClass 서브넷/보안그룹 태그 확인   |
| 학습 중 노드가 중단됨                       | do-not-disrupt 어노테이션 추가           |
| GPU 메모리 부족 (OOM)                       | 더 큰 GPU 인스턴스 타입 허용             |
| 불필요한 GPU 노드가 유지됨                  | Consolidation 정책 및 consolidateAfter   |
|                                             | 값 확인                                  |
| 특정 GPU 타입만 프로비저닝됨                | NodePool requirements 범위 확장          |
+---------------------------------------------+------------------------------------------+

GPU 메모리 확인 명령

# 노드에서 직접 GPU 상태 확인 (디버그 Pod 사용)
kubectl run gpu-debug --rm -it \
  --image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
  --overrides='{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],"nodeSelector":{"node-type":"gpu"}}}' \
  --restart=Never \
  -- nvidia-smi

13. 베스트 프랙티스 요약

GPU 노드 관리 체크리스트

+---+------------------------------------------------------------+
| # | 베스트 프랙티스                                            |
+---+------------------------------------------------------------+
| 1 | 추론과 학습 워크로드의 NodePool을 분리                     |
| 2 | GPU taint를 설정하여 비 GPU 워크로드 스케줄링 방지        |
| 3 | 추론은 Spot, 학습은 On-Demand 사용                         |
| 4 | 장시간 학습에 do-not-disrupt 어노테이션 적용               |
| 5 | 체크포인트 전략으로 학습 진행 상황 보호                     |
| 6 | GPU Operator로 드라이버 관리 자동화                        |
| 7 | DCGM Exporter로 GPU 메트릭 수집                           |
| 8 | NodePool limits로 GPU 비용 상한 설정                       |
| 9 | 여러 GPU 인스턴스 타입을 허용하여 가용성 확보              |
| 10| PDB로 추론 서비스의 최소 가용성 보장                       |
| 11| Disruption Budget으로 학습 시간 중 중단 차단               |
| 12| HPA와 Karpenter를 연동하여 자동 스케일링 구현             |
+---+------------------------------------------------------------+

비용 최적화 전략 요약

전략 1: 계층형 NodePool
  - Spot GPU (높은 가중치) -> On-Demand GPU (낮은 가중치)
  - 추론 워크로드에 최적

전략 2: 인스턴스 다각화
  - 여러 GPU 패밀리 (g4dn, g5, g6) 허용
  - 여러 인스턴스 크기 허용
  - Spot 가용성 극대화

전략 3: 자동 축소
  - WhenEmpty consolidation으로 빈 GPU 노드 즉시 제거
  - consolidateAfter를 짧게 설정 (추론)
  - 학습 노드는 더 긴 대기 시간 설정

전략 4: 적절한 리소스 한도
  - NodePool limits로 최대 GPU 수 제한
  - 예상치 못한 비용 폭주 방지
  - 팀/프로젝트별 할당량 관리

Karpenter + GPU 아키텍처 최종 다이어그램

+---------------------------------------------------------------------+
|                        EKS Cluster                                  |
|                                                                     |
|  +-------------------+  +-------------------+  +-----------------+  |
|  | NodePool:         |  | NodePool:         |  | NodePool:       |  |
|  | gpu-inference     |  | gpu-training      |  | multi-arch      |  |
|  | (Spot, weight:60) |  | (OD, weight:90)   |  | (Mixed, w:50)   |  |
|  +--------+----------+  +--------+----------+  +--------+--------+  |
|           |                      |                       |           |
|  +--------v----------+  +--------v----------+  +--------v--------+  |
|  | EC2NodeClass:     |  | EC2NodeClass:     |  | EC2NodeClass:   |  |
|  | gpu-optimized     |  | gpu-training      |  | default         |  |
|  | (200GB, gp3)      |  | (500GB, gp3)      |  | (100GB, gp3)    |  |
|  +-------------------+  +-------------------+  +-----------------+  |
|                                                                     |
|  +-------------------+  +-------------------+                       |
|  | GPU Operator      |  | Prometheus +      |                       |
|  | (NVIDIA Driver,   |  | Grafana           |                       |
|  |  Device Plugin,   |  | (Karpenter +      |                       |
|  |  DCGM Exporter)   |  |  DCGM Metrics)    |                       |
|  +-------------------+  +-------------------+                       |
+---------------------------------------------------------------------+

[AWS] Managing GPU Nodes with Karpenter: AI/ML Workload Optimization

Table of Contents

1. GPU Node Provisioning with Karpenter

The Unique Nature of GPU Workloads

AI/ML workloads have distinct requirements compared to general computing:

+---------------------------------------------------------------+
|               GPU Workload Characteristics                    |
+---------------------------------------------------------------+
| - Expensive GPU instances (dollars to tens of dollars/hour)   |
| - Long-running training jobs (hours to days)                  |
| - Low-latency requirements for inference                      |
| - GPU memory (VRAM) as the primary resource constraint        |
| - Significant performance differences across instance types   |
| - Risk of losing training progress on Spot interruption       |
+---------------------------------------------------------------+

Why Karpenter Excels for GPU Management

+------------------------------------------+
|    Traditional (Cluster Autoscaler)      |
|                                           |
|  GPU Node Group A: p3.2xlarge             |
|  GPU Node Group B: g5.xlarge              |
|  GPU Node Group C: g5.2xlarge             |
|  GPU Node Group D: p4d.24xlarge           |
|  ...                                      |
|  Manage each Node Group separately        |
|  (inefficient)                            |
+------------------------------------------+

+------------------------------------------+
|         Karpenter Approach               |
|                                           |
|  Single GPU NodePool:                     |
|  - Analyze pod requirements               |
|  - Auto-select optimal GPU instance       |
|  - Automatic Spot/On-Demand switching     |
|  - Cost-based instance optimization       |
+------------------------------------------+

2. GPU NodePool Configuration

General-Purpose GPU NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-general
spec:
  template:
    metadata:
      labels:
        node-type: gpu
        workload: ai-ml
    spec:
      requirements:
        # Select only GPU instances
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

        # GPU instance families
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g', 'p']

        # Capacity type
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

        # Availability zones
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

        # x86 architecture only
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']

      # GPU-dedicated taint
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

      # Longer expiration for GPU nodes
      expireAfter: 336h # 14 days

  limits:
    cpu: '500'
    memory: 2000Gi
    nvidia.com/gpu: '100'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
    budgets:
      - nodes: '1'

  weight: 80

AWS GPU Instance Type Guide

+------------------+----------+----------+------------------+---------------------+
| Instance Type    | GPU      | Count    | GPU Memory       | Primary Use Case    |
+------------------+----------+----------+------------------+---------------------+
| g4dn.xlarge      | T4       | 1        | 16 GB            | Inference, light ML |
| g4dn.12xlarge    | T4       | 4        | 64 GB            | Multi-inference     |
| g5.xlarge        | A10G     | 1        | 24 GB            | Inference, fine-tune|
| g5.12xlarge      | A10G     | 4        | 96 GB            | Medium training     |
| g5.48xlarge      | A10G     | 8        | 192 GB           | Large training      |
| g6.xlarge        | L4       | 1        | 24 GB            | Inference optimized |
| g6.12xlarge      | L4       | 4        | 96 GB            | Multimodal inference|
| p3.2xlarge       | V100     | 1        | 16 GB            | General training    |
| p3.8xlarge       | V100     | 4        | 64 GB            | Large training      |
| p4d.24xlarge     | A100     | 8        | 320 GB (40GB x8) | Ultra-large training|
| p5.48xlarge      | H100     | 8        | 640 GB (80GB x8) | Maximum performance |
+------------------+----------+----------+------------------+---------------------+

Inference-Only NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-inference
        workload: inference
    spec:
      requirements:
        # Inference-suitable instances
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Spot instances preferred (inference is stateless)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']

        # Instance size constraint
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '50'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 2m

  weight: 60

Training-Only NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training
spec:
  template:
    metadata:
      labels:
        node-type: gpu-training
        workload: training
    spec:
      requirements:
        # High-performance GPUs for training
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a100', 'h100', 'a10g']

        # On-Demand only (training interruption is costly)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

        # Large instances
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

      # No expiration for training nodes
      expireAfter: 720h # 30 days

  limits:
    nvidia.com/gpu: '32'

  disruption:
    # Disable consolidation during training
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      - nodes: '0'

  weight: 90

3. GPU-Optimized EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-optimized
spec:
  # AMI with GPU driver support
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
        network-type: private

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # Large disk for GPU workloads
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi
        volumeType: gp3
        iops: 6000
        throughput: 250
        encrypted: true
        deleteOnTermination: true

  metadataOptions:
    httpEndpoint: enabled
    httpPutResponseHopLimit: 2
    httpTokens: required

  tags:
    Environment: production
    NodeType: gpu
    ManagedBy: karpenter

  # Bootstrap script for GPU nodes
  userData: |
    #!/bin/bash
    echo "GPU node bootstrap"
    # NVIDIA drivers are handled by GPU Operator

Training-Specific EC2NodeClass (Large Storage)

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-training
spec:
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # Large, high-performance storage for training data
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 500Gi
        volumeType: gp3
        iops: 16000
        throughput: 1000
        encrypted: true
        deleteOnTermination: true

  tags:
    Environment: production
    NodeType: gpu-training
    ManagedBy: karpenter

4. Spot GPU Instance Strategy

Cost Savings with Spot GPU

+------------------+-------------------+-------------------+---------+
| Instance Type    | On-Demand (hr)    | Spot Est. (hr)    | Savings |
+------------------+-------------------+-------------------+---------+
| g4dn.xlarge      | ~0.526            | ~0.158            | ~70%    |
| g5.xlarge        | ~1.006            | ~0.302            | ~70%    |
| g5.2xlarge       | ~1.212            | ~0.364            | ~70%    |
| g5.12xlarge      | ~5.672            | ~1.702            | ~70%    |
| p3.2xlarge       | ~3.060            | ~0.918            | ~70%    |
+------------------+-------------------+-------------------+---------+
 (Prices vary by region and time)

Spot GPU NodePool for Inference

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-spot
        workload: inference
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']

        # Diverse GPU types for inference
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Various sizes for Spot availability
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge']

        # Multiple AZs
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '40'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

  weight: 70

Spot Interruption Mitigation

# Apply do-not-disrupt annotation for long-running training jobs
apiVersion: v1
kind: Pod
metadata:
  name: training-job
  annotations:
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: training
      image: my-training-image:latest
      resources:
        requests:
          nvidia.com/gpu: '1'
          cpu: '4'
          memory: 16Gi
        limits:
          nvidia.com/gpu: '1'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  terminationGracePeriodSeconds: 120

5. NVIDIA GPU Operator Integration

GPU Operator Overview

+----------------------------------------------------------------+
|                    NVIDIA GPU Operator                          |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | NVIDIA Driver    |  | Container Toolkit |  | Device Plugin| |
|  | (Auto Install)   |  | (Auto Config)     |  | (Auto Deploy)| |
|  +------------------+  +-------------------+  +--------------+ |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | GPU Feature      |  | DCGM Exporter    |  | MIG Manager  | |
|  | Discovery        |  | (Metrics)         |  | (MIG Mgmt)   | |
|  +------------------+  +-------------------+  +--------------+ |
+----------------------------------------------------------------+

Installing GPU Operator

# Add NVIDIA GPU Operator Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false \
  --set gfd.enabled=true

Verifying GPU Operator with Karpenter

# Check NVIDIA labels on GPU nodes
kubectl get nodes -l node-type=gpu -o json | \
  jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia")))'

# Verify GPU resources
kubectl describe node gpu-node-name | grep -A 5 "nvidia.com/gpu"

# Check DCGM Exporter pods
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

GPU Workload Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-inference-server
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference-server
  template:
    metadata:
      labels:
        app: inference-server
    spec:
      containers:
        - name: inference
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: model-store
              mountPath: /models
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc

6. Multi-Architecture Support (x86 + ARM/Graviton)

Multi-Architecture NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: multi-arch
spec:
  template:
    spec:
      requirements:
        # Allow both x86 and ARM
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64', 'arm64']

        # Include Graviton instances
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['c', 'm', 'r']

        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ['5']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default

  limits:
    cpu: '1000'
    memory: 2000Gi

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Graviton GPU Alternative: Inferentia/Trainium

# AWS Inferentia inference-only NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: inferentia
spec:
  template:
    metadata:
      labels:
        accelerator: inferentia
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

      taints:
        - key: aws.amazon.com/neuron
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: inferentia-nodes

  limits:
    aws.amazon.com/neuron: '32'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

7. Cost Optimization Strategies

Mixed Spot and On-Demand Strategy

+-------------------------------------------------------------+
|           Cost Optimization Decision Tree                    |
+-------------------------------------------------------------+
|                                                             |
|  Identify workload type                                     |
|      |                                                      |
|      +-- Inference (Stateless) --> Spot first + OD fallback |
|      |                                                      |
|      +-- Fine-tuning (Short) --> Spot + checkpoint strategy |
|      |                                                      |
|      +-- Large Training (Long) --> On-Demand + Reserved     |
|      |                                                      |
|      +-- Batch Processing --> Spot only                     |
|                                                             |
+-------------------------------------------------------------+

Weighted Priority Instance Family Strategy

# Tier 1: G5 Spot (most cost-effective for inference)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier1-g5-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 100
  limits:
    nvidia.com/gpu: '20'
---
# Tier 2: G4dn Spot (fallback)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier2-g4dn-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 50
  limits:
    nvidia.com/gpu: '20'
---
# Tier 3: G5 On-Demand (last resort)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier3-g5-ondemand
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 10
  limits:
    nvidia.com/gpu: '10'

Consolidation Policy Optimization

# GPU node consolidation settings
disruption:
  # Use WhenEmpty only for GPU (protect running GPU jobs)
  consolidationPolicy: WhenEmpty
  # Wait 5 minutes after detecting empty node (handle temporary inactivity)
  consolidateAfter: 5m
  budgets:
    # Disrupt at most 1 node at a time
    - nodes: '1'
    # Block disruption during business hours
    - nodes: '0'
      schedule: '0 9 * * MON-FRI'
      duration: 10h

8. Node Disruption Budgets for GPU Workloads

Disruption Budget for GPU Workloads

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training-protected
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      # Block disruption completely during training hours
      - nodes: '0'
        schedule: '0 0 * * *'
        duration: 23h

      # Maintenance window (1 hour daily)
      - nodes: '1'
        schedule: '0 23 * * *'
        duration: 1h

      # Manage drift-related disruption separately
      - nodes: '1'
        reasons:
          - 'Drifted'

Pod-Level Protection

# Long-running training pod: prevent Karpenter disruption
apiVersion: v1
kind: Pod
metadata:
  name: long-training-job
  annotations:
    # This annotation prevents Karpenter voluntary disruption
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: trainer
      image: my-training-image:v1
      resources:
        requests:
          nvidia.com/gpu: '4'
          cpu: '16'
          memory: 64Gi
        limits:
          nvidia.com/gpu: '4'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  # Sufficient grace period for checkpoint saving
  terminationGracePeriodSeconds: 300

PDB (Pod Disruption Budget) Configuration

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-server-pdb
  namespace: ml-serving
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: inference-server

9. Monitoring with Prometheus and Grafana

Karpenter Metrics Collection

# Karpenter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: karpenter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
    - port: http-metrics
      interval: 15s
      path: /metrics

Key Karpenter Metrics

+-----------------------------------------------+------------------------------------------+
| Metric                                        | Description                              |
+-----------------------------------------------+------------------------------------------+
| karpenter_nodeclaims_launched_total           | Total NodeClaims launched                |
| karpenter_nodeclaims_registered_total         | Total NodeClaims registered              |
| karpenter_nodeclaims_terminated_total         | Total NodeClaims terminated              |
| karpenter_pods_state                          | Pod state (node, namespace, etc.)        |
| karpenter_nodepool_usage                      | Resource usage per NodePool              |
| karpenter_nodepool_limit                      | Resource limits per NodePool             |
| karpenter_voluntary_disruption_eligible_nodes | Nodes eligible for voluntary disruption  |
| karpenter_disruption_actions_performed_total  | Total disruption actions performed       |
| karpenter_nodes_allocatable                   | Allocatable resources per node           |
| karpenter_nodes_total_daemon_requests         | Total daemon set resource requests       |
+-----------------------------------------------+------------------------------------------+

GPU-Specific Grafana Dashboard Queries

# Track GPU node count
count(karpenter_nodes_allocatable{resource_type="nvidia.com/gpu"} > 0)

# GPU utilization (requires DCGM Exporter)
DCGM_FI_DEV_GPU_UTIL

# GPU usage vs limits per NodePool
karpenter_nodepool_usage{resource_type="nvidia.com/gpu"}
  /
karpenter_nodepool_limit{resource_type="nvidia.com/gpu"}

# Provisioning latency
histogram_quantile(0.99,
  rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])
)

DCGM Exporter Metrics

# DCGM Exporter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

Alert Rules Example

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: karpenter-gpu-alerts
  namespace: monitoring
spec:
  groups:
    - name: karpenter-gpu
      rules:
        # GPU NodePool reaching 90% of limit
        - alert: GPUNodePoolNearLimit
          expr: |
            karpenter_nodepool_usage{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            /
            karpenter_nodepool_limit{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU NodePool approaching resource limit'

        # Low GPU utilization detected
        - alert: LowGPUUtilization
          expr: |
            avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10
          for: 1h
          labels:
            severity: info
          annotations:
            summary: 'GPU utilization below 10 percent for 1 hour'

        # Karpenter provisioning failure
        - alert: KarpenterProvisioningFailed
          expr: |
            increase(karpenter_nodeclaims_terminated_total{reason="ProvisioningFailed"}[15m]) > 0
          labels:
            severity: critical
          annotations:
            summary: 'Karpenter failed to provision GPU node'

10. Real-World Example: Training Cluster

Distributed Training Cluster Configuration

# PyTorch distributed training Job
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
  namespace: ml-training
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      labels:
        app: distributed-training
      annotations:
        karpenter.sh/do-not-disrupt: 'true'
    spec:
      containers:
        - name: pytorch-trainer
          image: my-pytorch-training:v1
          command: ['torchrun']
          args:
            - '--nproc_per_node=1'
            - '--nnodes=4'
            - '--node_rank=$(JOB_COMPLETION_INDEX)'
            - '--master_addr=training-master'
            - '--master_port=29500'
            - 'train.py'
          env:
            - name: JOB_COMPLETION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
          resources:
            requests:
              cpu: '8'
              memory: 32Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: shared-data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-training
      restartPolicy: OnFailure
      volumes:
        - name: shared-data
          persistentVolumeClaim:
            claimName: training-data-pvc
        - name: checkpoints
          persistentVolumeClaim:
            claimName: checkpoint-pvc

11. Real-World Example: Inference Cluster

Auto-Scaling Inference Service

# Inference Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
        - name: vllm-server
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3-8B'
            - '--tensor-parallel-size'
            - '1'
            - '--gpu-memory-utilization'
            - '0.9'
          ports:
            - containerPort: 8000
              name: http
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
---
# HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '70'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

12. Troubleshooting Guide

Common GPU Node Issues

# 1. GPU resources not showing on node
kubectl describe node gpu-node | grep -A 10 "Allocatable"
# If nvidia.com/gpu is missing, check GPU Operator

# 2. Check GPU Operator pod status
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# 3. Check Karpenter provisioning logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter \
  | grep -i "gpu\|nvidia\|instance-type"

# 4. Check NodeClaim status
kubectl get nodeclaims -o wide

# 5. Analyze pending pod causes
kubectl describe pod gpu-pod-name | grep -A 20 "Events"

Common Issues and Solutions

+---------------------------------------------+------------------------------------------+
| Issue                                       | Solution                                 |
+---------------------------------------------+------------------------------------------+
| GPU resources not showing on node           | Reinstall GPU Operator or verify drivers |
| Cannot find Spot GPU instances              | Add more GPU instance types and AZs      |
| GPU node provisioning timeout               | Check EC2NodeClass subnet/SG tags        |
| Node disrupted during training              | Add do-not-disrupt annotation            |
| GPU out of memory (OOM)                     | Allow larger GPU instance types           |
| Idle GPU nodes persisting                   | Review consolidation policy and           |
|                                             | consolidateAfter value                   |
| Only specific GPU type being provisioned    | Expand NodePool requirements range       |
+---------------------------------------------+------------------------------------------+

GPU Memory Debugging

# Check GPU status directly on node (using debug pod)
kubectl run gpu-debug --rm -it \
  --image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
  --overrides='{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],"nodeSelector":{"node-type":"gpu"}}}' \
  --restart=Never \
  -- nvidia-smi

13. Best Practices Summary

GPU Node Management Checklist

+---+------------------------------------------------------------+
| # | Best Practice                                              |
+---+------------------------------------------------------------+
| 1 | Separate NodePools for inference and training workloads    |
| 2 | Set GPU taints to prevent non-GPU workload scheduling      |
| 3 | Use Spot for inference, On-Demand for training             |
| 4 | Apply do-not-disrupt annotation for long training jobs     |
| 5 | Implement checkpoint strategy to protect training progress |
| 6 | Automate driver management with GPU Operator               |
| 7 | Collect GPU metrics with DCGM Exporter                     |
| 8 | Set GPU cost caps with NodePool limits                     |
| 9 | Allow multiple GPU instance types for availability         |
| 10| Guarantee minimum availability with PDB for inference      |
| 11| Block disruption during training with Disruption Budgets   |
| 12| Combine HPA with Karpenter for auto-scaling               |
+---+------------------------------------------------------------+

Cost Optimization Strategy Summary

Strategy 1: Tiered NodePools
  - Spot GPU (high weight) -> On-Demand GPU (low weight)
  - Optimal for inference workloads

Strategy 2: Instance Diversification
  - Allow multiple GPU families (g4dn, g5, g6)
  - Allow multiple instance sizes
  - Maximize Spot availability

Strategy 3: Auto Scale-Down
  - WhenEmpty consolidation to remove idle GPU nodes immediately
  - Short consolidateAfter for inference
  - Longer wait time for training nodes

Strategy 4: Appropriate Resource Limits
  - Limit max GPUs with NodePool limits
  - Prevent unexpected cost spikes
  - Manage quotas per team/project

Final Architecture Diagram: Karpenter + GPU

+---------------------------------------------------------------------+
|                        EKS Cluster                                  |
|                                                                     |
|  +-------------------+  +-------------------+  +-----------------+  |
|  | NodePool:         |  | NodePool:         |  | NodePool:       |  |
|  | gpu-inference     |  | gpu-training      |  | multi-arch      |  |
|  | (Spot, weight:60) |  | (OD, weight:90)   |  | (Mixed, w:50)   |  |
|  +--------+----------+  +--------+----------+  +--------+--------+  |
|           |                      |                       |           |
|  +--------v----------+  +--------v----------+  +--------v--------+  |
|  | EC2NodeClass:     |  | EC2NodeClass:     |  | EC2NodeClass:   |  |
|  | gpu-optimized     |  | gpu-training      |  | default         |  |
|  | (200GB, gp3)      |  | (500GB, gp3)      |  | (100GB, gp3)    |  |
|  +-------------------+  +-------------------+  +-----------------+  |
|                                                                     |
|  +-------------------+  +-------------------+                       |
|  | GPU Operator      |  | Prometheus +      |                       |
|  | (NVIDIA Driver,   |  | Grafana           |                       |
|  |  Device Plugin,   |  | (Karpenter +      |                       |
|  |  DCGM Exporter)   |  |  DCGM Metrics)    |                       |
|  +-------------------+  +-------------------+                       |
+---------------------------------------------------------------------+