[AWS] Karpenter로 GPU 노드 관리하기: AI/ML 워크로드 최적화

1. Karpenter를 이용한 GPU 노드 프로비저닝

GPU 워크로드의 특수성

AI/ML 워크로드는 일반 컴퓨팅과 다른 고유한 요구사항을 가집니다:

+---------------------------------------------------------------+
|                GPU 워크로드 특성                               |
+---------------------------------------------------------------+
| - 고가의 GPU 인스턴스 (시간당 수~수십 달러)                   |
| - 학습 작업의 장시간 실행 (수시간~수일)                       |
| - 추론 작업의 낮은 지연 시간 요구                             |
| - GPU 메모리(VRAM) 기반 리소스 제약                           |
| - 인스턴스 타입별 GPU 성능 차이가 큼                          |
| - Spot 중단 시 학습 진행 상황 손실 위험                       |
+---------------------------------------------------------------+

Karpenter가 GPU 관리에 적합한 이유

+------------------------------------------+
|         기존 방식 (Cluster Autoscaler)    |
|                                           |
|  GPU Node Group A: p3.2xlarge             |
|  GPU Node Group B: g5.xlarge              |
|  GPU Node Group C: g5.2xlarge             |
|  GPU Node Group D: p4d.24xlarge           |
|  ...                                      |
|  각 Node Group을 개별 관리 (비효율적)     |
+------------------------------------------+

+------------------------------------------+
|         Karpenter 방식                    |
|                                           |
|  단일 GPU NodePool:                       |
|  - Pod 요구사항 분석                      |
|  - 최적 GPU 인스턴스 자동 선택            |
|  - Spot/On-Demand 자동 전환               |
|  - 비용 기반 인스턴스 최적화              |
+------------------------------------------+

2. GPU NodePool 설정

범용 GPU NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-general
spec:
  template:
    metadata:
      labels:
        node-type: gpu
        workload: ai-ml
    spec:
      requirements:
        # GPU 인스턴스만 선택
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

        # GPU 인스턴스 패밀리
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g', 'p']

        # 용량 타입
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

        # 가용 영역
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

        # x86 아키텍처만
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']

      # GPU 전용 taint
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

      # GPU 노드는 더 긴 만료 시간
      expireAfter: 336h # 14일

  limits:
    cpu: '500'
    memory: 2000Gi
    nvidia.com/gpu: '100'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
    budgets:
      - nodes: '1'

  weight: 80

AWS GPU 인스턴스 타입 가이드

+------------------+----------+----------+------------------+-------------------+
| 인스턴스 타입    | GPU      | GPU 수   | GPU 메모리       | 주요 용도         |
+------------------+----------+----------+------------------+-------------------+
| g4dn.xlarge      | T4       | 1        | 16 GB            | 추론, 경량 학습   |
| g4dn.12xlarge    | T4       | 4        | 64 GB            | 다중 추론         |
| g5.xlarge        | A10G     | 1        | 24 GB            | 추론, 미세 조정   |
| g5.12xlarge      | A10G     | 4        | 96 GB            | 중형 학습         |
| g5.48xlarge      | A10G     | 8        | 192 GB           | 대형 학습         |
| g6.xlarge        | L4       | 1        | 24 GB            | 추론 최적화       |
| g6.12xlarge      | L4       | 4        | 96 GB            | 멀티모달 추론     |
| p3.2xlarge       | V100     | 1        | 16 GB            | 범용 학습         |
| p3.8xlarge       | V100     | 4        | 64 GB            | 대규모 학습       |
| p4d.24xlarge     | A100     | 8        | 320 GB (40GB x8) | 초대규모 학습     |
| p5.48xlarge      | H100     | 8        | 640 GB (80GB x8) | 최대 성능 학습    |
+------------------+----------+----------+------------------+-------------------+

추론 전용 NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-inference
        workload: inference
    spec:
      requirements:
        # 추론에 적합한 인스턴스
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Spot 인스턴스 우선 (추론은 stateless)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']

        # 인스턴스 크기 제한
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '50'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 2m

  weight: 60

학습 전용 NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training
spec:
  template:
    metadata:
      labels:
        node-type: gpu-training
        workload: training
    spec:
      requirements:
        # 학습에 적합한 고성능 GPU
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a100', 'h100', 'a10g']

        # On-Demand 전용 (학습은 중단 비용이 큼)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

        # 대형 인스턴스
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

      # 학습용 노드는 만료 없음
      expireAfter: 720h # 30일

  limits:
    nvidia.com/gpu: '32'

  disruption:
    # 학습 중 통합 비활성화
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      - nodes: '0'

  weight: 90

3. GPU 전용 EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-optimized
spec:
  # GPU 드라이버가 포함된 AMI
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
        network-type: private

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # GPU 워크로드용 대용량 디스크
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi
        volumeType: gp3
        iops: 6000
        throughput: 250
        encrypted: true
        deleteOnTermination: true

  metadataOptions:
    httpEndpoint: enabled
    httpPutResponseHopLimit: 2
    httpTokens: required

  tags:
    Environment: production
    NodeType: gpu
    ManagedBy: karpenter

  # GPU 드라이버 설치를 위한 사용자 데이터
  userData: |
    #!/bin/bash
    echo "GPU node bootstrap"
    # NVIDIA 드라이버는 GPU Operator가 처리

학습 전용 EC2NodeClass (대용량 스토리지)

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-training
spec:
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # 학습 데이터용 대용량 + 고성능 스토리지
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 500Gi
        volumeType: gp3
        iops: 16000
        throughput: 1000
        encrypted: true
        deleteOnTermination: true

  tags:
    Environment: production
    NodeType: gpu-training
    ManagedBy: karpenter

4. Spot GPU 인스턴스 전략

Spot GPU의 비용 절감 효과

+------------------+-------------------+-------------------+---------+
| 인스턴스 타입    | On-Demand (시간)  | Spot 예상 (시간)  | 절감율  |
+------------------+-------------------+-------------------+---------+
| g4dn.xlarge      | ~0.526            | ~0.158            | ~70%    |
| g5.xlarge        | ~1.006            | ~0.302            | ~70%    |
| g5.2xlarge       | ~1.212            | ~0.364            | ~70%    |
| g5.12xlarge      | ~5.672            | ~1.702            | ~70%    |
| p3.2xlarge       | ~3.060            | ~0.918            | ~70%    |
+------------------+-------------------+-------------------+---------+
 (가격은 리전과 시기에 따라 변동됩니다)

Spot GPU NodePool - 추론용

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-spot
        workload: inference
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']

        # 추론에 적합한 다양한 GPU 타입
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # 다양한 크기로 Spot 가용성 확보
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge']

        # 여러 AZ 활용
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '40'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

  weight: 70

Spot 중단 대비 전략

# Pod에 do-not-disrupt 어노테이션 적용 (장시간 학습 작업)
apiVersion: v1
kind: Pod
metadata:
  name: training-job
  annotations:
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: training
      image: my-training-image:latest
      resources:
        requests:
          nvidia.com/gpu: '1'
          cpu: '4'
          memory: 16Gi
        limits:
          nvidia.com/gpu: '1'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  terminationGracePeriodSeconds: 120

5. NVIDIA GPU Operator 연동

GPU Operator 개요

+----------------------------------------------------------------+
|                    NVIDIA GPU Operator                          |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | NVIDIA Driver    |  | Container Toolkit |  | Device Plugin| |
|  | (자동 설치)      |  | (자동 설정)       |  | (자동 배포)  | |
|  +------------------+  +-------------------+  +--------------+ |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | GPU Feature      |  | DCGM Exporter    |  | MIG Manager  | |
|  | Discovery        |  | (메트릭 수집)     |  | (MIG 관리)   | |
|  +------------------+  +-------------------+  +--------------+ |
+----------------------------------------------------------------+

GPU Operator 설치

# NVIDIA GPU Operator Helm 리포지토리 추가
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# GPU Operator 설치
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false \
  --set gfd.enabled=true

GPU Operator와 Karpenter 연동 확인

# GPU 노드의 레이블 확인
kubectl get nodes -l node-type=gpu -o json | \
  jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia")))'

# GPU 리소스 확인
kubectl describe node gpu-node-name | grep -A 5 "nvidia.com/gpu"

# DCGM Exporter Pod 확인
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

GPU 워크로드 배포 예제

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-inference-server
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference-server
  template:
    metadata:
      labels:
        app: inference-server
    spec:
      containers:
        - name: inference
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: model-store
              mountPath: /models
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc

6. 멀티 아키텍처 지원 (x86 + ARM/Graviton)

멀티 아키텍처 NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: multi-arch
spec:
  template:
    spec:
      requirements:
        # x86과 ARM 모두 허용
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64', 'arm64']

        # Graviton 인스턴스 포함
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['c', 'm', 'r']

        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ['5']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default

  limits:
    cpu: '1000'
    memory: 2000Gi

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Graviton GPU 대안: Inferentia/Trainium

# AWS Inferentia 추론 전용 NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: inferentia
spec:
  template:
    metadata:
      labels:
        accelerator: inferentia
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

      taints:
        - key: aws.amazon.com/neuron
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: inferentia-nodes

  limits:
    aws.amazon.com/neuron: '32'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

7. 비용 최적화 전략

Spot과 On-Demand 혼합 전략

+-------------------------------------------------------------+
|              비용 최적화 의사결정 트리                        |
+-------------------------------------------------------------+
|                                                             |
|  워크로드 유형 확인                                         |
|      |                                                      |
|      +-- 추론 (Stateless) --> Spot 우선 + On-Demand 대체    |
|      |                                                      |
|      +-- 미세 조정 (단기) --> Spot + 체크포인트 전략        |
|      |                                                      |
|      +-- 대규모 학습 (장기) --> On-Demand + 예약 인스턴스   |
|      |                                                      |
|      +-- 배치 처리 --> Spot 전용                            |
|                                                             |
+-------------------------------------------------------------+

가중 우선순위를 사용한 인스턴스 패밀리 전략

# 1순위: G5 Spot (가장 비용 효율적인 추론)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier1-g5-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 100
  limits:
    nvidia.com/gpu: '20'
---
# 2순위: G4dn Spot (대체)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier2-g4dn-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 50
  limits:
    nvidia.com/gpu: '20'
---
# 3순위: G5 On-Demand (최후의 대체)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier3-g5-ondemand
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 10
  limits:
    nvidia.com/gpu: '10'

Consolidation 정책 최적화

# GPU 노드의 Consolidation 설정
disruption:
  # GPU 노드는 WhenEmpty만 사용 (실행 중인 GPU 작업 보호)
  consolidationPolicy: WhenEmpty
  # 빈 노드 감지 후 5분 대기 (일시적 비활성 고려)
  consolidateAfter: 5m
  budgets:
    # 동시에 최대 1개 노드만 중단
    - nodes: '1'
    # 업무 시간에는 중단 차단
    - nodes: '0'
      schedule: '0 9 * * MON-FRI'
      duration: 10h

8. 노드 중단 예산 (Node Disruption Budgets)

GPU 워크로드를 위한 Disruption Budget

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training-protected
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      # 학습 시간에는 중단 완전 차단
      - nodes: '0'
        schedule: '0 0 * * *'
        duration: 23h

      # 유지보수 창 (매일 1시간)
      - nodes: '1'
        schedule: '0 23 * * *'
        duration: 1h

      # 드리프트로 인한 중단은 별도 관리
      - nodes: '1'
        reasons:
          - 'Drifted'

Pod 수준 보호

# 장시간 학습 Pod: Karpenter 중단 방지
apiVersion: v1
kind: Pod
metadata:
  name: long-training-job
  annotations:
    # 이 어노테이션으로 Karpenter의 자발적 중단을 방지
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: trainer
      image: my-training-image:v1
      resources:
        requests:
          nvidia.com/gpu: '4'
          cpu: '16'
          memory: 64Gi
        limits:
          nvidia.com/gpu: '4'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  # 충분한 종료 유예 시간 (체크포인트 저장)
  terminationGracePeriodSeconds: 300

PDB (Pod Disruption Budget) 설정

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-server-pdb
  namespace: ml-serving
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: inference-server

9. Prometheus/Grafana를 이용한 모니터링

Karpenter 메트릭 수집 설정

# Karpenter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: karpenter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
    - port: http-metrics
      interval: 15s
      path: /metrics

주요 Karpenter 메트릭

+-----------------------------------------------+----------------------------------------+
| 메트릭                                        | 설명                                   |
+-----------------------------------------------+----------------------------------------+
| karpenter_nodeclaims_launched_total           | 총 시작된 NodeClaim 수                 |
| karpenter_nodeclaims_registered_total         | 총 등록된 NodeClaim 수                 |
| karpenter_nodeclaims_terminated_total         | 총 종료된 NodeClaim 수                 |
| karpenter_pods_state                          | Pod 상태 (노드, 네임스페이스 등)       |
| karpenter_nodepool_usage                      | NodePool별 리소스 사용량               |
| karpenter_nodepool_limit                      | NodePool별 리소스 한도                 |
| karpenter_voluntary_disruption_eligible_nodes | 자발적 중단 대상 노드 수              |
| karpenter_disruption_actions_performed_total  | 수행된 중단 작업 수                    |
| karpenter_nodes_allocatable                   | 노드별 할당 가능 리소스               |
| karpenter_nodes_total_daemon_requests         | 데몬셋 리소스 요청 총합               |
+-----------------------------------------------+----------------------------------------+

GPU 전용 Grafana 대시보드 쿼리

# GPU 노드 수 추적
count(karpenter_nodes_allocatable{resource_type="nvidia.com/gpu"} > 0)

# GPU 활용률 (DCGM Exporter 필요)
DCGM_FI_DEV_GPU_UTIL

# NodePool별 GPU 사용량 vs 한도
karpenter_nodepool_usage{resource_type="nvidia.com/gpu"}
  /
karpenter_nodepool_limit{resource_type="nvidia.com/gpu"}

# 프로비저닝 지연 시간
histogram_quantile(0.99,
  rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])
)

DCGM Exporter 메트릭

# DCGM Exporter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

알림 규칙 예제

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: karpenter-gpu-alerts
  namespace: monitoring
spec:
  groups:
    - name: karpenter-gpu
      rules:
        # GPU NodePool이 한도의 90%에 도달
        - alert: GPUNodePoolNearLimit
          expr: |
            karpenter_nodepool_usage{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            /
            karpenter_nodepool_limit{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU NodePool approaching resource limit'

        # GPU 활용률이 낮은 노드 감지
        - alert: LowGPUUtilization
          expr: |
            avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10
          for: 1h
          labels:
            severity: info
          annotations:
            summary: 'GPU utilization below 10 percent for 1 hour'

        # Karpenter 프로비저닝 실패
        - alert: KarpenterProvisioningFailed
          expr: |
            increase(karpenter_nodeclaims_terminated_total{reason="ProvisioningFailed"}[15m]) > 0
          labels:
            severity: critical
          annotations:
            summary: 'Karpenter failed to provision GPU node'

10. 실전 예제: 학습 클러스터

분산 학습 클러스터 구성

# PyTorch 분산 학습 Job
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
  namespace: ml-training
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      labels:
        app: distributed-training
      annotations:
        karpenter.sh/do-not-disrupt: 'true'
    spec:
      containers:
        - name: pytorch-trainer
          image: my-pytorch-training:v1
          command: ['torchrun']
          args:
            - '--nproc_per_node=1'
            - '--nnodes=4'
            - '--node_rank=$(JOB_COMPLETION_INDEX)'
            - '--master_addr=training-master'
            - '--master_port=29500'
            - 'train.py'
          env:
            - name: JOB_COMPLETION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
          resources:
            requests:
              cpu: '8'
              memory: 32Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: shared-data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-training
      restartPolicy: OnFailure
      volumes:
        - name: shared-data
          persistentVolumeClaim:
            claimName: training-data-pvc
        - name: checkpoints
          persistentVolumeClaim:
            claimName: checkpoint-pvc

11. 실전 예제: 추론 클러스터

오토스케일링 추론 서비스

# 추론 Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
        - name: vllm-server
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3-8B'
            - '--tensor-parallel-size'
            - '1'
            - '--gpu-memory-utilization'
            - '0.9'
          ports:
            - containerPort: 8000
              name: http
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
---
# HPA 설정
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '70'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

12. 트러블슈팅 가이드

일반적인 GPU 노드 문제

# 1. GPU 리소스가 표시되지 않는 경우
kubectl describe node gpu-node | grep -A 10 "Allocatable"
# nvidia.com/gpu가 없으면 GPU Operator 확인

# 2. GPU Operator Pod 상태 확인
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# 3. Karpenter 프로비저닝 로그 확인
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter \
  | grep -i "gpu\|nvidia\|instance-type"

# 4. NodeClaim 상태 확인
kubectl get nodeclaims -o wide

# 5. Pending Pod 원인 분석
kubectl describe pod gpu-pod-name | grep -A 20 "Events"

자주 발생하는 문제와 해결

+---------------------------------------------+------------------------------------------+
| 문제                                        | 해결 방법                                |
+---------------------------------------------+------------------------------------------+
| GPU 리소스가 노드에 표시되지 않음           | GPU Operator 재설치 또는 드라이버 확인   |
| Spot GPU 인스턴스를 찾을 수 없음            | 더 많은 GPU 인스턴스 타입과 AZ 추가      |
| GPU 노드 프로비저닝 시간 초과               | EC2NodeClass 서브넷/보안그룹 태그 확인   |
| 학습 중 노드가 중단됨                       | do-not-disrupt 어노테이션 추가           |
| GPU 메모리 부족 (OOM)                       | 더 큰 GPU 인스턴스 타입 허용             |
| 불필요한 GPU 노드가 유지됨                  | Consolidation 정책 및 consolidateAfter   |
|                                             | 값 확인                                  |
| 특정 GPU 타입만 프로비저닝됨                | NodePool requirements 범위 확장          |
+---------------------------------------------+------------------------------------------+

GPU 메모리 확인 명령

# 노드에서 직접 GPU 상태 확인 (디버그 Pod 사용)
kubectl run gpu-debug --rm -it \
  --image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
  --overrides='{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],"nodeSelector":{"node-type":"gpu"}}}' \
  --restart=Never \
  -- nvidia-smi

13. 베스트 프랙티스 요약

GPU 노드 관리 체크리스트

+---+------------------------------------------------------------+
| # | 베스트 프랙티스                                            |
+---+------------------------------------------------------------+
| 1 | 추론과 학습 워크로드의 NodePool을 분리                     |
| 2 | GPU taint를 설정하여 비 GPU 워크로드 스케줄링 방지        |
| 3 | 추론은 Spot, 학습은 On-Demand 사용                         |
| 4 | 장시간 학습에 do-not-disrupt 어노테이션 적용               |
| 5 | 체크포인트 전략으로 학습 진행 상황 보호                     |
| 6 | GPU Operator로 드라이버 관리 자동화                        |
| 7 | DCGM Exporter로 GPU 메트릭 수집                           |
| 8 | NodePool limits로 GPU 비용 상한 설정                       |
| 9 | 여러 GPU 인스턴스 타입을 허용하여 가용성 확보              |
| 10| PDB로 추론 서비스의 최소 가용성 보장                       |
| 11| Disruption Budget으로 학습 시간 중 중단 차단               |
| 12| HPA와 Karpenter를 연동하여 자동 스케일링 구현             |
+---+------------------------------------------------------------+

비용 최적화 전략 요약

전략 1: 계층형 NodePool
  - Spot GPU (높은 가중치) -> On-Demand GPU (낮은 가중치)
  - 추론 워크로드에 최적

전략 2: 인스턴스 다각화
  - 여러 GPU 패밀리 (g4dn, g5, g6) 허용
  - 여러 인스턴스 크기 허용
  - Spot 가용성 극대화

전략 3: 자동 축소
  - WhenEmpty consolidation으로 빈 GPU 노드 즉시 제거
  - consolidateAfter를 짧게 설정 (추론)
  - 학습 노드는 더 긴 대기 시간 설정

전략 4: 적절한 리소스 한도
  - NodePool limits로 최대 GPU 수 제한
  - 예상치 못한 비용 폭주 방지
  - 팀/프로젝트별 할당량 관리

Karpenter + GPU 아키텍처 최종 다이어그램

+---------------------------------------------------------------------+
|                        EKS Cluster                                  |
|                                                                     |
|  +-------------------+  +-------------------+  +-----------------+  |
|  | NodePool:         |  | NodePool:         |  | NodePool:       |  |
|  | gpu-inference     |  | gpu-training      |  | multi-arch      |  |
|  | (Spot, weight:60) |  | (OD, weight:90)   |  | (Mixed, w:50)   |  |
|  +--------+----------+  +--------+----------+  +--------+--------+  |
|           |                      |                       |           |
|  +--------v----------+  +--------v----------+  +--------v--------+  |
|  | EC2NodeClass:     |  | EC2NodeClass:     |  | EC2NodeClass:   |  |
|  | gpu-optimized     |  | gpu-training      |  | default         |  |
|  | (200GB, gp3)      |  | (500GB, gp3)      |  | (100GB, gp3)    |  |
|  +-------------------+  +-------------------+  +-----------------+  |
|                                                                     |
|  +-------------------+  +-------------------+                       |
|  | GPU Operator      |  | Prometheus +      |                       |
|  | (NVIDIA Driver,   |  | Grafana           |                       |
|  |  Device Plugin,   |  | (Karpenter +      |                       |
|  |  DCGM Exporter)   |  |  DCGM Metrics)    |                       |
|  +-------------------+  +-------------------+                       |
+---------------------------------------------------------------------+

목차

1. Karpenter를 이용한 GPU 노드 프로비저닝

GPU 워크로드의 특수성

Karpenter가 GPU 관리에 적합한 이유

2. GPU NodePool 설정

범용 GPU NodePool

AWS GPU 인스턴스 타입 가이드

추론 전용 NodePool

학습 전용 NodePool

3. GPU 전용 EC2NodeClass

학습 전용 EC2NodeClass (대용량 스토리지)

4. Spot GPU 인스턴스 전략

Spot GPU의 비용 절감 효과

Spot GPU NodePool - 추론용

Spot 중단 대비 전략

5. NVIDIA GPU Operator 연동

GPU Operator 개요

GPU Operator 설치

GPU Operator와 Karpenter 연동 확인

GPU 워크로드 배포 예제

6. 멀티 아키텍처 지원 (x86 + ARM/Graviton)

멀티 아키텍처 NodePool

Graviton GPU 대안: Inferentia/Trainium

7. 비용 최적화 전략

Spot과 On-Demand 혼합 전략

가중 우선순위를 사용한 인스턴스 패밀리 전략

Consolidation 정책 최적화

8. 노드 중단 예산 (Node Disruption Budgets)

GPU 워크로드를 위한 Disruption Budget

Pod 수준 보호

PDB (Pod Disruption Budget) 설정

9. Prometheus/Grafana를 이용한 모니터링

Karpenter 메트릭 수집 설정

주요 Karpenter 메트릭

GPU 전용 Grafana 대시보드 쿼리

DCGM Exporter 메트릭

알림 규칙 예제

10. 실전 예제: 학습 클러스터

분산 학습 클러스터 구성

11. 실전 예제: 추론 클러스터

오토스케일링 추론 서비스

12. 트러블슈팅 가이드

일반적인 GPU 노드 문제

자주 발생하는 문제와 해결

GPU 메모리 확인 명령

13. 베스트 프랙티스 요약

GPU 노드 관리 체크리스트

비용 최적화 전략 요약

Karpenter + GPU 아키텍처 최종 다이어그램