Skip to content

필사 모드: [AWS] Managing GPU Nodes with Karpenter: AI/ML Workload Optimization

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Table of Contents

1. GPU Node Provisioning with Karpenter

The Unique Nature of GPU Workloads

AI/ML workloads have distinct requirements compared to general computing:

+---------------------------------------------------------------+

| GPU Workload Characteristics |

+---------------------------------------------------------------+

| - Expensive GPU instances (dollars to tens of dollars/hour) |

| - Long-running training jobs (hours to days) |

| - Low-latency requirements for inference |

| - GPU memory (VRAM) as the primary resource constraint |

| - Significant performance differences across instance types |

| - Risk of losing training progress on Spot interruption |

+---------------------------------------------------------------+

Why Karpenter Excels for GPU Management

+------------------------------------------+

| Traditional (Cluster Autoscaler) |

| |

| GPU Node Group A: p3.2xlarge |

| GPU Node Group B: g5.xlarge |

| GPU Node Group C: g5.2xlarge |

| GPU Node Group D: p4d.24xlarge |

| ... |

| Manage each Node Group separately |

| (inefficient) |

+------------------------------------------+

+------------------------------------------+

| Karpenter Approach |

| |

| Single GPU NodePool: |

| - Analyze pod requirements |

| - Auto-select optimal GPU instance |

| - Automatic Spot/On-Demand switching |

| - Cost-based instance optimization |

+------------------------------------------+

2. GPU NodePool Configuration

General-Purpose GPU NodePool

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: gpu-general

spec:

template:

metadata:

labels:

node-type: gpu

workload: ai-ml

spec:

requirements:

Select only GPU instances

- key: karpenter.k8s.aws/instance-gpu-count

operator: Gt

values: ['0']

GPU instance families

- key: karpenter.k8s.aws/instance-category

operator: In

values: ['g', 'p']

Capacity type

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand', 'spot']

Availability zones

- key: topology.kubernetes.io/zone

operator: In

values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

x86 architecture only

- key: kubernetes.io/arch

operator: In

values: ['amd64']

GPU-dedicated taint

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: gpu-optimized

Longer expiration for GPU nodes

expireAfter: 336h # 14 days

limits:

cpu: '500'

memory: 2000Gi

nvidia.com/gpu: '100'

disruption:

consolidationPolicy: WhenEmpty

consolidateAfter: 5m

budgets:

- nodes: '1'

weight: 80

AWS GPU Instance Type Guide

+------------------+----------+----------+------------------+---------------------+

| Instance Type | GPU | Count | GPU Memory | Primary Use Case |

+------------------+----------+----------+------------------+---------------------+

| g4dn.xlarge | T4 | 1 | 16 GB | Inference, light ML |

| g4dn.12xlarge | T4 | 4 | 64 GB | Multi-inference |

| g5.xlarge | A10G | 1 | 24 GB | Inference, fine-tune|

| g5.12xlarge | A10G | 4 | 96 GB | Medium training |

| g5.48xlarge | A10G | 8 | 192 GB | Large training |

| g6.xlarge | L4 | 1 | 24 GB | Inference optimized |

| g6.12xlarge | L4 | 4 | 96 GB | Multimodal inference|

| p3.2xlarge | V100 | 1 | 16 GB | General training |

| p3.8xlarge | V100 | 4 | 64 GB | Large training |

| p4d.24xlarge | A100 | 8 | 320 GB (40GB x8) | Ultra-large training|

| p5.48xlarge | H100 | 8 | 640 GB (80GB x8) | Maximum performance |

+------------------+----------+----------+------------------+---------------------+

Inference-Only NodePool

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: gpu-inference

spec:

template:

metadata:

labels:

node-type: gpu-inference

workload: inference

spec:

requirements:

Inference-suitable instances

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['t4', 'a10g', 'l4']

Spot instances preferred (inference is stateless)

- key: karpenter.sh/capacity-type

operator: In

values: ['spot', 'on-demand']

Instance size constraint

- key: karpenter.k8s.aws/instance-size

operator: In

values: ['xlarge', '2xlarge', '4xlarge']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: gpu-optimized

limits:

nvidia.com/gpu: '50'

disruption:

consolidationPolicy: WhenEmptyOrUnderutilized

consolidateAfter: 2m

weight: 60

Training-Only NodePool

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: gpu-training

spec:

template:

metadata:

labels:

node-type: gpu-training

workload: training

spec:

requirements:

High-performance GPUs for training

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['a100', 'h100', 'a10g']

On-Demand only (training interruption is costly)

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand']

Large instances

- key: karpenter.k8s.aws/instance-gpu-count

operator: Gt

values: ['0']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: gpu-training

No expiration for training nodes

expireAfter: 720h # 30 days

limits:

nvidia.com/gpu: '32'

disruption:

Disable consolidation during training

consolidationPolicy: WhenEmpty

consolidateAfter: 30m

budgets:

- nodes: '0'

weight: 90

3. GPU-Optimized EC2NodeClass

apiVersion: karpenter.k8s.aws/v1

kind: EC2NodeClass

metadata:

name: gpu-optimized

spec:

AMI with GPU driver support

amiSelectorTerms:

- alias: al2023@latest

role: KarpenterNodeRole-my-cluster

subnetSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

network-type: private

securityGroupSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

Large disk for GPU workloads

blockDeviceMappings:

- deviceName: /dev/xvda

ebs:

volumeSize: 200Gi

volumeType: gp3

iops: 6000

throughput: 250

encrypted: true

deleteOnTermination: true

metadataOptions:

httpEndpoint: enabled

httpPutResponseHopLimit: 2

httpTokens: required

tags:

Environment: production

NodeType: gpu

ManagedBy: karpenter

Bootstrap script for GPU nodes

userData: |

#!/bin/bash

echo "GPU node bootstrap"

NVIDIA drivers are handled by GPU Operator

Training-Specific EC2NodeClass (Large Storage)

apiVersion: karpenter.k8s.aws/v1

kind: EC2NodeClass

metadata:

name: gpu-training

spec:

amiSelectorTerms:

- alias: al2023@latest

role: KarpenterNodeRole-my-cluster

subnetSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

securityGroupSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

Large, high-performance storage for training data

blockDeviceMappings:

- deviceName: /dev/xvda

ebs:

volumeSize: 500Gi

volumeType: gp3

iops: 16000

throughput: 1000

encrypted: true

deleteOnTermination: true

tags:

Environment: production

NodeType: gpu-training

ManagedBy: karpenter

4. Spot GPU Instance Strategy

Cost Savings with Spot GPU

+------------------+-------------------+-------------------+---------+

| Instance Type | On-Demand (hr) | Spot Est. (hr) | Savings |

+------------------+-------------------+-------------------+---------+

| g4dn.xlarge | ~0.526 | ~0.158 | ~70% |

| g5.xlarge | ~1.006 | ~0.302 | ~70% |

| g5.2xlarge | ~1.212 | ~0.364 | ~70% |

| g5.12xlarge | ~5.672 | ~1.702 | ~70% |

| p3.2xlarge | ~3.060 | ~0.918 | ~70% |

+------------------+-------------------+-------------------+---------+

(Prices vary by region and time)

Spot GPU NodePool for Inference

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: gpu-spot-inference

spec:

template:

metadata:

labels:

node-type: gpu-spot

workload: inference

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ['spot']

Diverse GPU types for inference

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['t4', 'a10g', 'l4']

Various sizes for Spot availability

- key: karpenter.k8s.aws/instance-size

operator: In

values: ['xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge']

Multiple AZs

- key: topology.kubernetes.io/zone

operator: In

values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: gpu-optimized

limits:

nvidia.com/gpu: '40'

disruption:

consolidationPolicy: WhenEmptyOrUnderutilized

consolidateAfter: 1m

weight: 70

Spot Interruption Mitigation

Apply do-not-disrupt annotation for long-running training jobs

apiVersion: v1

kind: Pod

metadata:

name: training-job

annotations:

karpenter.sh/do-not-disrupt: 'true'

spec:

containers:

- name: training

image: my-training-image:latest

resources:

requests:

nvidia.com/gpu: '1'

cpu: '4'

memory: 16Gi

limits:

nvidia.com/gpu: '1'

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

terminationGracePeriodSeconds: 120

5. NVIDIA GPU Operator Integration

GPU Operator Overview

+----------------------------------------------------------------+

| NVIDIA GPU Operator |

| |

| +------------------+ +-------------------+ +--------------+ |

| | NVIDIA Driver | | Container Toolkit | | Device Plugin| |

| | (Auto Install) | | (Auto Config) | | (Auto Deploy)| |

| +------------------+ +-------------------+ +--------------+ |

| |

| +------------------+ +-------------------+ +--------------+ |

| | GPU Feature | | DCGM Exporter | | MIG Manager | |

| | Discovery | | (Metrics) | | (MIG Mgmt) | |

| +------------------+ +-------------------+ +--------------+ |

+----------------------------------------------------------------+

Installing GPU Operator

Add NVIDIA GPU Operator Helm repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

helm repo update

Install GPU Operator

helm install gpu-operator nvidia/gpu-operator \

--namespace gpu-operator \

--create-namespace \

--set driver.enabled=true \

--set toolkit.enabled=true \

--set devicePlugin.enabled=true \

--set dcgmExporter.enabled=true \

--set migManager.enabled=false \

--set gfd.enabled=true

Verifying GPU Operator with Karpenter

Check NVIDIA labels on GPU nodes

kubectl get nodes -l node-type=gpu -o json | \

jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia")))'

Verify GPU resources

kubectl describe node gpu-node-name | grep -A 5 "nvidia.com/gpu"

Check DCGM Exporter pods

kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

GPU Workload Deployment Example

apiVersion: apps/v1

kind: Deployment

metadata:

name: gpu-inference-server

namespace: ml-serving

spec:

replicas: 3

selector:

matchLabels:

app: inference-server

template:

metadata:

labels:

app: inference-server

spec:

containers:

- name: inference

image: nvcr.io/nvidia/tritonserver:24.01-py3

ports:

- containerPort: 8000

name: http

- containerPort: 8001

name: grpc

- containerPort: 8002

name: metrics

resources:

requests:

cpu: '4'

memory: 16Gi

nvidia.com/gpu: '1'

limits:

nvidia.com/gpu: '1'

volumeMounts:

- name: model-store

mountPath: /models

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

nodeSelector:

node-type: gpu-inference

volumes:

- name: model-store

persistentVolumeClaim:

claimName: model-store-pvc

6. Multi-Architecture Support (x86 + ARM/Graviton)

Multi-Architecture NodePool

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: multi-arch

spec:

template:

spec:

requirements:

Allow both x86 and ARM

- key: kubernetes.io/arch

operator: In

values: ['amd64', 'arm64']

Include Graviton instances

- key: karpenter.k8s.aws/instance-category

operator: In

values: ['c', 'm', 'r']

- key: karpenter.k8s.aws/instance-generation

operator: Gt

values: ['5']

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand', 'spot']

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: default

limits:

cpu: '1000'

memory: 2000Gi

disruption:

consolidationPolicy: WhenEmptyOrUnderutilized

consolidateAfter: 1m

Graviton GPU Alternative: Inferentia/Trainium

AWS Inferentia inference-only NodePool

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: inferentia

spec:

template:

metadata:

labels:

accelerator: inferentia

spec:

requirements:

- key: node.kubernetes.io/instance-type

operator: In

values: ['inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge']

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand']

taints:

- key: aws.amazon.com/neuron

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: inferentia-nodes

limits:

aws.amazon.com/neuron: '32'

disruption:

consolidationPolicy: WhenEmpty

consolidateAfter: 5m

7. Cost Optimization Strategies

Mixed Spot and On-Demand Strategy

+-------------------------------------------------------------+

| Cost Optimization Decision Tree |

+-------------------------------------------------------------+

| |

| Identify workload type |

| | |

| +-- Inference (Stateless) --> Spot first + OD fallback |

| | |

| +-- Fine-tuning (Short) --> Spot + checkpoint strategy |

| | |

| +-- Large Training (Long) --> On-Demand + Reserved |

| | |

| +-- Batch Processing --> Spot only |

| |

+-------------------------------------------------------------+

Weighted Priority Instance Family Strategy

Tier 1: G5 Spot (most cost-effective for inference)

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: gpu-tier1-g5-spot

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ['spot']

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['a10g']

- key: karpenter.k8s.aws/instance-category

operator: In

values: ['g']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: gpu-optimized

weight: 100

limits:

nvidia.com/gpu: '20'

Tier 2: G4dn Spot (fallback)

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: gpu-tier2-g4dn-spot

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ['spot']

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['t4']

- key: karpenter.k8s.aws/instance-category

operator: In

values: ['g']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: gpu-optimized

weight: 50

limits:

nvidia.com/gpu: '20'

Tier 3: G5 On-Demand (last resort)

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: gpu-tier3-g5-ondemand

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand']

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['a10g']

- key: karpenter.k8s.aws/instance-category

operator: In

values: ['g']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: gpu-optimized

weight: 10

limits:

nvidia.com/gpu: '10'

Consolidation Policy Optimization

GPU node consolidation settings

disruption:

Use WhenEmpty only for GPU (protect running GPU jobs)

consolidationPolicy: WhenEmpty

Wait 5 minutes after detecting empty node (handle temporary inactivity)

consolidateAfter: 5m

budgets:

Disrupt at most 1 node at a time

- nodes: '1'

Block disruption during business hours

- nodes: '0'

schedule: '0 9 * * MON-FRI'

duration: 10h

8. Node Disruption Budgets for GPU Workloads

Disruption Budget for GPU Workloads

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

name: gpu-training-protected

spec:

template:

spec:

requirements:

- key: karpenter.k8s.aws/instance-gpu-count

operator: Gt

values: ['0']

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

name: gpu-training

disruption:

consolidationPolicy: WhenEmpty

consolidateAfter: 30m

budgets:

Block disruption completely during training hours

- nodes: '0'

schedule: '0 0 * * *'

duration: 23h

Maintenance window (1 hour daily)

- nodes: '1'

schedule: '0 23 * * *'

duration: 1h

Manage drift-related disruption separately

- nodes: '1'

reasons:

- 'Drifted'

Pod-Level Protection

Long-running training pod: prevent Karpenter disruption

apiVersion: v1

kind: Pod

metadata:

name: long-training-job

annotations:

This annotation prevents Karpenter voluntary disruption

karpenter.sh/do-not-disrupt: 'true'

spec:

containers:

- name: trainer

image: my-training-image:v1

resources:

requests:

nvidia.com/gpu: '4'

cpu: '16'

memory: 64Gi

limits:

nvidia.com/gpu: '4'

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

Sufficient grace period for checkpoint saving

terminationGracePeriodSeconds: 300

PDB (Pod Disruption Budget) Configuration

apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

name: inference-server-pdb

namespace: ml-serving

spec:

minAvailable: 2

selector:

matchLabels:

app: inference-server

9. Monitoring with Prometheus and Grafana

Karpenter Metrics Collection

Karpenter ServiceMonitor

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

name: karpenter

namespace: karpenter

spec:

selector:

matchLabels:

app.kubernetes.io/name: karpenter

endpoints:

- port: http-metrics

interval: 15s

path: /metrics

Key Karpenter Metrics

+-----------------------------------------------+------------------------------------------+

| Metric | Description |

+-----------------------------------------------+------------------------------------------+

| karpenter_nodeclaims_launched_total | Total NodeClaims launched |

| karpenter_nodeclaims_registered_total | Total NodeClaims registered |

| karpenter_nodeclaims_terminated_total | Total NodeClaims terminated |

| karpenter_pods_state | Pod state (node, namespace, etc.) |

| karpenter_nodepool_usage | Resource usage per NodePool |

| karpenter_nodepool_limit | Resource limits per NodePool |

| karpenter_voluntary_disruption_eligible_nodes | Nodes eligible for voluntary disruption |

| karpenter_disruption_actions_performed_total | Total disruption actions performed |

| karpenter_nodes_allocatable | Allocatable resources per node |

| karpenter_nodes_total_daemon_requests | Total daemon set resource requests |

+-----------------------------------------------+------------------------------------------+

GPU-Specific Grafana Dashboard Queries

Track GPU node count

count(karpenter_nodes_allocatable{resource_type="nvidia.com/gpu"} > 0)

GPU utilization (requires DCGM Exporter)

DCGM_FI_DEV_GPU_UTIL

GPU usage vs limits per NodePool

karpenter_nodepool_usage{resource_type="nvidia.com/gpu"}

/

karpenter_nodepool_limit{resource_type="nvidia.com/gpu"}

Provisioning latency

histogram_quantile(0.99,

rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])

)

DCGM Exporter Metrics

DCGM Exporter ServiceMonitor

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

name: dcgm-exporter

namespace: gpu-operator

spec:

selector:

matchLabels:

app: nvidia-dcgm-exporter

endpoints:

- port: metrics

interval: 15s

Alert Rules Example

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

name: karpenter-gpu-alerts

namespace: monitoring

spec:

groups:

- name: karpenter-gpu

rules:

GPU NodePool reaching 90% of limit

- alert: GPUNodePoolNearLimit

expr: |

karpenter_nodepool_usage{nodepool="gpu-general", resource_type="nvidia.com/gpu"}

/

karpenter_nodepool_limit{nodepool="gpu-general", resource_type="nvidia.com/gpu"}

> 0.9

for: 5m

labels:

severity: warning

annotations:

summary: 'GPU NodePool approaching resource limit'

Low GPU utilization detected

- alert: LowGPUUtilization

expr: |

avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10

for: 1h

labels:

severity: info

annotations:

summary: 'GPU utilization below 10 percent for 1 hour'

Karpenter provisioning failure

- alert: KarpenterProvisioningFailed

expr: |

increase(karpenter_nodeclaims_terminated_total{reason="ProvisioningFailed"}[15m]) > 0

labels:

severity: critical

annotations:

summary: 'Karpenter failed to provision GPU node'

10. Real-World Example: Training Cluster

Distributed Training Cluster Configuration

PyTorch distributed training Job

apiVersion: batch/v1

kind: Job

metadata:

name: distributed-training

namespace: ml-training

spec:

parallelism: 4

completions: 4

template:

metadata:

labels:

app: distributed-training

annotations:

karpenter.sh/do-not-disrupt: 'true'

spec:

containers:

- name: pytorch-trainer

image: my-pytorch-training:v1

command: ['torchrun']

args:

- '--nproc_per_node=1'

- '--nnodes=4'

- '--node_rank=$(JOB_COMPLETION_INDEX)'

- '--master_addr=training-master'

- '--master_port=29500'

- 'train.py'

env:

- name: JOB_COMPLETION_INDEX

valueFrom:

fieldRef:

fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

resources:

requests:

cpu: '8'

memory: 32Gi

nvidia.com/gpu: '1'

limits:

nvidia.com/gpu: '1'

volumeMounts:

- name: shared-data

mountPath: /data

- name: checkpoints

mountPath: /checkpoints

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

nodeSelector:

node-type: gpu-training

restartPolicy: OnFailure

volumes:

- name: shared-data

persistentVolumeClaim:

claimName: training-data-pvc

- name: checkpoints

persistentVolumeClaim:

claimName: checkpoint-pvc

11. Real-World Example: Inference Cluster

Auto-Scaling Inference Service

Inference Deployment

apiVersion: apps/v1

kind: Deployment

metadata:

name: llm-inference

namespace: ml-serving

spec:

replicas: 2

selector:

matchLabels:

app: llm-inference

template:

metadata:

labels:

app: llm-inference

spec:

containers:

- name: vllm-server

image: vllm/vllm-openai:latest

args:

- '--model'

- 'meta-llama/Llama-3-8B'

- '--tensor-parallel-size'

- '1'

- '--gpu-memory-utilization'

- '0.9'

ports:

- containerPort: 8000

name: http

resources:

requests:

cpu: '4'

memory: 16Gi

nvidia.com/gpu: '1'

limits:

nvidia.com/gpu: '1'

readinessProbe:

httpGet:

path: /health

port: 8000

initialDelaySeconds: 60

periodSeconds: 10

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

nodeSelector:

node-type: gpu-inference

HPA Configuration

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

name: llm-inference-hpa

namespace: ml-serving

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

name: llm-inference

minReplicas: 2

maxReplicas: 10

metrics:

- type: Pods

pods:

metric:

name: gpu_utilization

target:

type: AverageValue

averageValue: '70'

behavior:

scaleUp:

stabilizationWindowSeconds: 60

policies:

- type: Pods

value: 2

periodSeconds: 60

scaleDown:

stabilizationWindowSeconds: 300

policies:

- type: Pods

value: 1

periodSeconds: 120

12. Troubleshooting Guide

Common GPU Node Issues

1. GPU resources not showing on node

kubectl describe node gpu-node | grep -A 10 "Allocatable"

If nvidia.com/gpu is missing, check GPU Operator

2. Check GPU Operator pod status

kubectl get pods -n gpu-operator

kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

3. Check Karpenter provisioning logs

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter \

| grep -i "gpu\|nvidia\|instance-type"

4. Check NodeClaim status

kubectl get nodeclaims -o wide

5. Analyze pending pod causes

kubectl describe pod gpu-pod-name | grep -A 20 "Events"

Common Issues and Solutions

+---------------------------------------------+------------------------------------------+

| Issue | Solution |

+---------------------------------------------+------------------------------------------+

| GPU resources not showing on node | Reinstall GPU Operator or verify drivers |

| Cannot find Spot GPU instances | Add more GPU instance types and AZs |

| GPU node provisioning timeout | Check EC2NodeClass subnet/SG tags |

| Node disrupted during training | Add do-not-disrupt annotation |

| GPU out of memory (OOM) | Allow larger GPU instance types |

| Idle GPU nodes persisting | Review consolidation policy and |

| | consolidateAfter value |

| Only specific GPU type being provisioned | Expand NodePool requirements range |

+---------------------------------------------+------------------------------------------+

GPU Memory Debugging

Check GPU status directly on node (using debug pod)

kubectl run gpu-debug --rm -it \

--image=nvidia/cuda:12.0.0-base-ubuntu22.04 \

--overrides='{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],"nodeSelector":{"node-type":"gpu"}}}' \

--restart=Never \

-- nvidia-smi

13. Best Practices Summary

GPU Node Management Checklist

+---+------------------------------------------------------------+

| # | Best Practice |

+---+------------------------------------------------------------+

| 1 | Separate NodePools for inference and training workloads |

| 2 | Set GPU taints to prevent non-GPU workload scheduling |

| 3 | Use Spot for inference, On-Demand for training |

| 4 | Apply do-not-disrupt annotation for long training jobs |

| 5 | Implement checkpoint strategy to protect training progress |

| 6 | Automate driver management with GPU Operator |

| 7 | Collect GPU metrics with DCGM Exporter |

| 8 | Set GPU cost caps with NodePool limits |

| 9 | Allow multiple GPU instance types for availability |

| 10| Guarantee minimum availability with PDB for inference |

| 11| Block disruption during training with Disruption Budgets |

| 12| Combine HPA with Karpenter for auto-scaling |

+---+------------------------------------------------------------+

Cost Optimization Strategy Summary

Strategy 1: Tiered NodePools

- Spot GPU (high weight) -> On-Demand GPU (low weight)

- Optimal for inference workloads

Strategy 2: Instance Diversification

- Allow multiple GPU families (g4dn, g5, g6)

- Allow multiple instance sizes

- Maximize Spot availability

Strategy 3: Auto Scale-Down

- WhenEmpty consolidation to remove idle GPU nodes immediately

- Short consolidateAfter for inference

- Longer wait time for training nodes

Strategy 4: Appropriate Resource Limits

- Limit max GPUs with NodePool limits

- Prevent unexpected cost spikes

- Manage quotas per team/project

Final Architecture Diagram: Karpenter + GPU

+---------------------------------------------------------------------+

| EKS Cluster |

| |

| +-------------------+ +-------------------+ +-----------------+ |

| | NodePool: | | NodePool: | | NodePool: | |

| | gpu-inference | | gpu-training | | multi-arch | |

| | (Spot, weight:60) | | (OD, weight:90) | | (Mixed, w:50) | |

| +--------+----------+ +--------+----------+ +--------+--------+ |

| | | | |

| +--------v----------+ +--------v----------+ +--------v--------+ |

| | EC2NodeClass: | | EC2NodeClass: | | EC2NodeClass: | |

| | gpu-optimized | | gpu-training | | default | |

| | (200GB, gp3) | | (500GB, gp3) | | (100GB, gp3) | |

| +-------------------+ +-------------------+ +-----------------+ |

| |

| +-------------------+ +-------------------+ |

| | GPU Operator | | Prometheus + | |

| | (NVIDIA Driver, | | Grafana | |

| | Device Plugin, | | (Karpenter + | |

| | DCGM Exporter) | | DCGM Metrics) | |

| +-------------------+ +-------------------+ |

+---------------------------------------------------------------------+

현재 단락 (1/898)

AI/ML workloads have distinct requirements compared to general computing:

작성 글자: 0원문 글자: 23,938작성 단락: 0/898