💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

1. GPU Node Provisioning with Karpenter

The Unique Nature of GPU Workloads

AI/ML workloads have distinct requirements compared to general computing:

+---------------------------------------------------------------+

| GPU Workload Characteristics |

+---------------------------------------------------------------+

| - Expensive GPU instances (dollars to tens of dollars/hour) |

| - Long-running training jobs (hours to days) |

| - Low-latency requirements for inference |

| - GPU memory (VRAM) as the primary resource constraint |

| - Significant performance differences across instance types |

| - Risk of losing training progress on Spot interruption |

+---------------------------------------------------------------+

Why Karpenter Excels for GPU Management

+------------------------------------------+

| Traditional (Cluster Autoscaler) |

| |

| GPU Node Group A: p3.2xlarge |

| GPU Node Group B: g5.xlarge |

| GPU Node Group C: g5.2xlarge |

| GPU Node Group D: p4d.24xlarge |

| ... |

| Manage each Node Group separately |

| (inefficient) |

+------------------------------------------+

| Karpenter Approach |

| |

| Single GPU NodePool: |

| - Analyze pod requirements |

| - Auto-select optimal GPU instance |

| - Automatic Spot/On-Demand switching |

| - Cost-based instance optimization |

+------------------------------------------+

2. GPU NodePool Configuration

General-Purpose GPU NodePool

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

spec:

template:

metadata:

labels:

node-type: gpu

workload: ai-ml

spec:

requirements:

Select only GPU instances

- key: karpenter.k8s.aws/instance-gpu-count

operator: Gt

values: ['0']

GPU instance families

- key: karpenter.k8s.aws/instance-category

operator: In

values: ['g', 'p']

Capacity type

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand', 'spot']

Availability zones

- key: topology.kubernetes.io/zone

operator: In

values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

x86 architecture only

- key: kubernetes.io/arch

operator: In

values: ['amd64']

GPU-dedicated taint

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

Longer expiration for GPU nodes

expireAfter: 336h # 14 days

limits:

cpu: '500'

memory: 2000Gi

nvidia.com/gpu: '100'

disruption:

consolidationPolicy: WhenEmpty

consolidateAfter: 5m

budgets:

- nodes: '1'

weight: 80

AWS GPU Instance Type Guide

+------------------+----------+----------+------------------+---------------------+

+------------------+----------+----------+------------------+---------------------+

| g4dn.xlarge | T4 | 1 | 16 GB | Inference, light ML |

| g4dn.12xlarge | T4 | 4 | 64 GB | Multi-inference |

| g5.xlarge | A10G | 1 | 24 GB | Inference, fine-tune|

| g5.12xlarge | A10G | 4 | 96 GB | Medium training |

| g5.48xlarge | A10G | 8 | 192 GB | Large training |

| g6.xlarge | L4 | 1 | 24 GB | Inference optimized |

| g6.12xlarge | L4 | 4 | 96 GB | Multimodal inference|

| p3.2xlarge | V100 | 1 | 16 GB | General training |

| p3.8xlarge | V100 | 4 | 64 GB | Large training |

+------------------+----------+----------+------------------+---------------------+

Inference-Only NodePool

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

spec:

template:

metadata:

labels:

node-type: gpu-inference

workload: inference

spec:

requirements:

Inference-suitable instances

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['t4', 'a10g', 'l4']

Spot instances preferred (inference is stateless)

- key: karpenter.sh/capacity-type

operator: In

values: ['spot', 'on-demand']

Instance size constraint

- key: karpenter.k8s.aws/instance-size

operator: In

values: ['xlarge', '2xlarge', '4xlarge']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

limits:

nvidia.com/gpu: '50'

disruption:

consolidationPolicy: WhenEmptyOrUnderutilized

consolidateAfter: 2m

weight: 60

Training-Only NodePool

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

spec:

template:

metadata:

labels:

node-type: gpu-training

workload: training

spec:

requirements:

High-performance GPUs for training

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['a100', 'h100', 'a10g']

On-Demand only (training interruption is costly)

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand']

Large instances

- key: karpenter.k8s.aws/instance-gpu-count

operator: Gt

values: ['0']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

No expiration for training nodes

expireAfter: 720h # 30 days

limits:

nvidia.com/gpu: '32'

disruption:

Disable consolidation during training

consolidationPolicy: WhenEmpty

consolidateAfter: 30m

budgets:

- nodes: '0'

weight: 90

3. GPU-Optimized EC2NodeClass

apiVersion: karpenter.k8s.aws/v1

kind: EC2NodeClass

metadata:

spec:

AMI with GPU driver support

amiSelectorTerms:

- alias: al2023@latest

role: KarpenterNodeRole-my-cluster

subnetSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

network-type: private

securityGroupSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

Large disk for GPU workloads

blockDeviceMappings:

- deviceName: /dev/xvda

ebs:

volumeSize: 200Gi

volumeType: gp3

iops: 6000

throughput: 250

encrypted: true

deleteOnTermination: true

metadataOptions:

httpEndpoint: enabled

httpPutResponseHopLimit: 2

httpTokens: required

tags:

Environment: production

NodeType: gpu

ManagedBy: karpenter

Bootstrap script for GPU nodes

userData: |

#!/bin/bash

echo "GPU node bootstrap"

NVIDIA drivers are handled by GPU Operator

Training-Specific EC2NodeClass (Large Storage)

apiVersion: karpenter.k8s.aws/v1

kind: EC2NodeClass

metadata:

spec:

amiSelectorTerms:

- alias: al2023@latest

role: KarpenterNodeRole-my-cluster

subnetSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

securityGroupSelectorTerms:

- tags:

karpenter.sh/discovery: my-cluster

Large, high-performance storage for training data

blockDeviceMappings:

- deviceName: /dev/xvda

ebs:

volumeSize: 500Gi

volumeType: gp3

iops: 16000

throughput: 1000

encrypted: true

deleteOnTermination: true

tags:

Environment: production

NodeType: gpu-training

ManagedBy: karpenter

4. Spot GPU Instance Strategy

Cost Savings with Spot GPU

+------------------+-------------------+-------------------+---------+

+------------------+-------------------+-------------------+---------+

| g4dn.xlarge | ~0.526 | ~0.158 | ~70% |

| g5.xlarge | ~1.006 | ~0.302 | ~70% |

| g5.2xlarge | ~1.212 | ~0.364 | ~70% |

| g5.12xlarge | ~5.672 | ~1.702 | ~70% |

| p3.2xlarge | ~3.060 | ~0.918 | ~70% |

+------------------+-------------------+-------------------+---------+

(Prices vary by region and time)

Spot GPU NodePool for Inference

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

spec:

template:

metadata:

labels:

node-type: gpu-spot

workload: inference

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ['spot']

Diverse GPU types for inference

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['t4', 'a10g', 'l4']

Various sizes for Spot availability

- key: karpenter.k8s.aws/instance-size

operator: In

values: ['xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge']

Multiple AZs

- key: topology.kubernetes.io/zone

operator: In

values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

limits:

nvidia.com/gpu: '40'

disruption:

consolidationPolicy: WhenEmptyOrUnderutilized

consolidateAfter: 1m

weight: 70

Spot Interruption Mitigation

Apply do-not-disrupt annotation for long-running training jobs

apiVersion: v1

kind: Pod

metadata:

annotations:

karpenter.sh/do-not-disrupt: 'true'

spec:

containers:

- name: training

image: my-training-image:latest

resources:

requests:

nvidia.com/gpu: '1'

cpu: '4'

memory: 16Gi

limits:

nvidia.com/gpu: '1'

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

terminationGracePeriodSeconds: 120

5. NVIDIA GPU Operator Integration

GPU Operator Overview

+----------------------------------------------------------------+

| NVIDIA GPU Operator |

| |

| +------------------+ +-------------------+ +--------------+ |

| +------------------+ +-------------------+ +--------------+ |

| |

| +------------------+ +-------------------+ +--------------+ |

| +------------------+ +-------------------+ +--------------+ |

+----------------------------------------------------------------+

Installing GPU Operator

Add NVIDIA GPU Operator Helm repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia

helm repo update

Install GPU Operator

helm install gpu-operator nvidia/gpu-operator \

--namespace gpu-operator \

--create-namespace \

--set driver.enabled=true \

--set toolkit.enabled=true \

--set devicePlugin.enabled=true \

--set dcgmExporter.enabled=true \

--set migManager.enabled=false \

--set gfd.enabled=true

Verifying GPU Operator with Karpenter

Check NVIDIA labels on GPU nodes

kubectl get nodes -l node-type=gpu -o json | \

jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia")))'

Verify GPU resources

kubectl describe node gpu-node-name | grep -A 5 "nvidia.com/gpu"

Check DCGM Exporter pods

kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

GPU Workload Deployment Example

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: ml-serving

spec:

replicas: 3

selector:

matchLabels:

app: inference-server

template:

metadata:

labels:

app: inference-server

spec:

containers:

- name: inference

image: nvcr.io/nvidia/tritonserver:24.01-py3

ports:

- containerPort: 8000

- containerPort: 8001

- containerPort: 8002

resources:

requests:

cpu: '4'

memory: 16Gi

nvidia.com/gpu: '1'

limits:

nvidia.com/gpu: '1'

volumeMounts:

- name: model-store

mountPath: /models

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

nodeSelector:

node-type: gpu-inference

volumes:

- name: model-store

persistentVolumeClaim:

claimName: model-store-pvc

6. Multi-Architecture Support (x86 + ARM/Graviton)

Multi-Architecture NodePool

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

spec:

template:

spec:

requirements:

Allow both x86 and ARM

- key: kubernetes.io/arch

operator: In

values: ['amd64', 'arm64']

Include Graviton instances

- key: karpenter.k8s.aws/instance-category

operator: In

values: ['c', 'm', 'r']

- key: karpenter.k8s.aws/instance-generation

operator: Gt

values: ['5']

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand', 'spot']

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

limits:

cpu: '1000'

memory: 2000Gi

disruption:

consolidationPolicy: WhenEmptyOrUnderutilized

consolidateAfter: 1m

Graviton GPU Alternative: Inferentia/Trainium

AWS Inferentia inference-only NodePool

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

spec:

template:

metadata:

labels:

accelerator: inferentia

spec:

requirements:

- key: node.kubernetes.io/instance-type

operator: In

values: ['inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge']

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand']

taints:

- key: aws.amazon.com/neuron

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

limits:

aws.amazon.com/neuron: '32'

disruption:

consolidationPolicy: WhenEmpty

consolidateAfter: 5m

7. Cost Optimization Strategies

Mixed Spot and On-Demand Strategy

+-------------------------------------------------------------+

| Cost Optimization Decision Tree |

+-------------------------------------------------------------+

| |

| Identify workload type |

| | |

| +-- Inference (Stateless) --> Spot first + OD fallback |

| | |

| +-- Fine-tuning (Short) --> Spot + checkpoint strategy |

| | |

| +-- Large Training (Long) --> On-Demand + Reserved |

| | |

| +-- Batch Processing --> Spot only |

| |

+-------------------------------------------------------------+

Weighted Priority Instance Family Strategy

Tier 1: G5 Spot (most cost-effective for inference)

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ['spot']

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['a10g']

- key: karpenter.k8s.aws/instance-category

operator: In

values: ['g']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

weight: 100

limits:

nvidia.com/gpu: '20'

Tier 2: G4dn Spot (fallback)

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ['spot']

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['t4']

- key: karpenter.k8s.aws/instance-category

operator: In

values: ['g']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

weight: 50

limits:

nvidia.com/gpu: '20'

Tier 3: G5 On-Demand (last resort)

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

spec:

template:

spec:

requirements:

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand']

- key: karpenter.k8s.aws/instance-gpu-name

operator: In

values: ['a10g']

- key: karpenter.k8s.aws/instance-category

operator: In

values: ['g']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

weight: 10

limits:

nvidia.com/gpu: '10'

Consolidation Policy Optimization

GPU node consolidation settings

disruption:

Use WhenEmpty only for GPU (protect running GPU jobs)

consolidationPolicy: WhenEmpty

Wait 5 minutes after detecting empty node (handle temporary inactivity)

consolidateAfter: 5m

budgets:

Disrupt at most 1 node at a time

- nodes: '1'

Block disruption during business hours

- nodes: '0'

schedule: '0 9 * * MON-FRI'

duration: 10h

8. Node Disruption Budgets for GPU Workloads

Disruption Budget for GPU Workloads

apiVersion: karpenter.sh/v1

kind: NodePool

metadata:

spec:

template:

spec:

requirements:

- key: karpenter.k8s.aws/instance-gpu-count

operator: Gt

values: ['0']

- key: karpenter.sh/capacity-type

operator: In

values: ['on-demand']

taints:

- key: nvidia.com/gpu

value: 'true'

effect: NoSchedule

nodeClassRef:

group: karpenter.k8s.aws

kind: EC2NodeClass

disruption:

consolidationPolicy: WhenEmpty

consolidateAfter: 30m

budgets:

Block disruption completely during training hours

- nodes: '0'

schedule: '0 0 * * *'

duration: 23h

Maintenance window (1 hour daily)

- nodes: '1'

schedule: '0 23 * * *'

duration: 1h

Manage drift-related disruption separately

- nodes: '1'

reasons:

- 'Drifted'

Pod-Level Protection

Long-running training pod: prevent Karpenter disruption

apiVersion: v1

kind: Pod

metadata:

annotations:

This annotation prevents Karpenter voluntary disruption

karpenter.sh/do-not-disrupt: 'true'

spec:

containers:

- name: trainer

image: my-training-image:v1

resources:

requests:

nvidia.com/gpu: '4'

cpu: '16'

memory: 64Gi

limits:

nvidia.com/gpu: '4'

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

Sufficient grace period for checkpoint saving

terminationGracePeriodSeconds: 300

PDB (Pod Disruption Budget) Configuration

apiVersion: policy/v1

kind: PodDisruptionBudget

metadata:

namespace: ml-serving

spec:

minAvailable: 2

selector:

matchLabels:

app: inference-server

9. Monitoring with Prometheus and Grafana

Karpenter Metrics Collection

Karpenter ServiceMonitor

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

namespace: karpenter

spec:

selector:

matchLabels:

app.kubernetes.io/name: karpenter

endpoints:

- port: http-metrics

interval: 15s

path: /metrics

Key Karpenter Metrics

+-----------------------------------------------+------------------------------------------+

| Metric | Description |

+-----------------------------------------------+------------------------------------------+

| karpenter_nodeclaims_launched_total | Total NodeClaims launched |

| karpenter_nodeclaims_registered_total | Total NodeClaims registered |

| karpenter_nodeclaims_terminated_total | Total NodeClaims terminated |

| karpenter_pods_state | Pod state (node, namespace, etc.) |

| karpenter_nodepool_usage | Resource usage per NodePool |

| karpenter_nodepool_limit | Resource limits per NodePool |

| karpenter_voluntary_disruption_eligible_nodes | Nodes eligible for voluntary disruption |

| karpenter_disruption_actions_performed_total | Total disruption actions performed |

| karpenter_nodes_allocatable | Allocatable resources per node |

| karpenter_nodes_total_daemon_requests | Total daemon set resource requests |

+-----------------------------------------------+------------------------------------------+

GPU-Specific Grafana Dashboard Queries

Track GPU node count

count(karpenter_nodes_allocatable{resource_type="nvidia.com/gpu"} > 0)

GPU utilization (requires DCGM Exporter)

DCGM_FI_DEV_GPU_UTIL

GPU usage vs limits per NodePool

karpenter_nodepool_usage{resource_type="nvidia.com/gpu"}

karpenter_nodepool_limit{resource_type="nvidia.com/gpu"}

Provisioning latency

histogram_quantile(0.99,

rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])

)

DCGM Exporter Metrics

DCGM Exporter ServiceMonitor

apiVersion: monitoring.coreos.com/v1

kind: ServiceMonitor

metadata:

namespace: gpu-operator

spec:

selector:

matchLabels:

app: nvidia-dcgm-exporter

endpoints:

- port: metrics

interval: 15s

Alert Rules Example

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

namespace: monitoring

spec:

groups:

- name: karpenter-gpu

rules:

GPU NodePool reaching 90% of limit

- alert: GPUNodePoolNearLimit

expr: |

karpenter_nodepool_usage{nodepool="gpu-general", resource_type="nvidia.com/gpu"}

karpenter_nodepool_limit{nodepool="gpu-general", resource_type="nvidia.com/gpu"}

> 0.9

for: 5m

labels:

severity: warning

annotations:

summary: 'GPU NodePool approaching resource limit'

Low GPU utilization detected

- alert: LowGPUUtilization

expr: |

avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10

for: 1h

labels:

severity: info

annotations:

summary: 'GPU utilization below 10 percent for 1 hour'

Karpenter provisioning failure

- alert: KarpenterProvisioningFailed

expr: |

increase(karpenter_nodeclaims_terminated_total{reason="ProvisioningFailed"}[15m]) > 0

labels:

severity: critical

annotations:

summary: 'Karpenter failed to provision GPU node'

10. Real-World Example: Training Cluster

Distributed Training Cluster Configuration

PyTorch distributed training Job

apiVersion: batch/v1

kind: Job

metadata:

namespace: ml-training

spec:

parallelism: 4

completions: 4

template:

metadata:

labels:

app: distributed-training

annotations:

karpenter.sh/do-not-disrupt: 'true'

spec:

containers:

- name: pytorch-trainer

image: my-pytorch-training:v1

command: ['torchrun']

args:

- '--nproc_per_node=1'

- '--nnodes=4'

- '--node_rank=$(JOB_COMPLETION_INDEX)'

- '--master_addr=training-master'

- '--master_port=29500'

- 'train.py'

env:

- name: JOB_COMPLETION_INDEX

valueFrom:

fieldRef:

fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

resources:

requests:

cpu: '8'

memory: 32Gi

nvidia.com/gpu: '1'

limits:

nvidia.com/gpu: '1'

volumeMounts:

- name: shared-data

mountPath: /data

- name: checkpoints

mountPath: /checkpoints

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

nodeSelector:

node-type: gpu-training

restartPolicy: OnFailure

volumes:

- name: shared-data

persistentVolumeClaim:

claimName: training-data-pvc

- name: checkpoints

persistentVolumeClaim:

claimName: checkpoint-pvc

11. Real-World Example: Inference Cluster

Auto-Scaling Inference Service

Inference Deployment

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: ml-serving

spec:

replicas: 2

selector:

matchLabels:

app: llm-inference

template:

metadata:

labels:

app: llm-inference

spec:

containers:

- name: vllm-server

image: vllm/vllm-openai:latest

args:

- '--model'

- 'meta-llama/Llama-3-8B'

- '--tensor-parallel-size'

- '1'

- '--gpu-memory-utilization'

- '0.9'

ports:

- containerPort: 8000

resources:

requests:

cpu: '4'

memory: 16Gi

nvidia.com/gpu: '1'

limits:

nvidia.com/gpu: '1'

readinessProbe:

httpGet:

path: /health

port: 8000

initialDelaySeconds: 60

periodSeconds: 10

tolerations:

- key: nvidia.com/gpu

operator: Exists

effect: NoSchedule

nodeSelector:

node-type: gpu-inference

HPA Configuration

apiVersion: autoscaling/v2

kind: HorizontalPodAutoscaler

metadata:

namespace: ml-serving

spec:

scaleTargetRef:

apiVersion: apps/v1

kind: Deployment

minReplicas: 2

maxReplicas: 10

metrics:

- type: Pods

pods:

metric:

target:

type: AverageValue

averageValue: '70'

behavior:

scaleUp:

stabilizationWindowSeconds: 60

policies:

- type: Pods

value: 2

periodSeconds: 60

scaleDown:

stabilizationWindowSeconds: 300

policies:

- type: Pods

value: 1

periodSeconds: 120

12. Troubleshooting Guide

Common GPU Node Issues

1. GPU resources not showing on node

kubectl describe node gpu-node | grep -A 10 "Allocatable"

If nvidia.com/gpu is missing, check GPU Operator

2. Check GPU Operator pod status

kubectl get pods -n gpu-operator

kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

3. Check Karpenter provisioning logs

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter \

| grep -i "gpu\|nvidia\|instance-type"

4. Check NodeClaim status

kubectl get nodeclaims -o wide

5. Analyze pending pod causes

kubectl describe pod gpu-pod-name | grep -A 20 "Events"

Common Issues and Solutions

+---------------------------------------------+------------------------------------------+

| Issue | Solution |

+---------------------------------------------+------------------------------------------+

| GPU resources not showing on node | Reinstall GPU Operator or verify drivers |

| Cannot find Spot GPU instances | Add more GPU instance types and AZs |

| GPU node provisioning timeout | Check EC2NodeClass subnet/SG tags |

| Node disrupted during training | Add do-not-disrupt annotation |

| GPU out of memory (OOM) | Allow larger GPU instance types |

| Idle GPU nodes persisting | Review consolidation policy and |

| | consolidateAfter value |

| Only specific GPU type being provisioned | Expand NodePool requirements range |

+---------------------------------------------+------------------------------------------+

GPU Memory Debugging

Check GPU status directly on node (using debug pod)

kubectl run gpu-debug --rm -it \

--image=nvidia/cuda:12.0.0-base-ubuntu22.04 \

--overrides='{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],"nodeSelector":{"node-type":"gpu"}}}' \

--restart=Never \

-- nvidia-smi

13. Best Practices Summary

GPU Node Management Checklist

+---+------------------------------------------------------------+

| # | Best Practice |

+---+------------------------------------------------------------+

| 1 | Separate NodePools for inference and training workloads |

| 2 | Set GPU taints to prevent non-GPU workload scheduling |

| 3 | Use Spot for inference, On-Demand for training |

| 4 | Apply do-not-disrupt annotation for long training jobs |

| 5 | Implement checkpoint strategy to protect training progress |

| 6 | Automate driver management with GPU Operator |

| 7 | Collect GPU metrics with DCGM Exporter |

| 8 | Set GPU cost caps with NodePool limits |

| 9 | Allow multiple GPU instance types for availability |

| 10| Guarantee minimum availability with PDB for inference |

| 11| Block disruption during training with Disruption Budgets |

| 12| Combine HPA with Karpenter for auto-scaling |

+---+------------------------------------------------------------+

Cost Optimization Strategy Summary

Strategy 1: Tiered NodePools

- Spot GPU (high weight) -> On-Demand GPU (low weight)

- Optimal for inference workloads

Strategy 2: Instance Diversification

- Allow multiple GPU families (g4dn, g5, g6)

- Allow multiple instance sizes

- Maximize Spot availability

Strategy 3: Auto Scale-Down

- WhenEmpty consolidation to remove idle GPU nodes immediately

- Short consolidateAfter for inference

- Longer wait time for training nodes

Strategy 4: Appropriate Resource Limits

- Limit max GPUs with NodePool limits

- Prevent unexpected cost spikes

- Manage quotas per team/project

Final Architecture Diagram: Karpenter + GPU

+---------------------------------------------------------------------+

| EKS Cluster |

| |

| +-------------------+ +-------------------+ +-----------------+ |

| +--------+----------+ +--------+----------+ +--------+--------+ |

| | | | |

| +--------v----------+ +--------v----------+ +--------v--------+ |

| +-------------------+ +-------------------+ +-----------------+ |

| |

| +-------------------+ +-------------------+ |

| +-------------------+ +-------------------+ |

+---------------------------------------------------------------------+

Table of Contents

1. GPU Node Provisioning with Karpenter

The Unique Nature of GPU Workloads

Why Karpenter Excels for GPU Management

2. GPU NodePool Configuration

General-Purpose GPU NodePool

Select only GPU instances

GPU instance families

Capacity type

Availability zones

x86 architecture only

GPU-dedicated taint

Longer expiration for GPU nodes

AWS GPU Instance Type Guide

Inference-Only NodePool

Inference-suitable instances

Spot instances preferred (inference is stateless)

Instance size constraint

Training-Only NodePool

High-performance GPUs for training

On-Demand only (training interruption is costly)

Large instances

No expiration for training nodes

Disable consolidation during training

3. GPU-Optimized EC2NodeClass

AMI with GPU driver support

Large disk for GPU workloads

Bootstrap script for GPU nodes

NVIDIA drivers are handled by GPU Operator

Training-Specific EC2NodeClass (Large Storage)

Large, high-performance storage for training data

4. Spot GPU Instance Strategy

Cost Savings with Spot GPU

Spot GPU NodePool for Inference

Diverse GPU types for inference

Various sizes for Spot availability

Multiple AZs

Spot Interruption Mitigation

Apply do-not-disrupt annotation for long-running training jobs

5. NVIDIA GPU Operator Integration

GPU Operator Overview

Installing GPU Operator

Add NVIDIA GPU Operator Helm repository

Install GPU Operator

Verifying GPU Operator with Karpenter

Check NVIDIA labels on GPU nodes

Verify GPU resources

Check DCGM Exporter pods

GPU Workload Deployment Example

6. Multi-Architecture Support (x86 + ARM/Graviton)

Multi-Architecture NodePool

Allow both x86 and ARM

Include Graviton instances

Graviton GPU Alternative: Inferentia/Trainium

AWS Inferentia inference-only NodePool

7. Cost Optimization Strategies

Mixed Spot and On-Demand Strategy

Weighted Priority Instance Family Strategy

Tier 1: G5 Spot (most cost-effective for inference)

Tier 2: G4dn Spot (fallback)

Tier 3: G5 On-Demand (last resort)

Consolidation Policy Optimization

GPU node consolidation settings

Use WhenEmpty only for GPU (protect running GPU jobs)

Wait 5 minutes after detecting empty node (handle temporary inactivity)

Disrupt at most 1 node at a time

Block disruption during business hours

8. Node Disruption Budgets for GPU Workloads

Disruption Budget for GPU Workloads

Block disruption completely during training hours

Maintenance window (1 hour daily)

Manage drift-related disruption separately

Pod-Level Protection

Long-running training pod: prevent Karpenter disruption

This annotation prevents Karpenter voluntary disruption

Sufficient grace period for checkpoint saving

PDB (Pod Disruption Budget) Configuration

9. Monitoring with Prometheus and Grafana

Karpenter Metrics Collection

Karpenter ServiceMonitor