- Authors

- Name
- Youngju Kim
- @fjvbn20031
Table of Contents
1. GPU Node Provisioning with Karpenter
The Unique Nature of GPU Workloads
AI/ML workloads have distinct requirements compared to general computing:
+---------------------------------------------------------------+
| GPU Workload Characteristics |
+---------------------------------------------------------------+
| - Expensive GPU instances (dollars to tens of dollars/hour) |
| - Long-running training jobs (hours to days) |
| - Low-latency requirements for inference |
| - GPU memory (VRAM) as the primary resource constraint |
| - Significant performance differences across instance types |
| - Risk of losing training progress on Spot interruption |
+---------------------------------------------------------------+
Why Karpenter Excels for GPU Management
+------------------------------------------+
| Traditional (Cluster Autoscaler) |
| |
| GPU Node Group A: p3.2xlarge |
| GPU Node Group B: g5.xlarge |
| GPU Node Group C: g5.2xlarge |
| GPU Node Group D: p4d.24xlarge |
| ... |
| Manage each Node Group separately |
| (inefficient) |
+------------------------------------------+
+------------------------------------------+
| Karpenter Approach |
| |
| Single GPU NodePool: |
| - Analyze pod requirements |
| - Auto-select optimal GPU instance |
| - Automatic Spot/On-Demand switching |
| - Cost-based instance optimization |
+------------------------------------------+
2. GPU NodePool Configuration
General-Purpose GPU NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-general
spec:
template:
metadata:
labels:
node-type: gpu
workload: ai-ml
spec:
requirements:
# Select only GPU instances
- key: karpenter.k8s.aws/instance-gpu-count
operator: Gt
values: ['0']
# GPU instance families
- key: karpenter.k8s.aws/instance-category
operator: In
values: ['g', 'p']
# Capacity type
- key: karpenter.sh/capacity-type
operator: In
values: ['on-demand', 'spot']
# Availability zones
- key: topology.kubernetes.io/zone
operator: In
values: ['us-east-1a', 'us-east-1b', 'us-east-1c']
# x86 architecture only
- key: kubernetes.io/arch
operator: In
values: ['amd64']
# GPU-dedicated taint
taints:
- key: nvidia.com/gpu
value: 'true'
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-optimized
# Longer expiration for GPU nodes
expireAfter: 336h # 14 days
limits:
cpu: '500'
memory: 2000Gi
nvidia.com/gpu: '100'
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m
budgets:
- nodes: '1'
weight: 80
AWS GPU Instance Type Guide
+------------------+----------+----------+------------------+---------------------+
| Instance Type | GPU | Count | GPU Memory | Primary Use Case |
+------------------+----------+----------+------------------+---------------------+
| g4dn.xlarge | T4 | 1 | 16 GB | Inference, light ML |
| g4dn.12xlarge | T4 | 4 | 64 GB | Multi-inference |
| g5.xlarge | A10G | 1 | 24 GB | Inference, fine-tune|
| g5.12xlarge | A10G | 4 | 96 GB | Medium training |
| g5.48xlarge | A10G | 8 | 192 GB | Large training |
| g6.xlarge | L4 | 1 | 24 GB | Inference optimized |
| g6.12xlarge | L4 | 4 | 96 GB | Multimodal inference|
| p3.2xlarge | V100 | 1 | 16 GB | General training |
| p3.8xlarge | V100 | 4 | 64 GB | Large training |
| p4d.24xlarge | A100 | 8 | 320 GB (40GB x8) | Ultra-large training|
| p5.48xlarge | H100 | 8 | 640 GB (80GB x8) | Maximum performance |
+------------------+----------+----------+------------------+---------------------+
Inference-Only NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-inference
spec:
template:
metadata:
labels:
node-type: gpu-inference
workload: inference
spec:
requirements:
# Inference-suitable instances
- key: karpenter.k8s.aws/instance-gpu-name
operator: In
values: ['t4', 'a10g', 'l4']
# Spot instances preferred (inference is stateless)
- key: karpenter.sh/capacity-type
operator: In
values: ['spot', 'on-demand']
# Instance size constraint
- key: karpenter.k8s.aws/instance-size
operator: In
values: ['xlarge', '2xlarge', '4xlarge']
taints:
- key: nvidia.com/gpu
value: 'true'
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-optimized
limits:
nvidia.com/gpu: '50'
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 2m
weight: 60
Training-Only NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-training
spec:
template:
metadata:
labels:
node-type: gpu-training
workload: training
spec:
requirements:
# High-performance GPUs for training
- key: karpenter.k8s.aws/instance-gpu-name
operator: In
values: ['a100', 'h100', 'a10g']
# On-Demand only (training interruption is costly)
- key: karpenter.sh/capacity-type
operator: In
values: ['on-demand']
# Large instances
- key: karpenter.k8s.aws/instance-gpu-count
operator: Gt
values: ['0']
taints:
- key: nvidia.com/gpu
value: 'true'
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-training
# No expiration for training nodes
expireAfter: 720h # 30 days
limits:
nvidia.com/gpu: '32'
disruption:
# Disable consolidation during training
consolidationPolicy: WhenEmpty
consolidateAfter: 30m
budgets:
- nodes: '0'
weight: 90
3. GPU-Optimized EC2NodeClass
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-optimized
spec:
# AMI with GPU driver support
amiSelectorTerms:
- alias: al2023@latest
role: KarpenterNodeRole-my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
network-type: private
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
# Large disk for GPU workloads
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 200Gi
volumeType: gp3
iops: 6000
throughput: 250
encrypted: true
deleteOnTermination: true
metadataOptions:
httpEndpoint: enabled
httpPutResponseHopLimit: 2
httpTokens: required
tags:
Environment: production
NodeType: gpu
ManagedBy: karpenter
# Bootstrap script for GPU nodes
userData: |
#!/bin/bash
echo "GPU node bootstrap"
# NVIDIA drivers are handled by GPU Operator
Training-Specific EC2NodeClass (Large Storage)
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-training
spec:
amiSelectorTerms:
- alias: al2023@latest
role: KarpenterNodeRole-my-cluster
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
# Large, high-performance storage for training data
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 500Gi
volumeType: gp3
iops: 16000
throughput: 1000
encrypted: true
deleteOnTermination: true
tags:
Environment: production
NodeType: gpu-training
ManagedBy: karpenter
4. Spot GPU Instance Strategy
Cost Savings with Spot GPU
+------------------+-------------------+-------------------+---------+
| Instance Type | On-Demand (hr) | Spot Est. (hr) | Savings |
+------------------+-------------------+-------------------+---------+
| g4dn.xlarge | ~0.526 | ~0.158 | ~70% |
| g5.xlarge | ~1.006 | ~0.302 | ~70% |
| g5.2xlarge | ~1.212 | ~0.364 | ~70% |
| g5.12xlarge | ~5.672 | ~1.702 | ~70% |
| p3.2xlarge | ~3.060 | ~0.918 | ~70% |
+------------------+-------------------+-------------------+---------+
(Prices vary by region and time)
Spot GPU NodePool for Inference
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-spot-inference
spec:
template:
metadata:
labels:
node-type: gpu-spot
workload: inference
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ['spot']
# Diverse GPU types for inference
- key: karpenter.k8s.aws/instance-gpu-name
operator: In
values: ['t4', 'a10g', 'l4']
# Various sizes for Spot availability
- key: karpenter.k8s.aws/instance-size
operator: In
values: ['xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge']
# Multiple AZs
- key: topology.kubernetes.io/zone
operator: In
values: ['us-east-1a', 'us-east-1b', 'us-east-1c']
taints:
- key: nvidia.com/gpu
value: 'true'
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-optimized
limits:
nvidia.com/gpu: '40'
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
weight: 70
Spot Interruption Mitigation
# Apply do-not-disrupt annotation for long-running training jobs
apiVersion: v1
kind: Pod
metadata:
name: training-job
annotations:
karpenter.sh/do-not-disrupt: 'true'
spec:
containers:
- name: training
image: my-training-image:latest
resources:
requests:
nvidia.com/gpu: '1'
cpu: '4'
memory: 16Gi
limits:
nvidia.com/gpu: '1'
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
terminationGracePeriodSeconds: 120
5. NVIDIA GPU Operator Integration
GPU Operator Overview
+----------------------------------------------------------------+
| NVIDIA GPU Operator |
| |
| +------------------+ +-------------------+ +--------------+ |
| | NVIDIA Driver | | Container Toolkit | | Device Plugin| |
| | (Auto Install) | | (Auto Config) | | (Auto Deploy)| |
| +------------------+ +-------------------+ +--------------+ |
| |
| +------------------+ +-------------------+ +--------------+ |
| | GPU Feature | | DCGM Exporter | | MIG Manager | |
| | Discovery | | (Metrics) | | (MIG Mgmt) | |
| +------------------+ +-------------------+ +--------------+ |
+----------------------------------------------------------------+
Installing GPU Operator
# Add NVIDIA GPU Operator Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true \
--set migManager.enabled=false \
--set gfd.enabled=true
Verifying GPU Operator with Karpenter
# Check NVIDIA labels on GPU nodes
kubectl get nodes -l node-type=gpu -o json | \
jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia")))'
# Verify GPU resources
kubectl describe node gpu-node-name | grep -A 5 "nvidia.com/gpu"
# Check DCGM Exporter pods
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter
GPU Workload Deployment Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: gpu-inference-server
namespace: ml-serving
spec:
replicas: 3
selector:
matchLabels:
app: inference-server
template:
metadata:
labels:
app: inference-server
spec:
containers:
- name: inference
image: nvcr.io/nvidia/tritonserver:24.01-py3
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
- containerPort: 8002
name: metrics
resources:
requests:
cpu: '4'
memory: 16Gi
nvidia.com/gpu: '1'
limits:
nvidia.com/gpu: '1'
volumeMounts:
- name: model-store
mountPath: /models
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
node-type: gpu-inference
volumes:
- name: model-store
persistentVolumeClaim:
claimName: model-store-pvc
6. Multi-Architecture Support (x86 + ARM/Graviton)
Multi-Architecture NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: multi-arch
spec:
template:
spec:
requirements:
# Allow both x86 and ARM
- key: kubernetes.io/arch
operator: In
values: ['amd64', 'arm64']
# Include Graviton instances
- key: karpenter.k8s.aws/instance-category
operator: In
values: ['c', 'm', 'r']
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ['5']
- key: karpenter.sh/capacity-type
operator: In
values: ['on-demand', 'spot']
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
limits:
cpu: '1000'
memory: 2000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 1m
Graviton GPU Alternative: Inferentia/Trainium
# AWS Inferentia inference-only NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: inferentia
spec:
template:
metadata:
labels:
accelerator: inferentia
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ['inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge']
- key: karpenter.sh/capacity-type
operator: In
values: ['on-demand']
taints:
- key: aws.amazon.com/neuron
value: 'true'
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: inferentia-nodes
limits:
aws.amazon.com/neuron: '32'
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 5m
7. Cost Optimization Strategies
Mixed Spot and On-Demand Strategy
+-------------------------------------------------------------+
| Cost Optimization Decision Tree |
+-------------------------------------------------------------+
| |
| Identify workload type |
| | |
| +-- Inference (Stateless) --> Spot first + OD fallback |
| | |
| +-- Fine-tuning (Short) --> Spot + checkpoint strategy |
| | |
| +-- Large Training (Long) --> On-Demand + Reserved |
| | |
| +-- Batch Processing --> Spot only |
| |
+-------------------------------------------------------------+
Weighted Priority Instance Family Strategy
# Tier 1: G5 Spot (most cost-effective for inference)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-tier1-g5-spot
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ['spot']
- key: karpenter.k8s.aws/instance-gpu-name
operator: In
values: ['a10g']
- key: karpenter.k8s.aws/instance-category
operator: In
values: ['g']
taints:
- key: nvidia.com/gpu
value: 'true'
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-optimized
weight: 100
limits:
nvidia.com/gpu: '20'
---
# Tier 2: G4dn Spot (fallback)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-tier2-g4dn-spot
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ['spot']
- key: karpenter.k8s.aws/instance-gpu-name
operator: In
values: ['t4']
- key: karpenter.k8s.aws/instance-category
operator: In
values: ['g']
taints:
- key: nvidia.com/gpu
value: 'true'
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-optimized
weight: 50
limits:
nvidia.com/gpu: '20'
---
# Tier 3: G5 On-Demand (last resort)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-tier3-g5-ondemand
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ['on-demand']
- key: karpenter.k8s.aws/instance-gpu-name
operator: In
values: ['a10g']
- key: karpenter.k8s.aws/instance-category
operator: In
values: ['g']
taints:
- key: nvidia.com/gpu
value: 'true'
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-optimized
weight: 10
limits:
nvidia.com/gpu: '10'
Consolidation Policy Optimization
# GPU node consolidation settings
disruption:
# Use WhenEmpty only for GPU (protect running GPU jobs)
consolidationPolicy: WhenEmpty
# Wait 5 minutes after detecting empty node (handle temporary inactivity)
consolidateAfter: 5m
budgets:
# Disrupt at most 1 node at a time
- nodes: '1'
# Block disruption during business hours
- nodes: '0'
schedule: '0 9 * * MON-FRI'
duration: 10h
8. Node Disruption Budgets for GPU Workloads
Disruption Budget for GPU Workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-training-protected
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-gpu-count
operator: Gt
values: ['0']
- key: karpenter.sh/capacity-type
operator: In
values: ['on-demand']
taints:
- key: nvidia.com/gpu
value: 'true'
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-training
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30m
budgets:
# Block disruption completely during training hours
- nodes: '0'
schedule: '0 0 * * *'
duration: 23h
# Maintenance window (1 hour daily)
- nodes: '1'
schedule: '0 23 * * *'
duration: 1h
# Manage drift-related disruption separately
- nodes: '1'
reasons:
- 'Drifted'
Pod-Level Protection
# Long-running training pod: prevent Karpenter disruption
apiVersion: v1
kind: Pod
metadata:
name: long-training-job
annotations:
# This annotation prevents Karpenter voluntary disruption
karpenter.sh/do-not-disrupt: 'true'
spec:
containers:
- name: trainer
image: my-training-image:v1
resources:
requests:
nvidia.com/gpu: '4'
cpu: '16'
memory: 64Gi
limits:
nvidia.com/gpu: '4'
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Sufficient grace period for checkpoint saving
terminationGracePeriodSeconds: 300
PDB (Pod Disruption Budget) Configuration
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: inference-server-pdb
namespace: ml-serving
spec:
minAvailable: 2
selector:
matchLabels:
app: inference-server
9. Monitoring with Prometheus and Grafana
Karpenter Metrics Collection
# Karpenter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: karpenter
namespace: karpenter
spec:
selector:
matchLabels:
app.kubernetes.io/name: karpenter
endpoints:
- port: http-metrics
interval: 15s
path: /metrics
Key Karpenter Metrics
+-----------------------------------------------+------------------------------------------+
| Metric | Description |
+-----------------------------------------------+------------------------------------------+
| karpenter_nodeclaims_launched_total | Total NodeClaims launched |
| karpenter_nodeclaims_registered_total | Total NodeClaims registered |
| karpenter_nodeclaims_terminated_total | Total NodeClaims terminated |
| karpenter_pods_state | Pod state (node, namespace, etc.) |
| karpenter_nodepool_usage | Resource usage per NodePool |
| karpenter_nodepool_limit | Resource limits per NodePool |
| karpenter_voluntary_disruption_eligible_nodes | Nodes eligible for voluntary disruption |
| karpenter_disruption_actions_performed_total | Total disruption actions performed |
| karpenter_nodes_allocatable | Allocatable resources per node |
| karpenter_nodes_total_daemon_requests | Total daemon set resource requests |
+-----------------------------------------------+------------------------------------------+
GPU-Specific Grafana Dashboard Queries
# Track GPU node count
count(karpenter_nodes_allocatable{resource_type="nvidia.com/gpu"} > 0)
# GPU utilization (requires DCGM Exporter)
DCGM_FI_DEV_GPU_UTIL
# GPU usage vs limits per NodePool
karpenter_nodepool_usage{resource_type="nvidia.com/gpu"}
/
karpenter_nodepool_limit{resource_type="nvidia.com/gpu"}
# Provisioning latency
histogram_quantile(0.99,
rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])
)
DCGM Exporter Metrics
# DCGM Exporter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: nvidia-dcgm-exporter
endpoints:
- port: metrics
interval: 15s
Alert Rules Example
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: karpenter-gpu-alerts
namespace: monitoring
spec:
groups:
- name: karpenter-gpu
rules:
# GPU NodePool reaching 90% of limit
- alert: GPUNodePoolNearLimit
expr: |
karpenter_nodepool_usage{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
/
karpenter_nodepool_limit{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
> 0.9
for: 5m
labels:
severity: warning
annotations:
summary: 'GPU NodePool approaching resource limit'
# Low GPU utilization detected
- alert: LowGPUUtilization
expr: |
avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10
for: 1h
labels:
severity: info
annotations:
summary: 'GPU utilization below 10 percent for 1 hour'
# Karpenter provisioning failure
- alert: KarpenterProvisioningFailed
expr: |
increase(karpenter_nodeclaims_terminated_total{reason="ProvisioningFailed"}[15m]) > 0
labels:
severity: critical
annotations:
summary: 'Karpenter failed to provision GPU node'
10. Real-World Example: Training Cluster
Distributed Training Cluster Configuration
# PyTorch distributed training Job
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
namespace: ml-training
spec:
parallelism: 4
completions: 4
template:
metadata:
labels:
app: distributed-training
annotations:
karpenter.sh/do-not-disrupt: 'true'
spec:
containers:
- name: pytorch-trainer
image: my-pytorch-training:v1
command: ['torchrun']
args:
- '--nproc_per_node=1'
- '--nnodes=4'
- '--node_rank=$(JOB_COMPLETION_INDEX)'
- '--master_addr=training-master'
- '--master_port=29500'
- 'train.py'
env:
- name: JOB_COMPLETION_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
resources:
requests:
cpu: '8'
memory: 32Gi
nvidia.com/gpu: '1'
limits:
nvidia.com/gpu: '1'
volumeMounts:
- name: shared-data
mountPath: /data
- name: checkpoints
mountPath: /checkpoints
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
node-type: gpu-training
restartPolicy: OnFailure
volumes:
- name: shared-data
persistentVolumeClaim:
claimName: training-data-pvc
- name: checkpoints
persistentVolumeClaim:
claimName: checkpoint-pvc
11. Real-World Example: Inference Cluster
Auto-Scaling Inference Service
# Inference Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
namespace: ml-serving
spec:
replicas: 2
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
args:
- '--model'
- 'meta-llama/Llama-3-8B'
- '--tensor-parallel-size'
- '1'
- '--gpu-memory-utilization'
- '0.9'
ports:
- containerPort: 8000
name: http
resources:
requests:
cpu: '4'
memory: 16Gi
nvidia.com/gpu: '1'
limits:
nvidia.com/gpu: '1'
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
node-type: gpu-inference
---
# HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
namespace: ml-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: '70'
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
12. Troubleshooting Guide
Common GPU Node Issues
# 1. GPU resources not showing on node
kubectl describe node gpu-node | grep -A 10 "Allocatable"
# If nvidia.com/gpu is missing, check GPU Operator
# 2. Check GPU Operator pod status
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset
# 3. Check Karpenter provisioning logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter \
| grep -i "gpu\|nvidia\|instance-type"
# 4. Check NodeClaim status
kubectl get nodeclaims -o wide
# 5. Analyze pending pod causes
kubectl describe pod gpu-pod-name | grep -A 20 "Events"
Common Issues and Solutions
+---------------------------------------------+------------------------------------------+
| Issue | Solution |
+---------------------------------------------+------------------------------------------+
| GPU resources not showing on node | Reinstall GPU Operator or verify drivers |
| Cannot find Spot GPU instances | Add more GPU instance types and AZs |
| GPU node provisioning timeout | Check EC2NodeClass subnet/SG tags |
| Node disrupted during training | Add do-not-disrupt annotation |
| GPU out of memory (OOM) | Allow larger GPU instance types |
| Idle GPU nodes persisting | Review consolidation policy and |
| | consolidateAfter value |
| Only specific GPU type being provisioned | Expand NodePool requirements range |
+---------------------------------------------+------------------------------------------+
GPU Memory Debugging
# Check GPU status directly on node (using debug pod)
kubectl run gpu-debug --rm -it \
--image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
--overrides='{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],"nodeSelector":{"node-type":"gpu"}}}' \
--restart=Never \
-- nvidia-smi
13. Best Practices Summary
GPU Node Management Checklist
+---+------------------------------------------------------------+
| # | Best Practice |
+---+------------------------------------------------------------+
| 1 | Separate NodePools for inference and training workloads |
| 2 | Set GPU taints to prevent non-GPU workload scheduling |
| 3 | Use Spot for inference, On-Demand for training |
| 4 | Apply do-not-disrupt annotation for long training jobs |
| 5 | Implement checkpoint strategy to protect training progress |
| 6 | Automate driver management with GPU Operator |
| 7 | Collect GPU metrics with DCGM Exporter |
| 8 | Set GPU cost caps with NodePool limits |
| 9 | Allow multiple GPU instance types for availability |
| 10| Guarantee minimum availability with PDB for inference |
| 11| Block disruption during training with Disruption Budgets |
| 12| Combine HPA with Karpenter for auto-scaling |
+---+------------------------------------------------------------+
Cost Optimization Strategy Summary
Strategy 1: Tiered NodePools
- Spot GPU (high weight) -> On-Demand GPU (low weight)
- Optimal for inference workloads
Strategy 2: Instance Diversification
- Allow multiple GPU families (g4dn, g5, g6)
- Allow multiple instance sizes
- Maximize Spot availability
Strategy 3: Auto Scale-Down
- WhenEmpty consolidation to remove idle GPU nodes immediately
- Short consolidateAfter for inference
- Longer wait time for training nodes
Strategy 4: Appropriate Resource Limits
- Limit max GPUs with NodePool limits
- Prevent unexpected cost spikes
- Manage quotas per team/project
Final Architecture Diagram: Karpenter + GPU
+---------------------------------------------------------------------+
| EKS Cluster |
| |
| +-------------------+ +-------------------+ +-----------------+ |
| | NodePool: | | NodePool: | | NodePool: | |
| | gpu-inference | | gpu-training | | multi-arch | |
| | (Spot, weight:60) | | (OD, weight:90) | | (Mixed, w:50) | |
| +--------+----------+ +--------+----------+ +--------+--------+ |
| | | | |
| +--------v----------+ +--------v----------+ +--------v--------+ |
| | EC2NodeClass: | | EC2NodeClass: | | EC2NodeClass: | |
| | gpu-optimized | | gpu-training | | default | |
| | (200GB, gp3) | | (500GB, gp3) | | (100GB, gp3) | |
| +-------------------+ +-------------------+ +-----------------+ |
| |
| +-------------------+ +-------------------+ |
| | GPU Operator | | Prometheus + | |
| | (NVIDIA Driver, | | Grafana | |
| | Device Plugin, | | (Karpenter + | |
| | DCGM Exporter) | | DCGM Metrics) | |
| +-------------------+ +-------------------+ |
+---------------------------------------------------------------------+