[AWS] Managing GPU Nodes with Karpenter: AI/ML Workload Optimization

1. GPU Node Provisioning with Karpenter

The Unique Nature of GPU Workloads

AI/ML workloads have distinct requirements compared to general computing:

+---------------------------------------------------------------+
|               GPU Workload Characteristics                    |
+---------------------------------------------------------------+
| - Expensive GPU instances (dollars to tens of dollars/hour)   |
| - Long-running training jobs (hours to days)                  |
| - Low-latency requirements for inference                      |
| - GPU memory (VRAM) as the primary resource constraint        |
| - Significant performance differences across instance types   |
| - Risk of losing training progress on Spot interruption       |
+---------------------------------------------------------------+

Why Karpenter Excels for GPU Management

+------------------------------------------+
|    Traditional (Cluster Autoscaler)      |
|                                           |
|  GPU Node Group A: p3.2xlarge             |
|  GPU Node Group B: g5.xlarge              |
|  GPU Node Group C: g5.2xlarge             |
|  GPU Node Group D: p4d.24xlarge           |
|  ...                                      |
|  Manage each Node Group separately        |
|  (inefficient)                            |
+------------------------------------------+

+------------------------------------------+
|         Karpenter Approach               |
|                                           |
|  Single GPU NodePool:                     |
|  - Analyze pod requirements               |
|  - Auto-select optimal GPU instance       |
|  - Automatic Spot/On-Demand switching     |
|  - Cost-based instance optimization       |
+------------------------------------------+

2. GPU NodePool Configuration

General-Purpose GPU NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-general
spec:
  template:
    metadata:
      labels:
        node-type: gpu
        workload: ai-ml
    spec:
      requirements:
        # Select only GPU instances
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

        # GPU instance families
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g', 'p']

        # Capacity type
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

        # Availability zones
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

        # x86 architecture only
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']

      # GPU-dedicated taint
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

      # Longer expiration for GPU nodes
      expireAfter: 336h # 14 days

  limits:
    cpu: '500'
    memory: 2000Gi
    nvidia.com/gpu: '100'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
    budgets:
      - nodes: '1'

  weight: 80

AWS GPU Instance Type Guide

+------------------+----------+----------+------------------+---------------------+
| Instance Type    | GPU      | Count    | GPU Memory       | Primary Use Case    |
+------------------+----------+----------+------------------+---------------------+
| g4dn.xlarge      | T4       | 1        | 16 GB            | Inference, light ML |
| g4dn.12xlarge    | T4       | 4        | 64 GB            | Multi-inference     |
| g5.xlarge        | A10G     | 1        | 24 GB            | Inference, fine-tune|
| g5.12xlarge      | A10G     | 4        | 96 GB            | Medium training     |
| g5.48xlarge      | A10G     | 8        | 192 GB           | Large training      |
| g6.xlarge        | L4       | 1        | 24 GB            | Inference optimized |
| g6.12xlarge      | L4       | 4        | 96 GB            | Multimodal inference|
| p3.2xlarge       | V100     | 1        | 16 GB            | General training    |
| p3.8xlarge       | V100     | 4        | 64 GB            | Large training      |
| p4d.24xlarge     | A100     | 8        | 320 GB (40GB x8) | Ultra-large training|
| p5.48xlarge      | H100     | 8        | 640 GB (80GB x8) | Maximum performance |
+------------------+----------+----------+------------------+---------------------+

Inference-Only NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-inference
        workload: inference
    spec:
      requirements:
        # Inference-suitable instances
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Spot instances preferred (inference is stateless)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']

        # Instance size constraint
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '50'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 2m

  weight: 60

Training-Only NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training
spec:
  template:
    metadata:
      labels:
        node-type: gpu-training
        workload: training
    spec:
      requirements:
        # High-performance GPUs for training
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a100', 'h100', 'a10g']

        # On-Demand only (training interruption is costly)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

        # Large instances
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

      # No expiration for training nodes
      expireAfter: 720h # 30 days

  limits:
    nvidia.com/gpu: '32'

  disruption:
    # Disable consolidation during training
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      - nodes: '0'

  weight: 90

3. GPU-Optimized EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-optimized
spec:
  # AMI with GPU driver support
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
        network-type: private

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # Large disk for GPU workloads
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi
        volumeType: gp3
        iops: 6000
        throughput: 250
        encrypted: true
        deleteOnTermination: true

  metadataOptions:
    httpEndpoint: enabled
    httpPutResponseHopLimit: 2
    httpTokens: required

  tags:
    Environment: production
    NodeType: gpu
    ManagedBy: karpenter

  # Bootstrap script for GPU nodes
  userData: |
    #!/bin/bash
    echo "GPU node bootstrap"
    # NVIDIA drivers are handled by GPU Operator

Training-Specific EC2NodeClass (Large Storage)

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-training
spec:
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # Large, high-performance storage for training data
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 500Gi
        volumeType: gp3
        iops: 16000
        throughput: 1000
        encrypted: true
        deleteOnTermination: true

  tags:
    Environment: production
    NodeType: gpu-training
    ManagedBy: karpenter

4. Spot GPU Instance Strategy

Cost Savings with Spot GPU

+------------------+-------------------+-------------------+---------+
| Instance Type    | On-Demand (hr)    | Spot Est. (hr)    | Savings |
+------------------+-------------------+-------------------+---------+
| g4dn.xlarge      | ~0.526            | ~0.158            | ~70%    |
| g5.xlarge        | ~1.006            | ~0.302            | ~70%    |
| g5.2xlarge       | ~1.212            | ~0.364            | ~70%    |
| g5.12xlarge      | ~5.672            | ~1.702            | ~70%    |
| p3.2xlarge       | ~3.060            | ~0.918            | ~70%    |
+------------------+-------------------+-------------------+---------+
 (Prices vary by region and time)

Spot GPU NodePool for Inference

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-spot
        workload: inference
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']

        # Diverse GPU types for inference
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Various sizes for Spot availability
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge']

        # Multiple AZs
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '40'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

  weight: 70

Spot Interruption Mitigation

# Apply do-not-disrupt annotation for long-running training jobs
apiVersion: v1
kind: Pod
metadata:
  name: training-job
  annotations:
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: training
      image: my-training-image:latest
      resources:
        requests:
          nvidia.com/gpu: '1'
          cpu: '4'
          memory: 16Gi
        limits:
          nvidia.com/gpu: '1'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  terminationGracePeriodSeconds: 120

5. NVIDIA GPU Operator Integration

GPU Operator Overview

+----------------------------------------------------------------+
|                    NVIDIA GPU Operator                          |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | NVIDIA Driver    |  | Container Toolkit |  | Device Plugin| |
|  | (Auto Install)   |  | (Auto Config)     |  | (Auto Deploy)| |
|  +------------------+  +-------------------+  +--------------+ |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | GPU Feature      |  | DCGM Exporter    |  | MIG Manager  | |
|  | Discovery        |  | (Metrics)         |  | (MIG Mgmt)   | |
|  +------------------+  +-------------------+  +--------------+ |
+----------------------------------------------------------------+

Installing GPU Operator

# Add NVIDIA GPU Operator Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false \
  --set gfd.enabled=true

Verifying GPU Operator with Karpenter

# Check NVIDIA labels on GPU nodes
kubectl get nodes -l node-type=gpu -o json | \
  jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia")))'

# Verify GPU resources
kubectl describe node gpu-node-name | grep -A 5 "nvidia.com/gpu"

# Check DCGM Exporter pods
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

GPU Workload Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-inference-server
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference-server
  template:
    metadata:
      labels:
        app: inference-server
    spec:
      containers:
        - name: inference
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: model-store
              mountPath: /models
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc

6. Multi-Architecture Support (x86 + ARM/Graviton)

Multi-Architecture NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: multi-arch
spec:
  template:
    spec:
      requirements:
        # Allow both x86 and ARM
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64', 'arm64']

        # Include Graviton instances
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['c', 'm', 'r']

        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ['5']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default

  limits:
    cpu: '1000'
    memory: 2000Gi

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Graviton GPU Alternative: Inferentia/Trainium

# AWS Inferentia inference-only NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: inferentia
spec:
  template:
    metadata:
      labels:
        accelerator: inferentia
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

      taints:
        - key: aws.amazon.com/neuron
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: inferentia-nodes

  limits:
    aws.amazon.com/neuron: '32'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

7. Cost Optimization Strategies

Mixed Spot and On-Demand Strategy

+-------------------------------------------------------------+
|           Cost Optimization Decision Tree                    |
+-------------------------------------------------------------+
|                                                             |
|  Identify workload type                                     |
|      |                                                      |
|      +-- Inference (Stateless) --> Spot first + OD fallback |
|      |                                                      |
|      +-- Fine-tuning (Short) --> Spot + checkpoint strategy |
|      |                                                      |
|      +-- Large Training (Long) --> On-Demand + Reserved     |
|      |                                                      |
|      +-- Batch Processing --> Spot only                     |
|                                                             |
+-------------------------------------------------------------+

Weighted Priority Instance Family Strategy

# Tier 1: G5 Spot (most cost-effective for inference)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier1-g5-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 100
  limits:
    nvidia.com/gpu: '20'
---
# Tier 2: G4dn Spot (fallback)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier2-g4dn-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 50
  limits:
    nvidia.com/gpu: '20'
---
# Tier 3: G5 On-Demand (last resort)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier3-g5-ondemand
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 10
  limits:
    nvidia.com/gpu: '10'

Consolidation Policy Optimization

# GPU node consolidation settings
disruption:
  # Use WhenEmpty only for GPU (protect running GPU jobs)
  consolidationPolicy: WhenEmpty
  # Wait 5 minutes after detecting empty node (handle temporary inactivity)
  consolidateAfter: 5m
  budgets:
    # Disrupt at most 1 node at a time
    - nodes: '1'
    # Block disruption during business hours
    - nodes: '0'
      schedule: '0 9 * * MON-FRI'
      duration: 10h

8. Node Disruption Budgets for GPU Workloads

Disruption Budget for GPU Workloads

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training-protected
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      # Block disruption completely during training hours
      - nodes: '0'
        schedule: '0 0 * * *'
        duration: 23h

      # Maintenance window (1 hour daily)
      - nodes: '1'
        schedule: '0 23 * * *'
        duration: 1h

      # Manage drift-related disruption separately
      - nodes: '1'
        reasons:
          - 'Drifted'

Pod-Level Protection

# Long-running training pod: prevent Karpenter disruption
apiVersion: v1
kind: Pod
metadata:
  name: long-training-job
  annotations:
    # This annotation prevents Karpenter voluntary disruption
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: trainer
      image: my-training-image:v1
      resources:
        requests:
          nvidia.com/gpu: '4'
          cpu: '16'
          memory: 64Gi
        limits:
          nvidia.com/gpu: '4'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  # Sufficient grace period for checkpoint saving
  terminationGracePeriodSeconds: 300

PDB (Pod Disruption Budget) Configuration

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-server-pdb
  namespace: ml-serving
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: inference-server

9. Monitoring with Prometheus and Grafana

Karpenter Metrics Collection

# Karpenter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: karpenter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
    - port: http-metrics
      interval: 15s
      path: /metrics

Key Karpenter Metrics

+-----------------------------------------------+------------------------------------------+
| Metric                                        | Description                              |
+-----------------------------------------------+------------------------------------------+
| karpenter_nodeclaims_launched_total           | Total NodeClaims launched                |
| karpenter_nodeclaims_registered_total         | Total NodeClaims registered              |
| karpenter_nodeclaims_terminated_total         | Total NodeClaims terminated              |
| karpenter_pods_state                          | Pod state (node, namespace, etc.)        |
| karpenter_nodepool_usage                      | Resource usage per NodePool              |
| karpenter_nodepool_limit                      | Resource limits per NodePool             |
| karpenter_voluntary_disruption_eligible_nodes | Nodes eligible for voluntary disruption  |
| karpenter_disruption_actions_performed_total  | Total disruption actions performed       |
| karpenter_nodes_allocatable                   | Allocatable resources per node           |
| karpenter_nodes_total_daemon_requests         | Total daemon set resource requests       |
+-----------------------------------------------+------------------------------------------+

GPU-Specific Grafana Dashboard Queries

# Track GPU node count
count(karpenter_nodes_allocatable{resource_type="nvidia.com/gpu"} > 0)

# GPU utilization (requires DCGM Exporter)
DCGM_FI_DEV_GPU_UTIL

# GPU usage vs limits per NodePool
karpenter_nodepool_usage{resource_type="nvidia.com/gpu"}
  /
karpenter_nodepool_limit{resource_type="nvidia.com/gpu"}

# Provisioning latency
histogram_quantile(0.99,
  rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])
)

DCGM Exporter Metrics

# DCGM Exporter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

Alert Rules Example

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: karpenter-gpu-alerts
  namespace: monitoring
spec:
  groups:
    - name: karpenter-gpu
      rules:
        # GPU NodePool reaching 90% of limit
        - alert: GPUNodePoolNearLimit
          expr: |
            karpenter_nodepool_usage{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            /
            karpenter_nodepool_limit{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU NodePool approaching resource limit'

        # Low GPU utilization detected
        - alert: LowGPUUtilization
          expr: |
            avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10
          for: 1h
          labels:
            severity: info
          annotations:
            summary: 'GPU utilization below 10 percent for 1 hour'

        # Karpenter provisioning failure
        - alert: KarpenterProvisioningFailed
          expr: |
            increase(karpenter_nodeclaims_terminated_total{reason="ProvisioningFailed"}[15m]) > 0
          labels:
            severity: critical
          annotations:
            summary: 'Karpenter failed to provision GPU node'

10. Real-World Example: Training Cluster

Distributed Training Cluster Configuration

# PyTorch distributed training Job
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
  namespace: ml-training
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      labels:
        app: distributed-training
      annotations:
        karpenter.sh/do-not-disrupt: 'true'
    spec:
      containers:
        - name: pytorch-trainer
          image: my-pytorch-training:v1
          command: ['torchrun']
          args:
            - '--nproc_per_node=1'
            - '--nnodes=4'
            - '--node_rank=$(JOB_COMPLETION_INDEX)'
            - '--master_addr=training-master'
            - '--master_port=29500'
            - 'train.py'
          env:
            - name: JOB_COMPLETION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
          resources:
            requests:
              cpu: '8'
              memory: 32Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: shared-data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-training
      restartPolicy: OnFailure
      volumes:
        - name: shared-data
          persistentVolumeClaim:
            claimName: training-data-pvc
        - name: checkpoints
          persistentVolumeClaim:
            claimName: checkpoint-pvc

11. Real-World Example: Inference Cluster

Auto-Scaling Inference Service

# Inference Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
        - name: vllm-server
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3-8B'
            - '--tensor-parallel-size'
            - '1'
            - '--gpu-memory-utilization'
            - '0.9'
          ports:
            - containerPort: 8000
              name: http
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
---
# HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '70'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

12. Troubleshooting Guide

Common GPU Node Issues

# 1. GPU resources not showing on node
kubectl describe node gpu-node | grep -A 10 "Allocatable"
# If nvidia.com/gpu is missing, check GPU Operator

# 2. Check GPU Operator pod status
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# 3. Check Karpenter provisioning logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter \
  | grep -i "gpu\|nvidia\|instance-type"

# 4. Check NodeClaim status
kubectl get nodeclaims -o wide

# 5. Analyze pending pod causes
kubectl describe pod gpu-pod-name | grep -A 20 "Events"

Common Issues and Solutions

+---------------------------------------------+------------------------------------------+
| Issue                                       | Solution                                 |
+---------------------------------------------+------------------------------------------+
| GPU resources not showing on node           | Reinstall GPU Operator or verify drivers |
| Cannot find Spot GPU instances              | Add more GPU instance types and AZs      |
| GPU node provisioning timeout               | Check EC2NodeClass subnet/SG tags        |
| Node disrupted during training              | Add do-not-disrupt annotation            |
| GPU out of memory (OOM)                     | Allow larger GPU instance types           |
| Idle GPU nodes persisting                   | Review consolidation policy and           |
|                                             | consolidateAfter value                   |
| Only specific GPU type being provisioned    | Expand NodePool requirements range       |
+---------------------------------------------+------------------------------------------+

GPU Memory Debugging

# Check GPU status directly on node (using debug pod)
kubectl run gpu-debug --rm -it \
  --image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
  --overrides='{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],"nodeSelector":{"node-type":"gpu"}}}' \
  --restart=Never \
  -- nvidia-smi

13. Best Practices Summary

GPU Node Management Checklist

+---+------------------------------------------------------------+
| # | Best Practice                                              |
+---+------------------------------------------------------------+
| 1 | Separate NodePools for inference and training workloads    |
| 2 | Set GPU taints to prevent non-GPU workload scheduling      |
| 3 | Use Spot for inference, On-Demand for training             |
| 4 | Apply do-not-disrupt annotation for long training jobs     |
| 5 | Implement checkpoint strategy to protect training progress |
| 6 | Automate driver management with GPU Operator               |
| 7 | Collect GPU metrics with DCGM Exporter                     |
| 8 | Set GPU cost caps with NodePool limits                     |
| 9 | Allow multiple GPU instance types for availability         |
| 10| Guarantee minimum availability with PDB for inference      |
| 11| Block disruption during training with Disruption Budgets   |
| 12| Combine HPA with Karpenter for auto-scaling               |
+---+------------------------------------------------------------+

Cost Optimization Strategy Summary

Strategy 1: Tiered NodePools
  - Spot GPU (high weight) -> On-Demand GPU (low weight)
  - Optimal for inference workloads

Strategy 2: Instance Diversification
  - Allow multiple GPU families (g4dn, g5, g6)
  - Allow multiple instance sizes
  - Maximize Spot availability

Strategy 3: Auto Scale-Down
  - WhenEmpty consolidation to remove idle GPU nodes immediately
  - Short consolidateAfter for inference
  - Longer wait time for training nodes

Strategy 4: Appropriate Resource Limits
  - Limit max GPUs with NodePool limits
  - Prevent unexpected cost spikes
  - Manage quotas per team/project

Final Architecture Diagram: Karpenter + GPU

+---------------------------------------------------------------------+
|                        EKS Cluster                                  |
|                                                                     |
|  +-------------------+  +-------------------+  +-----------------+  |
|  | NodePool:         |  | NodePool:         |  | NodePool:       |  |
|  | gpu-inference     |  | gpu-training      |  | multi-arch      |  |
|  | (Spot, weight:60) |  | (OD, weight:90)   |  | (Mixed, w:50)   |  |
|  +--------+----------+  +--------+----------+  +--------+--------+  |
|           |                      |                       |           |
|  +--------v----------+  +--------v----------+  +--------v--------+  |
|  | EC2NodeClass:     |  | EC2NodeClass:     |  | EC2NodeClass:   |  |
|  | gpu-optimized     |  | gpu-training      |  | default         |  |
|  | (200GB, gp3)      |  | (500GB, gp3)      |  | (100GB, gp3)    |  |
|  +-------------------+  +-------------------+  +-----------------+  |
|                                                                     |
|  +-------------------+  +-------------------+                       |
|  | GPU Operator      |  | Prometheus +      |                       |
|  | (NVIDIA Driver,   |  | Grafana           |                       |
|  |  Device Plugin,   |  | (Karpenter +      |                       |
|  |  DCGM Exporter)   |  |  DCGM Metrics)    |                       |
|  +-------------------+  +-------------------+                       |
+---------------------------------------------------------------------+

Table of Contents

1. GPU Node Provisioning with Karpenter

The Unique Nature of GPU Workloads

Why Karpenter Excels for GPU Management

2. GPU NodePool Configuration

General-Purpose GPU NodePool

AWS GPU Instance Type Guide

Inference-Only NodePool

Training-Only NodePool

3. GPU-Optimized EC2NodeClass

Training-Specific EC2NodeClass (Large Storage)

4. Spot GPU Instance Strategy

Cost Savings with Spot GPU

Spot GPU NodePool for Inference

Spot Interruption Mitigation

5. NVIDIA GPU Operator Integration

GPU Operator Overview

Installing GPU Operator

Verifying GPU Operator with Karpenter

GPU Workload Deployment Example

6. Multi-Architecture Support (x86 + ARM/Graviton)

Multi-Architecture NodePool

Graviton GPU Alternative: Inferentia/Trainium

7. Cost Optimization Strategies

Mixed Spot and On-Demand Strategy

Weighted Priority Instance Family Strategy

Consolidation Policy Optimization

8. Node Disruption Budgets for GPU Workloads

Disruption Budget for GPU Workloads

Pod-Level Protection

PDB (Pod Disruption Budget) Configuration

9. Monitoring with Prometheus and Grafana

Karpenter Metrics Collection

Key Karpenter Metrics

GPU-Specific Grafana Dashboard Queries

DCGM Exporter Metrics

Alert Rules Example

10. Real-World Example: Training Cluster

Distributed Training Cluster Configuration

11. Real-World Example: Inference Cluster

Auto-Scaling Inference Service

12. Troubleshooting Guide

Common GPU Node Issues

Common Issues and Solutions

GPU Memory Debugging

13. Best Practices Summary

GPU Node Management Checklist

Cost Optimization Strategy Summary

Final Architecture Diagram: Karpenter + GPU