Skip to content
Published on

[AWS] Managing GPU Nodes with Karpenter: AI/ML Workload Optimization

Authors

Table of Contents

1. GPU Node Provisioning with Karpenter

The Unique Nature of GPU Workloads

AI/ML workloads have distinct requirements compared to general computing:

+---------------------------------------------------------------+
|               GPU Workload Characteristics                    |
+---------------------------------------------------------------+
| - Expensive GPU instances (dollars to tens of dollars/hour)   |
| - Long-running training jobs (hours to days)                  |
| - Low-latency requirements for inference                      |
| - GPU memory (VRAM) as the primary resource constraint        |
| - Significant performance differences across instance types   |
| - Risk of losing training progress on Spot interruption       |
+---------------------------------------------------------------+

Why Karpenter Excels for GPU Management

+------------------------------------------+
|    Traditional (Cluster Autoscaler)      |
|                                           |
|  GPU Node Group A: p3.2xlarge             |
|  GPU Node Group B: g5.xlarge              |
|  GPU Node Group C: g5.2xlarge             |
|  GPU Node Group D: p4d.24xlarge           |
|  ...                                      |
|  Manage each Node Group separately        |
|  (inefficient)                            |
+------------------------------------------+

+------------------------------------------+
|         Karpenter Approach               |
|                                           |
|  Single GPU NodePool:                     |
|  - Analyze pod requirements               |
|  - Auto-select optimal GPU instance       |
|  - Automatic Spot/On-Demand switching     |
|  - Cost-based instance optimization       |
+------------------------------------------+

2. GPU NodePool Configuration

General-Purpose GPU NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-general
spec:
  template:
    metadata:
      labels:
        node-type: gpu
        workload: ai-ml
    spec:
      requirements:
        # Select only GPU instances
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

        # GPU instance families
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g', 'p']

        # Capacity type
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

        # Availability zones
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

        # x86 architecture only
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64']

      # GPU-dedicated taint
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

      # Longer expiration for GPU nodes
      expireAfter: 336h # 14 days

  limits:
    cpu: '500'
    memory: 2000Gi
    nvidia.com/gpu: '100'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
    budgets:
      - nodes: '1'

  weight: 80

AWS GPU Instance Type Guide

+------------------+----------+----------+------------------+---------------------+
| Instance Type    | GPU      | Count    | GPU Memory       | Primary Use Case    |
+------------------+----------+----------+------------------+---------------------+
| g4dn.xlarge      | T4       | 1        | 16 GB            | Inference, light ML |
| g4dn.12xlarge    | T4       | 4        | 64 GB            | Multi-inference     |
| g5.xlarge        | A10G     | 1        | 24 GB            | Inference, fine-tune|
| g5.12xlarge      | A10G     | 4        | 96 GB            | Medium training     |
| g5.48xlarge      | A10G     | 8        | 192 GB           | Large training      |
| g6.xlarge        | L4       | 1        | 24 GB            | Inference optimized |
| g6.12xlarge      | L4       | 4        | 96 GB            | Multimodal inference|
| p3.2xlarge       | V100     | 1        | 16 GB            | General training    |
| p3.8xlarge       | V100     | 4        | 64 GB            | Large training      |
| p4d.24xlarge     | A100     | 8        | 320 GB (40GB x8) | Ultra-large training|
| p5.48xlarge      | H100     | 8        | 640 GB (80GB x8) | Maximum performance |
+------------------+----------+----------+------------------+---------------------+

Inference-Only NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-inference
        workload: inference
    spec:
      requirements:
        # Inference-suitable instances
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Spot instances preferred (inference is stateless)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot', 'on-demand']

        # Instance size constraint
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '50'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 2m

  weight: 60

Training-Only NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training
spec:
  template:
    metadata:
      labels:
        node-type: gpu-training
        workload: training
    spec:
      requirements:
        # High-performance GPUs for training
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a100', 'h100', 'a10g']

        # On-Demand only (training interruption is costly)
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

        # Large instances
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

      # No expiration for training nodes
      expireAfter: 720h # 30 days

  limits:
    nvidia.com/gpu: '32'

  disruption:
    # Disable consolidation during training
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      - nodes: '0'

  weight: 90

3. GPU-Optimized EC2NodeClass

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-optimized
spec:
  # AMI with GPU driver support
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
        network-type: private

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # Large disk for GPU workloads
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi
        volumeType: gp3
        iops: 6000
        throughput: 250
        encrypted: true
        deleteOnTermination: true

  metadataOptions:
    httpEndpoint: enabled
    httpPutResponseHopLimit: 2
    httpTokens: required

  tags:
    Environment: production
    NodeType: gpu
    ManagedBy: karpenter

  # Bootstrap script for GPU nodes
  userData: |
    #!/bin/bash
    echo "GPU node bootstrap"
    # NVIDIA drivers are handled by GPU Operator

Training-Specific EC2NodeClass (Large Storage)

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu-training
spec:
  amiSelectorTerms:
    - alias: al2023@latest

  role: KarpenterNodeRole-my-cluster

  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster

  # Large, high-performance storage for training data
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 500Gi
        volumeType: gp3
        iops: 16000
        throughput: 1000
        encrypted: true
        deleteOnTermination: true

  tags:
    Environment: production
    NodeType: gpu-training
    ManagedBy: karpenter

4. Spot GPU Instance Strategy

Cost Savings with Spot GPU

+------------------+-------------------+-------------------+---------+
| Instance Type    | On-Demand (hr)    | Spot Est. (hr)    | Savings |
+------------------+-------------------+-------------------+---------+
| g4dn.xlarge      | ~0.526            | ~0.158            | ~70%    |
| g5.xlarge        | ~1.006            | ~0.302            | ~70%    |
| g5.2xlarge       | ~1.212            | ~0.364            | ~70%    |
| g5.12xlarge      | ~5.672            | ~1.702            | ~70%    |
| p3.2xlarge       | ~3.060            | ~0.918            | ~70%    |
+------------------+-------------------+-------------------+---------+
 (Prices vary by region and time)

Spot GPU NodePool for Inference

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-spot-inference
spec:
  template:
    metadata:
      labels:
        node-type: gpu-spot
        workload: inference
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']

        # Diverse GPU types for inference
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4', 'a10g', 'l4']

        # Various sizes for Spot availability
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ['xlarge', '2xlarge', '4xlarge', '8xlarge', '12xlarge']

        # Multiple AZs
        - key: topology.kubernetes.io/zone
          operator: In
          values: ['us-east-1a', 'us-east-1b', 'us-east-1c']

      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized

  limits:
    nvidia.com/gpu: '40'

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

  weight: 70

Spot Interruption Mitigation

# Apply do-not-disrupt annotation for long-running training jobs
apiVersion: v1
kind: Pod
metadata:
  name: training-job
  annotations:
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: training
      image: my-training-image:latest
      resources:
        requests:
          nvidia.com/gpu: '1'
          cpu: '4'
          memory: 16Gi
        limits:
          nvidia.com/gpu: '1'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  terminationGracePeriodSeconds: 120

5. NVIDIA GPU Operator Integration

GPU Operator Overview

+----------------------------------------------------------------+
|                    NVIDIA GPU Operator                          |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | NVIDIA Driver    |  | Container Toolkit |  | Device Plugin| |
|  | (Auto Install)   |  | (Auto Config)     |  | (Auto Deploy)| |
|  +------------------+  +-------------------+  +--------------+ |
|                                                                |
|  +------------------+  +-------------------+  +--------------+ |
|  | GPU Feature      |  | DCGM Exporter    |  | MIG Manager  | |
|  | Discovery        |  | (Metrics)         |  | (MIG Mgmt)   | |
|  +------------------+  +-------------------+  +--------------+ |
+----------------------------------------------------------------+

Installing GPU Operator

# Add NVIDIA GPU Operator Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set migManager.enabled=false \
  --set gfd.enabled=true

Verifying GPU Operator with Karpenter

# Check NVIDIA labels on GPU nodes
kubectl get nodes -l node-type=gpu -o json | \
  jq '.items[].metadata.labels | with_entries(select(.key | startswith("nvidia")))'

# Verify GPU resources
kubectl describe node gpu-node-name | grep -A 5 "nvidia.com/gpu"

# Check DCGM Exporter pods
kubectl get pods -n gpu-operator -l app=nvidia-dcgm-exporter

GPU Workload Deployment Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-inference-server
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: inference-server
  template:
    metadata:
      labels:
        app: inference-server
    spec:
      containers:
        - name: inference
          image: nvcr.io/nvidia/tritonserver:24.01-py3
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: model-store
              mountPath: /models
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
      volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: model-store-pvc

6. Multi-Architecture Support (x86 + ARM/Graviton)

Multi-Architecture NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: multi-arch
spec:
  template:
    spec:
      requirements:
        # Allow both x86 and ARM
        - key: kubernetes.io/arch
          operator: In
          values: ['amd64', 'arm64']

        # Include Graviton instances
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['c', 'm', 'r']

        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ['5']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand', 'spot']

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default

  limits:
    cpu: '1000'
    memory: 2000Gi

  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Graviton GPU Alternative: Inferentia/Trainium

# AWS Inferentia inference-only NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: inferentia
spec:
  template:
    metadata:
      labels:
        accelerator: inferentia
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ['inf2.xlarge', 'inf2.8xlarge', 'inf2.24xlarge', 'inf2.48xlarge']

        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']

      taints:
        - key: aws.amazon.com/neuron
          value: 'true'
          effect: NoSchedule

      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: inferentia-nodes

  limits:
    aws.amazon.com/neuron: '32'

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m

7. Cost Optimization Strategies

Mixed Spot and On-Demand Strategy

+-------------------------------------------------------------+
|           Cost Optimization Decision Tree                    |
+-------------------------------------------------------------+
|                                                             |
|  Identify workload type                                     |
|      |                                                      |
|      +-- Inference (Stateless) --> Spot first + OD fallback |
|      |                                                      |
|      +-- Fine-tuning (Short) --> Spot + checkpoint strategy |
|      |                                                      |
|      +-- Large Training (Long) --> On-Demand + Reserved     |
|      |                                                      |
|      +-- Batch Processing --> Spot only                     |
|                                                             |
+-------------------------------------------------------------+

Weighted Priority Instance Family Strategy

# Tier 1: G5 Spot (most cost-effective for inference)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier1-g5-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 100
  limits:
    nvidia.com/gpu: '20'
---
# Tier 2: G4dn Spot (fallback)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier2-g4dn-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['spot']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['t4']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 50
  limits:
    nvidia.com/gpu: '20'
---
# Tier 3: G5 On-Demand (last resort)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-tier3-g5-ondemand
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values: ['a10g']
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['g']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-optimized
  weight: 10
  limits:
    nvidia.com/gpu: '10'

Consolidation Policy Optimization

# GPU node consolidation settings
disruption:
  # Use WhenEmpty only for GPU (protect running GPU jobs)
  consolidationPolicy: WhenEmpty
  # Wait 5 minutes after detecting empty node (handle temporary inactivity)
  consolidateAfter: 5m
  budgets:
    # Disrupt at most 1 node at a time
    - nodes: '1'
    # Block disruption during business hours
    - nodes: '0'
      schedule: '0 9 * * MON-FRI'
      duration: 10h

8. Node Disruption Budgets for GPU Workloads

Disruption Budget for GPU Workloads

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-training-protected
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: Gt
          values: ['0']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
      taints:
        - key: nvidia.com/gpu
          value: 'true'
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-training

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30m
    budgets:
      # Block disruption completely during training hours
      - nodes: '0'
        schedule: '0 0 * * *'
        duration: 23h

      # Maintenance window (1 hour daily)
      - nodes: '1'
        schedule: '0 23 * * *'
        duration: 1h

      # Manage drift-related disruption separately
      - nodes: '1'
        reasons:
          - 'Drifted'

Pod-Level Protection

# Long-running training pod: prevent Karpenter disruption
apiVersion: v1
kind: Pod
metadata:
  name: long-training-job
  annotations:
    # This annotation prevents Karpenter voluntary disruption
    karpenter.sh/do-not-disrupt: 'true'
spec:
  containers:
    - name: trainer
      image: my-training-image:v1
      resources:
        requests:
          nvidia.com/gpu: '4'
          cpu: '16'
          memory: 64Gi
        limits:
          nvidia.com/gpu: '4'
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  # Sufficient grace period for checkpoint saving
  terminationGracePeriodSeconds: 300

PDB (Pod Disruption Budget) Configuration

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: inference-server-pdb
  namespace: ml-serving
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: inference-server

9. Monitoring with Prometheus and Grafana

Karpenter Metrics Collection

# Karpenter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: karpenter
  namespace: karpenter
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: karpenter
  endpoints:
    - port: http-metrics
      interval: 15s
      path: /metrics

Key Karpenter Metrics

+-----------------------------------------------+------------------------------------------+
| Metric                                        | Description                              |
+-----------------------------------------------+------------------------------------------+
| karpenter_nodeclaims_launched_total           | Total NodeClaims launched                |
| karpenter_nodeclaims_registered_total         | Total NodeClaims registered              |
| karpenter_nodeclaims_terminated_total         | Total NodeClaims terminated              |
| karpenter_pods_state                          | Pod state (node, namespace, etc.)        |
| karpenter_nodepool_usage                      | Resource usage per NodePool              |
| karpenter_nodepool_limit                      | Resource limits per NodePool             |
| karpenter_voluntary_disruption_eligible_nodes | Nodes eligible for voluntary disruption  |
| karpenter_disruption_actions_performed_total  | Total disruption actions performed       |
| karpenter_nodes_allocatable                   | Allocatable resources per node           |
| karpenter_nodes_total_daemon_requests         | Total daemon set resource requests       |
+-----------------------------------------------+------------------------------------------+

GPU-Specific Grafana Dashboard Queries

# Track GPU node count
count(karpenter_nodes_allocatable{resource_type="nvidia.com/gpu"} > 0)

# GPU utilization (requires DCGM Exporter)
DCGM_FI_DEV_GPU_UTIL

# GPU usage vs limits per NodePool
karpenter_nodepool_usage{resource_type="nvidia.com/gpu"}
  /
karpenter_nodepool_limit{resource_type="nvidia.com/gpu"}

# Provisioning latency
histogram_quantile(0.99,
  rate(karpenter_provisioner_scheduling_duration_seconds_bucket[5m])
)

DCGM Exporter Metrics

# DCGM Exporter ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter
  namespace: gpu-operator
spec:
  selector:
    matchLabels:
      app: nvidia-dcgm-exporter
  endpoints:
    - port: metrics
      interval: 15s

Alert Rules Example

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: karpenter-gpu-alerts
  namespace: monitoring
spec:
  groups:
    - name: karpenter-gpu
      rules:
        # GPU NodePool reaching 90% of limit
        - alert: GPUNodePoolNearLimit
          expr: |
            karpenter_nodepool_usage{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            /
            karpenter_nodepool_limit{nodepool="gpu-general", resource_type="nvidia.com/gpu"}
            > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'GPU NodePool approaching resource limit'

        # Low GPU utilization detected
        - alert: LowGPUUtilization
          expr: |
            avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 10
          for: 1h
          labels:
            severity: info
          annotations:
            summary: 'GPU utilization below 10 percent for 1 hour'

        # Karpenter provisioning failure
        - alert: KarpenterProvisioningFailed
          expr: |
            increase(karpenter_nodeclaims_terminated_total{reason="ProvisioningFailed"}[15m]) > 0
          labels:
            severity: critical
          annotations:
            summary: 'Karpenter failed to provision GPU node'

10. Real-World Example: Training Cluster

Distributed Training Cluster Configuration

# PyTorch distributed training Job
apiVersion: batch/v1
kind: Job
metadata:
  name: distributed-training
  namespace: ml-training
spec:
  parallelism: 4
  completions: 4
  template:
    metadata:
      labels:
        app: distributed-training
      annotations:
        karpenter.sh/do-not-disrupt: 'true'
    spec:
      containers:
        - name: pytorch-trainer
          image: my-pytorch-training:v1
          command: ['torchrun']
          args:
            - '--nproc_per_node=1'
            - '--nnodes=4'
            - '--node_rank=$(JOB_COMPLETION_INDEX)'
            - '--master_addr=training-master'
            - '--master_port=29500'
            - 'train.py'
          env:
            - name: JOB_COMPLETION_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
          resources:
            requests:
              cpu: '8'
              memory: 32Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          volumeMounts:
            - name: shared-data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-training
      restartPolicy: OnFailure
      volumes:
        - name: shared-data
          persistentVolumeClaim:
            claimName: training-data-pvc
        - name: checkpoints
          persistentVolumeClaim:
            claimName: checkpoint-pvc

11. Real-World Example: Inference Cluster

Auto-Scaling Inference Service

# Inference Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      containers:
        - name: vllm-server
          image: vllm/vllm-openai:latest
          args:
            - '--model'
            - 'meta-llama/Llama-3-8B'
            - '--tensor-parallel-size'
            - '1'
            - '--gpu-memory-utilization'
            - '0.9'
          ports:
            - containerPort: 8000
              name: http
          resources:
            requests:
              cpu: '4'
              memory: 16Gi
              nvidia.com/gpu: '1'
            limits:
              nvidia.com/gpu: '1'
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 10
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-type: gpu-inference
---
# HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: ml-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: gpu_utilization
        target:
          type: AverageValue
          averageValue: '70'
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

12. Troubleshooting Guide

Common GPU Node Issues

# 1. GPU resources not showing on node
kubectl describe node gpu-node | grep -A 10 "Allocatable"
# If nvidia.com/gpu is missing, check GPU Operator

# 2. Check GPU Operator pod status
kubectl get pods -n gpu-operator
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# 3. Check Karpenter provisioning logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter \
  | grep -i "gpu\|nvidia\|instance-type"

# 4. Check NodeClaim status
kubectl get nodeclaims -o wide

# 5. Analyze pending pod causes
kubectl describe pod gpu-pod-name | grep -A 20 "Events"

Common Issues and Solutions

+---------------------------------------------+------------------------------------------+
| Issue                                       | Solution                                 |
+---------------------------------------------+------------------------------------------+
| GPU resources not showing on node           | Reinstall GPU Operator or verify drivers |
| Cannot find Spot GPU instances              | Add more GPU instance types and AZs      |
| GPU node provisioning timeout               | Check EC2NodeClass subnet/SG tags        |
| Node disrupted during training              | Add do-not-disrupt annotation            |
| GPU out of memory (OOM)                     | Allow larger GPU instance types           |
| Idle GPU nodes persisting                   | Review consolidation policy and           |
|                                             | consolidateAfter value                   |
| Only specific GPU type being provisioned    | Expand NodePool requirements range       |
+---------------------------------------------+------------------------------------------+

GPU Memory Debugging

# Check GPU status directly on node (using debug pod)
kubectl run gpu-debug --rm -it \
  --image=nvidia/cuda:12.0.0-base-ubuntu22.04 \
  --overrides='{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Exists","effect":"NoSchedule"}],"nodeSelector":{"node-type":"gpu"}}}' \
  --restart=Never \
  -- nvidia-smi

13. Best Practices Summary

GPU Node Management Checklist

+---+------------------------------------------------------------+
| # | Best Practice                                              |
+---+------------------------------------------------------------+
| 1 | Separate NodePools for inference and training workloads    |
| 2 | Set GPU taints to prevent non-GPU workload scheduling      |
| 3 | Use Spot for inference, On-Demand for training             |
| 4 | Apply do-not-disrupt annotation for long training jobs     |
| 5 | Implement checkpoint strategy to protect training progress |
| 6 | Automate driver management with GPU Operator               |
| 7 | Collect GPU metrics with DCGM Exporter                     |
| 8 | Set GPU cost caps with NodePool limits                     |
| 9 | Allow multiple GPU instance types for availability         |
| 10| Guarantee minimum availability with PDB for inference      |
| 11| Block disruption during training with Disruption Budgets   |
| 12| Combine HPA with Karpenter for auto-scaling               |
+---+------------------------------------------------------------+

Cost Optimization Strategy Summary

Strategy 1: Tiered NodePools
  - Spot GPU (high weight) -> On-Demand GPU (low weight)
  - Optimal for inference workloads

Strategy 2: Instance Diversification
  - Allow multiple GPU families (g4dn, g5, g6)
  - Allow multiple instance sizes
  - Maximize Spot availability

Strategy 3: Auto Scale-Down
  - WhenEmpty consolidation to remove idle GPU nodes immediately
  - Short consolidateAfter for inference
  - Longer wait time for training nodes

Strategy 4: Appropriate Resource Limits
  - Limit max GPUs with NodePool limits
  - Prevent unexpected cost spikes
  - Manage quotas per team/project

Final Architecture Diagram: Karpenter + GPU

+---------------------------------------------------------------------+
|                        EKS Cluster                                  |
|                                                                     |
|  +-------------------+  +-------------------+  +-----------------+  |
|  | NodePool:         |  | NodePool:         |  | NodePool:       |  |
|  | gpu-inference     |  | gpu-training      |  | multi-arch      |  |
|  | (Spot, weight:60) |  | (OD, weight:90)   |  | (Mixed, w:50)   |  |
|  +--------+----------+  +--------+----------+  +--------+--------+  |
|           |                      |                       |           |
|  +--------v----------+  +--------v----------+  +--------v--------+  |
|  | EC2NodeClass:     |  | EC2NodeClass:     |  | EC2NodeClass:   |  |
|  | gpu-optimized     |  | gpu-training      |  | default         |  |
|  | (200GB, gp3)      |  | (500GB, gp3)      |  | (100GB, gp3)    |  |
|  +-------------------+  +-------------------+  +-----------------+  |
|                                                                     |
|  +-------------------+  +-------------------+                       |
|  | GPU Operator      |  | Prometheus +      |                       |
|  | (NVIDIA Driver,   |  | Grafana           |                       |
|  |  Device Plugin,   |  | (Karpenter +      |                       |
|  |  DCGM Exporter)   |  |  DCGM Metrics)    |                       |
|  +-------------------+  +-------------------+                       |
+---------------------------------------------------------------------+