1. Limitations of the Default Kubernetes Scheduler
2. Volcano Scheduler Official Documentation Analysis
3. Volcano Gang Scheduling and Fair-share Scheduling Policies
4. Volcano Job CRD and Plugin System
- 4.1 VolcanoJob CRD
- 4.2 Plugin System Details
5. Kubeflow Training Operator Official Documentation Analysis
- 5.1 V1 (Legacy) and V2 Architecture
- 5.2 Role of the Training Operator
6. PyTorchJob CRD Detailed Analysis and YAML Examples
7. MPIJob and TFJob Support
- 7.1 MPIJob
- 7.2 TFJob
8. Multi-node Multi-GPU Distributed Training Setup (PyTorch DDP on K8s)
9. Managing Training Data and Checkpoints with PVC/PV
10. Kueue: The Next-Generation Job Queuing System
- 10.1 Kueue vs Volcano
- 10.2 Core Concepts of Kueue
11. Kueue's ResourceFlavor, ClusterQueue, and LocalQueue
12. Training on Spot/Preemptible Instances (Fault Tolerance)
13. References

1. Limitations of the Default Kubernetes Scheduler

The default Kubernetes scheduler (kube-scheduler) performs scheduling at the Pod level. Its core role is to place each Pod on an appropriate Node based on the resources it requests (CPU, Memory, GPU). While this architecture works well for typical workloads like web services and API servers, it has fundamental limitations when it comes to batch workloads such as AI/ML distributed training.

1.1 Lack of Gang Scheduling

The biggest challenge in distributed training is the absence of Gang Scheduling. For example, in PyTorch DDP (Distributed Data Parallel) training, all 4 Worker Pods must start simultaneously for training to proceed. Since the default scheduler schedules Pods individually, it may schedule 3 out of 4 while the remaining one stays in a Pending state due to insufficient resources. The 3 already-placed Pods occupy GPU resources, but training cannot begin — resulting in resource waste.

1.2 Deadlock Problem

An even more serious issue is deadlock. Suppose two distributed training Jobs each require 4 GPUs, but the cluster only has 6 GPUs in total. The default scheduler might allocate 3 GPUs to Job A and 3 to Job B. Since both Jobs need a minimum of 4, neither can start training. This is exactly the "All or Nothing" problem that Gang Scheduling solves.

The default scheduler has no concept of Queues for resource sharing across teams. In AI organizations, multiple research teams share GPU clusters, and implementing fair-share policies — such as setting per-team quotas and sharing idle resources — is difficult with the default scheduler alone.

These limitations gave rise to projects like Volcano, Kubeflow Training Operator, and Kueue.

2. Volcano Scheduler Official Documentation Analysis

Volcano is a project under the CNCF (Cloud Native Computing Foundation) that serves as both a scheduler and controller for handling high-performance batch workloads on Kubernetes. It is a batch scheduling system designed for AI/ML, BigData, HPC (High Performance Computing), and similar workloads.

2.1 Architecture

Volcano consists of three core components:

Volcano Scheduler: A scheduler that replaces or runs alongside kube-scheduler. It performs scheduling at the Job/PodGroup level rather than the Pod level.
Volcano Controller Manager: A controller that manages the lifecycle of CRDs such as VolcanoJob, Queue, and PodGroup.
Volcano Webhook (Admission Controller): An Admission Webhook responsible for validation and default value injection when CRDs are created.

The Volcano Scheduler is compatible with Kubernetes' Scheduling Framework and performs the PreFilter/Filter and Score phases of the default scheduler through predicates and nodeorder plugins. On top of this, it adds advanced scheduling strategies like Gang Scheduling, Fair-share, and Binpack as plugins.

2.2 Queue CRD

Queue is a CRD that logically partitions cluster resources to provide resource isolation in multi-tenant environments.

apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: ml-research-queue
spec:
  weight: 4
  reclaimable: true
  capability:
    cpu: '64'
    memory: '256Gi'
    nvidia.com/gpu: '8'

weight: The resource allocation ratio among Queues. A higher weight means more resources are allocated.
reclaimable: Whether idle resources can be lent to other Queues. When set to true, idle resources in this Queue can be used by other Queues.
capability: The maximum resource limits the Queue can use. Specifies CPU, Memory, GPU, and more.

2.3 PodGroup CRD

PodGroup is a CRD that defines a group of tightly coupled Pods and serves as the core unit for Gang Scheduling.

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: pytorch-training-pg
  namespace: default
spec:
  minMember: 4
  minResources:
    cpu: '16'
    memory: '64Gi'
    nvidia.com/gpu: '4'
  priorityClassName: high-priority
  queue: ml-research-queue

minMember: The minimum number of Pods that must run within the PodGroup. If cluster resources cannot satisfy this number, no Pods in the PodGroup will be scheduled.
minResources: The minimum resources required to run the PodGroup. If available resources do not meet this threshold, scheduling is deferred.
priorityClassName: The priority of the PodGroup. Used by the scheduler to sort PodGroups within a Queue.
queue: The Queue name to which the PodGroup belongs. The Queue must be in an Open state.

The PodGroup Status Phase transitions as follows: Pending (resources not met) -> Inqueue (validation complete, waiting for node binding) -> Running (at least minMember Pods running) -> Unknown (some Pods cannot be scheduled). When a VolcanoJob is created, the PodGroup is automatically generated.

3.1 Gang Scheduling

Gang Scheduling is the most critical feature of Volcano. Using an "All or Nothing" strategy, it only places Pods when all Pods comprising a Job (or at least the minimum number specified by minMember) can be scheduled simultaneously.

Gang Scheduling is essential in the following scenarios:

MPI-based distributed training: All Workers must start simultaneously for allreduce communication to work.
PyTorch DDP: Both Master and Workers must be ready before training can begin.
Spark/Big Data: Both Driver and Executors must be ready before a Job can run.

The Gang Plugin checks the PodGroup's minMember field to determine whether that number of Pods can be simultaneously scheduled on the cluster. If not, no Pods are scheduled at all, preventing resource waste and deadlock.

3.2 DRF (Dominant Resource Fairness) Scheduling

DRF is an algorithm that implements fair resource distribution in multi-tenant environments. It calculates fair-share based on each Job's Dominant Resource.

For example, if Job A is CPU-intensive (8 CPU cores, 4GB Memory) and Job B is memory-intensive (2 CPU cores, 32GB Memory), the scheduler prioritizes the Job with the lowest dominant resource ratio to ensure fairness. The DRF Plugin considers multi-dimensional resources (CPU, Memory, GPU) simultaneously to achieve fair distribution without bias toward any particular resource type.

3.3 Proportion Plugin

The Proportion Plugin manages resource ratios per Queue. It proportionally distributes the total cluster resources based on each Queue's weight and allocates resources within the Queue's capability limits. When multiple teams share a GPU cluster, you can create per-team Queues and set weights to adjust resource allocation ratios.

4. Volcano Job CRD and Plugin System

4.1 VolcanoJob CRD

VolcanoJob is Volcano's core CRD for defining batch workloads. It allows you to group multiple Tasks (roles) into a single Job.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: pytorch-ddp-training
spec:
  minAvailable: 5
  schedulerName: volcano
  queue: ml-research-queue
  plugins:
    env: []
    svc: []
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: master
      template:
        spec:
          containers:
            - image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
              name: master
              command: ['torchrun']
              args: ['--nproc_per_node=1', '--nnodes=5', '--node_rank=0', 'train.py']
              resources:
                limits:
                  nvidia.com/gpu: 1
          restartPolicy: OnFailure
    - replicas: 4
      name: worker
      template:
        spec:
          containers:
            - image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
              name: worker
              command: ['torchrun']
              args: ['--nproc_per_node=1', '--nnodes=5', 'train.py']
              resources:
                limits:
                  nvidia.com/gpu: 1
          restartPolicy: OnFailure

minAvailable: The minimum number of Pods for Gang Scheduling. Serves the same role as minMember in PodGroup.
schedulerName: Set to volcano to use the Volcano Scheduler.
queue: The Queue name where the Job is submitted.
plugins: List of Volcano plugins to activate. env handles automatic environment variable injection, and svc handles automatic service creation.
policies: Event-based action policies. Defines actions such as restarting the Job when a Pod is evicted.
tasks: List of Tasks comprising the Job. Each Task defines replicas and a Pod Template per role (master, worker).

4.2 Plugin System Details

The Volcano Scheduler adopts a plugin architecture that allows flexible extension of scheduling strategies. The major plugins are as follows:

Plugin	Function	Suitable Workloads
Gang	All or Nothing scheduling	Distributed training, MPI, Spark
Binpack	Maximize utilization of existing nodes	Small Jobs, resource efficiency optimization
DRF	Dominant Resource-based Fair-share	Multi-tenant batch processing
Proportion	Per-Queue resource ratio management	Team-based resource isolation
Priority	Priority-based scheduling	Multiple Jobs with different priorities
Task-Topology	Task Affinity/Anti-Affinity	Distributed training requiring network optimization
TDM	Time-division resource sharing	Kubernetes + YARN hybrid environments
SLA	Wait time limits	Batches with real-time service requirements
Numa-aware	NUMA topology-aware scheduling	HPC where CPU cache affinity matters
Predicates	GPU resource-based pre-filtering	GPU-intensive AI workloads
Nodeorder	Multi-dimensional node scoring	Custom composite scheduling

The Volcano Scheduler configuration file uses actions and tiers to compose plugin combinations:

actions: 'enqueue, allocate, backfill'
tiers:
  - plugins:
      - name: priority
      - name: gang
      - name: conformance
  - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack

5. Kubeflow Training Operator Official Documentation Analysis

Kubeflow Training Operator is an Operator that manages distributed AI training Jobs on Kubernetes. It supports various frameworks including PyTorch, TensorFlow, MPI, and XGBoost, providing framework-specific CRDs.

5.1 V1 (Legacy) and V2 Architecture

The Training Operator currently has two coexisting versions.

V1 (Legacy): Uses separate CRDs per framework (PyTorchJob, TFJob, MPIJob, etc.). Each CRD provides ReplicaSpecs (Master, Worker, PS, etc.) tailored to the distributed training patterns of the respective framework.

V2 (Current): Unified into a single TrainJob CRD. Framework-specific configurations are managed through TrainingRuntime and ClusterTrainingRuntime. It supports a broader range of frameworks including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, and XGBoost.

This article focuses on the V1 PyTorchJob, which is widely used in production, while also examining the direction of V2.

5.2 Role of the Training Operator

The Training Operator automatically handles the following:

Automatic environment variable configuration: Injects environment variables required for distributed training — such as WORLD_SIZE, RANK, MASTER_ADDR, and MASTER_PORT — into Pods automatically.
torchrun CLI configuration: Sets up the environment so that PyTorch's torchrun (formerly torch.distributed.launch) operates correctly.
Job state management: Manages the lifecycle from Created -> Running -> Succeeded/Failed, applying restart policies on failure.
Service creation: Automatically creates Headless Services for inter-Pod communication.

6. PyTorchJob CRD Detailed Analysis and YAML Examples

PyTorchJob is a CRD for running PyTorch distributed training Jobs on Kubernetes.

6.1 PyTorchJob Structure

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-ddp-mnist
  namespace: ml-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
        spec:
          containers:
            - name: pytorch
              image: my-registry/pytorch-training:latest
              command:
                - 'torchrun'
                - '--nproc_per_node=2'
                - '--nnodes=3'
                - '--node_rank=0'
                - '--master_addr=$(MASTER_ADDR)'
                - '--master_port=$(MASTER_PORT)'
                - 'train.py'
                - '--epochs=100'
                - '--batch-size=256'
              env:
                - name: NCCL_DEBUG
                  value: 'INFO'
              resources:
                limits:
                  nvidia.com/gpu: 2
                  memory: '32Gi'
                  cpu: '8'
              volumeMounts:
                - name: training-data
                  mountPath: /data
                - name: checkpoints
                  mountPath: /checkpoints
          volumes:
            - name: training-data
              persistentVolumeClaim:
                claimName: training-data-pvc
            - name: checkpoints
              persistentVolumeClaim:
                claimName: checkpoints-pvc
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: 'false'
        spec:
          containers:
            - name: pytorch
              image: my-registry/pytorch-training:latest
              command:
                - 'torchrun'
                - '--nproc_per_node=2'
                - '--nnodes=3'
                - '--master_addr=$(MASTER_ADDR)'
                - '--master_port=$(MASTER_PORT)'
                - 'train.py'
                - '--epochs=100'
                - '--batch-size=256'
              env:
                - name: NCCL_DEBUG
                  value: 'INFO'
              resources:
                limits:
                  nvidia.com/gpu: 2
                  memory: '32Gi'
                  cpu: '8'
              volumeMounts:
                - name: training-data
                  mountPath: /data
                - name: checkpoints
                  mountPath: /checkpoints
          volumes:
            - name: training-data
              persistentVolumeClaim:
                claimName: training-data-pvc
            - name: checkpoints
              persistentVolumeClaim:
                claimName: checkpoints-pvc

6.2 Key Field Descriptions

Master: Must always have replicas: 1. It represents the state of the training Job and serves as the reference point for model parameter aggregation. The Training Operator injects this Pod's address as MASTER_ADDR into other Pods.
Worker: The Pods that perform the actual training. Adjust the number of replicas to set the scale of distributed training.
restartPolicy: When set to OnFailure, Pods are automatically restarted upon failure.
sidecar.istio.io/inject: "false": Disables sidecar injection so that PyTorchJob works correctly in clusters with Istio installed. This is explicitly recommended in the official documentation.

6.3 Job Status Monitoring

kubectl get pytorchjobs -n ml-training
kubectl describe pytorchjob pytorch-ddp-mnist -n ml-training
kubectl get pods -l training.kubeflow.org/job-name=pytorch-ddp-mnist -n ml-training

The PyTorchJob status can be checked in the conditions field, transitioning in order: Created -> Running -> Succeeded.

7. MPIJob and TFJob Support

7.1 MPIJob

MPIJob is a CRD for MPI (Message Passing Interface)-based distributed training. It is primarily used with allreduce-based frameworks like Horovod.

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: horovod-training
spec:
  slotsPerWorker: 2
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - name: mpi-launcher
              image: horovod/horovod:latest
              command:
                - mpirun
                - --allow-run-as-root
                - -np
                - '8'
                - -bind-to
                - none
                - -map-by
                - slot
                - python
                - train.py
              resources:
                limits:
                  cpu: '2'
                  memory: '4Gi'
    Worker:
      replicas: 4
      template:
        spec:
          containers:
            - name: mpi-worker
              image: horovod/horovod:latest
              resources:
                limits:
                  nvidia.com/gpu: 2
                  cpu: '8'
                  memory: '32Gi'

Key characteristics of MPIJob:

Launcher: The Pod that runs the mpirun command. It does not perform actual training but distributes work to Workers.
Worker: The Pods that perform the actual training. slotsPerWorker specifies the number of processes (GPUs) per Worker.
Being framework-agnostic, it can be used with any framework that supports MPI, including Horovod, PyTorch, TensorFlow, and MXNet.

7.2 TFJob

TFJob is a CRD designed for TensorFlow's Parameter Server distributed training strategy.

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tf-ps-training
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      template:
        spec:
          containers:
            - name: tensorflow
              image: tensorflow/tensorflow:2.14.0-gpu
              command: ['python', 'train.py']
              resources:
                limits:
                  cpu: '4'
                  memory: '16Gi'
    Worker:
      replicas: 4
      template:
        spec:
          containers:
            - name: tensorflow
              image: tensorflow/tensorflow:2.14.0-gpu
              command: ['python', 'train.py']
              resources:
                limits:
                  nvidia.com/gpu: 1
                  cpu: '8'
                  memory: '32Gi'

Key characteristics of TFJob:

PS (Parameter Server): Stores model parameters, receives gradients from Workers, and updates the parameters.
Worker: Processes a portion of the training data, computes gradients, and sends them to the PS.
The Training Operator automatically configures the TF_CONFIG environment variable to ensure TensorFlow's distribution strategy works correctly.

8. Multi-node Multi-GPU Distributed Training Setup (PyTorch DDP on K8s)

8.1 How PyTorch DDP Works

PyTorch DDP (Distributed Data Parallel) is a data-parallel distributed training strategy. It places a model replica on each GPU, splits mini-batches across the GPUs for training, and then synchronizes gradients using an allreduce operation.

Key concepts you need to understand when setting up Multi-node Multi-GPU DDP on Kubernetes:

WORLD_SIZE: The total number of processes. (number of nodes) x (GPUs per node). Example: 3 nodes, 2 GPUs each = WORLD_SIZE 6.
RANK: A unique ID for each process (from 0 to WORLD_SIZE-1).
LOCAL_RANK: The process ID within a node (from 0 to GPUs per node - 1).
MASTER_ADDR: The address of the Master node. Automatically set by the Training Operator.
MASTER_PORT: The communication port on the Master node. Automatically set by the Training Operator.

8.2 Training Code Example

When using the Training Operator, the training code itself uses PyTorch's native Distributed API:

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def main():
    # The Training Operator automatically sets environment variables,
    # so torchrun handles init_process_group automatically.
    dist.init_process_group(backend="nccl")

    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    model = MyModel().cuda(local_rank)
    model = DDP(model, device_ids=[local_rank])

    # Use DistributedSampler to split data across GPUs
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=64,
        sampler=train_sampler
    )

    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)
        for batch in train_loader:
            loss = model(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        # Save checkpoints only on rank 0
        if dist.get_rank() == 0:
            torch.save({
                'epoch': epoch,
                'model_state_dict': model.module.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
            }, f'/checkpoints/checkpoint_epoch_{epoch}.pt')

    dist.destroy_process_group()

8.3 NCCL Communication Optimization

In multi-node GPU communication, the performance of NCCL (NVIDIA Collective Communication Library) directly impacts training speed. Tips for optimizing NCCL in a Kubernetes environment:

env:
  - name: NCCL_DEBUG
    value: 'INFO' # Check NCCL communication logs during debugging
  - name: NCCL_IB_DISABLE
    value: '0' # Enable when InfiniBand is available
  - name: NCCL_SOCKET_IFNAME
    value: 'eth0' # Specify the network interface for communication
  - name: NCCL_P2P_LEVEL
    value: 'NVL' # Set P2P communication level when using NVLink

In environments with RDMA networks like InfiniBand or RoCE, you can also set hostNetwork: true on Pods to maximize network performance.

9. Managing Training Data and Checkpoints with PVC/PV

In distributed training, storage design is critical for training data access, checkpoint persistence, and model artifact management.

9.1 PVC for Training Data

Training data must be readable by all Workers, requiring ReadOnlyMany (ROX) or ReadWriteMany (RWX) Access Modes.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data-pvc
  namespace: ml-training
spec:
  accessModes:
    - ReadOnlyMany
  storageClassName: nfs-storage
  resources:
    requests:
      storage: 500Gi

9.2 PVC for Checkpoints

Since only the Rank 0 process saves checkpoints, ReadWriteOnce (RWO) is sufficient. However, ReadWriteMany (RWX) offers more flexibility when considering Fault Tolerance.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: checkpoints-pvc
  namespace: ml-training
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi

9.3 Storage Backend Selection

Storage Backend	Access Mode	Performance Characteristics	Suitable Use Cases
NFS	RWX/ROX	Medium bandwidth, high compatibility	Training data sharing
CephFS	RWX	High bandwidth, distributed storage	Large-scale datasets
Lustre / GPFS	RWX	Very high bandwidth	HPC-level I/O
Local SSD	RWO	Highest IOPS	Checkpoints, local cache
S3 (CSI)	RWX	High bandwidth, unlimited capacity	Object storage-based datasets

For large-scale training datasets, direct access on NFS or CephFS can cause I/O bottlenecks. In such cases, consider using an Init Container pattern to copy data to Local SSD before training starts, or mounting S3-compatible object storage via a CSI Driver.

10. Kueue: The Next-Generation Job Queuing System

Kueue is a cloud-native Job queuing system developed under the Kubernetes SIG (Special Interest Group). It provides Job-level Admission Control for Batch, HPC, and AI/ML workloads.

10.1 Kueue vs Volcano

Kueue and Volcano operate at different layers.

Volcano: Operates at the Scheduler level. Decides which Node to place Pods on. Replaces kube-scheduler.
Kueue: Operates at the Admission Control level. Decides when a Job can be admitted (started). Works alongside kube-scheduler without replacing it.

Kueue decides "should this Job start?" while Volcano (or kube-scheduler) decides "where should the Pods of a started Job be placed?" The two systems can be used complementarily.

10.2 Core Concepts of Kueue

The Admission process in Kueue works as follows:

When a user creates a Job, Kueue converts it into a Workload.
The Workload is submitted to a LocalQueue.
The LocalQueue references a ClusterQueue.
The ClusterQueue checks quota from the resource pool defined in ResourceFlavor.
If quota is sufficient, the Workload is Admitted and actual Pods are created.

11. Kueue's ResourceFlavor, ClusterQueue, and LocalQueue

11.1 ResourceFlavor

ResourceFlavor defines the types of resources available in the cluster. It is used to distinguish node characteristics such as pricing, architecture, and availability.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gpu-a100
spec:
  nodeLabels:
    cloud.google.com/gke-accelerator: nvidia-tesla-a100
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gpu-t4-spot
spec:
  nodeLabels:
    cloud.google.com/gke-accelerator: nvidia-tesla-t4
    cloud.google.com/gke-provisioning: spot
  tolerations:
    - key: cloud.google.com/gke-spot
      operator: Equal
      value: 'true'
      effect: NoSchedule

In the example above, gpu-a100 represents on-demand A100 GPU nodes, while gpu-t4-spot represents Spot T4 GPU nodes. Tolerations are configured to allow Pods to be scheduled on Spot nodes.

11.2 ClusterQueue

ClusterQueue defines a cluster-scoped resource pool and manages Quota and Fair Sharing rules.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: ml-cluster-queue
spec:
  cohort: ml-team-cohort
  namespaceSelector: {}
  preemption:
    withinClusterQueue: LowerPriority
    reclaimWithinCohort: LowerPriority
    borrowWithinCohort:
      policy: LowerPriority
      maxPriorityThreshold: 100
  resourceGroups:
    - coveredResources: ['cpu', 'memory', 'nvidia.com/gpu']
      flavors:
        - name: gpu-a100
          resources:
            - name: 'cpu'
              nominalQuota: 64
            - name: 'memory'
              nominalQuota: 256Gi
            - name: 'nvidia.com/gpu'
              nominalQuota: 8
              borrowingLimit: 4
              lendingLimit: 2
        - name: gpu-t4-spot
          resources:
            - name: 'cpu'
              nominalQuota: 128
            - name: 'memory'
              nominalQuota: 512Gi
            - name: 'nvidia.com/gpu'
              nominalQuota: 16

Key field descriptions:

cohort: ClusterQueues belonging to the same Cohort can borrow idle quota from each other.
nominalQuota: The guaranteed baseline amount of resources.
borrowingLimit: The maximum amount of resources that can be borrowed from other ClusterQueues in the Cohort.
lendingLimit: The maximum amount of resources that can be lent to other ClusterQueues.
preemption.withinClusterQueue: Whether lower-priority Workloads within the same ClusterQueue can be preempted. When set to LowerPriority, higher-priority Workloads can interrupt lower-priority ones.
preemption.reclaimWithinCohort: Whether resources borrowed by other ClusterQueues in the Cohort can be reclaimed (preempted).
flavorFungibility: When multiple ResourceFlavors exist, determines the priority of Borrowing and Preemption.

11.3 LocalQueue

LocalQueue is the namespace-scoped entry point where users submit Jobs.

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: ml-research-queue
  namespace: ml-training
spec:
  clusterQueue: ml-cluster-queue

Users submit Jobs to a LocalQueue by adding the kueue.x-k8s.io/queue-name label:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: my-training-job
  namespace: ml-training
  labels:
    kueue.x-k8s.io/queue-name: ml-research-queue
spec:
  # ... PyTorchJob spec

11.4 Kueue Queuing Strategies

Kueue supports two queuing strategies:

StrictFIFO: Strict first-in, first-out. If a Workload at the front of the queue cannot be admitted, Workloads behind it are also blocked from admission.
BestEffortFIFO: Even if a Workload at the front cannot be admitted due to insufficient resources, a Workload further back may be admitted first if it has lower resource requirements.

12. Training on Spot/Preemptible Instances (Fault Tolerance)

12.1 Advantages and Risks of Spot Instances

Cloud Spot Instances (AWS), Preemptible VMs (GCP), and Spot VMs (Azure) can reduce costs by 60-90% compared to on-demand pricing. Since the bulk of AI training costs comes from GPU computing, leveraging Spot Instances is highly attractive.

However, Spot Instances can be reclaimed (preempted) by the cloud provider at any time, with only a short notice period of typically 30 seconds to 2 minutes. If a training Job is preempted without a checkpoint, all progress is lost.

12.2 Fault-tolerant Training Strategies

Checkpoint-based Recovery

The most fundamental strategy is periodic checkpoint saving. The Rank 0 process saves model parameters, optimizer state, and training progress to persistent storage at regular intervals (per epoch or per step). When Pods restart, training resumes from the last checkpoint.

# Save an emergency checkpoint in the preStop Hook
# Configure the Pod's lifecycle.preStop to call this script.
import signal
import sys

def graceful_shutdown(signum, frame):
    """Save emergency checkpoint when Spot Instance reclamation notice is received"""
    if dist.get_rank() == 0:
        torch.save({
            'epoch': current_epoch,
            'step': current_step,
            'model_state_dict': model.module.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
        }, '/checkpoints/emergency_checkpoint.pt')
    dist.barrier()
    sys.exit(0)

signal.signal(signal.SIGTERM, graceful_shutdown)

Kubernetes preStop Hook Configuration

lifecycle:
  preStop:
    exec:
      command: ['/bin/sh', '-c', 'python save_checkpoint.py --emergency']

TorchElastic (Elastic Training)

PyTorch's TorchElastic (now integrated into torchrun) is a distributed training framework optimized for Spot Instance environments.

Key features:

Elastic scaling: Training continues even when Workers are added or removed.
Automatic restart: Training automatically resumes upon Worker failure.
Partial execution: Training can start even when only a portion of the requested GPUs are available.

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: elastic-training
spec:
  elasticPolicy:
    rdzvBackend: etcd
    rdzvHost: etcd-service
    rdzvPort: 2379
    minReplicas: 2
    maxReplicas: 8
  pytorchReplicaSpecs:
    Worker:
      replicas: 4
      restartPolicy: OnFailure
      template:
        spec:
          nodeSelector:
            cloud.google.com/gke-provisioning: spot
          tolerations:
            - key: cloud.google.com/gke-spot
              operator: Equal
              value: 'true'
              effect: NoSchedule
          containers:
            - name: pytorch
              image: my-registry/elastic-training:latest
              resources:
                limits:
                  nvidia.com/gpu: 1

12.3 Hybrid Cluster Strategy

In production environments, a Hybrid strategy mixing on-demand and Spot nodes is commonly used:

On-demand nodes: Host core cluster components (etcd, Kueue Controller, Training Operator). Set Taints to prevent general workloads from being placed on these nodes.
Spot nodes: Host the actual training Worker Pods. Manage Spot nodes separately using Kueue's ResourceFlavor, and select GPUs with the best cost-efficiency ratio.

When combined with Kueue's Preemption policies, Kueue can automatically reassign Workloads to other available resources when Spot nodes are reclaimed.

13. References

This article was written based on the following official documentation and resources.

Volcano Official Documentation

Kubeflow Training Operator Official Documentation

Kueue Official Documentation

Other References