- Authors
- Name
- 1. Limitations of the Default Kubernetes Scheduler
- 2. Volcano Scheduler Official Documentation Analysis
- 3. Volcano Gang Scheduling and Fair-share Scheduling Policies
- 4. Volcano Job CRD and Plugin System
- 5. Kubeflow Training Operator Official Documentation Analysis
- 6. PyTorchJob CRD Detailed Analysis and YAML Examples
- 7. MPIJob and TFJob Support
- 8. Multi-node Multi-GPU Distributed Training Setup (PyTorch DDP on K8s)
- 9. Managing Training Data and Checkpoints with PVC/PV
- 10. Kueue: The Next-Generation Job Queuing System
- 11. Kueue's ResourceFlavor, ClusterQueue, and LocalQueue
- 12. Training on Spot/Preemptible Instances (Fault Tolerance)
- 13. References
1. Limitations of the Default Kubernetes Scheduler
The default Kubernetes scheduler (kube-scheduler) performs scheduling at the Pod level. Its core role is to place each Pod on an appropriate Node based on the resources it requests (CPU, Memory, GPU). While this architecture works well for typical workloads like web services and API servers, it has fundamental limitations when it comes to batch workloads such as AI/ML distributed training.
1.1 Lack of Gang Scheduling
The biggest challenge in distributed training is the absence of Gang Scheduling. For example, in PyTorch DDP (Distributed Data Parallel) training, all 4 Worker Pods must start simultaneously for training to proceed. Since the default scheduler schedules Pods individually, it may schedule 3 out of 4 while the remaining one stays in a Pending state due to insufficient resources. The 3 already-placed Pods occupy GPU resources, but training cannot begin — resulting in resource waste.
1.2 Deadlock Problem
An even more serious issue is deadlock. Suppose two distributed training Jobs each require 4 GPUs, but the cluster only has 6 GPUs in total. The default scheduler might allocate 3 GPUs to Job A and 3 to Job B. Since both Jobs need a minimum of 4, neither can start training. This is exactly the "All or Nothing" problem that Gang Scheduling solves.
1.3 Lack of Fair-share and Queue-based Resource Management
The default scheduler has no concept of Queues for resource sharing across teams. In AI organizations, multiple research teams share GPU clusters, and implementing fair-share policies — such as setting per-team quotas and sharing idle resources — is difficult with the default scheduler alone.
These limitations gave rise to projects like Volcano, Kubeflow Training Operator, and Kueue.
2. Volcano Scheduler Official Documentation Analysis
Volcano is a project under the CNCF (Cloud Native Computing Foundation) that serves as both a scheduler and controller for handling high-performance batch workloads on Kubernetes. It is a batch scheduling system designed for AI/ML, BigData, HPC (High Performance Computing), and similar workloads.
2.1 Architecture
Volcano consists of three core components:
- Volcano Scheduler: A scheduler that replaces or runs alongside kube-scheduler. It performs scheduling at the Job/PodGroup level rather than the Pod level.
- Volcano Controller Manager: A controller that manages the lifecycle of CRDs such as VolcanoJob, Queue, and PodGroup.
- Volcano Webhook (Admission Controller): An Admission Webhook responsible for validation and default value injection when CRDs are created.
The Volcano Scheduler is compatible with Kubernetes' Scheduling Framework and performs the PreFilter/Filter and Score phases of the default scheduler through predicates and nodeorder plugins. On top of this, it adds advanced scheduling strategies like Gang Scheduling, Fair-share, and Binpack as plugins.
2.2 Queue CRD
Queue is a CRD that logically partitions cluster resources to provide resource isolation in multi-tenant environments.
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: ml-research-queue
spec:
weight: 4
reclaimable: true
capability:
cpu: '64'
memory: '256Gi'
nvidia.com/gpu: '8'
- weight: The resource allocation ratio among Queues. A higher weight means more resources are allocated.
- reclaimable: Whether idle resources can be lent to other Queues. When set to
true, idle resources in this Queue can be used by other Queues. - capability: The maximum resource limits the Queue can use. Specifies CPU, Memory, GPU, and more.
2.3 PodGroup CRD
PodGroup is a CRD that defines a group of tightly coupled Pods and serves as the core unit for Gang Scheduling.
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: pytorch-training-pg
namespace: default
spec:
minMember: 4
minResources:
cpu: '16'
memory: '64Gi'
nvidia.com/gpu: '4'
priorityClassName: high-priority
queue: ml-research-queue
- minMember: The minimum number of Pods that must run within the PodGroup. If cluster resources cannot satisfy this number, no Pods in the PodGroup will be scheduled.
- minResources: The minimum resources required to run the PodGroup. If available resources do not meet this threshold, scheduling is deferred.
- priorityClassName: The priority of the PodGroup. Used by the scheduler to sort PodGroups within a Queue.
- queue: The Queue name to which the PodGroup belongs. The Queue must be in an Open state.
The PodGroup Status Phase transitions as follows: Pending (resources not met) -> Inqueue (validation complete, waiting for node binding) -> Running (at least minMember Pods running) -> Unknown (some Pods cannot be scheduled). When a VolcanoJob is created, the PodGroup is automatically generated.
3. Volcano Gang Scheduling and Fair-share Scheduling Policies
3.1 Gang Scheduling
Gang Scheduling is the most critical feature of Volcano. Using an "All or Nothing" strategy, it only places Pods when all Pods comprising a Job (or at least the minimum number specified by minMember) can be scheduled simultaneously.
Gang Scheduling is essential in the following scenarios:
- MPI-based distributed training: All Workers must start simultaneously for allreduce communication to work.
- PyTorch DDP: Both Master and Workers must be ready before training can begin.
- Spark/Big Data: Both Driver and Executors must be ready before a Job can run.
The Gang Plugin checks the PodGroup's minMember field to determine whether that number of Pods can be simultaneously scheduled on the cluster. If not, no Pods are scheduled at all, preventing resource waste and deadlock.
3.2 DRF (Dominant Resource Fairness) Scheduling
DRF is an algorithm that implements fair resource distribution in multi-tenant environments. It calculates fair-share based on each Job's Dominant Resource.
For example, if Job A is CPU-intensive (8 CPU cores, 4GB Memory) and Job B is memory-intensive (2 CPU cores, 32GB Memory), the scheduler prioritizes the Job with the lowest dominant resource ratio to ensure fairness. The DRF Plugin considers multi-dimensional resources (CPU, Memory, GPU) simultaneously to achieve fair distribution without bias toward any particular resource type.
3.3 Proportion Plugin
The Proportion Plugin manages resource ratios per Queue. It proportionally distributes the total cluster resources based on each Queue's weight and allocates resources within the Queue's capability limits. When multiple teams share a GPU cluster, you can create per-team Queues and set weights to adjust resource allocation ratios.
4. Volcano Job CRD and Plugin System
4.1 VolcanoJob CRD
VolcanoJob is Volcano's core CRD for defining batch workloads. It allows you to group multiple Tasks (roles) into a single Job.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: pytorch-ddp-training
spec:
minAvailable: 5
schedulerName: volcano
queue: ml-research-queue
plugins:
env: []
svc: []
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: master
template:
spec:
containers:
- image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
name: master
command: ['torchrun']
args: ['--nproc_per_node=1', '--nnodes=5', '--node_rank=0', 'train.py']
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: OnFailure
- replicas: 4
name: worker
template:
spec:
containers:
- image: pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
name: worker
command: ['torchrun']
args: ['--nproc_per_node=1', '--nnodes=5', 'train.py']
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: OnFailure
- minAvailable: The minimum number of Pods for Gang Scheduling. Serves the same role as
minMemberin PodGroup. - schedulerName: Set to
volcanoto use the Volcano Scheduler. - queue: The Queue name where the Job is submitted.
- plugins: List of Volcano plugins to activate.
envhandles automatic environment variable injection, andsvchandles automatic service creation. - policies: Event-based action policies. Defines actions such as restarting the Job when a Pod is evicted.
- tasks: List of Tasks comprising the Job. Each Task defines replicas and a Pod Template per role (master, worker).
4.2 Plugin System Details
The Volcano Scheduler adopts a plugin architecture that allows flexible extension of scheduling strategies. The major plugins are as follows:
| Plugin | Function | Suitable Workloads |
|---|---|---|
| Gang | All or Nothing scheduling | Distributed training, MPI, Spark |
| Binpack | Maximize utilization of existing nodes | Small Jobs, resource efficiency optimization |
| DRF | Dominant Resource-based Fair-share | Multi-tenant batch processing |
| Proportion | Per-Queue resource ratio management | Team-based resource isolation |
| Priority | Priority-based scheduling | Multiple Jobs with different priorities |
| Task-Topology | Task Affinity/Anti-Affinity | Distributed training requiring network optimization |
| TDM | Time-division resource sharing | Kubernetes + YARN hybrid environments |
| SLA | Wait time limits | Batches with real-time service requirements |
| Numa-aware | NUMA topology-aware scheduling | HPC where CPU cache affinity matters |
| Predicates | GPU resource-based pre-filtering | GPU-intensive AI workloads |
| Nodeorder | Multi-dimensional node scoring | Custom composite scheduling |
The Volcano Scheduler configuration file uses actions and tiers to compose plugin combinations:
actions: 'enqueue, allocate, backfill'
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
5. Kubeflow Training Operator Official Documentation Analysis
Kubeflow Training Operator is an Operator that manages distributed AI training Jobs on Kubernetes. It supports various frameworks including PyTorch, TensorFlow, MPI, and XGBoost, providing framework-specific CRDs.
5.1 V1 (Legacy) and V2 Architecture
The Training Operator currently has two coexisting versions.
V1 (Legacy): Uses separate CRDs per framework (PyTorchJob, TFJob, MPIJob, etc.). Each CRD provides ReplicaSpecs (Master, Worker, PS, etc.) tailored to the distributed training patterns of the respective framework.
V2 (Current): Unified into a single TrainJob CRD. Framework-specific configurations are managed through TrainingRuntime and ClusterTrainingRuntime. It supports a broader range of frameworks including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, and XGBoost.
This article focuses on the V1 PyTorchJob, which is widely used in production, while also examining the direction of V2.
5.2 Role of the Training Operator
The Training Operator automatically handles the following:
- Automatic environment variable configuration: Injects environment variables required for distributed training — such as
WORLD_SIZE,RANK,MASTER_ADDR, andMASTER_PORT— into Pods automatically. - torchrun CLI configuration: Sets up the environment so that PyTorch's
torchrun(formerlytorch.distributed.launch) operates correctly. - Job state management: Manages the lifecycle from Created -> Running -> Succeeded/Failed, applying restart policies on failure.
- Service creation: Automatically creates Headless Services for inter-Pod communication.
6. PyTorchJob CRD Detailed Analysis and YAML Examples
PyTorchJob is a CRD for running PyTorch distributed training Jobs on Kubernetes.
6.1 PyTorchJob Structure
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-ddp-mnist
namespace: ml-training
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
spec:
containers:
- name: pytorch
image: my-registry/pytorch-training:latest
command:
- 'torchrun'
- '--nproc_per_node=2'
- '--nnodes=3'
- '--node_rank=0'
- '--master_addr=$(MASTER_ADDR)'
- '--master_port=$(MASTER_PORT)'
- 'train.py'
- '--epochs=100'
- '--batch-size=256'
env:
- name: NCCL_DEBUG
value: 'INFO'
resources:
limits:
nvidia.com/gpu: 2
memory: '32Gi'
cpu: '8'
volumeMounts:
- name: training-data
mountPath: /data
- name: checkpoints
mountPath: /checkpoints
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
- name: checkpoints
persistentVolumeClaim:
claimName: checkpoints-pvc
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: 'false'
spec:
containers:
- name: pytorch
image: my-registry/pytorch-training:latest
command:
- 'torchrun'
- '--nproc_per_node=2'
- '--nnodes=3'
- '--master_addr=$(MASTER_ADDR)'
- '--master_port=$(MASTER_PORT)'
- 'train.py'
- '--epochs=100'
- '--batch-size=256'
env:
- name: NCCL_DEBUG
value: 'INFO'
resources:
limits:
nvidia.com/gpu: 2
memory: '32Gi'
cpu: '8'
volumeMounts:
- name: training-data
mountPath: /data
- name: checkpoints
mountPath: /checkpoints
volumes:
- name: training-data
persistentVolumeClaim:
claimName: training-data-pvc
- name: checkpoints
persistentVolumeClaim:
claimName: checkpoints-pvc
6.2 Key Field Descriptions
- Master: Must always have
replicas: 1. It represents the state of the training Job and serves as the reference point for model parameter aggregation. The Training Operator injects this Pod's address asMASTER_ADDRinto other Pods. - Worker: The Pods that perform the actual training. Adjust the number of replicas to set the scale of distributed training.
- restartPolicy: When set to
OnFailure, Pods are automatically restarted upon failure. - sidecar.istio.io/inject: "false": Disables sidecar injection so that PyTorchJob works correctly in clusters with Istio installed. This is explicitly recommended in the official documentation.
6.3 Job Status Monitoring
kubectl get pytorchjobs -n ml-training
kubectl describe pytorchjob pytorch-ddp-mnist -n ml-training
kubectl get pods -l training.kubeflow.org/job-name=pytorch-ddp-mnist -n ml-training
The PyTorchJob status can be checked in the conditions field, transitioning in order: Created -> Running -> Succeeded.
7. MPIJob and TFJob Support
7.1 MPIJob
MPIJob is a CRD for MPI (Message Passing Interface)-based distributed training. It is primarily used with allreduce-based frameworks like Horovod.
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: horovod-training
spec:
slotsPerWorker: 2
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: mpi-launcher
image: horovod/horovod:latest
command:
- mpirun
- --allow-run-as-root
- -np
- '8'
- -bind-to
- none
- -map-by
- slot
- python
- train.py
resources:
limits:
cpu: '2'
memory: '4Gi'
Worker:
replicas: 4
template:
spec:
containers:
- name: mpi-worker
image: horovod/horovod:latest
resources:
limits:
nvidia.com/gpu: 2
cpu: '8'
memory: '32Gi'
Key characteristics of MPIJob:
- Launcher: The Pod that runs the
mpiruncommand. It does not perform actual training but distributes work to Workers. - Worker: The Pods that perform the actual training.
slotsPerWorkerspecifies the number of processes (GPUs) per Worker. - Being framework-agnostic, it can be used with any framework that supports MPI, including Horovod, PyTorch, TensorFlow, and MXNet.
7.2 TFJob
TFJob is a CRD designed for TensorFlow's Parameter Server distributed training strategy.
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: tf-ps-training
spec:
tfReplicaSpecs:
PS:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.14.0-gpu
command: ['python', 'train.py']
resources:
limits:
cpu: '4'
memory: '16Gi'
Worker:
replicas: 4
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:2.14.0-gpu
command: ['python', 'train.py']
resources:
limits:
nvidia.com/gpu: 1
cpu: '8'
memory: '32Gi'
Key characteristics of TFJob:
- PS (Parameter Server): Stores model parameters, receives gradients from Workers, and updates the parameters.
- Worker: Processes a portion of the training data, computes gradients, and sends them to the PS.
- The Training Operator automatically configures the
TF_CONFIGenvironment variable to ensure TensorFlow's distribution strategy works correctly.
8. Multi-node Multi-GPU Distributed Training Setup (PyTorch DDP on K8s)
8.1 How PyTorch DDP Works
PyTorch DDP (Distributed Data Parallel) is a data-parallel distributed training strategy. It places a model replica on each GPU, splits mini-batches across the GPUs for training, and then synchronizes gradients using an allreduce operation.
Key concepts you need to understand when setting up Multi-node Multi-GPU DDP on Kubernetes:
- WORLD_SIZE: The total number of processes. (number of nodes) x (GPUs per node). Example: 3 nodes, 2 GPUs each = WORLD_SIZE 6.
- RANK: A unique ID for each process (from 0 to WORLD_SIZE-1).
- LOCAL_RANK: The process ID within a node (from 0 to GPUs per node - 1).
- MASTER_ADDR: The address of the Master node. Automatically set by the Training Operator.
- MASTER_PORT: The communication port on the Master node. Automatically set by the Training Operator.
8.2 Training Code Example
When using the Training Operator, the training code itself uses PyTorch's native Distributed API:
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
# The Training Operator automatically sets environment variables,
# so torchrun handles init_process_group automatically.
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
model = MyModel().cuda(local_rank)
model = DDP(model, device_ids=[local_rank])
# Use DistributedSampler to split data across GPUs
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=64,
sampler=train_sampler
)
for epoch in range(num_epochs):
train_sampler.set_epoch(epoch)
for batch in train_loader:
loss = model(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Save checkpoints only on rank 0
if dist.get_rank() == 0:
torch.save({
'epoch': epoch,
'model_state_dict': model.module.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, f'/checkpoints/checkpoint_epoch_{epoch}.pt')
dist.destroy_process_group()
8.3 NCCL Communication Optimization
In multi-node GPU communication, the performance of NCCL (NVIDIA Collective Communication Library) directly impacts training speed. Tips for optimizing NCCL in a Kubernetes environment:
env:
- name: NCCL_DEBUG
value: 'INFO' # Check NCCL communication logs during debugging
- name: NCCL_IB_DISABLE
value: '0' # Enable when InfiniBand is available
- name: NCCL_SOCKET_IFNAME
value: 'eth0' # Specify the network interface for communication
- name: NCCL_P2P_LEVEL
value: 'NVL' # Set P2P communication level when using NVLink
In environments with RDMA networks like InfiniBand or RoCE, you can also set hostNetwork: true on Pods to maximize network performance.
9. Managing Training Data and Checkpoints with PVC/PV
In distributed training, storage design is critical for training data access, checkpoint persistence, and model artifact management.
9.1 PVC for Training Data
Training data must be readable by all Workers, requiring ReadOnlyMany (ROX) or ReadWriteMany (RWX) Access Modes.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-data-pvc
namespace: ml-training
spec:
accessModes:
- ReadOnlyMany
storageClassName: nfs-storage
resources:
requests:
storage: 500Gi
9.2 PVC for Checkpoints
Since only the Rank 0 process saves checkpoints, ReadWriteOnce (RWO) is sufficient. However, ReadWriteMany (RWX) offers more flexibility when considering Fault Tolerance.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: checkpoints-pvc
namespace: ml-training
spec:
accessModes:
- ReadWriteMany
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
9.3 Storage Backend Selection
| Storage Backend | Access Mode | Performance Characteristics | Suitable Use Cases |
|---|---|---|---|
| NFS | RWX/ROX | Medium bandwidth, high compatibility | Training data sharing |
| CephFS | RWX | High bandwidth, distributed storage | Large-scale datasets |
| Lustre / GPFS | RWX | Very high bandwidth | HPC-level I/O |
| Local SSD | RWO | Highest IOPS | Checkpoints, local cache |
| S3 (CSI) | RWX | High bandwidth, unlimited capacity | Object storage-based datasets |
For large-scale training datasets, direct access on NFS or CephFS can cause I/O bottlenecks. In such cases, consider using an Init Container pattern to copy data to Local SSD before training starts, or mounting S3-compatible object storage via a CSI Driver.
10. Kueue: The Next-Generation Job Queuing System
Kueue is a cloud-native Job queuing system developed under the Kubernetes SIG (Special Interest Group). It provides Job-level Admission Control for Batch, HPC, and AI/ML workloads.
10.1 Kueue vs Volcano
Kueue and Volcano operate at different layers.
- Volcano: Operates at the Scheduler level. Decides which Node to place Pods on. Replaces kube-scheduler.
- Kueue: Operates at the Admission Control level. Decides when a Job can be admitted (started). Works alongside kube-scheduler without replacing it.
Kueue decides "should this Job start?" while Volcano (or kube-scheduler) decides "where should the Pods of a started Job be placed?" The two systems can be used complementarily.
10.2 Core Concepts of Kueue
The Admission process in Kueue works as follows:
- When a user creates a Job, Kueue converts it into a Workload.
- The Workload is submitted to a LocalQueue.
- The LocalQueue references a ClusterQueue.
- The ClusterQueue checks quota from the resource pool defined in ResourceFlavor.
- If quota is sufficient, the Workload is Admitted and actual Pods are created.
11. Kueue's ResourceFlavor, ClusterQueue, and LocalQueue
11.1 ResourceFlavor
ResourceFlavor defines the types of resources available in the cluster. It is used to distinguish node characteristics such as pricing, architecture, and availability.
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: gpu-a100
spec:
nodeLabels:
cloud.google.com/gke-accelerator: nvidia-tesla-a100
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: gpu-t4-spot
spec:
nodeLabels:
cloud.google.com/gke-accelerator: nvidia-tesla-t4
cloud.google.com/gke-provisioning: spot
tolerations:
- key: cloud.google.com/gke-spot
operator: Equal
value: 'true'
effect: NoSchedule
In the example above, gpu-a100 represents on-demand A100 GPU nodes, while gpu-t4-spot represents Spot T4 GPU nodes. Tolerations are configured to allow Pods to be scheduled on Spot nodes.
11.2 ClusterQueue
ClusterQueue defines a cluster-scoped resource pool and manages Quota and Fair Sharing rules.
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: ml-cluster-queue
spec:
cohort: ml-team-cohort
namespaceSelector: {}
preemption:
withinClusterQueue: LowerPriority
reclaimWithinCohort: LowerPriority
borrowWithinCohort:
policy: LowerPriority
maxPriorityThreshold: 100
resourceGroups:
- coveredResources: ['cpu', 'memory', 'nvidia.com/gpu']
flavors:
- name: gpu-a100
resources:
- name: 'cpu'
nominalQuota: 64
- name: 'memory'
nominalQuota: 256Gi
- name: 'nvidia.com/gpu'
nominalQuota: 8
borrowingLimit: 4
lendingLimit: 2
- name: gpu-t4-spot
resources:
- name: 'cpu'
nominalQuota: 128
- name: 'memory'
nominalQuota: 512Gi
- name: 'nvidia.com/gpu'
nominalQuota: 16
Key field descriptions:
- cohort: ClusterQueues belonging to the same Cohort can borrow idle quota from each other.
- nominalQuota: The guaranteed baseline amount of resources.
- borrowingLimit: The maximum amount of resources that can be borrowed from other ClusterQueues in the Cohort.
- lendingLimit: The maximum amount of resources that can be lent to other ClusterQueues.
- preemption.withinClusterQueue: Whether lower-priority Workloads within the same ClusterQueue can be preempted. When set to
LowerPriority, higher-priority Workloads can interrupt lower-priority ones. - preemption.reclaimWithinCohort: Whether resources borrowed by other ClusterQueues in the Cohort can be reclaimed (preempted).
- flavorFungibility: When multiple ResourceFlavors exist, determines the priority of Borrowing and Preemption.
11.3 LocalQueue
LocalQueue is the namespace-scoped entry point where users submit Jobs.
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: ml-research-queue
namespace: ml-training
spec:
clusterQueue: ml-cluster-queue
Users submit Jobs to a LocalQueue by adding the kueue.x-k8s.io/queue-name label:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: my-training-job
namespace: ml-training
labels:
kueue.x-k8s.io/queue-name: ml-research-queue
spec:
# ... PyTorchJob spec
11.4 Kueue Queuing Strategies
Kueue supports two queuing strategies:
- StrictFIFO: Strict first-in, first-out. If a Workload at the front of the queue cannot be admitted, Workloads behind it are also blocked from admission.
- BestEffortFIFO: Even if a Workload at the front cannot be admitted due to insufficient resources, a Workload further back may be admitted first if it has lower resource requirements.
12. Training on Spot/Preemptible Instances (Fault Tolerance)
12.1 Advantages and Risks of Spot Instances
Cloud Spot Instances (AWS), Preemptible VMs (GCP), and Spot VMs (Azure) can reduce costs by 60-90% compared to on-demand pricing. Since the bulk of AI training costs comes from GPU computing, leveraging Spot Instances is highly attractive.
However, Spot Instances can be reclaimed (preempted) by the cloud provider at any time, with only a short notice period of typically 30 seconds to 2 minutes. If a training Job is preempted without a checkpoint, all progress is lost.
12.2 Fault-tolerant Training Strategies
Checkpoint-based Recovery
The most fundamental strategy is periodic checkpoint saving. The Rank 0 process saves model parameters, optimizer state, and training progress to persistent storage at regular intervals (per epoch or per step). When Pods restart, training resumes from the last checkpoint.
# Save an emergency checkpoint in the preStop Hook
# Configure the Pod's lifecycle.preStop to call this script.
import signal
import sys
def graceful_shutdown(signum, frame):
"""Save emergency checkpoint when Spot Instance reclamation notice is received"""
if dist.get_rank() == 0:
torch.save({
'epoch': current_epoch,
'step': current_step,
'model_state_dict': model.module.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, '/checkpoints/emergency_checkpoint.pt')
dist.barrier()
sys.exit(0)
signal.signal(signal.SIGTERM, graceful_shutdown)
Kubernetes preStop Hook Configuration
lifecycle:
preStop:
exec:
command: ['/bin/sh', '-c', 'python save_checkpoint.py --emergency']
TorchElastic (Elastic Training)
PyTorch's TorchElastic (now integrated into torchrun) is a distributed training framework optimized for Spot Instance environments.
Key features:
- Elastic scaling: Training continues even when Workers are added or removed.
- Automatic restart: Training automatically resumes upon Worker failure.
- Partial execution: Training can start even when only a portion of the requested GPUs are available.
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: elastic-training
spec:
elasticPolicy:
rdzvBackend: etcd
rdzvHost: etcd-service
rdzvPort: 2379
minReplicas: 2
maxReplicas: 8
pytorchReplicaSpecs:
Worker:
replicas: 4
restartPolicy: OnFailure
template:
spec:
nodeSelector:
cloud.google.com/gke-provisioning: spot
tolerations:
- key: cloud.google.com/gke-spot
operator: Equal
value: 'true'
effect: NoSchedule
containers:
- name: pytorch
image: my-registry/elastic-training:latest
resources:
limits:
nvidia.com/gpu: 1
12.3 Hybrid Cluster Strategy
In production environments, a Hybrid strategy mixing on-demand and Spot nodes is commonly used:
- On-demand nodes: Host core cluster components (etcd, Kueue Controller, Training Operator). Set Taints to prevent general workloads from being placed on these nodes.
- Spot nodes: Host the actual training Worker Pods. Manage Spot nodes separately using Kueue's ResourceFlavor, and select GPUs with the best cost-efficiency ratio.
When combined with Kueue's Preemption policies, Kueue can automatically reassign Workloads to other available resources when Spot nodes are reclaimed.
13. References
This article was written based on the following official documentation and resources.
Volcano Official Documentation
- Volcano Official Site
- Volcano Introduction
- Volcano PodGroup
- Volcano Plugins
- Volcano Unified Scheduling
- Volcano GitHub Repository
Kubeflow Training Operator Official Documentation
- Kubeflow Trainer Overview
- PyTorchJob (Legacy V1)
- Distributed Training Reference
- MPIJob
- TFJob
- Migrating to Trainer V2
- Kubeflow Trainer GitHub Repository
- Volcano Integration for Job Scheduling
Kueue Official Documentation
- Kueue Official Site
- Kueue Overview
- Kueue Concepts
- ClusterQueue
- ResourceFlavor
- Run a PyTorchJob with Kueue
- Kueue GitHub Repository
Other References