- Authors
- Name
- 1. What Is Slurm
- 2. Architecture
- 3. Core Concepts
- 4. Essential Commands
- 5. Job Script Examples
- 6. GPU Scheduling (GRES)
- 7. AI/ML Distributed Training
- 8. Container Integration
- 9. Configuration (slurm.conf)
- 10. Advanced Features
- 11. Monitoring and Troubleshooting
- 12. Comparison with Other Schedulers
- 13. References
1. What Is Slurm
Slurm (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant, highly scalable workload manager for Linux clusters. It is the de facto standard in the world of HPC (High-Performance Computing) and AI training infrastructure.
1.1 Three Core Capabilities
- Resource Allocation: Grants users exclusive or shared access to compute nodes for a specified duration
- Job Execution Framework: Launches, runs, and monitors parallel tasks on allocated nodes
- Queue Management: Resolves resource contention through sophisticated scheduling algorithms
1.2 History
| Year | Event |
|---|---|
| 2002 | First release at Lawrence Livermore National Laboratory (LLNL) |
| 2010 | Core developers founded SchedMD (commercial support, development, training) |
| 2025.12 | NVIDIA acquired SchedMD — committed to maintaining open-source and vendor neutrality |
1.3 Who Uses It
- ~60-65% of TOP500 supercomputers run Slurm
- Notable systems: Frontier (Oak Ridge, #1), Perlmutter (NERSC), Polaris (Argonne)
- Cloud: AWS ParallelCluster, Google Cloud HPC, Azure CycleCloud
- AI companies: Large-scale LLM training, image generation model training
- Industries: Autonomous driving, healthcare, energy, finance, government research labs
License: GNU GPL v2 (open-source)
2. Architecture
┌─────────────────────────────────────────────────────────────┐
│ Slurm Architecture │
│ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ slurmctld │ │ slurmctld │ │
│ │ (Primary) │◄──────►│ (Backup) │ │
│ │ Head Node │ HA │ Backup Node │ │
│ └───────┬────────┘ └────────────────┘ │
│ │ RPC (TCP) │
│ ┌───────┼──────────────────────────────────┐ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ slurmd │ │ slurmd │ │ slurmd │ │ Compute Nodes │
│ │ │ Node 01 │ │ Node 02 │ │ Node N │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ slurmdbd │ │ slurmrestd │ │
│ │ (Database) │ │ (REST API) │ │
│ │ MySQL/MariaDB │ │ JSON / JWT │ │
│ └────────────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────┘
2.1 Daemon Roles
| Daemon | Role | Location |
|---|---|---|
| slurmctld | Central management (scheduling, resource monitoring, job queue) | Head node |
| slurmd | Task execution, resource usage monitoring, status reporting | All compute nodes |
| slurmdbd | Job accounting, history, usage statistics (MySQL/MariaDB backend) | DB server |
| slurmrestd | HTTP RESTful API (JSON, JWT authentication) | API server |
2.2 Plugin Architecture
Slurm supports an extensible plugin architecture:
- Authentication (auth/munge, auth/jwt)
- Containers (OCI, Singularity, Enroot)
- GPU/GRES management
- MPI implementations (PMIx, PMI2)
- Scheduling algorithms (backfill, priority multifactor)
- Process tracking (cgroup, linuxproc)
3. Core Concepts
3.1 Node, Partition, Job
| Concept | Description |
|---|---|
| Node | The basic compute resource. Has CPU, memory, GPU, and disk attributes |
| Partition | A logical grouping of nodes = job queue. Defines access control, resource limits, and priority |
| Job | Resources allocated to a user for a specified time. Has a unique ID, resource requirements, and state |
| Job Step | A set of parallel tasks within a job. Lower overhead than separate jobs |
Job 1234
├── Step 0: Data preprocessing (1 node)
├── Step 1: Training (4 nodes, 32 tasks)
└── Step 2: Evaluation (1 node)
3.2 Account, QoS, Fairshare
Account: A hierarchical organizational unit for tracking resource usage
root
├── engineering
│ ├── ml-team
│ └── platform-team
└── research
├── physics
└── biology
QoS (Quality of Service): A set of limits and priorities that control job behavior
- Scheduling priority
- Preemption policy
- Resource limits (CPU, GPU, memory, number of running jobs)
Fairshare: Fair allocation scheduling that considers historical resource usage
- Each account is assigned a share proportional to its investment/entitlement
- Users who have used fewer resources get higher priority; heavy users get lower priority
- More recent usage is weighted more heavily (decay factor)
3.3 Priority Calculation (Multifactor)
Job Priority = site_factor
+ (WeightAge) × age_factor -- Wait time
+ (WeightFairshare) × fairshare_factor -- Fair distribution
+ (WeightJobSize) × job_size_factor -- Job size
+ (WeightPartition) × partition_factor -- Partition tier
+ (WeightQOS) × QOS_factor -- QoS tier
+ TRES weights -- Resource weights (GPU, etc.)
- nice_factor -- User courtesy value
4. Essential Commands
4.1 sbatch — Submit Batch Jobs
# Basic submission
sbatch job.sh
# Override options
sbatch --partition=gpu --nodes=2 --gres=gpu:4 --time=24:00:00 train.sh
# Specify job name and output
sbatch --job-name=my_training --output=train_%j.log job.sh
# Specific account and QoS
sbatch --account=ml-team --qos=high job.sh
4.2 srun — Execute Parallel Jobs/Steps
# Print hostname across 3 nodes
srun -N3 -l /bin/hostname
# Interactive GPU job
srun --partition=gpu --gres=gpu:1 --pty bash
# Run with GPU binding
srun --ntasks=4 --gpus-per-task=1 --gpu-bind=closest python train.py
4.3 salloc — Interactive Resource Allocation
# Allocate 2 GPU nodes for 4 hours
salloc --nodes=2 --gres=gpu:4 --time=04:00:00 --partition=gpu
# Request specific features
salloc -N1 --constraint=a100 --gres=gpu:8 --mem=512G
4.4 squeue — View the Job Queue
# Show only my jobs
squeue -u $USER
# Custom format
squeue -o "%.10i %.9P %.20j %.8u %.2t %.10M %.6D %R"
# Show PENDING jobs and reasons
squeue -t PENDING -o "%.10i %.20j %.8u %.10M %R"
4.5 sinfo — View System Information
# Partition summary
sinfo
# Show GPU resources
sinfo -o "%20N %10c %10m %20G %10t"
# Idle nodes
sinfo -t idle
4.6 scancel — Cancel Jobs
scancel 12345 # Cancel a specific job
scancel -u $USER # Cancel all my jobs
scancel -t PENDING -u $USER # Cancel only PENDING jobs
scancel 12345_[1-10] # Cancel specific array tasks
4.7 sacct — Job Accounting
# View completed jobs
sacct -u $USER
# Detailed format
sacct -j 12345 --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,TotalCPU
# Query by date range
sacct --starttime=2026-01-01 --endtime=2026-01-31 -u $USER
4.8 scontrol — Administrative Control
scontrol show job 12345 # Job details
scontrol show node gpu-001 # Node details
scontrol hold 12345 # Hold a job
scontrol release 12345 # Release a hold
scontrol update JobId=12345 TimeLimit=48:00:00 # Modify time limit
scontrol ping # Test controller connectivity
5. Job Script Examples
5.1 Basic CPU Job
#!/bin/bash
#SBATCH --job-name=basic_job
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log
module load gcc/12.2.0
module load openmpi/4.1.5
echo "Job started on $(hostname) at $(date)"
echo "SLURM_JOB_ID: $SLURM_JOB_ID"
echo "SLURM_NODELIST: $SLURM_NODELIST"
srun ./my_simulation --input data.csv --output results/
echo "Job completed at $(date)"
5.2 Single-Node GPU Job
#!/bin/bash
#SBATCH --job-name=gpu_training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:a100:2
#SBATCH --time=12:00:00
#SBATCH --output=train_%j.log
module load cuda/12.2
module load anaconda/2024
conda activate ml_env
echo "Available GPUs: $CUDA_VISIBLE_DEVICES"
nvidia-smi
python train.py \
--model resnet50 \
--batch-size 256 \
--epochs 100 \
--data /shared/datasets/imagenet \
--output /scratch/$USER/checkpoints/
5.3 Multi-Node MPI Job
#!/bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --partition=compute
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=24:00:00
#SBATCH --exclusive
module load openmpi/4.1.5
echo "Running on $SLURM_NNODES nodes with $SLURM_NTASKS total tasks"
srun ./weather_simulation --grid-size 4096x4096x128 --timesteps 10000
5.4 Job Array (Hyperparameter Sweep)
#!/bin/bash
#SBATCH --job-name=hparam_sweep
#SBATCH --partition=gpu
#SBATCH --array=0-19%5 # 20 tasks, max 5 running concurrently
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=06:00:00
#SBATCH --output=sweep_%A_%a.log # %A=Array ID, %a=Task ID
LEARNING_RATES=(0.1 0.01 0.001 0.0001 0.00001)
BATCH_SIZES=(32 64 128 256)
LR_IDX=$((SLURM_ARRAY_TASK_ID / 4))
BS_IDX=$((SLURM_ARRAY_TASK_ID % 4))
LR=${LEARNING_RATES[$LR_IDX]}
BS=${BATCH_SIZES[$BS_IDX]}
echo "Task $SLURM_ARRAY_TASK_ID: LR=$LR, BS=$BS"
python train.py --lr $LR --batch-size $BS \
--experiment-name "sweep_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"
5.5 Job Dependencies (Pipeline)
# Step 1: Preprocessing
JOB1=$(sbatch --parsable preprocess.sh)
# Step 2: Training (after preprocessing succeeds)
JOB2=$(sbatch --parsable --dependency=afterok:$JOB1 train.sh)
# Step 3: Evaluation (after training succeeds)
JOB3=$(sbatch --parsable --dependency=afterok:$JOB2 evaluate.sh)
# Step 4: Cleanup (after all complete, regardless of success/failure)
JOB4=$(sbatch --parsable --dependency=afterany:$JOB1:$JOB2:$JOB3 cleanup.sh)
Dependency Types:
| Type | Meaning |
|---|---|
after:jobid | After the job starts |
afterok:jobid | After the job completes successfully |
afternotok:jobid | After the job fails |
afterany:jobid | After the job completes (regardless of success/failure) |
aftercorr:jobid | Array: after the corresponding task succeeds |
singleton | Only one job with the same name runs at a time |
6. GPU Scheduling (GRES)
6.1 Configuration
slurm.conf:
GresTypes=gpu,mps,shard
NodeName=gpu-node[01-08] Gres=gpu:a100:8 CPUs=128 RealMemory=1024000
AccountingStorageTres=gres/gpu
gres.conf:
# Auto-detection (recommended)
AutoDetect=nvml # NVIDIA
# AutoDetect=rsmi # AMD
# AutoDetect=nrt # Intel Arc
6.2 How to Request GPUs
# 2 GPUs of any type
sbatch --gres=gpu:2 job.sh
# Specific GPU type
sbatch --gres=gpu:a100:4 job.sh
# GPUs per node
sbatch --nodes=4 --gpus-per-node=8 job.sh
# GPUs per task + CPU/memory affinity
sbatch --ntasks=8 --gpus-per-task=1 --cpus-per-gpu=8 --mem-per-gpu=32G job.sh
Slurm automatically sets CUDA_VISIBLE_DEVICES for GPU isolation.
6.3 GPU Sharing Options
| Method | Description | Configuration |
|---|---|---|
| MPS (Multi-Process Service) | Multi-process GPU sharing with concurrent kernel execution | Gres=gpu:a100:8,mps:800 |
| MIG (Multi-Instance GPU) | Partitions A100/H100 into independent instances | AutoDetect=nvml (auto-detected) |
| Shard | GPU sharing without isolation (lightweight inference) | Gres=gpu:2,shard:64 |
7. AI/ML Distributed Training
7.1 PyTorch DDP (Multi-Node)
sbatch script (4 nodes x 8 GPUs = 32 GPUs):
#!/bin/bash
#SBATCH --job-name=ddp-multigpu
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=64
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH --time=72:00:00
#SBATCH --output=ddp_%j.log
export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1) + 30000 ))
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export WORLD_SIZE=$(( SLURM_NNODES * 8 ))
echo "MASTER_ADDR=$MASTER_ADDR, MASTER_PORT=$MASTER_PORT, WORLD_SIZE=$WORLD_SIZE"
# NCCL configuration
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=^docker0,lo
srun torchrun \
--nnodes $SLURM_NNODES \
--nproc_per_node 8 \
--rdzv_id $SLURM_JOB_ID \
--rdzv_backend c10d \
--rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
train.py \
--model llama-7b \
--data /shared/datasets/openwebtext \
--batch-size 32 \
--gradient-accumulation-steps 4
Python training script pattern:
import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
def setup():
dist.init_process_group(backend="nccl")
rank = dist.get_rank()
world_size = dist.get_world_size()
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
return rank, world_size, local_rank
def main():
rank, world_size, local_rank = setup()
model = MyModel().cuda(local_rank)
model = DDP(model, device_ids=[local_rank])
dataset = MyDataset(...)
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)
for epoch in range(num_epochs):
sampler.set_epoch(epoch) # Required for proper shuffling
for batch in dataloader:
...
dist.destroy_process_group()
7.2 DeepSpeed (Multi-Node)
#!/bin/bash
#SBATCH --job-name=deepspeed-train
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=64
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH --time=96:00:00
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1) + 30000 ))
export NNODES=$SLURM_NNODES
export NUM_PROCESSES=$(( NNODES * 8 ))
srun bash -c 'accelerate launch \
--main_process_ip $MASTER_ADDR \
--main_process_port $MASTER_PORT \
--machine_rank $SLURM_NODEID \
--num_processes $NUM_PROCESSES \
--num_machines $NNODES \
--use_deepspeed \
--zero_stage 2 \
--mixed_precision fp16 \
train.py \
--model_name_or_path meta-llama/Llama-2-7b \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 8'
7.3 Horovod (MPI-Based)
#!/bin/bash
#SBATCH --job-name=horovod-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
module load openmpi/4.1.5 cuda/12.2
srun --mpi=pmix python train_horovod.py --epochs 100 --batch-size 64
8. Container Integration
8.1 Singularity / Apptainer
#!/bin/bash
#SBATCH --job-name=container_job
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
# --nv flag enables GPU support
srun singularity exec --nv \
--bind /scratch/$USER:/data \
--bind /shared/datasets:/datasets \
/shared/containers/pytorch_24.03.sif \
python train.py --data /datasets/imagenet
8.2 Enroot + Pyxis (NVIDIA)
Enroot is NVIDIA's lightweight container runtime, and Pyxis is a Slurm SPANK plugin that provides the srun --container-* flags.
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --exclusive
srun --container-image=nvcr.io/nvidia/pytorch:24.03-py3 \
--container-mounts=/shared/data:/data,/scratch/$USER:/workspace \
--container-workdir=/workspace \
torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=8 \
--rdzv_backend=c10d \
--rdzv_endpoint=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1):29500 \
train.py --data /data
Widely used on NVIDIA DGX SuperPOD and DGX Cloud.
9. Configuration (slurm.conf)
9.1 Key Configuration Parameters
# Cluster identification
ClusterName=my_cluster
SlurmctldHost=controller01
SlurmctldHost=controller02 # Backup controller
# Authentication
AuthType=auth/munge
# Scheduling
SchedulerType=sched/backfill
SelectType=select/cons_tres # Consumable trackable resources
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=500
PriorityWeightQOS=2000
PriorityDecayHalfLife=7-0 # 7 days
# Resource management
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
GresTypes=gpu,mps,shard
# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTres=gres/gpu
# Job defaults
DefMemPerCPU=4096 # 4GB
MaxMemPerCPU=16384 # 16GB
DisableRootJobs=YES
MpiDefault=pmix_v4
# Node definitions
NodeName=compute[001-100] CPUs=64 RealMemory=256000 State=UNKNOWN
NodeName=gpu[001-032] CPUs=128 RealMemory=1024000 Gres=gpu:a100:8 Feature=a100,nvlink
# Partition definitions
PartitionName=compute Nodes=compute[001-100] Default=YES MaxTime=7-00:00:00
PartitionName=gpu Nodes=gpu[001-032] MaxTime=3-00:00:00 AllowGroups=gpu-users
PartitionName=debug Nodes=compute[001-004],gpu[001-002] MaxTime=01:00:00 PriorityTier=100
9.2 cgroup.conf
ConstrainCores=yes # CPU core pinning
ConstrainRAMSpace=yes # Enforce memory limits
AllowedRAMSpace=100 # % of allocated memory (OOM Kill if exceeded)
ConstrainSwapSpace=yes
ConstrainDevices=yes # Device isolation (GPU)
10. Advanced Features
10.1 Backfill Scheduling
Slurm's secondary scheduling loop. It allows short, lower-priority jobs to run ahead of schedule as long as they do not delay longer, higher-priority jobs.
SchedulerType=sched/backfill
SchedulerParameters=bf_interval=30,bf_resolution=300,bf_max_job_test=1200
For backfill to work effectively, specifying a job time limit (--time) is essential.
10.2 Preemption
# slurm.conf
PreemptType=preempt/qos
PreemptMode=REQUEUE # CANCEL, REQUEUE, SUSPEND, GANG
PreemptExemptTime=00:05:00 # Grace period before preemption
| Mode | Behavior |
|---|---|
| CANCEL | Terminates the lower-priority job |
| REQUEUE | Requeues if possible, cancels otherwise |
| SUSPEND | Suspends the job |
| GANG | Time-sharing between jobs |
10.3 Large-Scale Job Array Submission
# 1000 tasks, max 50 running concurrently
sbatch --array=0-999%50 sweep.sh
# Environment variables: SLURM_ARRAY_JOB_ID, SLURM_ARRAY_TASK_ID
# MaxArraySize: up to 4,000,001 (configurable)
11. Monitoring and Troubleshooting
11.1 Diagnostic Commands
scontrol ping # Test controller connectivity
sdiag # Scheduler diagnostics (threads, queue, backfill cycles)
scontrol show node X # Check node status
sacct -j ID --format=JobID,Elapsed,TotalCPU,ReqMem,MaxRSS,State # Job efficiency
11.2 Common Problems and Solutions
| Problem | Diagnosis | Solution |
|---|---|---|
| Node in DRAIN state | scontrol show node | Fix the issue, then scontrol update NodeName=X State=RESUME |
| Job stuck in PENDING | squeue -j ID -o "%R" (check Reason) | Check resources, partition limits, QoS, and dependencies |
| GPU not detected | slurmd -C, slurmd -G | Verify driver, gres.conf AutoDetect, and device files |
| OOM Kill | sacct --format=MaxRSS,ReqMem | Request more memory or adjust cgroup limits |
| slurmctld overloaded | sdiag (thread count) | Enable RPC rate limiting, reduce polling frequency |
11.3 Common PENDING Reasons
| Reason | Meaning |
|---|---|
| Resources | Waiting for resources to become available |
| Priority | Higher-priority jobs are ahead in the queue |
| Dependency | Waiting for dependent jobs to complete |
| QOSMaxJobsPerUserLimit | QoS per-user job count limit reached |
| PartitionTimeLimit | Requested time exceeds the partition time limit |
| ReqNodeNotAvail | Requested node is unavailable |
12. Comparison with Other Schedulers
| Feature | Slurm | PBS Pro / Torque | IBM LSF | Kubernetes |
|---|---|---|---|---|
| License | GPL v2 (open-source) | AGPL / Commercial | Commercial (IBM) | Apache 2.0 |
| Primary Use | HPC, AI training | HPC, traditional batch | Enterprise HPC | Cloud microservices |
| Scalability | 100K+ nodes | 50K+ | 100K+ | 5K+ |
| GPU Support | Native GRES, MIG, MPS | Hook-based | GPU-aware | Device Plugin |
| MPI Support | Native (PMIx) | Native | Native | MPI Operator |
| Fairshare | Built-in | Requires Maui/Moab | Built-in | Not built-in |
| TOP500 Adoption | ~60-65% | ~10-15% | ~10-15% | Rare |
Trend: Hybrid setups with Slurm (training) + Kubernetes (inference/serving) are becoming mainstream.
13. References
Official Documentation
- Slurm Official Documentation — The authoritative reference for all features
- Quick Start User Guide
- Quick Start Admin Guide
- Slurm GRES Scheduling
- Slurm Job Array
- Slurm Containers Guide
- Slurm Configuration Tool — Web-based slurm.conf generator
- Slurm Rosetta Stone (PDF) — PBS/LSF/SGE command comparison chart
- GitHub: SchedMD/slurm
AI/ML Distributed Training
- PyTorch DDP Multi-Node Slurm Examples
- PyTorch Multi-Node Training Tutorial
- NVIDIA DGX Cloud DeepSpeed Examples
- Multi-Node Training on Slurm (GitHub Gist)