Skip to content
Published on

Mastering Slurm: A Practical Guide to the HPC/AI Cluster Workload Manager

Authors
  • Name
    Twitter

1. What Is Slurm

Slurm (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant, highly scalable workload manager for Linux clusters. It is the de facto standard in the world of HPC (High-Performance Computing) and AI training infrastructure.

1.1 Three Core Capabilities

  1. Resource Allocation: Grants users exclusive or shared access to compute nodes for a specified duration
  2. Job Execution Framework: Launches, runs, and monitors parallel tasks on allocated nodes
  3. Queue Management: Resolves resource contention through sophisticated scheduling algorithms

1.2 History

YearEvent
2002First release at Lawrence Livermore National Laboratory (LLNL)
2010Core developers founded SchedMD (commercial support, development, training)
2025.12NVIDIA acquired SchedMD — committed to maintaining open-source and vendor neutrality

1.3 Who Uses It

  • ~60-65% of TOP500 supercomputers run Slurm
  • Notable systems: Frontier (Oak Ridge, #1), Perlmutter (NERSC), Polaris (Argonne)
  • Cloud: AWS ParallelCluster, Google Cloud HPC, Azure CycleCloud
  • AI companies: Large-scale LLM training, image generation model training
  • Industries: Autonomous driving, healthcare, energy, finance, government research labs

License: GNU GPL v2 (open-source)


2. Architecture

┌─────────────────────────────────────────────────────────────┐
Slurm Architecture│                                                              │
│  ┌────────────────┐        ┌────────────────┐                │
│  │  slurmctld     │        │  slurmctld     │                │
  (Primary)     │◄──────►│  (Backup)      │                │
│  │  Head NodeHABackup Node   │                │
│  └───────┬────────┘        └────────────────┘                │
│          │ RPC (TCP)│  ┌───────┼──────────────────────────────────┐                │
│  │       ▼          ▼          ▼            │                │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐    │                │
│  │  │ slurmd  │ │ slurmd  │ │ slurmd  │    │ Compute Nodes│  │  │ Node 01 │ │ Node 02 │ │ Node N  │    │                │
│  │  └─────────┘ └─────────┘ └─────────┘    │                │
│  └──────────────────────────────────────────┘                │
│                                                              │
│  ┌────────────────┐        ┌────────────────┐                │
│  │  slurmdbd      │        │  slurmrestd    │                │
  (Database)  (REST API)    │                │
│  │  MySQL/MariaDB │        │  JSON / JWT    │                │
│  └────────────────┘        └────────────────┘                │
└─────────────────────────────────────────────────────────────┘

2.1 Daemon Roles

DaemonRoleLocation
slurmctldCentral management (scheduling, resource monitoring, job queue)Head node
slurmdTask execution, resource usage monitoring, status reportingAll compute nodes
slurmdbdJob accounting, history, usage statistics (MySQL/MariaDB backend)DB server
slurmrestdHTTP RESTful API (JSON, JWT authentication)API server

2.2 Plugin Architecture

Slurm supports an extensible plugin architecture:

  • Authentication (auth/munge, auth/jwt)
  • Containers (OCI, Singularity, Enroot)
  • GPU/GRES management
  • MPI implementations (PMIx, PMI2)
  • Scheduling algorithms (backfill, priority multifactor)
  • Process tracking (cgroup, linuxproc)

3. Core Concepts

3.1 Node, Partition, Job

ConceptDescription
NodeThe basic compute resource. Has CPU, memory, GPU, and disk attributes
PartitionA logical grouping of nodes = job queue. Defines access control, resource limits, and priority
JobResources allocated to a user for a specified time. Has a unique ID, resource requirements, and state
Job StepA set of parallel tasks within a job. Lower overhead than separate jobs
Job 1234
  ├── Step 0: Data preprocessing (1 node)
  ├── Step 1: Training (4 nodes, 32 tasks)
  └── Step 2: Evaluation (1 node)

3.2 Account, QoS, Fairshare

Account: A hierarchical organizational unit for tracking resource usage

root
├── engineering
│   ├── ml-team
│   └── platform-team
└── research
    ├── physics
    └── biology

QoS (Quality of Service): A set of limits and priorities that control job behavior

  1. Scheduling priority
  2. Preemption policy
  3. Resource limits (CPU, GPU, memory, number of running jobs)

Fairshare: Fair allocation scheduling that considers historical resource usage

  • Each account is assigned a share proportional to its investment/entitlement
  • Users who have used fewer resources get higher priority; heavy users get lower priority
  • More recent usage is weighted more heavily (decay factor)

3.3 Priority Calculation (Multifactor)

Job Priority = site_factor
             + (WeightAge)       × age_factor       -- Wait time
             + (WeightFairshare) × fairshare_factor  -- Fair distribution
             + (WeightJobSize)   × job_size_factor   -- Job size
             + (WeightPartition) × partition_factor   -- Partition tier
             + (WeightQOS)       × QOS_factor        -- QoS tier
             + TRES weights                           -- Resource weights (GPU, etc.)
             - nice_factor                            -- User courtesy value

4. Essential Commands

4.1 sbatch — Submit Batch Jobs

# Basic submission
sbatch job.sh

# Override options
sbatch --partition=gpu --nodes=2 --gres=gpu:4 --time=24:00:00 train.sh

# Specify job name and output
sbatch --job-name=my_training --output=train_%j.log job.sh

# Specific account and QoS
sbatch --account=ml-team --qos=high job.sh

4.2 srun — Execute Parallel Jobs/Steps

# Print hostname across 3 nodes
srun -N3 -l /bin/hostname

# Interactive GPU job
srun --partition=gpu --gres=gpu:1 --pty bash

# Run with GPU binding
srun --ntasks=4 --gpus-per-task=1 --gpu-bind=closest python train.py

4.3 salloc — Interactive Resource Allocation

# Allocate 2 GPU nodes for 4 hours
salloc --nodes=2 --gres=gpu:4 --time=04:00:00 --partition=gpu

# Request specific features
salloc -N1 --constraint=a100 --gres=gpu:8 --mem=512G

4.4 squeue — View the Job Queue

# Show only my jobs
squeue -u $USER

# Custom format
squeue -o "%.10i %.9P %.20j %.8u %.2t %.10M %.6D %R"

# Show PENDING jobs and reasons
squeue -t PENDING -o "%.10i %.20j %.8u %.10M %R"

4.5 sinfo — View System Information

# Partition summary
sinfo

# Show GPU resources
sinfo -o "%20N %10c %10m %20G %10t"

# Idle nodes
sinfo -t idle

4.6 scancel — Cancel Jobs

scancel 12345              # Cancel a specific job
scancel -u $USER           # Cancel all my jobs
scancel -t PENDING -u $USER  # Cancel only PENDING jobs
scancel 12345_[1-10]       # Cancel specific array tasks

4.7 sacct — Job Accounting

# View completed jobs
sacct -u $USER

# Detailed format
sacct -j 12345 --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,TotalCPU

# Query by date range
sacct --starttime=2026-01-01 --endtime=2026-01-31 -u $USER

4.8 scontrol — Administrative Control

scontrol show job 12345       # Job details
scontrol show node gpu-001    # Node details
scontrol hold 12345           # Hold a job
scontrol release 12345        # Release a hold
scontrol update JobId=12345 TimeLimit=48:00:00  # Modify time limit
scontrol ping                 # Test controller connectivity

5. Job Script Examples

5.1 Basic CPU Job

#!/bin/bash
#SBATCH --job-name=basic_job
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log

module load gcc/12.2.0
module load openmpi/4.1.5

echo "Job started on $(hostname) at $(date)"
echo "SLURM_JOB_ID: $SLURM_JOB_ID"
echo "SLURM_NODELIST: $SLURM_NODELIST"

srun ./my_simulation --input data.csv --output results/
echo "Job completed at $(date)"

5.2 Single-Node GPU Job

#!/bin/bash
#SBATCH --job-name=gpu_training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:a100:2
#SBATCH --time=12:00:00
#SBATCH --output=train_%j.log

module load cuda/12.2
module load anaconda/2024
conda activate ml_env

echo "Available GPUs: $CUDA_VISIBLE_DEVICES"
nvidia-smi

python train.py \
    --model resnet50 \
    --batch-size 256 \
    --epochs 100 \
    --data /shared/datasets/imagenet \
    --output /scratch/$USER/checkpoints/

5.3 Multi-Node MPI Job

#!/bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --partition=compute
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=24:00:00
#SBATCH --exclusive

module load openmpi/4.1.5

echo "Running on $SLURM_NNODES nodes with $SLURM_NTASKS total tasks"
srun ./weather_simulation --grid-size 4096x4096x128 --timesteps 10000

5.4 Job Array (Hyperparameter Sweep)

#!/bin/bash
#SBATCH --job-name=hparam_sweep
#SBATCH --partition=gpu
#SBATCH --array=0-19%5            # 20 tasks, max 5 running concurrently
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=06:00:00
#SBATCH --output=sweep_%A_%a.log  # %A=Array ID, %a=Task ID

LEARNING_RATES=(0.1 0.01 0.001 0.0001 0.00001)
BATCH_SIZES=(32 64 128 256)

LR_IDX=$((SLURM_ARRAY_TASK_ID / 4))
BS_IDX=$((SLURM_ARRAY_TASK_ID % 4))
LR=${LEARNING_RATES[$LR_IDX]}
BS=${BATCH_SIZES[$BS_IDX]}

echo "Task $SLURM_ARRAY_TASK_ID: LR=$LR, BS=$BS"
python train.py --lr $LR --batch-size $BS \
    --experiment-name "sweep_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"

5.5 Job Dependencies (Pipeline)

# Step 1: Preprocessing
JOB1=$(sbatch --parsable preprocess.sh)

# Step 2: Training (after preprocessing succeeds)
JOB2=$(sbatch --parsable --dependency=afterok:$JOB1 train.sh)

# Step 3: Evaluation (after training succeeds)
JOB3=$(sbatch --parsable --dependency=afterok:$JOB2 evaluate.sh)

# Step 4: Cleanup (after all complete, regardless of success/failure)
JOB4=$(sbatch --parsable --dependency=afterany:$JOB1:$JOB2:$JOB3 cleanup.sh)

Dependency Types:

TypeMeaning
after:jobidAfter the job starts
afterok:jobidAfter the job completes successfully
afternotok:jobidAfter the job fails
afterany:jobidAfter the job completes (regardless of success/failure)
aftercorr:jobidArray: after the corresponding task succeeds
singletonOnly one job with the same name runs at a time

6. GPU Scheduling (GRES)

6.1 Configuration

slurm.conf:

GresTypes=gpu,mps,shard
NodeName=gpu-node[01-08] Gres=gpu:a100:8 CPUs=128 RealMemory=1024000
AccountingStorageTres=gres/gpu

gres.conf:

# Auto-detection (recommended)
AutoDetect=nvml    # NVIDIA
# AutoDetect=rsmi  # AMD
# AutoDetect=nrt   # Intel Arc

6.2 How to Request GPUs

# 2 GPUs of any type
sbatch --gres=gpu:2 job.sh

# Specific GPU type
sbatch --gres=gpu:a100:4 job.sh

# GPUs per node
sbatch --nodes=4 --gpus-per-node=8 job.sh

# GPUs per task + CPU/memory affinity
sbatch --ntasks=8 --gpus-per-task=1 --cpus-per-gpu=8 --mem-per-gpu=32G job.sh

Slurm automatically sets CUDA_VISIBLE_DEVICES for GPU isolation.

6.3 GPU Sharing Options

MethodDescriptionConfiguration
MPS (Multi-Process Service)Multi-process GPU sharing with concurrent kernel executionGres=gpu:a100:8,mps:800
MIG (Multi-Instance GPU)Partitions A100/H100 into independent instancesAutoDetect=nvml (auto-detected)
ShardGPU sharing without isolation (lightweight inference)Gres=gpu:2,shard:64

7. AI/ML Distributed Training

7.1 PyTorch DDP (Multi-Node)

sbatch script (4 nodes x 8 GPUs = 32 GPUs):

#!/bin/bash
#SBATCH --job-name=ddp-multigpu
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=64
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH --time=72:00:00
#SBATCH --output=ddp_%j.log

export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1) + 30000 ))
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export WORLD_SIZE=$(( SLURM_NNODES * 8 ))

echo "MASTER_ADDR=$MASTER_ADDR, MASTER_PORT=$MASTER_PORT, WORLD_SIZE=$WORLD_SIZE"

# NCCL configuration
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=^docker0,lo

srun torchrun \
    --nnodes $SLURM_NNODES \
    --nproc_per_node 8 \
    --rdzv_id $SLURM_JOB_ID \
    --rdzv_backend c10d \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    train.py \
        --model llama-7b \
        --data /shared/datasets/openwebtext \
        --batch-size 32 \
        --gradient-accumulation-steps 4

Python training script pattern:

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def setup():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    return rank, world_size, local_rank

def main():
    rank, world_size, local_rank = setup()

    model = MyModel().cuda(local_rank)
    model = DDP(model, device_ids=[local_rank])

    dataset = MyDataset(...)
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Required for proper shuffling
        for batch in dataloader:
            ...

    dist.destroy_process_group()

7.2 DeepSpeed (Multi-Node)

#!/bin/bash
#SBATCH --job-name=deepspeed-train
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=64
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH --time=96:00:00

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1) + 30000 ))
export NNODES=$SLURM_NNODES
export NUM_PROCESSES=$(( NNODES * 8 ))

srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank $SLURM_NODEID \
    --num_processes $NUM_PROCESSES \
    --num_machines $NNODES \
    --use_deepspeed \
    --zero_stage 2 \
    --mixed_precision fp16 \
    train.py \
        --model_name_or_path meta-llama/Llama-2-7b \
        --per_device_train_batch_size 4 \
        --gradient_accumulation_steps 8'

7.3 Horovod (MPI-Based)

#!/bin/bash
#SBATCH --job-name=horovod-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4

module load openmpi/4.1.5 cuda/12.2
srun --mpi=pmix python train_horovod.py --epochs 100 --batch-size 64

8. Container Integration

8.1 Singularity / Apptainer

#!/bin/bash
#SBATCH --job-name=container_job
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

# --nv flag enables GPU support
srun singularity exec --nv \
    --bind /scratch/$USER:/data \
    --bind /shared/datasets:/datasets \
    /shared/containers/pytorch_24.03.sif \
    python train.py --data /datasets/imagenet

8.2 Enroot + Pyxis (NVIDIA)

Enroot is NVIDIA's lightweight container runtime, and Pyxis is a Slurm SPANK plugin that provides the srun --container-* flags.

#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --exclusive

srun --container-image=nvcr.io/nvidia/pytorch:24.03-py3 \
     --container-mounts=/shared/data:/data,/scratch/$USER:/workspace \
     --container-workdir=/workspace \
     torchrun \
         --nnodes=$SLURM_NNODES \
         --nproc_per_node=8 \
         --rdzv_backend=c10d \
         --rdzv_endpoint=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1):29500 \
         train.py --data /data

Widely used on NVIDIA DGX SuperPOD and DGX Cloud.


9. Configuration (slurm.conf)

9.1 Key Configuration Parameters

# Cluster identification
ClusterName=my_cluster
SlurmctldHost=controller01
SlurmctldHost=controller02     # Backup controller

# Authentication
AuthType=auth/munge

# Scheduling
SchedulerType=sched/backfill
SelectType=select/cons_tres     # Consumable trackable resources
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=500
PriorityWeightQOS=2000
PriorityDecayHalfLife=7-0       # 7 days

# Resource management
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
GresTypes=gpu,mps,shard

# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTres=gres/gpu

# Job defaults
DefMemPerCPU=4096               # 4GB
MaxMemPerCPU=16384              # 16GB
DisableRootJobs=YES
MpiDefault=pmix_v4

# Node definitions
NodeName=compute[001-100] CPUs=64 RealMemory=256000 State=UNKNOWN
NodeName=gpu[001-032] CPUs=128 RealMemory=1024000 Gres=gpu:a100:8 Feature=a100,nvlink

# Partition definitions
PartitionName=compute Nodes=compute[001-100] Default=YES MaxTime=7-00:00:00
PartitionName=gpu Nodes=gpu[001-032] MaxTime=3-00:00:00 AllowGroups=gpu-users
PartitionName=debug Nodes=compute[001-004],gpu[001-002] MaxTime=01:00:00 PriorityTier=100

9.2 cgroup.conf

ConstrainCores=yes          # CPU core pinning
ConstrainRAMSpace=yes       # Enforce memory limits
AllowedRAMSpace=100         # % of allocated memory (OOM Kill if exceeded)
ConstrainSwapSpace=yes
ConstrainDevices=yes        # Device isolation (GPU)

10. Advanced Features

10.1 Backfill Scheduling

Slurm's secondary scheduling loop. It allows short, lower-priority jobs to run ahead of schedule as long as they do not delay longer, higher-priority jobs.

SchedulerType=sched/backfill
SchedulerParameters=bf_interval=30,bf_resolution=300,bf_max_job_test=1200

For backfill to work effectively, specifying a job time limit (--time) is essential.

10.2 Preemption

# slurm.conf
PreemptType=preempt/qos
PreemptMode=REQUEUE              # CANCEL, REQUEUE, SUSPEND, GANG
PreemptExemptTime=00:05:00       # Grace period before preemption
ModeBehavior
CANCELTerminates the lower-priority job
REQUEUERequeues if possible, cancels otherwise
SUSPENDSuspends the job
GANGTime-sharing between jobs

10.3 Large-Scale Job Array Submission

# 1000 tasks, max 50 running concurrently
sbatch --array=0-999%50 sweep.sh

# Environment variables: SLURM_ARRAY_JOB_ID, SLURM_ARRAY_TASK_ID
# MaxArraySize: up to 4,000,001 (configurable)

11. Monitoring and Troubleshooting

11.1 Diagnostic Commands

scontrol ping           # Test controller connectivity
sdiag                   # Scheduler diagnostics (threads, queue, backfill cycles)
scontrol show node X    # Check node status
sacct -j ID --format=JobID,Elapsed,TotalCPU,ReqMem,MaxRSS,State  # Job efficiency

11.2 Common Problems and Solutions

ProblemDiagnosisSolution
Node in DRAIN statescontrol show nodeFix the issue, then scontrol update NodeName=X State=RESUME
Job stuck in PENDINGsqueue -j ID -o "%R" (check Reason)Check resources, partition limits, QoS, and dependencies
GPU not detectedslurmd -C, slurmd -GVerify driver, gres.conf AutoDetect, and device files
OOM Killsacct --format=MaxRSS,ReqMemRequest more memory or adjust cgroup limits
slurmctld overloadedsdiag (thread count)Enable RPC rate limiting, reduce polling frequency

11.3 Common PENDING Reasons

ReasonMeaning
ResourcesWaiting for resources to become available
PriorityHigher-priority jobs are ahead in the queue
DependencyWaiting for dependent jobs to complete
QOSMaxJobsPerUserLimitQoS per-user job count limit reached
PartitionTimeLimitRequested time exceeds the partition time limit
ReqNodeNotAvailRequested node is unavailable

12. Comparison with Other Schedulers

FeatureSlurmPBS Pro / TorqueIBM LSFKubernetes
LicenseGPL v2 (open-source)AGPL / CommercialCommercial (IBM)Apache 2.0
Primary UseHPC, AI trainingHPC, traditional batchEnterprise HPCCloud microservices
Scalability100K+ nodes50K+100K+5K+
GPU SupportNative GRES, MIG, MPSHook-basedGPU-awareDevice Plugin
MPI SupportNative (PMIx)NativeNativeMPI Operator
FairshareBuilt-inRequires Maui/MoabBuilt-inNot built-in
TOP500 Adoption~60-65%~10-15%~10-15%Rare

Trend: Hybrid setups with Slurm (training) + Kubernetes (inference/serving) are becoming mainstream.


13. References

Official Documentation

  1. Slurm Official Documentation — The authoritative reference for all features
  2. Quick Start User Guide
  3. Quick Start Admin Guide
  4. Slurm GRES Scheduling
  5. Slurm Job Array
  6. Slurm Containers Guide
  7. Slurm Configuration Tool — Web-based slurm.conf generator
  8. Slurm Rosetta Stone (PDF) — PBS/LSF/SGE command comparison chart
  9. GitHub: SchedMD/slurm

AI/ML Distributed Training

  1. PyTorch DDP Multi-Node Slurm Examples
  2. PyTorch Multi-Node Training Tutorial
  3. NVIDIA DGX Cloud DeepSpeed Examples
  4. Multi-Node Training on Slurm (GitHub Gist)

Containers

  1. NVIDIA Pyxis GitHub
  2. AWS ParallelCluster Pyxis Tutorial

Tutorials

  1. LLNL Slurm Quick Start Guide
  2. Princeton Slurm Resources
  3. NERSC Training Libraries Documentation
  4. Nebius Slurm Blog
  5. NVIDIA Acquires SchedMD