Mastering Slurm: A Practical Guide to the HPC/AI Cluster Workload Manager

1. What Is Slurm
2. Architecture
- 2.1 Daemon Roles
- 2.2 Plugin Architecture
3. Core Concepts
4. Essential Commands
5. Job Script Examples
6. GPU Scheduling (GRES)
7. AI/ML Distributed Training
8. Container Integration
- 8.1 Singularity / Apptainer
- 8.2 Enroot + Pyxis (NVIDIA)
9. Configuration (slurm.conf)
- 9.1 Key Configuration Parameters
- 9.2 cgroup.conf
10. Advanced Features
11. Monitoring and Troubleshooting
12. Comparison with Other Schedulers
13. References

1. What Is Slurm

Slurm (Simple Linux Utility for Resource Management) is an open-source, fault-tolerant, highly scalable workload manager for Linux clusters. It is the de facto standard in the world of HPC (High-Performance Computing) and AI training infrastructure.

1.1 Three Core Capabilities

Resource Allocation: Grants users exclusive or shared access to compute nodes for a specified duration
Job Execution Framework: Launches, runs, and monitors parallel tasks on allocated nodes
Queue Management: Resolves resource contention through sophisticated scheduling algorithms

1.2 History

Year	Event
2002	First release at Lawrence Livermore National Laboratory (LLNL)
2010	Core developers founded SchedMD (commercial support, development, training)
2025.12	NVIDIA acquired SchedMD — committed to maintaining open-source and vendor neutrality

1.3 Who Uses It

~60-65% of TOP500 supercomputers run Slurm
Notable systems: Frontier (Oak Ridge, #1), Perlmutter (NERSC), Polaris (Argonne)
Cloud: AWS ParallelCluster, Google Cloud HPC, Azure CycleCloud
AI companies: Large-scale LLM training, image generation model training
Industries: Autonomous driving, healthcare, energy, finance, government research labs

License: GNU GPL v2 (open-source)

2. Architecture

┌─────────────────────────────────────────────────────────────┐
│                   Slurm Architecture                         │
│                                                              │
│  ┌────────────────┐        ┌────────────────┐                │
│  │  slurmctld     │        │  slurmctld     │                │
│  │  (Primary)     │◄──────►│  (Backup)      │                │
│  │  Head Node     │  HA    │  Backup Node   │                │
│  └───────┬────────┘        └────────────────┘                │
│          │ RPC (TCP)                                         │
│  ┌───────┼──────────────────────────────────┐                │
│  │       ▼          ▼          ▼            │                │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐    │                │
│  │  │ slurmd  │ │ slurmd  │ │ slurmd  │    │ Compute Nodes  │
│  │  │ Node 01 │ │ Node 02 │ │ Node N  │    │                │
│  │  └─────────┘ └─────────┘ └─────────┘    │                │
│  └──────────────────────────────────────────┘                │
│                                                              │
│  ┌────────────────┐        ┌────────────────┐                │
│  │  slurmdbd      │        │  slurmrestd    │                │
│  │  (Database)    │        │  (REST API)    │                │
│  │  MySQL/MariaDB │        │  JSON / JWT    │                │
│  └────────────────┘        └────────────────┘                │
└─────────────────────────────────────────────────────────────┘

2.1 Daemon Roles

Daemon	Role	Location
slurmctld	Central management (scheduling, resource monitoring, job queue)	Head node
slurmd	Task execution, resource usage monitoring, status reporting	All compute nodes
slurmdbd	Job accounting, history, usage statistics (MySQL/MariaDB backend)	DB server
slurmrestd	HTTP RESTful API (JSON, JWT authentication)	API server

2.2 Plugin Architecture

Slurm supports an extensible plugin architecture:

Authentication (auth/munge, auth/jwt)
Containers (OCI, Singularity, Enroot)
GPU/GRES management
MPI implementations (PMIx, PMI2)
Scheduling algorithms (backfill, priority multifactor)
Process tracking (cgroup, linuxproc)

3. Core Concepts

3.1 Node, Partition, Job

Concept	Description
Node	The basic compute resource. Has CPU, memory, GPU, and disk attributes
Partition	A logical grouping of nodes = job queue. Defines access control, resource limits, and priority
Job	Resources allocated to a user for a specified time. Has a unique ID, resource requirements, and state
Job Step	A set of parallel tasks within a job. Lower overhead than separate jobs

Job 1234
  ├── Step 0: Data preprocessing (1 node)
  ├── Step 1: Training (4 nodes, 32 tasks)
  └── Step 2: Evaluation (1 node)

3.2 Account, QoS, Fairshare

Account: A hierarchical organizational unit for tracking resource usage

root
├── engineering
│   ├── ml-team
│   └── platform-team
└── research
    ├── physics
    └── biology

QoS (Quality of Service): A set of limits and priorities that control job behavior

Scheduling priority
Preemption policy
Resource limits (CPU, GPU, memory, number of running jobs)

Fairshare: Fair allocation scheduling that considers historical resource usage

Each account is assigned a share proportional to its investment/entitlement
Users who have used fewer resources get higher priority; heavy users get lower priority
More recent usage is weighted more heavily (decay factor)

3.3 Priority Calculation (Multifactor)

Job Priority = site_factor
             + (WeightAge)       × age_factor       -- Wait time
             + (WeightFairshare) × fairshare_factor  -- Fair distribution
             + (WeightJobSize)   × job_size_factor   -- Job size
             + (WeightPartition) × partition_factor   -- Partition tier
             + (WeightQOS)       × QOS_factor        -- QoS tier
             + TRES weights                           -- Resource weights (GPU, etc.)
             - nice_factor                            -- User courtesy value

4. Essential Commands

4.1 sbatch — Submit Batch Jobs

# Basic submission
sbatch job.sh

# Override options
sbatch --partition=gpu --nodes=2 --gres=gpu:4 --time=24:00:00 train.sh

# Specify job name and output
sbatch --job-name=my_training --output=train_%j.log job.sh

# Specific account and QoS
sbatch --account=ml-team --qos=high job.sh

4.2 srun — Execute Parallel Jobs/Steps

# Print hostname across 3 nodes
srun -N3 -l /bin/hostname

# Interactive GPU job
srun --partition=gpu --gres=gpu:1 --pty bash

# Run with GPU binding
srun --ntasks=4 --gpus-per-task=1 --gpu-bind=closest python train.py

4.3 salloc — Interactive Resource Allocation

# Allocate 2 GPU nodes for 4 hours
salloc --nodes=2 --gres=gpu:4 --time=04:00:00 --partition=gpu

# Request specific features
salloc -N1 --constraint=a100 --gres=gpu:8 --mem=512G

4.4 squeue — View the Job Queue

# Show only my jobs
squeue -u $USER

# Custom format
squeue -o "%.10i %.9P %.20j %.8u %.2t %.10M %.6D %R"

# Show PENDING jobs and reasons
squeue -t PENDING -o "%.10i %.20j %.8u %.10M %R"

4.5 sinfo — View System Information

# Partition summary
sinfo

# Show GPU resources
sinfo -o "%20N %10c %10m %20G %10t"

# Idle nodes
sinfo -t idle

4.6 scancel — Cancel Jobs

scancel 12345              # Cancel a specific job
scancel -u $USER           # Cancel all my jobs
scancel -t PENDING -u $USER  # Cancel only PENDING jobs
scancel 12345_[1-10]       # Cancel specific array tasks

4.7 sacct — Job Accounting

# View completed jobs
sacct -u $USER

# Detailed format
sacct -j 12345 --format=JobID,JobName,Partition,State,Elapsed,MaxRSS,MaxVMSize,TotalCPU

# Query by date range
sacct --starttime=2026-01-01 --endtime=2026-01-31 -u $USER

4.8 scontrol — Administrative Control

scontrol show job 12345       # Job details
scontrol show node gpu-001    # Node details
scontrol hold 12345           # Hold a job
scontrol release 12345        # Release a hold
scontrol update JobId=12345 TimeLimit=48:00:00  # Modify time limit
scontrol ping                 # Test controller connectivity

5. Job Script Examples

5.1 Basic CPU Job

#!/bin/bash
#SBATCH --job-name=basic_job
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --mem=16G
#SBATCH --time=02:00:00
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log

module load gcc/12.2.0
module load openmpi/4.1.5

echo "Job started on $(hostname) at $(date)"
echo "SLURM_JOB_ID: $SLURM_JOB_ID"
echo "SLURM_NODELIST: $SLURM_NODELIST"

srun ./my_simulation --input data.csv --output results/
echo "Job completed at $(date)"

5.2 Single-Node GPU Job

#!/bin/bash
#SBATCH --job-name=gpu_training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:a100:2
#SBATCH --time=12:00:00
#SBATCH --output=train_%j.log

module load cuda/12.2
module load anaconda/2024
conda activate ml_env

echo "Available GPUs: $CUDA_VISIBLE_DEVICES"
nvidia-smi

python train.py \
    --model resnet50 \
    --batch-size 256 \
    --epochs 100 \
    --data /shared/datasets/imagenet \
    --output /scratch/$USER/checkpoints/

5.3 Multi-Node MPI Job

#!/bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --partition=compute
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=24:00:00
#SBATCH --exclusive

module load openmpi/4.1.5

echo "Running on $SLURM_NNODES nodes with $SLURM_NTASKS total tasks"
srun ./weather_simulation --grid-size 4096x4096x128 --timesteps 10000

5.4 Job Array (Hyperparameter Sweep)

#!/bin/bash
#SBATCH --job-name=hparam_sweep
#SBATCH --partition=gpu
#SBATCH --array=0-19%5            # 20 tasks, max 5 running concurrently
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --time=06:00:00
#SBATCH --output=sweep_%A_%a.log  # %A=Array ID, %a=Task ID

LEARNING_RATES=(0.1 0.01 0.001 0.0001 0.00001)
BATCH_SIZES=(32 64 128 256)

LR_IDX=$((SLURM_ARRAY_TASK_ID / 4))
BS_IDX=$((SLURM_ARRAY_TASK_ID % 4))
LR=${LEARNING_RATES[$LR_IDX]}
BS=${BATCH_SIZES[$BS_IDX]}

echo "Task $SLURM_ARRAY_TASK_ID: LR=$LR, BS=$BS"
python train.py --lr $LR --batch-size $BS \
    --experiment-name "sweep_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}"

5.5 Job Dependencies (Pipeline)

# Step 1: Preprocessing
JOB1=$(sbatch --parsable preprocess.sh)

# Step 2: Training (after preprocessing succeeds)
JOB2=$(sbatch --parsable --dependency=afterok:$JOB1 train.sh)

# Step 3: Evaluation (after training succeeds)
JOB3=$(sbatch --parsable --dependency=afterok:$JOB2 evaluate.sh)

# Step 4: Cleanup (after all complete, regardless of success/failure)
JOB4=$(sbatch --parsable --dependency=afterany:$JOB1:$JOB2:$JOB3 cleanup.sh)

Dependency Types:

Type	Meaning
`after:jobid`	After the job starts
`afterok:jobid`	After the job completes successfully
`afternotok:jobid`	After the job fails
`afterany:jobid`	After the job completes (regardless of success/failure)
`aftercorr:jobid`	Array: after the corresponding task succeeds
`singleton`	Only one job with the same name runs at a time

6. GPU Scheduling (GRES)

6.1 Configuration

slurm.conf:

GresTypes=gpu,mps,shard
NodeName=gpu-node[01-08] Gres=gpu:a100:8 CPUs=128 RealMemory=1024000
AccountingStorageTres=gres/gpu

gres.conf:

# Auto-detection (recommended)
AutoDetect=nvml    # NVIDIA
# AutoDetect=rsmi  # AMD
# AutoDetect=nrt   # Intel Arc

6.2 How to Request GPUs

# 2 GPUs of any type
sbatch --gres=gpu:2 job.sh

# Specific GPU type
sbatch --gres=gpu:a100:4 job.sh

# GPUs per node
sbatch --nodes=4 --gpus-per-node=8 job.sh

# GPUs per task + CPU/memory affinity
sbatch --ntasks=8 --gpus-per-task=1 --cpus-per-gpu=8 --mem-per-gpu=32G job.sh

Slurm automatically sets CUDA_VISIBLE_DEVICES for GPU isolation.

Method	Description	Configuration
MPS (Multi-Process Service)	Multi-process GPU sharing with concurrent kernel execution	`Gres=gpu:a100:8,mps:800`
MIG (Multi-Instance GPU)	Partitions A100/H100 into independent instances	`AutoDetect=nvml` (auto-detected)
Shard	GPU sharing without isolation (lightweight inference)	`Gres=gpu:2,shard:64`

7. AI/ML Distributed Training

7.1 PyTorch DDP (Multi-Node)

sbatch script (4 nodes x 8 GPUs = 32 GPUs):

#!/bin/bash
#SBATCH --job-name=ddp-multigpu
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=64
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH --time=72:00:00
#SBATCH --output=ddp_%j.log

export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1) + 30000 ))
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export WORLD_SIZE=$(( SLURM_NNODES * 8 ))

echo "MASTER_ADDR=$MASTER_ADDR, MASTER_PORT=$MASTER_PORT, WORLD_SIZE=$WORLD_SIZE"

# NCCL configuration
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=^docker0,lo

srun torchrun \
    --nnodes $SLURM_NNODES \
    --nproc_per_node 8 \
    --rdzv_id $SLURM_JOB_ID \
    --rdzv_backend c10d \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    train.py \
        --model llama-7b \
        --data /shared/datasets/openwebtext \
        --batch-size 32 \
        --gradient-accumulation-steps 4

Python training script pattern:

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def setup():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    return rank, world_size, local_rank

def main():
    rank, world_size, local_rank = setup()

    model = MyModel().cuda(local_rank)
    model = DDP(model, device_ids=[local_rank])

    dataset = MyDataset(...)
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

    for epoch in range(num_epochs):
        sampler.set_epoch(epoch)  # Required for proper shuffling
        for batch in dataloader:
            ...

    dist.destroy_process_group()

7.2 DeepSpeed (Multi-Node)

#!/bin/bash
#SBATCH --job-name=deepspeed-train
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=64
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH --time=96:00:00

export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(( RANDOM % (50000 - 30000 + 1) + 30000 ))
export NNODES=$SLURM_NNODES
export NUM_PROCESSES=$(( NNODES * 8 ))

srun bash -c 'accelerate launch \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank $SLURM_NODEID \
    --num_processes $NUM_PROCESSES \
    --num_machines $NNODES \
    --use_deepspeed \
    --zero_stage 2 \
    --mixed_precision fp16 \
    train.py \
        --model_name_or_path meta-llama/Llama-2-7b \
        --per_device_train_batch_size 4 \
        --gradient_accumulation_steps 8'

7.3 Horovod (MPI-Based)

#!/bin/bash
#SBATCH --job-name=horovod-train
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4

module load openmpi/4.1.5 cuda/12.2
srun --mpi=pmix python train_horovod.py --epochs 100 --batch-size 64

8. Container Integration

8.1 Singularity / Apptainer

#!/bin/bash
#SBATCH --job-name=container_job
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

# --nv flag enables GPU support
srun singularity exec --nv \
    --bind /scratch/$USER:/data \
    --bind /shared/datasets:/datasets \
    /shared/containers/pytorch_24.03.sif \
    python train.py --data /datasets/imagenet

8.2 Enroot + Pyxis (NVIDIA)

Enroot is NVIDIA's lightweight container runtime, and Pyxis is a Slurm SPANK plugin that provides the srun --container-* flags.

#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --exclusive

srun --container-image=nvcr.io/nvidia/pytorch:24.03-py3 \
     --container-mounts=/shared/data:/data,/scratch/$USER:/workspace \
     --container-workdir=/workspace \
     torchrun \
         --nnodes=$SLURM_NNODES \
         --nproc_per_node=8 \
         --rdzv_backend=c10d \
         --rdzv_endpoint=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1):29500 \
         train.py --data /data

Widely used on NVIDIA DGX SuperPOD and DGX Cloud.

9. Configuration (slurm.conf)

9.1 Key Configuration Parameters

# Cluster identification
ClusterName=my_cluster
SlurmctldHost=controller01
SlurmctldHost=controller02     # Backup controller

# Authentication
AuthType=auth/munge

# Scheduling
SchedulerType=sched/backfill
SelectType=select/cons_tres     # Consumable trackable resources
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityWeightAge=1000
PriorityWeightFairshare=10000
PriorityWeightJobSize=500
PriorityWeightQOS=2000
PriorityDecayHalfLife=7-0       # 7 days

# Resource management
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
GresTypes=gpu,mps,shard

# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageTres=gres/gpu

# Job defaults
DefMemPerCPU=4096               # 4GB
MaxMemPerCPU=16384              # 16GB
DisableRootJobs=YES
MpiDefault=pmix_v4

# Node definitions
NodeName=compute[001-100] CPUs=64 RealMemory=256000 State=UNKNOWN
NodeName=gpu[001-032] CPUs=128 RealMemory=1024000 Gres=gpu:a100:8 Feature=a100,nvlink

# Partition definitions
PartitionName=compute Nodes=compute[001-100] Default=YES MaxTime=7-00:00:00
PartitionName=gpu Nodes=gpu[001-032] MaxTime=3-00:00:00 AllowGroups=gpu-users
PartitionName=debug Nodes=compute[001-004],gpu[001-002] MaxTime=01:00:00 PriorityTier=100

9.2 cgroup.conf

ConstrainCores=yes          # CPU core pinning
ConstrainRAMSpace=yes       # Enforce memory limits
AllowedRAMSpace=100         # % of allocated memory (OOM Kill if exceeded)
ConstrainSwapSpace=yes
ConstrainDevices=yes        # Device isolation (GPU)

10. Advanced Features

10.1 Backfill Scheduling

Slurm's secondary scheduling loop. It allows short, lower-priority jobs to run ahead of schedule as long as they do not delay longer, higher-priority jobs.

SchedulerType=sched/backfill
SchedulerParameters=bf_interval=30,bf_resolution=300,bf_max_job_test=1200

For backfill to work effectively, specifying a job time limit (--time) is essential.

10.2 Preemption

# slurm.conf
PreemptType=preempt/qos
PreemptMode=REQUEUE              # CANCEL, REQUEUE, SUSPEND, GANG
PreemptExemptTime=00:05:00       # Grace period before preemption

Mode	Behavior
CANCEL	Terminates the lower-priority job
REQUEUE	Requeues if possible, cancels otherwise
SUSPEND	Suspends the job
GANG	Time-sharing between jobs

10.3 Large-Scale Job Array Submission

# 1000 tasks, max 50 running concurrently
sbatch --array=0-999%50 sweep.sh

# Environment variables: SLURM_ARRAY_JOB_ID, SLURM_ARRAY_TASK_ID
# MaxArraySize: up to 4,000,001 (configurable)

11. Monitoring and Troubleshooting

11.1 Diagnostic Commands

scontrol ping           # Test controller connectivity
sdiag                   # Scheduler diagnostics (threads, queue, backfill cycles)
scontrol show node X    # Check node status
sacct -j ID --format=JobID,Elapsed,TotalCPU,ReqMem,MaxRSS,State  # Job efficiency

11.2 Common Problems and Solutions

Problem	Diagnosis	Solution
Node in DRAIN state	`scontrol show node`	Fix the issue, then `scontrol update NodeName=X State=RESUME`
Job stuck in PENDING	`squeue -j ID -o "%R"` (check Reason)	Check resources, partition limits, QoS, and dependencies
GPU not detected	`slurmd -C`, `slurmd -G`	Verify driver, gres.conf AutoDetect, and device files
OOM Kill	`sacct --format=MaxRSS,ReqMem`	Request more memory or adjust cgroup limits
slurmctld overloaded	`sdiag` (thread count)	Enable RPC rate limiting, reduce polling frequency

11.3 Common PENDING Reasons

Reason	Meaning
Resources	Waiting for resources to become available
Priority	Higher-priority jobs are ahead in the queue
Dependency	Waiting for dependent jobs to complete
QOSMaxJobsPerUserLimit	QoS per-user job count limit reached
PartitionTimeLimit	Requested time exceeds the partition time limit
ReqNodeNotAvail	Requested node is unavailable

12. Comparison with Other Schedulers

Feature	Slurm	PBS Pro / Torque	IBM LSF	Kubernetes
License	GPL v2 (open-source)	AGPL / Commercial	Commercial (IBM)	Apache 2.0
Primary Use	HPC, AI training	HPC, traditional batch	Enterprise HPC	Cloud microservices
Scalability	100K+ nodes	50K+	100K+	5K+
GPU Support	Native GRES, MIG, MPS	Hook-based	GPU-aware	Device Plugin
MPI Support	Native (PMIx)	Native	Native	MPI Operator
Fairshare	Built-in	Requires Maui/Moab	Built-in	Not built-in
TOP500 Adoption	~60-65%	~10-15%	~10-15%	Rare

Trend: Hybrid setups with Slurm (training) + Kubernetes (inference/serving) are becoming mainstream.

13. References

Official Documentation

Slurm Official Documentation — The authoritative reference for all features
Quick Start User Guide
Quick Start Admin Guide
Slurm GRES Scheduling
Slurm Job Array
Slurm Containers Guide
Slurm Configuration Tool — Web-based slurm.conf generator
Slurm Rosetta Stone (PDF) — PBS/LSF/SGE command comparison chart
GitHub: SchedMD/slurm