Commands
List of useful Slurm commands
-
Check Slurm cluster state:
sinfowill display general cluster information.sinfo -swill display the summary of cluster information.sinfo -N -lwill display the status of each nodescontrol show nodeswill display detailed information about the state of each node, helpful for debugging purposesscontrol show partitionswill display all available partitionsscontrol show partition <partition_name>will display detailed partition information regarding the partition
-
Check Slurm user information:
sacctmgr show userwill display general user informationsacctmgr show associationwill display user associations (to quotas, resource limitations, etc)
-
Check Slurm job states:
watch squeue -u $USERwill display information about jobs that are scheduled for execution or are currently runningsqueue -u $USER -o "%.18i %.20P %.15j %.8u %.8T %.10M %.20R"will also display information about the scheduled jobs with more detailswatch sacctwill display the current state of each job (press Ctrl+C to exit)sacct -N slurm-worker-cpu-1will show the list of executed jobs in which the given node was involvedsacct -u konrad --format=JobID,JobName,Partition,State,Elapsed -S now-1hourwill display recent job history per userscontrol show job <job_id>will display a general information regarding the jobsstat <job_id>will display a summary information regarding the jobscontrol getaddrs $(scontrol show job 10 | grep "NodeList=slurm" | cut -d '=' -f 2) | col2 | cut -d ':' -f 1will display the worker node IP on which the current interactive job is running
-
Schedule Slurm jobs:
- use
sbatch test_job.batchto schedule a job - to schedule a job against one particular partition, for example a GPU partition, use
sbatch test_job.batch -p gpu_largecommand - use
scancel <job_id>to cancel any jobs
- use
-
Check resource limitations:
- use the
quota -u $USER -scommand to check the storage space available sacctmgr show qos format=name%30,MaxJobsPerUser%30,MaxSubmitJobsPerUser%30,MaxTRESPerJob%30will display detailed information regarding the partition-level resource limitations
- use the