Batch Compute

Introduction

Slurm is a batch scheduler that enables users to submit compute 'jobs' of varying scope to our compute clusters. It will queue up jobs such that the compute resources available in SDF are fairly shared and distributed for all users. This page describes basic usage of the Slurm batch scheduler at SLAC. It will provide some simple examples of how to request common resources.

Why should I use Batch?

Whilst your desktop or laptop computer has a fast processor and quick access to data stored on its local hard disk/ssd; you may want to run large compute tasks that require more CPU/GPU/memory to run, or to process a large amount of data. Our compute servers are part of a Batch system that allows you to accomplish such tasks in a reasonable amount of time. Our servers also have fast access to centralised storage, have widely-used, common software packages and images pre-installed, and will enable you to run these larger compute tasks without impacting your own local desktop/laptop resources.

Why should I use Slurm?

Historically, SLAC has used IBM's LSF as our Batch scheduler software. However, with the addition of new hardware such as our NVIDIA GPUs, we have decided to switch to Slurm to schedule compute jobs as it is also commonly used across other academic and laboratory environments. We hope that this commonality and consistency with other facilities will enable easier usage for users, as well as simpler administration for the Science Computing team here at SLAC.

Slurm Basics

What should I know about using Batch?

The purpose of a batch system is to enable efficient sharing of the CPUs, GPUs, memory, and ephemeral storage that exists in a compute Cluster. The cluster is comprised of many servers - often called Batch Nodes. As the number of these batch nodes and their resources (such as GPUs) in our environment is finite, we need to keep account of which users consume which resources so that we can provide access to all users in a fair manner. At the same time, as groups/teams can purchase their own hardware resources to be added to the SDF cluster, we must provide a method by which authorised users in those groups/teams can have priority access to their purchased resources. Slurm users are associated with one or many slurm Accounts, which have the dual function of providing requested resources and also as a means of tracking usage so that individual users cannot consume all resources in the entire cluster in fairness to other users. Partitions are used as a means to define the restrictions on Batch Nodes in the cluster.

In order to use the Batch cluster, a user logs in to a Login Node. These dedicated servers are used for the sole purpose of interacting with the batch system and should not be used as compute resources to actually run any intensive work (using considerable CPU, memory, disk etc for a more than a few minutes). These login nodes are shared resources where many people will concurrently log in to use the batch system, and running intensive jobs on such machines will often cause issues for others users on those nodes and prevent them from submitting jobs. One would typically SSH into such a node and run batch commands (like sbatch, squeue etc) in order to queue work and monitor work onto the cluster. These work units are often called Jobs. The batch scheduler will then use the defined rules and algorithms to prioritise (or deprioritise) a user's Jobs on the system against everyone else who are effectively competing to use the cluster. The Jobs themselves, will then actually run on the Batch Nodes in the cluster. Depending on local policies, you may or may not be able to actually login these Batch Nodes directly. SLAC has typically forbid the ability to login to the Batch Nodes as multiple users Jobs are typically running on a single Batch Node.

It is also possible to request an interactive session on a Batch Node. This would be akin to SSH'ing into a Batch Node directly - however, because these interactive batch jobs are run in Linux cgroups to , they are effectively run in a sandbox, and hence more isolated from other processes than just remotely using SSH to log in. As all nodes in the cluster should be relatively homogenous, one should not need to request a specific Batch Node by name (although it is possible). One should typically not use interactive batch sessions for long periods of time as typically such usage is often idle time and not an efficient use of the cluster's resources. It is recommended however, to use such sessions to help debug issues with your usual batch Jobs.

What is a Slurm Partition? :id=partition

A Partition is a logical grouping of Batch Nodes. These may be servers of a similar technical specification (eg Cascade Lake CPUs, Telsa GPUs etc), or by ownership of the servers - eg SUNCAT group may have purchased so many servers, so we put them all into a Partition. For the SDF, we partition machines according to science and engineering groups who have purchased servers for the SDF. We do this such that members (or associates) of those groups can have priority access to their hardware. Whilst we give everyone access to all hardware via the shared partition users who belong to groups who do not own any hardware in SDF will have lower priority access to use stakeholder’s resources.

Users should contact their Coordinators to be added to appropriate group Partitions to get priority access to resources.

You can also view the active Partitions and associated hardware by using the sinfo command - each line shows the list and number of nodes (nodelist) that are in a specific state for that partition. As this command provides a point in time snapshot of the status of the SDF cluster, your output may vary.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
shared*      up 5-00:00:00     26 drain* rome[0011-0012,0081-0084,0101-0104,0181-0184,0191-0194,0201-0204,0211-0214]
shared*      up 5-00:00:00      5  down* rome[0041-0044,0052]
shared*      up 5-00:00:00      4  drain rome[0132-0133,0151],tur000
shared*      up 5-00:00:00     24    mix rome[0001,0013],tur[001-017],volt[000-004]
shared*      up 5-00:00:00     52  alloc rome[0002-0004,0014,0021-0024,0031-0034,0051,0053-0054,0061-0064,0071-0074,0091-0094,0111-0114,0121-0124,0131,0134,0141-0144,0152-0154,0161-0164,0171-0174]
supercdms    up 5-00:00:00      1    mix rome0001
supercdms    up 5-00:00:00      1  alloc rome0002
cryoem       up 10-00:00:0      1 drain* rome0011
cryoem       up 10-00:00:0      1  drain tur000
cryoem       up 10-00:00:0      5    mix tur[004-005,009-011]
cryoem       up 10-00:00:0      2  alloc rome[0003-0004]
suncat       up 5-00:00:00      9 drain* rome[0012,0081-0084,0101-0104]
suncat       up 5-00:00:00      5  down* rome[0041-0044,0052]
suncat       up 5-00:00:00      1    mix rome0013
suncat       up 5-00:00:00     27  alloc rome[0014,0021-0024,0031-0034,0051,0053-0054,0061-0064,0071-0074,0091-0094,0111-0113]
fermi        up 5-00:00:00      3  drain rome[0132-0133,0151]
fermi        up 5-00:00:00     13  alloc rome[0114,0121-0124,0131,0134,0141-0144,0152-0153]
hps          up 5-00:00:00      3  alloc rome[0154,0161-0162]
lcls         up 5-00:00:00     16 drain* rome[0181-0184,0191-0194,0201-0204,0211-0214]
lcls         up 5-00:00:00      1    mix tur003
lcls         up 5-00:00:00      6  alloc rome[0163-0164,0171-0174]
neutrino     up 5-00:00:00      4    mix tur001,volt[002-004]
ml           up 5-00:00:00      8    mix tur[002,006-008,014-017]
atlas        up 5-00:00:00      2    mix tur[012-013]
usatlas      up 5-00:00:00      2    mix tur[012-013]

The Shared Parition :id=shared-partition

All SLAC employees, user-facility users and researchers are entitled to compute and storage resources at SLAC. This is provided through SLAC's indirect. The shared partition is a method by which all users can use the compute resources of SDF without having to buy hardware.

We enforce that all servers put into the SDF be added to partitions of the owner as well as into this global shared partition.

However, as compute is a limited resource, all batch jobs submitted into the shared partition are subject to pre-emption.

!> TODO: Add more information on the shared partition and its policies of use.

What is Pre-emption? :id=pre-emption

As typically we have fewer resources than the aggregate requirements of all user groups at SLAC, we cannot provide resources to everyone when they need it. We therefore have to have different classes of users on our systems: those whose groups have purchased servers and those who are using the shared partition to use resources provided by SLAC's indirect. In order to be "fair" to the owners of servers who have contributed their resources into the SDF, we provide immediate access to their servers - when they need it. At the same time, users on the shared partition are allowed to use the owner's servers - when the owners do not need it. As such, in order to provide this level of guarantee to the owners, we have to 'kick-off' any shared scavenger jobs that may be running on that server at the time the owners request access to their hardware. This is known as pre-emption.

What is a Slurm Account and Allocation? :id=account-and-allocation

Accounts are used to allow us to track, monitor and report on usage of SDF resources. As such, users who are members of stakeholders of SDF hardware, should use their relevant Account to charge their jobs against. We do not associate any monetary value to Accounts currently, but we do require all Jobs to be charged against an Account.

In order to map a User to the allow use of appropriate hardware (Partition) and charge against the relevant Account, an Allocation in slurm must be created. For all intents and purposes, we have a one-to-one mapping between the Partition and Account (ie the shared Partition is charged against the shared Account). Therefore an Allocation acts as a authorisation definition for resource use in SDF.

How do I get access to a Slurm Partition? :id=allocation

We delegate authority to Coordinators to allow them to define their own Allocations. As such, you should contact your local Coordinator to obtain permissions to submit jobs into desired Partitions.

Fair-share and Job Priorities :id=fair-share

!> TODO:

How do I use Slurm? :id=slurmexample

There are two ways to interact with slurm

using command line tools on the SDF login hosts
using the ondemand web interface

Common actions that you may want to perform are:


Submit a job	`srun` or `sbatch`	request a quick job to be ran - eg an interactive terminal for `srun` and a longer job(s) with `sbatch`
Show information about a job	`scontrol show job <jobid>`	shows detailed information about the state, resources requested etc. for a job
Cancel or terminate a job	`scancel <jobid>`	cancel a job; you can also use --signal=INT to send a unix signal to the job to cleanly terminate
Show position in squeue	`sprio`	shows the fairshare calculations that determine your place in line for the job to start
Show running statistics about a job	`sstat`	show job usage details
Modify accout/add users to partitions etc.	`sacctmgr`	manage Associations

SLAC Slurm Specifics

We run partitions based upon groups that own the hardware associated with that queue. A list of partitions and coordinators can be found here. This allows you to quickly identify suitable resources that have been organised by your local Coordinator and ourselves to ensure that your jobs are running on suitable hardware.

We assign accounts such that they have the same name as the relevant partition. This is to simplify usage such that it becomes obvious whom to charge your resources against (don't worry, we do not bill for your usage). We collect such information to aid with planning and scheduler optimisations.

Group Coordinators have the power to add and remove users from their partitions (actually it would be an Allocation).

?> TODO: add instructions for coordinators

To also simplify usage for you, our users:

if you do not define an account with --account, then we will assume you want to use the same account name as that of the partition name.
if you do not specify a partition with --partition, we will assume you want your job to run in the shared partition.

Using Slurm

How can I get an Interactive Terminal/Session? :id=interactive

use the srun command

srun --partition shared -n 1 --time=01:00:00 --pty /bin/bash

This will then execute /bin/bash on a (scheduled) server in the Partition shared and charge against Account shared. This will request a single CPU for one hour, launch a pseudo terminal (pty) where bash will run. You may be provided different Accounts and Partitions by your Coordinator and should use them when possible.

Note that when you 'exit' the interactive session, it will relinquish the resources for someone else to use. This also means that if your terminal is disconnected (you turn your laptop off, loose network etc), then the Job will also terminate (similar to ssh).

How can I get an Interactive Terminal/Session that supports X11?

Same as for a normal Interactive Session (above) but add the "--x11" option

srun --x11 --partition shared -n 1 --time=01:00:00 --pty /bin/bash

How do I submit a Batch Job?

In order to submit a batch job, you have to:

create a text file containing some slurm commands (lines starting with #SBATCH) and a list of commands/programs that you wish to run. This is called a batch script.
submit this batch script to the cluster using the sbatch command
monitor the job using scontrol show job

Create a Batch Script :id=create-batch-script

Create a job submission script (text file) script.sh (or whatever filename you wish):

#!/bin/bash
 
#SBATCH --partition=shared
#
#SBATCH --job-name=test
#SBATCH --output=output-%j.txt
#SBATCH --error=output-%j.txt
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --mem-per-cpu=1g
#
#SBATCH --time=0-00:10:00
#
#SBATCH --gpus 1
 
<commands here>

In the above example, we write a batch script for a job named 'test' (using the --job-name). You can choose what ever name you wish to give the job so that you may be able to quickly identify it yourself. Both stdout and stderr will be outputted the same file output-%j.txt in the current working working - %j will be replaced with the slurm job id (using the --output and --error options). We request a single Task (think of it as an MPI rank) and that single task will request 12 CPUs; each of which will be allocated 1GB of RAM - so a total of 12GB. By default, the --ntasks will be equivalent to the number of nodes (servers) asked for. In order to aid scheduling (and potentially prioritising the Job), we limit the duration of the Job to 10 minutes. The format of the time limit field is D-HH:MM:SS. We also request a single GPU with the Job. This will be exposed via CUDA_VISIBLE_DEVICES.

?> TIP: only lines starting with #SBATCH will be processed by the slurm interpretor. As the script itself is just a bash script, any line beginning with # will be ignored. As such you may also comment out slurm directives by using somethign like ##SBATCH

We can define where the job will run using the --partition option. All SLAC users have access to the shared partition. Your group may also have access to other partitions that will provide you priority immediate access to resources

?> You can think of the batch script as a shell script but with some extra directives that only slurm understand. As such, you can also just run the same script in hte command line to ensure that your job will work; ie sh script.sh will run the same set of commands but on the local host. This therefore also means that if you already have a shell script that runs your code, you can 'slurmify' it by adding slurm directives with #SBATCH. Please note, however, that if you are using GPUs in your code etc. the login node may not have any GPUs and hence your local run will fail.

?> add something about submitting sbatch commands directy without using a batch script - ie --wrap

Specifying how long the job should run for :id=time

It is important that you specify a meaningful duration for which your expect your job to run for. This allows the slurm scheduler to appropriately priortize your job against other jobs that are competing for the limited resources in the cluster. The duration of a job may depend upon many different factors such as the type of hardware that you may be constraining your job to run against, how well your code/application scales with multiple nodes, the speed of memory and disk access etc. etc.

You can specify the expected duration with the --time option. Valid time formats are:

M (M minutes)
M:S (M minutes, S seconds)
H:M:S (H hours, M minutes, S seconds)
D-H (D days, H hours)
D-H:M (D days, H hours, M minutes)
D-H:M:S (D days, H hours, M minutes, S seconds)

Once the job exceeds the specified job time, it will terminate. Unless you checkpoint your application as it progresses this may result in wasted cycles and the need to submit the job again with a longer duration.

Specifying CPU requirements :id=cpu

!> TODO

Specifying memory requirements :id=memory

!> TODO

Specifying nodes with specific resources (constraints) :id=constraints

!> TODO

Specifying local scratch space :id=scratch

!> TODO

Notification of job status :id=notification

Changing the working directory :id=workingdir

!> diff between cd'ing on the script and --workingdir

Specifying a reservation :id=reservation

Submit the job :id=submit-batch-script

?> note stuff about workign directories etc.

After you have created a batch script, you then need to tell slurm to queue it so that it may run. The command to you is sbatch and is synonymous with the bsub command in LSF. Therfore to submit the script script.sh we simply run

sbatch script.sh

If successful, it should provide you with the job id that the script will run as. You can use this job id to monitor your job progress.

?> TIP: you can also submit the slurm directives directly on the command line rather than within the batch script. When submitted as arguments to srun or sbatch, they will take precedence over any same directives that may already be specified in the batch script. e.g. if you run sbatch --partition ml script.sh and your script.sh contains a definiting to use the shared partition, your job will be submitted into the ml partition.

Monitor job progress :id=monitor-job

You can then use the command to monitor your job progress

squeue

And you can cancel the job with

scancel <jobid>

Job Arrays

!> TODO

Using GPUs

How can I request GPUs?

You can use the --gpus to specify gpus for your jobs: Using a number will request the number of any gpu that is available (what you get depends upon what your Account/Association is and what is available when you request it). You can also specify the type of gpus by prefixing the number with the model name. eg

# request single gpu
srun -A shared -p shared -n 1 --gpus 1 --pty /bin/bash
 
# request a gtx 1080 gpu
srun -A shared -p shared -n 1 --gpus geforce_gtx_1080_ti:1 --pty /bin/bash
 
# request a gtx 2080 gpu
srun -A shared -p shared -n 1 --gpus geforce_rtx_2080_ti:1 --pty /bin/bash
 
# request a v100 gpu
srun -A shared -p shared -n 1 --gpus v100:1 --pty /bin/bash

How can I see what GPUs are available?

# sinfo -o "%12P %5D %14F %7z %7m %10d %11l %42G %38N %f"
PARTITION    NODES NODES(A/I/O/T) S:C:T   MEMORY  TMP_DISK   TIMELIMIT   GRES                                       NODELIST                               AVAIL_FEATURES
shared*      1     0/1/0/1        2:8:2   191567  0          7-00:00:00  gpu:v100:4                                 nu-gpu02                               CPU_GEN:SKX,CPU_SKU:4110,CPU_FRQ:2.10GHz,GPU_GEN:VLT,GPU_SKU:V100,GPU_MEM:32GB,GPU_CC:7.0
shared*      8     0/1/7/8        2:12:2  257336  0          7-00:00:00  gpu:geforce_gtx_1080_ti:10                 cryoem-gpu[02-09]                      CPU_GEN:HSW,CPU_SKU:E5-2670v3,CPU_FRQ:2.30GHz,GPU_GEN:PSC,GPU_SKU:GTX1080TI,GPU_MEM:11GB,GPU_CC:6.1
shared*      14    0/0/14/14      2:12:2  191552  0          7-00:00:00  gpu:geforce_rtx_2080_ti:10                 cryoem-gpu[11-15],ml-gpu[02-10]        CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
shared*      1     0/1/0/1        2:12:2  257336  0          7-00:00:00  gpu:geforce_gtx_1080_ti:10(S:0)            cryoem-gpu01                           CPU_GEN:HSW,CPU_SKU:E5-2670v3,CPU_FRQ:2.30GHz,GPU_GEN:PSC,GPU_SKU:GTX1080TI,GPU_MEM:11GB,GPU_CC:6.1
shared*      3     0/3/0/3        2:12:2  191552  0          7-00:00:00  gpu:geforce_rtx_2080_ti:10(S:0)            cryoem-gpu10,ml-gpu[01,11]             CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
shared*      3     0/3/0/3        2:8:2   191567  0          7-00:00:00  gpu:v100:4(S:0-1)                          cryoem-gpu50,nu-gpu[01,03]             CPU_GEN:SKX,CPU_SKU:4110,CPU_FRQ:2.10GHz,GPU_GEN:VLT,GPU_SKU:V100,GPU_MEM:32GB,GPU_CC:7.0
shared*      1     0/1/0/1        2:12:2  257330  0          7-00:00:00  gpu:geforce_gtx_1080_ti:8(S:0),gpu:titan_x hep-gpu01                              CPU_GEN:HSW,CPU_SKU:E5-2670v3,CPU_FRQ:2.30GHz,GPU_GEN:PSC,GPU_SKU:GTX1080TI,GPU_MEM:11GB,GPU_CC:6.1
ml           9     0/0/9/9        2:12:2  191552  0          infinite    gpu:geforce_rtx_2080_ti:10                 ml-gpu[02-10]                          CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
ml           2     0/2/0/2        2:12:2  191552  0          infinite    gpu:geforce_rtx_2080_ti:10(S:0)            ml-gpu[01,11]                          CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
neutrino     1     0/1/0/1        2:8:2   191567  0          infinite    gpu:v100:4                                 nu-gpu02                               CPU_GEN:SKX,CPU_SKU:4110,CPU_FRQ:2.10GHz,GPU_GEN:VLT,GPU_SKU:V100,GPU_MEM:32GB,GPU_CC:7.0
neutrino     2     0/2/0/2        2:8:2   191567  0          infinite    gpu:v100:4(S:0-1)                          nu-gpu[01,03]                          CPU_GEN:SKX,CPU_SKU:4110,CPU_FRQ:2.10GHz,GPU_GEN:VLT,GPU_SKU:V100,GPU_MEM:32GB,GPU_CC:7.0
cryoem       8     0/1/7/8        2:12:2  257336  0          infinite    gpu:geforce_gtx_1080_ti:10                 cryoem-gpu[02-09]                      CPU_GEN:HSW,CPU_SKU:E5-2670v3,CPU_FRQ:2.30GHz,GPU_GEN:PSC,GPU_SKU:GTX1080TI,GPU_MEM:11GB,GPU_CC:6.1
cryoem       5     0/0/5/5        2:12:2  191552  0          infinite    gpu:geforce_rtx_2080_ti:10                 cryoem-gpu[11-15]                      CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
cryoem       1     0/1/0/1        2:12:2  257336  0          infinite    gpu:geforce_gtx_1080_ti:10(S:0)            cryoem-gpu01                           CPU_GEN:HSW,CPU_SKU:E5-2670v3,CPU_FRQ:2.30GHz,GPU_GEN:PSC,GPU_SKU:GTX1080TI,GPU_MEM:11GB,GPU_CC:6.1
cryoem       1     0/1/0/1        2:12:2  191552  0          infinite    gpu:geforce_rtx_2080_ti:10(S:0)            cryoem-gpu10                           CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
cryoem       1     0/1/0/1        2:8:2   191567  0          infinite    gpu:v100:4(S:0-1)

How can I request a GPU with certain features and or memory?

?> TBA... something about using Constraints. Maybe get the gres for gpu memory working.

Frequently Asked Questions

Help! My Job takes a long time before it starts!

This is often due to limited resources. The simplest way is to request less CPU (--cpus) or less memory (--memfor your Job. However, this will also likely increase the amount of time that you need for the Job to complete. Note that perfect scaling is often very difficult (ie using 16 CPUs will not run twice as fast as 8 CPUs, as will using 4 nodes via MPI will not run twice as fast as 2 nodes), so it may be beneficial to submit many smaller Jobs if your code allows it. You can also set the --time option to specify that your job will only run upto that amount of time so that the scheduler can better fit your job in.

The more expensive option is to buy more hardware to SDF and have it added to your group/teams' Partition. Please contact your Coordinator or contact to discuss.

You can also make use of the Scavenger QoS such that your job may run on any available resources available at SLAC. This, however, has the disadvantage that should the owners of the hardware that your job runs on requires its resources, your may will be terminated (preempted) - possibly before it has completed.

What is QoS?

A Quality of Service for a job defines restrictions on how a job is ran. In relation to an Allocation, a user may preempt, or be preempted by other job with a 'higher' QoS. We define 2 levels of QoS:

scavenger: Everyone has access to all resources, however it is ran with the lowest priority and will be terminated if another job with a higher priority needs it

normal: Standard QoS for owners of hardware; jobs will (attempt) to run til completion and will not be preempted. normal jobs therefore will preempt scavenger jobs.

Scavenger QoS is useful if you have jobs that may be resumed (checkpointed) and if there are available resources available (ie owners are not using all of their resources).

You may submit to multiple Partition with the same QoS level:

#!/bin/bash
#SBATCH --account=cryoem
#SBATCH --partition=cryoem,shared
#SBATCH --qos=scavenger

In the above example, a cryoem user is charging against their Account cryoem; she is willing to run the job whereever available (the use of the cryoem Partition is kinda moot as the cryoem nodes are a subset of the Shared Partition anyway).

is it possible to define multiple? ie cryoem with normal + shared with scavenger?

How can I restrict/contraint which servers to run my Job on?

You can use slurm Constraints. We tag each and every server that help identify specific Features that each has: whether that is the kind of CPU, or the kind of GPU that run on them.

You can view a servers specific Feature's using

$ scontrol show node ml-gpu01
NodeName=ml-gpu01 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUTot=48 CPULoad=1.41
   AvailableFeatures=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
   ActiveFeatures=CPU_GEN:SKX,CPU_SKU:5118,CPU_FRQ:2.30GHz,GPU_GEN:TUR,GPU_SKU:RTX2080TI,GPU_MEM:11GB,GPU_CC:7.5
   Gres=gpu:geforce_rtx_2080_ti:10(S:0)
   NodeAddr=ml-gpu01 NodeHostName=ml-gpu01 Version=19.05.2
   OS=Linux 3.10.0-1062.4.1.el7.x86_64 #1 SMP Fri Oct 18 17:15:30 UTC 2019
   RealMemory=191552 AllocMem=0 FreeMem=182473 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu
   BootTime=2019-11-12T11:18:04 SlurmdStartTime=2019-12-06T16:42:16
   CfgTRES=cpu=48,mem=191552M,billing=48,gres/gpu=10
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

We are openly investigating additional Features to add. Comments and suggestions welcome.

Documentation PENDING.

Possibly add: GPU_DRV, OS_VER, OS_TYPE

Files

batch-compute.md

Latest commit

History

batch-compute.md

File metadata and controls

Batch Compute

Introduction

Why should I use Batch?

Why should I use Slurm?

Slurm Basics

What should I know about using Batch?

What is a Slurm Partition? :id=partition

The Shared Parition :id=shared-partition

What is Pre-emption? :id=pre-emption

What is a Slurm Account and Allocation? :id=account-and-allocation

How do I get access to a Slurm Partition? :id=allocation

Fair-share and Job Priorities :id=fair-share

How do I use Slurm? :id=slurmexample

SLAC Slurm Specifics

Using Slurm

How can I get an Interactive Terminal/Session? :id=interactive

How can I get an Interactive Terminal/Session that supports X11?

How do I submit a Batch Job?

Create a Batch Script :id=create-batch-script

Specifying how long the job should run for :id=time

Specifying CPU requirements :id=cpu

Specifying memory requirements :id=memory

Specifying nodes with specific resources (constraints) :id=constraints

Specifying local scratch space :id=scratch

Notification of job status :id=notification

Changing the working directory :id=workingdir

Specifying a reservation :id=reservation

Submit the job :id=submit-batch-script

Monitor job progress :id=monitor-job

Job Arrays

Using GPUs

How can I request GPUs?

How can I see what GPUs are available?

How can I request a GPU with certain features and or memory?

Frequently Asked Questions

Help! My Job takes a long time before it starts!

What is QoS?

How can I restrict/contraint which servers to run my Job on?