This is a global Snakemake profile for CeMM's SLURM HPC, using the slurm-executor-plugin. Global Snakemake profiles are the interface between Snakemake workflows and the workload manager of your cluster (here: SLURM).
- Install the slurm-executor-plugin(tested with
v0.10.0
) inside of your existing Snakemake conda environment.conda activate snakemake conda install snakemake-executor-plugin-slurm
- Clone this GitHub repository.
git clone https://github.com/epigen/cemm.slurm.sm.git
- Adapt the entries in the config.v8+.yaml to your setup (e.g., put your slurm_account).
- There are two options to let Snakemake know about the profile:
- (Recommended) Set environmental variable, e.g., by adding to
~/.bashrc
, once.
export SNAKEMAKE_PROFILE=<path/to/this/repo/cemm.slurm.sm>
- Provide it as Snakemake command line argument with every call.
snakemake --profile <path/to/this/repo/cemm.slurm.sm>
- (Recommended) Set environmental variable, e.g., by adding to
There are three different flavors depending on personal preference and if the workflow has many jobs to be submitted (hundreds) with "small ones" in the beginning.
- Interactive
- Start snakemake on your head/login node (recommended by Snakemake developers).
- This is not allowed at CeMM's HPC.
- Conductor job (recommended)
- One job (on longq) to rule them all: use a sbatch job script to call and manage (conduct) the execution of all workflow jobs.
- Set
immediate-submit: false
in config.v8+.yaml. - For details see next section and snakejob_conductor.sh.
- Advantages
- Snakemake orchestrates the job submission.
- One place to check progress, log errors/failed jobs, and document performance (e.g., duration).
- Disadvantages
- If the conductor job is canceled the workflow directory might be "locked" → use
snakemake --unlock
- Incomplete files (i.e., files that started to be created, but not finished) might persist → delete the content of this folder
rm -rf <path/to/workflow>/.snakemake/incomplete/*
- If the conductor job is canceled the workflow directory might be "locked" → use
- immediate-submit (not recommended)
- All jobs will be submitted at once with their respective dependencies, if one job fails all jobs depending on it are cancelled automatically.
- Set
immediate-submit: true
in config.v8+.yaml. - Add
--kill-on-invalid-dep=yes
toslurm_extra
in config.v8+.yaml.. - Advantages
- Everything is submitted at once with SLURM dependencies.
- Maximum parallelization is achieved.
- Disadvantages:
- An error is thrown if a job with a dependency gets submitted, but the dependency has already finished. This can happen if early jobs are very small.
- To find failed jobs one has to investigate many
.log
files and/or look at the remaining/unfinished jobs in a new Snakemake DAG. - If you submit a lot of jobs (e.g., >500) this might take some time (i.e., 1s/job) until all jobs are submitted.
- Open question: behaviour of
--retries
flag unknown. If someone finds out, please let me know.
If you want to use a conductor job for the submission and execution of your worklfow follow these steps:
- Copy
snakejob_conductor.sh
to the workflow/project root directory. - Go through every line and adapt it according to your setup (e.g., set paths to the log folder and use absolute paths).
- Use
sbatch snakejob_conductor.sh
to submit the conductor job. - Watch the queue and/or check the .out/.err files for progress.
The SLURM job's comments contain the rule
and wildcard
information and can be accessed using %.50k
in squeue
:
squeue -u $USER -o %i,%P,%.50j,%.50k
You can also create an alias
for checking on all your jobs in a structured manner.
alias mq="squeue -u sreichl -o '%.22i %.9P %.50j %.50k %.8u %.2t %.10M %.4C %.9m %.6D %R'"
Snakemake will print the following upon SLURM job submissions, pointing to the respective SLURM job log files containing standard output:
Job {XX} has been submitted with SLURM jobid {jobid} (log: /path/to/workflow/.snakemake/slurm_logs/rule_{rulename}/{wildcards}/{jobid}.log).
For a similar profile working for the MedUni HPC cluster, refer to https://github.com/moritzschaefer/muwhpc_slurm