Skip to content
Verena Kutschera edited this page Nov 26, 2024 · 12 revisions

Where can I learn more about Snakemake?

Here you can find the official tutorial: https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html

This is the link to the NBIS reproducible research workshop, with a Snakemake tutorial: https://nbis-reproducible-research.readthedocs.io/en/latest/snakemake/

Where can I learn more about tmux?

The pipeline runs for a very long time, so it is necessary to send the Snakemake process to the background so that you don't have to keep the terminal open until it is finished. To do that, you can use terminal multiplexers such as tmux or screen. Here is a link to a crash-course in tmux: https://robots.thoughtbot.com/a-tmux-crash-course

To run GenErode, you actually only need the very basic commands:

  • check which sessions are currently running: $ tmux ls
  • start a new session with the name "mysession": $ tmux new -s mysession
  • detach from a running session: type CTRL + b, d
  • re-attach to the session: $ tmux a -t mysession
  • kill a session: $ tmux kill-session -t mysession

What do I do if my GenErode run failed because a specific rule/job failed?

If the pipeline run failed, you will get an error message in the log file (i.e. the standard output that we redirect to a file) that reads like this: "Error in rule XYZ". Each rule generates a log file that you can find in the directory results/logs and the subdirectories therein. If you are using a system with the slurm workload manager, you will find the slurm job ID a few lines below that that corresponds to the job run on the cluster and the path to the slurm file "slurm-1234567.out", where you will hopefully find another error message explaining why the job failed.

I changed a metadata table and now GenErode attempts to rerun everything from the start. How do I rerun GenErode only for the new samples, or only for the set of samples remaining in the metadata table and for the remaining rules?

Snakemake has changed their rerun behaviour in Snakemake version 7.8 (see https://github.com/snakemake/snakemake/issues/1694). This means that when changing metadata tables, Snakemake will now run everything from the beginning, stating "Set of input files has changed since last execution". To get around this, use --rerun-triggers mtime in the Snakemake command when starting the pipeline from the command line. This also applies to any local changes in code or other parameters.

How do I change parameters for slurm jobs?

Open the file slurm/config.yaml and find the rule (or group job) that needs to be adjusted:

  • Adjust the number of cores under set-threads and under set-resources and cpus_per_task
  • Adjust the memory under set-resources and mem_mb
  • Adjust the duration under set-resources and runtime

Please note that in several cases, rules were grouped together to be run as one job on the cluster. In that case, you need to adjust the parameters for the entire group ID in the file slurm/config.yaml.

Some rules (in the .smk files within the workflow/rules/ directory) have a default number of threads specified under threads that corresponds to the number under set-threads in the slurm/config.yaml file. If you change this number in the slurm/config.yaml file, the number of threads should be adjusted automatically.

What happened if I got the warning "Multiple include of [...] ignored"?

Each step of GenErode depends on most previous steps (except the mitogenome mapping, which depends on the fastq file processing, but is not automatically loaded for the subsequent steps). The pipeline is written in a way so that all required steps are automatically included if you set a step to True in the config file. If you set several steps to True at the same time, Snakemake therefore tries to include the same steps multiple times and throws this warning message.

I want to rerun the pipeline with changed parameters settings, but I get the message "nothing to be done". How do I fix that?

GenErode checks the presence of the final output file of each step to decide if it should rerun the analyses. For most steps of the pipeline, these are the output files of the MultiQC analysis. You can find them in the stats directory of the step you were running, and therein. Deleting, moving or renaming these files will force the pipeline to rerun the analyses leading to these files, using the parameters specified in the config file.

For downstream analyses (mlRho, PCA, ROH, snpEff, gerp), delete, move or rename the final output files (tables, figures) to trigger a rerun (see next question).

Alternatively, you can add the flag -R path/to/file.out to the Snakemake command to start the pipeline, with path/to/file.out being the file you want to re-create.

Is it possible to rerun the pipeline with a different optional filtering step in the same directory, or will it overwrite everything? For example, would the output from a run without subsampling be overwritten when rerunning the pipeline with subsampling?

Subsampled BAM files, the resulting mlRho output and BCF files per individual have different file names than the same files without subsampling, so a rerun would not overwrite the not-subsampled files. The merged BCF files and all downstream files, however, have the same file name for any individual filtering, so they would be overwritten. If it is important to keep both versions, please rename the file that should be protected from overwriting before rerunning the pipeline (or move it to a new directory).

I want to keep intermediate files that would be otherwise automatically deleted by Snakemake (marked as "temporary"). How do I do that?

This is only recommended when you have double checked that you have enough storage space to keep intermediate files as GenErode is creating a very large number of (large) files. Also, please remove them as soon as you don't need them anymore. If you are really sure you want to prevent intermediate files from being deleted, run the pipeline from the command line, adding the flag --notemp.

Once you want to remove all temporary files at once, you can start a run with the additional flag --delete-temp-output. It is recommended to do a dry run first to see which files will be deleted.

Snakemake tells me that the working directory is locked by another Snakemake process. I've tried to run --unlock but the error message remains.

GenErode is written in a way that it expects the config file to be config/config.yaml and can't unlock the working directory if you saved it under a different name. To unlock the working directory, type snakemake --unlock --cores 1 --configfile config/my_config.yaml (or the file name you chose).

The links to MultiQC files in the GenErode report are broken. How do I fix that?

This seemed to be a bug related to certain browsers in GenErode versions prior to 0.5.0. When trying to access the MultiQC files from a report downloaded to a Mac, this happened with Chrome and Firefox, but it worked with Safari.

Clone this wiki locally