Skip to content

6. Test datasets

Verena Kutschera edited this page Nov 26, 2024 · 8 revisions

Sumatran rhinoceros test dataset

A test dataset composed of Sumatran rhinoceros (Dicerorhinus sumatrensis) whole-genome re-sequencing data from historical and modern samples, several reference FASTA files, a GTF file with annotations, outgroup genomes and a dated phylogenetic tree for GERP is available from the Scilifelab Data Repository (DOI: 10.17044/scilifelab.19248172.v2). The included data corresponds to scaffold ‘Sc9M7eS_2_HRSCAF_41’ from the Sumatran rhinoceros genome (GenBank accession number GCA_014189135.1) and can be used to get familiar with running GenErode on a high-performance compute cluster.

Workflow descriptions and all scripts used to generate the test dataset are provided in this GitHub repository (docs/extras/test_dataset_generation).

Note that the phylogenetic tree provided with the test dataset that was used in Kutschera et al. 2022 is scaled to billions of years whereas the recommended scale is millions of years.

Detailed description on how the pipeline was run with the test dataset in Kutschera et al. 2022 (GenErode version 0.4.1)

The following steps were performed to analyze the test dataset with GenErode on a HPC cluster with slurm that has both Conda and Singularity installed on the system. The pipeline was run three times with different settings that are described below.

Preparations for the first pipeline run

  1. Cloned the GitHub repository to a directory on the HPC cluster with git clone https://github.com/NBISweden/GenErode.git and moved into the directory GenErode/
  2. Generated a conda environment with Snakemake and other required libraries from the environment.yml file with conda env create -n generode -f environment.yml (only if the conda environment had not been created yet)
  3. Created a slurm profile for the pipeline run:

a) Installed cookiecutter with conda install -c conda-forge cookiecutter (only if cookiecutter was not installed yet)
b) Downloaded the Snakemake slurm profile with cookiecutter cookiecutter https://github.com/Snakemake-Profiles/slurm.git with profile_name: slurm, no cluster_sidecar_help (2) and cluster_config: ../config/cluster.yaml
c) Edited the Snakemake slurm profile as follows: moved into the folder slurm, opened the file config.yaml and deleted the line cluster-cancel: "scancel"

  1. Placed the test dataset into a dedicated directory on the HPC cluster (available for download from the Scilifelab Data Repository; DOI: 10.17044/scilifelab.19248172)
  2. Metadata files for three historical and three modern Sumatran rhinoceros samples (available for download from the Scilifelab Data Repository along with the test dataset; rhino_3_historical_samples.txt and rhino_3_modern_samples.txt) were edited as follows

a) Placed the metadata files for the test dataset into the subdirectory of the pipeline GenErode/config/
b) Updated the metadata files. The paths to the FASTQ files (in columns path_to_R1_fastq_file and path_to_R2_fastq_file) were updated with the correct paths on the HPC cluster to the corresponding FASTQ files from the test dataset

  1. The configuration files were updated with the correct paths on the HPC cluster to the test dataset (three configuration files are available for download from the Scilifelab Data Repository)

a) config_mitogenomes.yaml: Updated the path to the Sumatran rhinoceros mitochondrial genome FASTA file in line 97 (species_mt_path) with the correct path to the corresponding FASTA file from the test dataset
b) config_sum_rhino.yaml: Updated the path to the Sumatran rhinoceros reference FASTA file in line 21 (ref_path) with the correct full path to the corresponding FASTA file from the test dataset
c) config_white_rhino.yaml: Updated the path to the White rhinoceros reference FASTA file in line 21 (ref_path), the path to the White rhinoceros annotation in GTF format in line 455 (gtf_path), as well as the paths to the GERP outgroup FASTA files in line 492 (gerp_ref_path) and to the phylogenetic tree in line 501 (tree) with the correct paths to the corresponding files from the test dataset

Started a GenErode pipeline run

  1. Chose one of the following configuration files for the pipeline run and created a copy of the configuration file

a) config_mitogenomes.yaml: Maps reads from three historical and three modern Sumatran rhinoceros samples to the Sumatran rhinoceros mitochondrial genome and the mitochondrial genomes from several other species to identify samples with elevated mapping success to another species. Run cp config/config_mitogenomes.yaml config/config.yaml
b) config_sum_rhino.yaml: Maps reads from three historical and three modern Sumatran rhinoceros samples to a Sumatran rhinoceros reference. After data processing, runs mlRho, PCA, and ROH. Run cp config/config_sum_rhino.yaml config/config.yaml
c) config_white_rhino.yaml: Maps reads from three historical and three modern Sumatran rhinoceros samples to a White rhinoceros reference. After data processing, runs PCA, snpEff, and GERP. Run cp config/config_white_rhino.yaml config/config.yaml

  1. Opened a tmux session to be able to run GenErode in the background with tmux new-session -s generode
  2. Activated the conda environment with conda activate generode
  3. Started a dry run to check if the pipeline run would work as expected with snakemake -j 100 --use-singularity --profile slurm -npr &> YYMMDD_dry_run.out (replace YYMMDD with the current date) and checked the file YYMMDD_dry_run.out for any errors
  4. Started the main run of the pipeline with snakemake -j 100 --use-singularity --profile slurm &> YYMMDD_main_run.out and checked the file YYMMDD_main_run.out regularly during the run
  5. After the pipeline run finished successfully, repeated 7. to 11. for the remaining configuration files

Minimal test dataset

A minimal test dataset has been compiled to set up automatic testing of the pipeline on GitHub and is located in .test/data. This data can be used to try out the pipeline with limited computational resources, e.g. on a local computer. Note that the results are not biologically meaningful because of the small size of the minimal test dataset.

Description of the minimal test dataset

Test data reference genome

  • .test/data/references/sumatran_rhino.fasta is a reference fasta file with a short scaffold from the Sumatran rhinoceros reference genome.
  • .test/data/references/sumatran_rhino.gtf is a GTF file with gene predictions for the short scaffold from the Sumatran rhinoceros reference genome.
  • .test/data/references/rhino_NC_012684.fasta is a fasta file containing the mitochondrial genome from Sumatran rhinoceros to test mapping of reads from historical samples to different mitochondrial genomes.

Test data for whole-genome resequencing data processing and analyses

  • .test/data/shortread_data/historical/ contains fastq files with Illumina paired-end reads of historical Sumatran rhinoceros samples.
  • .test/data/shortread_data/modern/ contains fastq files with Illumina paired-end reads of modern Sumatran rhinoceros samples.
  • .test/config/historical_paths.txt is the metadata file for the historical samples and .test/config/modern_paths.txt is the metadata file for the modern samples.

Fastq files were generated as follows: reads from whole genome re-sequencing of the historical and modern samples were mapped to the full Sumatran rhinoceros reference genome with bwa mem and default settings. The test scaffold was extracted from the BAM files and converted to fastq format (containing only mapped paired-end reads). The same approach was applied to identify mitochondrial reads for historical samples, which were then sampled down to 500 paired-end reads per sample. The mitochondrial reads were finally merged with the reads that had mapped to the test scaffold to create two fastq files per sample (forward and reverse reads).

Test data for GERP analyses

  • .test/data/gerp_data/ contains gzipped fasta files for outgroup species and a time-calibrated tree (generated on www.timetree.org) that can be used as input data for a GERP++ test run.

Fasta files for outgroup species were generated as follows: CDS were extracted from the Sumatran rhinoceros test scaffold and blasted to each of the outgroup species' full genomes with tblastx. The top blast hits per CDS were blasted back to the Sumatran rhinoceros CDS with blastn. All scaffolds with reciprocal blast hits were extracted per outgroup species.

Set up to run the pipeline with the minimal test dataset on a MacBook Pro in a virtual machine

The following should also work on a Windows 10 PC with the WSL2 subsystem but hasn't been tested.

Install all dependencies

brew install virtualbox

brew install vagrant

brew install vagrant-manager

Create a new directory and clone the git repository

mkdir vm-singularity

cd vm-singularity

git clone https://github.com/NBISweden/GenErode.git

Create and bring up the virtual machine

The first time you create and bring up the virtual machine, follow these instructions:

export VM=sylabs/singularity-3.7-ubuntu-bionic64 && vagrant init $VM && vagrant up && vagrant ssh

The GenErode folder was not present in /home/vagrant the first time I entered the virtual machine. If this is the case for you, try to follow these steps:

  1. Exit the virtual machine with CTRL + d and edit the Vagrantfile, uncommenting the respective line. Make sure you replace /fullpath/to/ with the full path on your Mac:

config.vm.synced_folder "/fullpath/to/vm-singularity/GenErode/", "/home/vagrant/vagrant_GenErode", disabled: false

  1. Reload the virtual machine (on the host, standing in the directory vm-singularity)

vagrant reload

  1. Bring up the virtual machine & check if the folder is shared (you should find a folder named vagrant_GenErode)

vagrant up && vagrant ssh

If this was successful, continue with these steps:

Install Miniconda & mamba

The first time the virtual machine is run, you need to install Miniconda and mamba.

wget https://repo.continuum.io/miniconda/Miniconda3-4.7.12.1-Linux-x86_64.sh .

bash Miniconda3-4.7.12.1-Linux-x86_64.sh

rm Miniconda3-4.7.12.1-Linux-x86_64.sh

. .bashrc

conda install -c conda-forge mamba

Install the conda environment

The pipelines' conda environment needs to be created the first time you want to run the pipeline:

cd vagrant_GenErode

mamba env create -f environment.yml -n generode

conda activate generode

Start the pipeline

You can test the pipeline with one of the existing configuration files from the .test/config folder, for example to run the mapping of historical samples to mitochondrial genomes. Edit the config/config.yaml file accordingly (or replace it with the file .test/config/config_mitogenomes.yaml by running cp .test/config/config_mitogenomes.yaml config/config.yaml).

snakemake --cores 1 --use-singularity -npr &> dry_run_test_mitos.out

snakemake --cores 1 --use-singularity &> main_run_test_mitos.out

  • All results should be stored on your host OS, in the folder vm-singularity/GenErode.
  • After a successful main run, test other configurations, e.g. using the other config files in .test/config
  • You can exit the virtual machine by typing CTRL + d

Bring up the virtual machine again

Whenever you want to bring up the virtual machine again to run the pipeline, move into the folder vm-singularity and type:

vagrant up && vagrant ssh

In the virtual machine, you need to activate the conda environment:

conda activate generode

Now you can run the pipeline as described above.