Skip to content

6. Test datasets

Verena Kutschera edited this page Mar 30, 2022 · 8 revisions

Sumatran rhinoceros test dataset

A test dataset composed of Sumatran rhinoceros (Dicerorhinus sumatrensis) whole-genome re-sequencing data from historical and modern samples, several reference FASTA files, a GTF file with annotations, outgroup genomes and a dated phylogenetic tree for GERP is available from the Scilifelab Data Repository (DOI: 10.17044/scilifelab.19248172). The included data corresponds to scaffold ‘Sc9M7eS_2_HRSCAF_41’ from the Sumatran rhinoceros genome (GenBank accession number GCA_014189135.1) and can be used to get familiar with running GenErode on a high-performance compute cluster.

Workflow descriptions and all scripts used to generate the test dataset are provided in this GitHub repository (docs/extras/test_dataset_generation).

Minimal test dataset

A minimal test dataset has been compiled to set up automatic testing of the pipeline on GitHub and is located in .test/data. This data can be used to try out the pipeline with limited computational resources, e.g. on a local computer. Note that the results are not biologically meaningful because of the small size of the minimal test dataset.

Description of the minimal test dataset

Test data reference genome

  • .test/data/references/sumatran_rhino.fasta is a reference fasta file with a short scaffold from the Sumatran rhinoceros reference genome.
  • .test/data/references/sumatran_rhino.gtf is a GTF file with gene predictions for the short scaffold from the Sumatran rhinoceros reference genome.
  • .test/data/references/rhino_NC_012684.fasta is a fasta file containing the mitochondrial genome from Sumatran rhinoceros to test mapping of reads from historical samples to different mitochondrial genomes.

Test data for whole-genome resequencing data processing and analyses

  • .test/data/shortread_data/historical/ contains fastq files with Illumina paired-end reads of historical Sumatran rhinoceros samples.
  • .test/data/shortread_data/modern/ contains fastq files with Illumina paired-end reads of modern Sumatran rhinoceros samples.
  • .test/config/historical_paths.txt is the metadata file for the historical samples and .test/config/modern_paths.txt is the metadata file for the modern samples.

Fastq files were generated as follows: reads from whole genome re-sequencing of the historical and modern samples were mapped to the full Sumatran rhinoceros reference genome with bwa mem and default settings. The test scaffold was extracted from the BAM files and converted to fastq format (containing only mapped paired-end reads). The same approach was applied to identify mitochondrial reads for historical samples, which were then sampled down to 500 paired-end reads per sample. The mitochondrial reads were finally merged with the reads that had mapped to the test scaffold to create two fastq files per sample (forward and reverse reads).

Test data for GERP analyses

  • .test/data/gerp_data/ contains gzipped fasta files for outgroup species and a time-calibrated tree (generated on www.timetree.org) that can be used as input data for a GERP++ test run.

Fasta files for outgroup species were generated as follows: CDS were extracted from the Sumatran rhinoceros test scaffold and blasted to each of the outgroup species' full genomes with tblastx. The top blast hits per CDS were blasted back to the Sumatran rhinoceros CDS with blastn. All scaffolds with reciprocal blast hits were extracted per outgroup species.

Set up to run the pipeline with the minimal test dataset on a MacBook Pro in a virtual machine

The following should also work on a Windows 10 PC with the WSL2 subsystem but hasn't been tested.

Install all dependencies

brew install virtualbox

brew install vagrant

brew install vagrant-manager

Create a new directory and clone the git repository

mkdir vm-singularity

cd vm-singularity

git clone https://github.com/NBISweden/GenErode.git

Create and bring up the virtual machine

The first time you create and bring up the virtual machine, follow these instructions:

export VM=sylabs/singularity-3.7-ubuntu-bionic64 && vagrant init $VM && vagrant up && vagrant ssh

The GenErode folder was not present in /home/vagrant the first time I entered the virtual machine. If this is the case for you, try to follow these steps:

  1. Exit the virtual machine with CTRL + d and edit the Vagrantfile, uncommenting the respective line. Make sure you replace /fullpath/to/ with the full path on your Mac:

config.vm.synced_folder "/fullpath/to/vm-singularity/GenErode/", "/home/vagrant/vagrant_GenErode", disabled: false

  1. Reload the virtual machine (on the host, standing in the directory vm-singularity)

vagrant reload

  1. Bring up the virtual machine & check if the folder is shared (you should find a folder named vagrant_GenErode)

vagrant up && vagrant ssh

If this was successful, continue with these steps:

Install Miniconda & mamba

The first time the virtual machine is run, you need to install Miniconda and mamba.

wget https://repo.continuum.io/miniconda/Miniconda3-4.7.12.1-Linux-x86_64.sh

bash Miniconda3-4.7.12.1-Linux-x86_64.sh

rm Miniconda3-4.7.12.1-Linux-x86_64.sh

. .bashrc

conda install -c conda-forge mamba

Install the conda environment

The pipelines' conda environment needs to be created the first time you want to run the pipeline:

cd vagrant_GenErode

mamba env create -f environment.yml -n generode

conda activate generode

Start the pipeline

You can test the pipeline with one of the existing configuration files from the .test/config folder, for example to run the mapping of historical samples to mitochondrial genomes. Edit the config/config.yaml file accordingly (or replace it with the file .test/config/config_mitogenomes.yaml by running cp .test/config/config_mitogenomes.yaml config/config.yaml).

snakemake --cores 1 --use-singularity -npr &> dry_run_test_mitos.out

snakemake --cores 1 --use-singularity &> main_run_test_mitos.out

  • All results should be stored on your host OS, in the folder vm-singularity/GenErode.
  • After a successful main run, test other configurations, e.g. using the other config files in .test/config
  • You can exit the virtual machine by typing CTRL + d

Bring up the virtual machine again

Whenever you want to bring up the virtual machine again to run the pipeline, move into the folder vm-singularity and type:

vagrant up && vagrant ssh

In the virtual machine, you need to activate the conda environment:

conda activate generode

Now you can run the pipeline as described above.