This repository contains a genome alignment workflow that takes raw FASTQ files, aligns them to a reference genome and outputs analysis ready BAM files. This workflow is designed for the National Computational Infrastructure's (NCI) Gadi supercompter, leveraging multiple nodes on NCI Gadi to run all stages of the workflow in parallel, either massively parallel using the scatter-gather approach or parallel by sample. Parellelisation across the cluster was achieved with Open MPI v.4.1.0 (Graham et al. 2005), along with NCI's utility script nci-parallel. It consists of a number of stages and follows the BROAD Institute's best practice recommendations. The stages of the pipeline are as follows:
Paired FASTQ files are initially split into smaller files of approximately 500 K read pairs with fastp v.0.20.0 (Chen et al. 2018) for parallel alignment. Reads are then aligned to Hg38 + alt contigs as downloaded by bwakit v. 0.7.17 ‘run-gen-ref’.
Alignment of reads to the reference assembly is then performed by BWA-MEM v.0.7.17 (Li 2013), using the ‘M’ flag to mark split hits as secondary. Post-processing of the reads is performed with bwa-postalt.js from bwakit v. 0.7.17 to improve mapping quality of alt hits. Scattered BAM files are merged into sample-level BAMs with SAMbamba v.0.7.1 (Tarasov et al. 2015).
Duplicate reads are marked with SAMblaster v. 0.1.24 (Faust and Hall 2014), then reads are sorted by genomic coordinate and indexed with SAMtools v. 1.10 (Li et al. 2009). During duplicate read marking, split and discordant reads can be extracted to BAM format for later analysis.
Base quality score recalibration is performed with GATK v 4.1.2.0 (Van der Auwera et al. 2013). GATK SplitIntervals is used to define a number of evenly-sized genomic intervals over which GATK BaseRecalibrator is run for each sample. Eah of the recalibration tables per sample are merged into one table per sample with GATK GatherReports. Base-recalibrated BAM files are produced with GATK ApplyBQSR in parallel over each of the contigs in the reference genome assembly, plus the unmapped reads. These are then merged into a final BAM per sample with SAMbamba merge or GATK GatherBamFiles.
Clone the Fastq-to-bam repository:
module load git
git clone https://github.com/Sydney-Informatics-Hub/Fastq-to-BAM.git
NOTE: This pipeline is currently compatible with NCI Gadi only.
All software required to run the fastq-to-bam pipeline include:
- bwa/0.7.17
- bwakit/0.7.11
- fastp/0.20.0
- fastqc/0.11.7
- gatk/4.1.8.1
- java/jdk1.8.0_60
- k8/0.2.5
- nci-parallel/1.0.0
- openmpi/4.1.0
- sambamba/0.7.1
- samblaster/0.1.24
- samtools/1.10
- seqtk/1.3-r113-dirty
This pipeline has specific input and directory structure requirements. The set up instructions below will help you to achieve the required set up, which will resemble the following:
|-- Reference
|-- Fastq-to-BAM
|-- <cohort>.config
|-- create_project.bash
|-- Fastq
We refer to this directory as the base directory. The working directory (where you run scripts) is Fastq-to-BAM
.
The Fastq-to-BAM pipeline requires users supply the following inputs (a full description of each are provided below):
- A
<cohort>.config
file listing all samples and required metadata in TSV file format - Short read sequences in FASTQ format (in a
Fastq
directory) - An indexed reference genome assembly in FASTA format (in
Reference
directory) - A set of high-confidence variants in VCF format (if performing BQSR)
The config file must have the suffix '.config' (i.e. cohortname.config) and one row per unique sample, matching the format:
#SampleID | LabSampleID | Seq_centre | Library(default=1) |
---|---|---|---|
SAMPLE1 | SAMPLE1 | AGRF | |
SAMPLE2 | SAMPLE2 | KCCG |
- SampleID is the unique identifier enabling one to recognise which FASTQs belong to the same sample, including samples which have been multiplexed (e.g., IDs provided by your sequencing company. This can often be the same as the LabSampleID)
- LabSampleID is the desired name for the output files for that sample i.e. in-house ID
- If you have tumour/matched normal samples, normal samples denoted by
<patientID>-N
are considered to match to tumour samples denoted by<patientID>-<tumourID>
if you are using the Somatic-ShortV pipeline
- If you have tumour/matched normal samples, normal samples denoted by
- SeqCentre (where the samples were sequenced, this metadata will be stored in the final BAM files)
- Library (default=1)
Save the config file to the base directory, as in set up.
Create a directory by mkdir -p Fastq
in the base directory and download your raw FASTQ files into this directory (see the data transfer section of our Intro to Gadi course for tips and scripts to do this). We also recommend copying .md5 files that should be provided by your sequencing company and performing checksums to confirm the validity of the files after transfer.
Human genome: hg38 + alternate contigs
Provided with this pipeline is the (+ alt contigs) reference assembly. The complete set of Hg38 required files can be downloaded with the following commands to the base directory, as in set up:
wget https://cloudstor.aarnet.edu.au/plus/s/CHeAuEsBkHalDvI/download -O Reference.tar.gz
wget https://cloudstor.aarnet.edu.au/plus/s/0fN5X8tksH2PGso/download -O Reference.tar.gz.md5
# Perform checksums
md5sum -c Reference.tar.gz.md5
Next, unpack the Reference tar file:
tar -xvf Reference.tar.gz
The Reference
directory must be located within the base directory to ensure the relative paths that are used within the scripts are correct.
Non-human reference genomes
If you would like to use this pipeline with a different reference genome, you will need to:
mkdir -p Reference # in the base directory
cd Reference
# Copy or download your reference genome into Reference
Prepare the required index files
# edit ref= to the name of your reference.fasta file
qsub index_reference.pbs
Prepare split intervals for GATK scatter-gathering (this enables multi-node, embarassingly parallelism). GATK split intervals for BQSR and GATK split intervals for variant calling should be prepared. Split intervals for BQSR must be a minimum of 100 Mbp (e.g. for 3.2 Gbp hg38 + alt, this is 32 intervals). For ApplyBQSR adjust the number of contigs over which to print recalibrated BAMs. You will also need to source the known variants databases to be used for BQSR and VQSR and edit those scripts that use them accordingly. We have provided a script to prepare reference intervals. Use it by editing and running:
qsub create_gatk_ref_intervals.pbs
This script will edit all pipeline scripts and create the required directory set up. Users will be prompted to provide the following:
- NCI project e.g. ab00
- Path to the directory containing your scripts -
Fastq-to-BAM
(relative or absolute) e.g. ./Fastq-to-BAM or /scratch/ab00/Fastq-to-BAM - Basename of config file e.g. cohortname for cohortname.config
- Full path to the reference directory e.g. /scratch/ab00/Fastq-to-BAM/Reference
- Name of your reference genome sequence (include suffix) e.g. Hg38.fasta
- Required read/write access to any Gadi storage other than the supplied NCI project e.g. ab00
Users will run a series of scripts for each stage of the Fastq-to-BAM pipeline. After completing set up, follow the steps in the subheading belows. In brief, each job consists of a make_input.sh
script which creates a list of input files and a run_parallel.pbs
script which submits the <task>.sh
in parallel to the PBS job scheduler. Some stages have additional 'checker' scripts that can be run to confirm the integrity of a process' output. Each stage is detailed below.
All scripts must be run from the Fastq-to-BAM
directory. For how to calculate the resource requirements for each _run_parallel.pbs
script, please see Resource Usage and Benchmarking metrics sections.
IMPORTANT NOTE: Due to variability in fastq file naming between sequencing companies, FastQC and split FASTQ steps require users to check the regular expression (regex) within their respective make_input.sh
and task.sh
scripts before running to ensure their fastq files have been correctly captured. Be sure to check and regex and edit where necessary
This step will produce quality reports for all fastq files in a list. Each fastq file will be run as a separate task and each task is processed in parallel. For an explanation of reports, see the FastQC documentation. We recommend creating aggregate reports of FastQC results for all FASTQ files using MultiQC. See the MultiQC documentation for more information.
Steps for running fastqc scripts are:
1. Run bash fastqc_make_input.sh /path/to/fastqs
Creates a list of inputs to be submitted as separate tasks. This script assumes your fastq files are stored in a directory called Fastq
and will output a list of fastq files to the Inputs
directory. From this list, FASTQ files will be organised by size and run from largest to smallest in the next step to improve overall job efficiency.
2. Run fastqc_run_parallel.pbs
This script runs fastqc.sh
for each FASTQ file as an independent task.
In this step paired FASTQ files are split into smaller files of approximately 500 K read pairs for parallel alignment to the reference genome assembly.
Steps for running the split_fastq scripts are:
1. Run bash split_fastq_make_input.sh <config prefix>
This script will search the Fastq
directory and search for all FASTQ pairs belonging to the samples listed in the config file. It will then print a list of unique FASTQ identifiers to the .inputs file. This script assumes FASTQ files are paired. For samples with multiple pairs, please do not concatenate them into one massive pair per sample. That is general advice that extends beyond the scope of this analysis.
2. Run qsub split_fastq_run_parallel.pbs
This script will run split_fastq.sh
, which runs a splitting task for each of the input files created previously. Before running, confirm the fq1 and fq2 regex in split_fastq.sh
match your FASTQ files and edit as required.
3. Run qsub split_fastq_check_fastq_input_vs_split_output_run_parallel.pbs
Check that the number of output reads in split FASTQ files matches that detected by fastQC in the unsplit FASTQ files. This step assumes fastQC has previously been run, with unzipped results in ./FastQC
. Before running, check the regex in split_fastq_check_fastq_input_vs_split_output.sh
match your FASTQ files and edit as required.
In this step parallel alignment of 500 K FASTQ pairs is performed with bwakit.
Steps for running the align scripts are:
1. Run bash align_make_inputs.sh <config prefix>
This script will report the total number of split FASTQ pairs to be aligned. If > 20 K input files then users will need to manually batch the list of inputs to make approximately evenly sized input files of ~20 K inputs per input file. Name these files align.input-chunk<N>
and modify run_parallel.pbs
script accordingly between submissions. Use the number of tasks printed to stout to determine resource requirements.
2. Run qsub align_run_parallel.pbs
This script will run align.sh
or align_nonhuman.sh
. These scripts run BWA-mem and post-processing over parallel tasks for each chunk of ~20,000 inputs. Before running, confirm the correct SCRIPT=
is unhashed in align_run_parallel.pbs
and edit accordingly.
In this step all parallel alignment output files will be merged to form a single BAM file per sample.
Steps for running the merge_align scripts are:
1. Run bash merge_align_make_input.sh <config prefix>
This script will make a list of input files for parallel merging of alignment outputs. This script can be adjusted for cohorts with a binomial difference in coverage (e.g. 30x/60x normal/tumour samples). If a binomial grouping is desired, change group=false
to group=true
.
2. Run qsub merge_align_run_parallel.pbs
This script will run merge_align.sh
which merges the chunked small BAM files into a single BAM file for each sample.
3. Run bash merge_align_check_fastq_input_vs_bam_output_sizes.sh <config prefix>
This script will check that merged alignments meet the expected size. We expect merged BAM files will be ~1.4-1.5x large than FASTQ files. If a preprocessing step has output FASTQ files in a different compression level, this ratio can deviate from expected.
In this step, duplicate reads are marked, then sorted by genomic coordinate and indexed. During duplicate read marking, split and discordant reads are extracted to SAM format and reformatted as BAM for later analysis.
Steps for running dedup_sort scripts are:
1. Run bash dedup_sort_make_input.sh <config prefix>
This script reads samples from the config file and creates an inputs list for duplicate read marking and sorting. This script can be adjusted for cohorts with a binomial difference in coverage (e.g. 30x/60x normal/tumour samples). If a binomial grouping is desired, change group=false
to group=true
.
2. Run qsub dedup_sort_run_parallel.pbs
This script will run dedup_sort.sh
which sorts reads by coordinates and mark duplicate reads, for each sample. It will also extract split and discordant reads to separate files.
3. Run bash reformat_split_disc_make_input.sh <config prefix>
This script reads samples from the config file and creates an inputs list for converting SAM files containing discordant and split reads to BAM format. This script can be adjusted for cohorts with a binomial difference in coverage (e.g. 30x/60x normal/tumour samples). If a binomial grouping is desired, change group=false
to group=true
.
4. Run qsub reformat_split_disc_run_parallel.pbs
This script will run reformat_split_disc.sh
which reformats split and discordant reads from SAM to BAM format (and indexes the BAM). SAMblaster optionally outputs split reads and discordant reads to SAM format. These reads are useful for downstream SV detection and analysis. After this script has been run successfully, the SAM files can be deleted.
In this step, BAI index files are created for the dedup/sorted BAM files.
Steps for running the index scripts are:
1. Run bash index_make_input.sh <config prefix>
This script creates an inputs list for indexing of BAM files. This script can be adjusted for cohorts with a binomial difference in coverage (e.g. 30x/60x normal/tumour samples). If a binomial grouping is desired, change group=false
to group=true
.
2. Run qsub index_run_parallel.pbs
This script will run index.sh
for all dedup/sorted BAM files in parallel.
This step will create base quality score recalubration (BQSR) tables based on high-confidence, population level variant call sets. For more information on this process see GATK's BaseRecalibrator tool documentation.
The steps for running the BQSR_recal scripts are:
1. Run bash bqsr_recal_make_input.sh <config prefix>
This script creates an inputs list for parallel execution of GATK's BaseRecalibrator tool. This script can be adjusted for cohorts with a binomial difference in coverage (e.g. 30x/60x normal/tumour samples). If a binomial grouping is desired, change group=false
to group=true
. The comma-delimited list of genomic intervals should consist of 100 Mb chunks per sample i.e. for Hg38, this is 32 entries per sample. GATK recommends BQSR operate over min 100 Mb per chunk. This script uses the human reference assmbly files by default, if using a different reference assembly edit the script accordingly.
2. Run qsub bqsr_recal_run_parallel.pbs
This script will run bqsr_recal.sh
or bqsr_recal_nonhuman.sh
in parallel for the number of interval chunks for the supplied reference assembly, per sample. Before running, confirm the correct SCRIPT=
is unhashed in bqsr_recal_run_parallel.pbs
and edit accordingly.
This step gathers the BQSR reports into a single file for each sample.
The steps for running the BQSR_gather scripts are:
1. Run bash bqsr_gather_make_input.sh <config prefix>
This script makes an inputs list for parallel or serial execution of GATK's GatherBQSRReports. It will create a sample list for gathering chunked BQSR recal tables to one table per sample.
2. Run qsub bqsr_gather_run_parallel.pbs
OR qsub bqsr_gather_serial.pbs
Running bqsr_gather_run_parallel.pbs
will run bqsr_gather.sh
for each BAM in parallel. Because of resource requirements for this process, this script is best suited to larger bathces of samples. BQSR BaseRecalibrator produces one table file per BQSR task (number of samples * number of 100 Mbp intervals). These tables need to be gathered into a merged table for each sample.
Running bqsr_gather_serial.pbs
will run bqsr_gather.sh
for each BAM serially. Because of resource requirements for this process, this script is best suited to small bathces of samples. BQSR BaseRecalibrator produces one table file per BQSR task (number of samples * number of 100 Mbp intervals). These tables need to be gathered into a merged table for each sample.
This process runs GATK's ApplyBQSR tool over each of the contigs for each sample in parallel to adjust base quality scores.
The steps for running the BQSR_apply scripts are:
1. Run bash bqsr_apply_make_input.sh <config prefix>
This script creates an inputs list for applying BQSR reports to each BAM file. This script can be adjusted for cohorts with a binomial difference in coverage (e.g. 30x/60x normal/tumour samples). If a binomial grouping is desired, change group=false
to group=true
. This script uses the human reference assmbly files by default, if using a different reference assembly edit the script accordingly.
2. Run qsub bqsr_apply_run_parallel.pbs
This script will run bqsr_apply.sh
for each contig for each sample in parallel.
This is the final step which merges recalibrated split BAM files into a final BAM file for each sample using GATK's GatherBamFiles tool.
1. Run bash bqsr_merge_make_input.sh <config prefix>
This script creates an inputs list of split BAMs to be merged for each sample. This script can be adjusted for cohorts with a binomial difference in coverage (e.g. 30x/60x normal/tumour samples). If a binomial grouping is desired, change group=false
to group=true
. This script uses the human reference assmbly files by default, if using a different reference assembly edit the script accordingly.
2. Run qsub bqsr_merge_GATK_run_parallel.pbs
This script will run bqsr_merge_GATK.sh
for each sample in parallel.
These resource request recommendations are based on benchmarking performed using 6x Platinum Genomes samples. All computation was performed on NCI ‘Gadi’ HPC, running Centos 8, PBS Pro v. 19, on Intel Xeon Cascade Lake 2 x 24 core nodes each with 192 GB ('normal' queue), 128/256 GB ('normalbw' queue) or 1536 GB ('hugemem' queue) RAM.
Compute performance is dependant on genome size, composition, and complexity, therefore we recommend those working with non-human datasets perform one round of benchmarking before running the pipeline on their full dataset.
Job | Queue | Job distribution | NCPU/task | Mem/task | Walltime | Notes |
---|---|---|---|---|---|---|
FastQC | Normalbw | 1 task/FASTQ | 1 | 4-5 GB | ~30min | |
Split_fastq | Normal | 1 task/FASTQ pair | 4 | see note | ~40min | Request whole nodes only |
Align | Normal | 1 task/Split FASTQ | 2 | see note | ~30min | Request 120nodes/1hr for 20k chunk |
Merge_align | Normalbw | 1 task/Sample | 28 | 128 GB | ~1hr | Batch samples by size if binomial |
Dedup_sort | Normalbw | 1 task/Sample | 28 | 256 GB | ~2hr | Batch samples by size if binomial |
Index | Normal | 1 task/Sample | 24 | half node | 3-6.5min | |
BQSR_recal | Normal | BQSR intervals/sample | 2 | see note | 10-20min | |
BQSR_gather | Normal | 1 task/Sample | 1 | 4 GB | 10min | |
BQSR_apply | Normal | Contigs/sample | 1.5-2.5 node/sample | see NCPU/task | 1.5hr | Avg time to print a contig is 1min. max is chr1 (40-60min) |
BQSR_merge | Normal | 1 task/Sample | 3 | 12 GB | 1hr | 1 CPU/task on hugemem |
6x Platinum Genomes human samples (~100x coverage)
#JobName | CPUs_requested | CPUs_used | Mem_requested(GB) | Mem_used(GB) | CPUtime | CPUtime_mins | Walltime_req | Walltime_used | Walltime_mins | JobFS_req | JobFS_used | Efficiency | Service_units(CPU_hours) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fastqc | 28 | 28 | 128.0GB | 128.0GB | 17:56:01 | 1076.02 | 2:00:00 | 0:45:20 | 45.33 | 100.0MB | 8.11MB | 0.85 | 26.44 |
split_fastq | 96 | 96 | 380 | 198.2 | 29:19:58 | 1759.97 | 1:00:00 | 0:22:48 | 22.8 | 209715200b | 16714kb | 0.80 | 72.96 |
check_split_fastq | 96 | 96 | 380 | 189.87 | 6:27:50 | 387.83 | 0:45:00 | 0:05:36 | 5.6 | 209715200b | 16714kb | 0.72 | 17.92 |
align_5 | 5760 | 5760 | 22.27TB | 18.47TB | 968:51:47 | 58131.78 | 2:00:00 | 0:12:49 | 12.82 | 11.72GB | 9.02MB | 0.79 | 2460.8 |
merge_align | 168 | 168 | 768.0GB | 687.34GB | 57:51:12 | 3471.2 | 1:30:00 | 0:55:31 | 55.52 | 600.0MB | 8.11MB | 0.37 | 194.31 |
dedup_sort_3 | 168 | 168 | 1.5TB | 1.3TB | 76:09:56 | 4569.93 | 6:00:00 | 2:06:53 | 126.88 | 600.0MB | 8.11MB | 0.21 | 451.14 |
index | 144 | 144 | 570.0GB | 281.75GB | 4:48:21 | 288.35 | 0:20:00 | 0:05:49 | 5.82 | 300.0MB | 8.16MB | 0.34 | 27.92 |
bqsr_recal | 384 | 384 | 1.48TB | 1.37TB | 43:45:07 | 2625.12 | 0:40:00 | 0:16:28 | 16.47 | 800.0MB | 8.18MB | 0.42 | 210.77 |
bqsr_gather | 6 | 6 | 24.0GB | 22.44GB | 0:04:06 | 4.1 | 0:10:00 | 0:00:49 | 0.82 | 100.0MB | 150.0B | 0.83 | 0.16 |
bqsr_apply | 720 | 720 | 2.78TB | 1.08TB | 137:59:57 | 8279.95 | 1:30:00 | 0:27:58 | 27.97 | 1.46GB | 9.17MB | 0.41 | 671.2 |
bqsr_merge_2 | 6 | 6 | 72.0GB | 72.0GB | 4:32:52 | 272.87 | 2:00:00 | 0:55:59 | 55.98 | 100.0MB | 174.0B | 0.81 | 33.59 |
16x Canine samples
#JobName | CPUs_requested | CPUs_used | Mem_requested | Mem_used | CPUtime | CPUtime_mins | Walltime_req | Walltime_used | Walltime_mins | JobFS_req | JobFS_used | Efficiency | Service_units(CPU_hours) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
fastqc | 28 | 28 | 128.0GB | 128.0GB | 10:42:42 | 642.7 | 2:00:00 | 0:30:44 | 30.73 | 100.0MB | 8.11MB | 0.75 | 17.93 |
split_fastq | 96 | 96 | 200.0GB | 197.53GB | 18:44:06 | 1124.1 | 0:40:00 | 0:27:07 | 27.12 | 200.0MB | 8.16MB | 0.43 | 86.77 |
align | 5760 | 5760 | 22.27TB | 14.53TB | 524:08:31 | 31448.52 | 1:00:00 | 0:15:18 | 15.3 | 11.72GB | 8.74MB | 0.36 | 2937.6 |
merge_align | 448 | 448 | 2.0TB | 1.28TB | 31:43:02 | 1903.03 | 2:00:00 | 0:12:34 | 12.57 | 1.56GB | 8.11MB | 0.34 | 117.29 |
index | 384 | 384 | 1.48TB | 467.44GB | 2:51:33 | 171.55 | 0:20:00 | 0:02:47 | 2.78 | 800.0MB | 8.16MB | 0.16 | 35.63 |
dedup_sort | 448 | 448 | 4.0TB | 2.88TB | 37:50:32 | 2270.53 | 1:00:00 | 0:25:31 | 25.52 | 1.56GB | 8.11MB | 0.2 | 241.94 |
bqsr_recal | 768 | 768 | 2.97TB | 2.23TB | 25:25:12 | 1525.2 | 0:40:00 | 0:05:26 | 5.43 | 1.56GB | 8.19MB | 0.37 | 139.09 |
bqsr_gather | 16 | 16 | 64.0GB | 16.2GB | 0:05:28 | 5.47 | 0:10:00 | 0:00:40 | 0.67 | 100.0MB | 388.0B | 0.51 | 0.36 |
bqsr_apply | 1920 | 1920 | 7.42TB | 51.51GB | 1:09:03 | 69.05 | 3:00:00 | 0:00:13 | 0.22 | 800.0GB | 115.18KB | 0.16 | 13.87 |
bqsr_merge | 48 | 48 | 190.0GB | 130.35GB | 3:13:45 | 193.75 | 2:00:00 | 0:19:44 | 19.73 | 100.0MB | 8.16MB | 0.2 | 31.57 |
- Users with non-human datasets will need to create the required reference interval files using
create_gatk_ref_intervals.pbs
and edit the align_run_parallel.pbs and bqsr_recal_run_parallel.pbs scripts to run the non-human version of these steps. - Given the impact genome complexity and size can have on compute resources requirements, we recommend users with non-human datasets perform a round of benchmarking using their own data before running this pipeline.
- Users can adjust scripts for cohorts with a binomial difference in coverage (e.g. 30x/60x normal/tumour samples). If a binomial grouping is desired, change group=false to group=true in scripts.
Acknowledgements (and co-authorship, where appropriate) are an important way for us to demonstrate the value we bring to your research. Your research outcomes are vital for ongoing funding of the Sydney Informatics Hub and national compute facilities.
- Cali Willet (Sydney Informatics Hub, University of Sydney)
- Tracy Chew (Sydney Informatics Hub, University of Sydney)
- Georgina Samaha (Sydney Informatics Hub, University of Sydney)
- Rosemarie Sadsad (Sydney Informatics Hub, University of Sydney)
- Ben Menadue (National Computational Infrastructure)
- Rika Kobayashi (National Computational Infrastructure)
- Matthew Downton (National Computational Infrastructure)
- Yue Sun (National Computational Infrastructure)
The authors acknowledge the technical assistance provided by the Sydney Informatics Hub, a Core Research Facility of the University of Sydney and the Australian BioCommons which is enabled by NCRIS via Bioplatforms Australia. The authors acknowledge the use of the National Computational Infrastructure (NCI) supported by the Australian Government and the Sydney Informatics Hub HPC Allocation Scheme, supported by the Deputy Vice-Chancellor (Research), University of Sydney and the ARC LIEF, 2019: Smith, Muller, Thornber et al., Sustaining and strengthening merit-based access to National Computational Infrastructure (LE190100021).
Willet, C., Chew, T., Samaha, G., Menadue, B. J., Downton, M., Sun, Y., Kobayashi, R., & Sadsad, R. (2021). Fastq-to-BAM (Version 2.0) [Computer software]. https://doi.org/10.48546/workflowhub.workflow.146.1
Chen, S., Zhou Y., Chen Y., Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, 01 September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560
Faust G.G., Hall I.M., SAMBLASTER: fast duplicate marking and structural variant read extraction, Bioinformatics, Volume 30, Issue 17, 1 September 2014, Pages 2503–2505, https://doi.org/10.1093/bioinformatics/btu314
Graham, R.L., Woodall, T.S., Squyres, J.M. 2005. Open MPI: a flexible high performance MPI. In Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics (PPAM'05). Springer-Verlag, Berlin, Heidelberg, 228–239. DOI:https://doi.org/10.1007/11752578_29
Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., & 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078–2079. https://doi.org/10.1093/bioinformatics/btp352
Li, H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997
Tarasov A., Vilella A.J., Cuppen E., Nijman I.J., Prins P., Sambamba: fast processing of NGS alignment formats, Bioinformatics, Volume 31, Issue 12, 15 June 2015, Pages 2032–2034, https://doi.org/10.1093/bioinformatics/btv098
Van der Auwera G.A., Carneiro M.O., Hartl C., Poplin R., del Angel G., Levy-Moonshine A., Jordan T., Shakir K., Roazen D., Thibault J., Banks E., Garimella K.V., Altshuler D., Gabriel S. and DePristo M.A. (2013), From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics, 43: 11.10.1-11.10.33. https://doi.org/10.1002/0471250953.bi1110s43