A Nextflow pipeline script for processing aDNASeq samples.
The pipeline was written by The Bioinformatics & Biostatistics Group in collaboration with The Ancient Genomics Lab @ The Francis Crick Institute.
nextflow: http://www.nextflow.io nextflow-quickstart: http://www.nextflow.io/docs/latest/getstarted.html#get-started
To run an BABS-aDNASeq analysis you will need to complete the following steps. These are explained in more detail further down.
- Obtain BABS-aDNASeq files from GitHub.
- Install/load nextflow-0.32.0 or higher.
- Configure reference genome file paths (genome.yml).
- Configure environment profile if running software via a module system.
- Create a sample design file.
- Run nextflow pipeline.
To obtain BABS-aDNASeq files run the following git command.
git clone https://github.com/crickbabs/BABS-aDNASeq
BABS-aDNASeq.nf The Nextflow script. BABS-aDNASeq Wrapper script to run an analysis. nextflow.config Main BABS-aDNASeq config file. conf/babs_profile.config Profile configuration for running the script @ The Crick. conf/genomes.config Genomes configuration file for defining reference data. conf/multiqc_config.yml Multiqc configuration used to generate integrated QC report.
If you are working within a module environment such as that at The Crick, load the nextflow module.
module purge module load nextflow/0.30.2
Fastq files are specified in a csv design file with the following columns.
column 1 : Individual ID
column 2 : Sequencing library ID
column 3 : full path to fastq file R1
column 4 : full path to fastq file R2
BABS-aDNASeq --outdir ./ --design design.csv --profile babs --genome hg19 --resume
Adapter trimming and paired-end overlap consensus building. Only the overlap is saved here. Non-overlapping read-pairs are discarded.
https://github.com/jstjohn/SeqPrep
Consensus overlaps are aligned to the specified reference using BWA. BAM files with read groups are created.
Duplicate alignments are removed using Picard.
VCFs are created using samtools mpileup. QC metrics are produced using bcftools stats.
Ambiguity encoded consensus fasta files are produced using vcftools consensus.
Random allele fasta files are produced using htsbox pileup -R.
https://github.com/lh3/htsbox
BAM files from the same individual are merged using samtools merge. Varient calling and QC ae carried out at both the library and individual level.
Alignment QC is assessed using pmdtools and CollectWgsMetrics, CollectWgsMetricsWithNonZeroCoverage, CollectOxoGMetrics & CollectAlignmentSummaryMetrics from Picard. A QC report is generated using multiqc.
https://github.com/pontussk/PMDtools
https://github.com/broadinstitute/picard
https://github.com/ewels/MultiQC
The BABS-aDNASeq nextflow pipeline was written and developed by Philip East & Pontus Skoglund.
The Bioinformatics & Biostatistics Group (BABS) @ The Francis Crick Institute. Ancient Genomics @ The Francis Crick Institute.
This project is licensed under the MIT License - see the LICENSE.md file for details.