This pipeline processes dengue virus sequencing data to perform quality assurance of genomic data. To do so it runs steps of genotyping, sequence alignment, genome coverage calculation, and molecular clock analysis.
-
Clone the repository:
git clone https://github.com/InstitutoTodosPelaSaude/dengueQA.git cd dengueQA
-
Install dependencies:
conda env create -f config/dengueQA.yaml conda activate dengueQA
- Edit the Snakefile to specify your input files and parameters.
- Run the pipeline:
snakemake all --cores all
The pipeline consists of the following steps:
- BLAST searches: Perform BLAST searches for serotyping, genotyping, and lineage assignment according to reference files provided by 'dengue-lineages' nomenclature system.
- Dataset splitting: Split datasets based on dengue serotypes.
- Sequence alignment: Align newly sequenced genomes with reference genomes using
augur align
. - Coverage calculation: Calculate coverage per coding region (CDS) and whole genomes.
- Contextual genome selection: Select contextual genomes for root-to-tip analysis.
- Dataset compilation: Join new sequences with contextual genomes.
- Metadata processing: Add missing columns and process metadata.
- Phylogenetic analysis: Align genomes (
augur align
), mask 5' and 3' untranslated regions (augur mask
), build phylogenetic trees (iqtree
), and perform root-to-tip analysis (treetime
). - Quality assurance: Generate quality assurance report containing information about genomic coverage (flagging low coverage coding regions (CDS) and whole genomes) and sequence quality (flagging molecular clock outliers).
Using this pipeline, the quality assurance report reveals important statistics to evaluate genomes under the following criteria:
- Evolutionary rates with strong deviations from the molecular clock
- Minimum coverage of both whole genome and specific coding regions.
The molecular clock quality analysis (1) aims to identify genomes whose evolutionary rates deviate more than 10 interquartile ranges from the trend line produced by high-quality genomes. This cutoff is rather permissive, and was set on purpose to avoid exclusion of genomes that may have naturally higher evolutionary rates.
In this example, the correlation between genetic distance (y axis) and time (x axis) of the root-to-tip molecular clock analysis identified one sequence with evolutionary rate the deviates from the expected, being flagged as a potentially low quality sequence.
The genome coverage analysis (2) aims to identify genomes with less than 70% coverage, the minimum cutoff used for submitting the sequence as 'complete genome' or 'near-complete genome' to public databases. When sequences have less than 70% coverage, specific genomic fragments are evaluated for possible submission as partial sequences, thery are: the C-prM-E region (~2300 bp) and the viral envelope region (E, ~1500 bp) (see image below). In this order, if any of these regions have at least 95% coverage, it could be submitted to databases as genomic fragments, still relevant for specific analyses.
Examples of sequences of interest for submission to public genomic databases, according to their levels of completeness.
- Anderson Brito, Instituto Todos pela Saúde (ITpS) - Website - anderson.brito@itps.org.br
This project is licensed under the MIT License.