Split by dengue serotype (denv1-denv4) #19

j23414 · 2024-02-06T23:13:55Z

Description

Implement strategies (or an ensemble of strategies for cross-validation) to produce pairs of sequences_{serotype}.fasta and metadata_{serotype}.tsv files.

Context

Following the merge of #13, all ingested dengue records now exist in a unified pair of sequences.fasta and metadata.tsv files.

For subsequent analysis #18, and to maintain consistency with the previous approach and ensure a seamless integration with the phylogenetic pipeline, it is necessary to separate these files based on dengue serotypes (e.g., from sequences_denv1.fasta to sequences_denv4.fasta).

Possible solution(s)

Rely on NCBI taxon id annotations for serotype segregation

Historically, for dengue, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a virus-tax-id field, that we can use to separate the serotypes.

Notably, this method carries the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID. (~3k records) which can potentially be further refined with a nextclade all dataset.

Create a Nextclade dataset for finer subtype classification

Originally, the plan was to leverage Nextclade assignment to categorize records into major dengue serotypes and subsequent minor subtypes #16. However, due to the diversity within dengue, the major serotype classification did not align with expectations. Therefore the idea is to rely on NCBI taxon ids for major serotypes, and nextclade datasets for within-serotype sub-classification.

An ensemble method

Ideally, employ a combination of the above methods to consistently and accurately classify records into major serotypes and minor subtypes.

Tasks to solve this issue

Split by serotype using NCBI virus_tax_id #20
Nextclade assignment #16
- which possibly requires Add workflow for producing the Nextclade dengue dataset #25

The text was updated successfully, but these errors were encountered:

joverlee521 · 2024-03-14T19:59:48Z

Learned in today's Nextstrain meeting that this can probably be done by nextclade sort similar to how RSV separates A/B subtypes

j23414 added the enhancement New feature or request label Feb 6, 2024

This was referenced Feb 6, 2024

Split by serotype using NCBI virus_tax_id #20

Merged

Add E gene builds #17

Closed

j23414 self-assigned this Feb 11, 2024

j23414 closed this as completed Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split by dengue serotype (denv1-denv4) #19

Split by dengue serotype (denv1-denv4) #19

j23414 commented Feb 6, 2024 •

edited

Loading

joverlee521 commented Mar 14, 2024

Split by dengue serotype (denv1-denv4) #19

Split by dengue serotype (denv1-denv4) #19

Comments

j23414 commented Feb 6, 2024 • edited Loading

Description

Context

Possible solution(s)

Tasks to solve this issue

joverlee521 commented Mar 14, 2024

j23414 commented Feb 6, 2024 •

edited

Loading