You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement strategies (or an ensemble of strategies for cross-validation) to produce pairs of sequences_{serotype}.fasta and metadata_{serotype}.tsv files.
Context
Following the merge of #13, all ingested dengue records now exist in a unified pair of sequences.fasta and metadata.tsv files.
For subsequent analysis #18, and to maintain consistency with the previous approach and ensure a seamless integration with the phylogenetic pipeline, it is necessary to separate these files based on dengue serotypes (e.g., from sequences_denv1.fasta to sequences_denv4.fasta).
Possible solution(s)
Rely on NCBI taxon id annotations for serotype segregation
Historically, for dengue, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a virus-tax-id field, that we can use to separate the serotypes.
Notably, this method carries the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID. (~3k records) which can potentially be further refined with a nextclade all dataset.
Create a Nextclade dataset for finer subtype classification
Originally, the plan was to leverage Nextclade assignment to categorize records into major dengue serotypes and subsequent minor subtypes #16. However, due to the diversity within dengue, the major serotype classification did not align with expectations. Therefore the idea is to rely on NCBI taxon ids for major serotypes, and nextclade datasets for within-serotype sub-classification.
An ensemble method
Ideally, employ a combination of the above methods to consistently and accurately classify records into major serotypes and minor subtypes.
Description
Implement strategies (or an ensemble of strategies for cross-validation) to produce pairs of
sequences_{serotype}.fasta
andmetadata_{serotype}.tsv
files.Context
Following the merge of #13, all ingested dengue records now exist in a unified pair of
sequences.fasta
andmetadata.tsv
files.For subsequent analysis #18, and to maintain consistency with the previous approach and ensure a seamless integration with the phylogenetic pipeline, it is necessary to separate these files based on dengue serotypes (e.g., from
sequences_denv1.fasta
tosequences_denv4.fasta
).Possible solution(s)
Rely on NCBI taxon id annotations for serotype segregation
Historically, for dengue, we obtained each serotype individually in this code, leading to redundant fetching and processing of each sequence. Now that we're using ncbi datasets, these numeric IDs are recorded in a
virus-tax-id
field, that we can use to separate the serotypes.Notably, this method carries the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID. (~3k records) which can potentially be further refined with a nextclade all dataset.
Create a Nextclade dataset for finer subtype classification
Originally, the plan was to leverage Nextclade assignment to categorize records into major dengue serotypes and subsequent minor subtypes #16. However, due to the diversity within dengue, the major serotype classification did not align with expectations. Therefore the idea is to rely on NCBI taxon ids for major serotypes, and nextclade datasets for within-serotype sub-classification.
An ensemble method
Ideally, employ a combination of the above methods to consistently and accurately classify records into major serotypes and minor subtypes.
Tasks to solve this issue
The text was updated successfully, but these errors were encountered: