Nextclade assignment #16

j23414 · 2023-12-05T21:51:20Z

Description of proposed changes

This PR is a continuation of breaking out the dengue ingest changes PR#6 into more focused and manageable pull requests. Following the merge of copy ingest, the focus of this PR is to split the dengue sequences by serotype (e.g. sequences_denv1.fasta to sequences_denv4.fasta).

Previously for dengue, we had been fetching each serotype separately in this code, leading to redundant fetching and processing of each sequence. Additionally, this method posed the risk of missing sequences in individual serotype builds if NCBI did not annotate the record with the lineage ID.

Therefore, as an alternative, this PR proposes to leverage Nextclade assignment to categorize the records into the major dengue serotypes, and subsequent subtypes.

The primary scope of this PR includes:

Adding a Nextclade dataset (v2)
Splitting dengue sequences into denv1 to denv4 based on assignment
Adding Nextclade subtype assignments to the metadata files

Related issue(s)

Checklist

Checks pass
Manual check

git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout nextclade_assignment
nextstrain build ingest results/metadata_denv4.tsv

joverlee521 · 2024-01-05T22:23:38Z

nextclade_data/README.md

I haven't done an in-depth review yet but wanted to ask about the Nextclade dataset in general:

Should we start with V3 datasets instead of V2 so we don't have worry about migrating them later?

Should the dengue datasets be added to https://github.com/nextstrain/nextclade_data and then pulled down through the usual nextclade commands?

This is a good point.

I'll run a test conversion to V3 to see if any errors occur, similar to https://github.com/j23414/test_convert_v3 . Happy to pair-code on this, but no pressure.

Made a quick attempt:

python convert_v2_to_v3.py \ --input-dir ~/github/nextstrain/dengue_branches/nextclade_assignment/nextclade_data/denv1 \ --output-dir dengue_v3/denv1

Wonder what's causing "valueFriendly" errors (nextclade_v3/dengue/denv1)

Wonder what's causing "valueFriendly" errors

Do you see errors when running Nextclade v3 via the CLI instead of through the test web UI? I suspect the dataset-url param has not been updated to support V3 datasets (yet??). Actually, they probably will not support the dataset-url param going forward considering they are asking people to directly submit datasets to the nextclade_data repo.

Different error, I went and converted the "dengue/all"

nextclade3 run \ --input-dataset ../nextclade_data/all \ #<====== This is created with the convert_v2_to_v3.py script -j 4 \ --output-tsv data/nextclade_results/nextclade_all.tsv \ --min-length 1000 \ --silent \ data/sequences.fasta

And hitting:

Error: 0: When preprocessing Nextclade graph 1: When retrieving nuc mutations from reference tree node NC_002640.1 2: Encountered a mutation (T447C) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch. Mutations origin state is 'T', but tree (inferred from the reference sequence as no prior mutations were observed at this position) has state 'C'. This is likely an inconsistency between reference tree and reference sequence in the Nextclade dataset. Reference sequence should either correspond to the root of the reference tree or the root of the reference tree needs to account for difference between the tree and reference sequence. Please check that your reference tree is consistent with your reference sequence. Location: packages_rs/nextclade/src/tree/tree_preprocess.rs:145 Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it. Run with RUST_BACKTRACE=full to include source snippets.

The error message makes me think the reference tree is somehow violating one of the requirements:

The tree must be rooted at the sample that matches the reference (root) sequence.

You mentioned separately that

dengue (multiple spillovers, long diversity, there is no root strain for all 4 major spillovers) is probably problematic for nextclade v3.

Maybe @corneliusroemer or @ivan-aksamentov can comment on what's the best way to validate this dataset and whether there needs to be some adjustments specifically for Dengue.

Sorry for the troubles! Dataset creation is still a bit rough at edges. I'll mostly provide technical help and let scientists to help with the science.

Wonder what's causing "valueFriendly" errors

The reason for this error is simply that v3 is not released and so we've been breaking things in the dataset formats lately without telling anyone :). You used the latest conversion script from master branch of data repo, but also a random old version of Nextclade Web from a couple of months ago (the "UI2" prototype, https://nextclade-git-ui-2-nextstrain.vercel.app/), and the dataset format is simply incompatible between those versions.

We are pretty much done with the changes and we plan to make the v3 release soon. You can use https://master.clades.nextstrain.org for testing latest Web features in the meantime - this is a snapshot of master branch. And I released 3.0.0-alpha.2 of the CLI yesterday. Likely the last one.

The master deployment of Nextclade Web should show the same error as the recent Nextclade CLI:
https://master.clades.nextstrain.org/?dataset-url=https://github.com/j23414/test_convert_v3/tree/main/dengue_v3/denv1

Until the release, please also refer to the "latest" version of docs, sourced from master branch (where there's v3), rather than "stable" version sourced from release branch (where there's still v2):
https://docs.nextstrain.org/projects/nextclade/en/latest/

2: Encountered a mutation (T447C) in reference tree branch attributes, for which the origin state of the mutation is inconsistent with the state at the parental branch [...]

This is an error from the new tree placement algo. It now has some simple tree building capabilities (it can migrate new nodes deeper into the tree when appropriate, rather than just appending them as leaves). Here the new algo tried to calculate mutations resulting from adjusting node placement, but cannot make sense of the mutations on the ref tree, and cannot proceed.

As stated in the error and as Jover mentioned, it is likely due to a mismatch between the ref sequence fasta and the sequence for root node in the tree JSON. Augur usually provides consistent trees. However, if you then feeding a different ref sequence to Nextclade, the root and ref don't match - the evolution breaks (this is my simplified, perhaps limited, understanding as an engineer). A workaround is to add mutations between ref seq and root node seq to the root node JSON. I think Richard and Cornelius have some scripts for that in their dataset workflow repos. We need some scientific expertise there.

Although it worked in v2, it was likely silently producing incorrect results. Nextclade has no sufficient information to catch the inconsistency between the ref and the root, so the only thing we do is we warn users in the docs and shift responsibility to handle this correctly onto them. In v3 however, this can sometimes manifest if the mutations are inconsistent and it breaks the algorithm. Thus the error.

We did not have many custom datasets, and most datasets were created by Richard and Cornelius, so this was not a critical issue in v2. In v3 we actively encourage people to create datasets, so I am wondering what we can do for dataset authors to catch this earlier, during dataset creation, rather than during Nextclade run. Perhaps we can have a script in the data repo to check the ref tree JSON? Any ideas/help is very welcome here.

If you absolutely have no time to deal with this now, you can disable tree builder in CLI with --without-greedy-tree-builder flag. This will fall back to the v2 placement algo. And you can revisit this later. But, as I mentioned, it might produce results that are not 100% correct, so this is not recommended in general. It is also not currently possible to disable tree builder in the Web version.

directly submit datasets to the nextclade_data repo.

You are very welcome to submit!

will not support the dataset-url param going forward

Nextclade v3 will retain support of the dataset-url and dataset-server as the ways to fetch external datasets! It is important for 3rd party integrations (other apps linking to Nextclade). Also handy for dataset testing, just as you are doing it - it is an intended use-case :).

We also extended the format of accepted URLs - you can use the fancy shortcuts for URLs on GitHub - they are a bit easier to write by hand :) Although in this case it does not make it much shorter
https://master.clades.nextstrain.org/?dataset-url=gh:j23414/test_convert_v3@main@dengue_v3/denv1

You can also omit the branch if it's the default branch:
https://master.clades.nextstrain.org/?dataset-url=gh:j23414/test_convert_v3/dengue_v3/denv1

P.S. Don't hesitate to ping me on Slack or on GitHub for any questions!

A few points/questions reading through this thread:

I'd probably not do the v2 -> v3 conversion using the script but rather use an existing v2 dataset with v2 nextclade and make a new v3 compliant dataset in a v3 dataset workflow.

I don't know much about Dengue, it may make sense to have one dataset for each serotype, and maybe one "all serotypes" dataset in addition. I would think it should be possible to align different serotypes against each other but the result might contain a lot of indels and mutations making it hard to interpret in web UI

There's a new v3 subcommand split that should be helpful here to split by serotypes. But unfortunately it requires creating a minimizer index for which the tooling is a bit hidden at the moment. For now it might still be easiest to make a single "all-serotypes" dataset and use that to split into subtypes.

The inconsistency error you get because we require the genotype on the root branch to be the same as the reference. If the tree root is X and the reference is Y, you need to put all mutations from Y to X on the first branch leading to the tree root. Augur ancestral does this automatically now if you give it the reference as --root-sequence (the terminology here is confusing, root-sequence should really be called reference-sequence), see here for how the mpox Nextclade workflow does it: https://github.com/nextstrain/mpox/blob/master/nextclade/Snakefile#L343C1-L370C12 If this is a struggle, you can start by having the root and reference be the same.

corneliusroemer · 2024-01-10T14:13:31Z

Looking at the PR, given that you already have working v2 datasets, I don't think it would be unreasonable to continue using nextclade2 for now (you can add the 2 to future-proof).

As the purpose of this PR is to solely split into serotypes, there's no need for v3 features. Transitioning from 2 -> 3 can happen once this is merged to reduce complexity.

j23414 · 2024-02-14T19:30:19Z

Rebased after switching to relying on NCBI virus-tax-id to split serotypes.

Split by serotype using NCBI virus_tax_id #20

The subsequent nextclade subtype classification looks much cleaner.

joverlee521

Excited to see the dengue subtype assignments through Nextclade!

I'm less experienced on Nextclade datasets themselves, so I focused on the workflow changes. We can discuss the Nextclade dataset specifics in #25.

joverlee521 · 2024-02-14T20:50:41Z

ingest/workflow/snakemake_rules/nextclade.smk

+        min_length=1000, # E gene length is approximately 1400
+        min_seed_cover=0.1,


Do these flags still need to be provided even through they are defined within the Nextclade datasets under alignmentParams?

If we need to keep them or if this is a means for users to override the default params, I recommend moving these to configs that can be easily changed.

You are correct that these flags are equivalent to the alignmentParams and therefore do not need to be provided. Nonetheless, I'll move these params to the config file (a9b4d01) allowing us to refine them further in accordance with the present diversity in denegue sequences (for genome length and E gene length).

Subsequently, we can update the pathogen.json file once a definitive value is established.

I see, that's good to know! So once they've been settled in pathogen.json then we can remove them from the config and the workflow?

Yes, exactly :D

ingest/config/optional.yaml

joverlee521 · 2024-02-14T22:09:34Z

ingest/workflow/snakemake_rules/nextclade.smk

+        """
+        echo "genbank_accession,nextclade_subtype,nextclade_type" \
+        | tr ',' '\t' \
+        > results/nextclade_subtype.tsv
+
+        tsv-select -H -f "seqName,clade" {input.nextclade_denv1} \
+        | awk 'NR>1 {{print $0"\tDENV1"}}' \
+        >> results/nextclade_subtype.tsv
+        tsv-select -H -f "seqName,clade" {input.nextclade_denv2} \
+        | awk 'NR>1 {{print $0"\tDENV2"}}' \
+        >> results/nextclade_subtype.tsv
+        tsv-select -H -f "seqName,clade" {input.nextclade_denv3} \
+        | awk 'NR>1 {{print $0"\tDENV3"}}' \
+        >> results/nextclade_subtype.tsv
+        tsv-select -H -f "seqName,clade" {input.nextclade_denv4} \
+        | awk 'NR>1 {{print $0"\tDENV4"}}' \
+        >> results/nextclade_subtype.tsv
+
+        tsv-join -H \
+            --filter-file results/nextclade_subtype.tsv \
+            --key-fields genbank_accession \
+            --append-fields 'nextclade_subtype,nextclade_type' \
+            --write-all ? \
+            {input.metadata} \
+        > {output.metadata_all}
+
+        tsv-filter -H --str-eq ncbi_serotype:denv1 {output.metadata_all} > {output.metadata_denv1}
+        tsv-filter -H --str-eq ncbi_serotype:denv2 {output.metadata_all} > {output.metadata_denv2}
+        tsv-filter -H --str-eq ncbi_serotype:denv3 {output.metadata_all} > {output.metadata_denv3}
+        tsv-filter -H --str-eq ncbi_serotype:denv4 {output.metadata_all} > {output.metadata_denv4}
+        """


@j23414 and I spoke about this online because I was confused about what was going on here.

Background that was discussed during our chat: The metadata_all.tsv has all dengue metadata, including sequences that are not assigned serotypes. The metadata must then be joined with each individual serotype outputs from Nextclade to get the sequence subtype. The metadata + subtype TSV is then split back out into individual serotypes to make them available via data.nextstrain.org.

This rule can be simplified by splitting into 3 different rules.

Aggregates all serotype Nextclade outputs to create a single nextclade_subtype.tsv, i.e. lines 44 - 59.

Joins metadata with nextclade_subtype.tsv, i.e. lines 61 - 67

Splits out the all metadata + nextclade file into individual serotype files, i.e. lines 69 - 72

Thanks for chatting! Refactored and (hopefully) clarified here: 11a5cbc

I was wondering if it was possible to deduplicate some of the code in the three rules and ended up pushing up
831aa0d. Feel free to rework it!

Thanks for refactoring and deduplicating the code! Works as expected and looks more streamlined now :D

ingest/workflow/snakemake_rules/nextclade.smk

j23414 · 2024-02-22T22:19:27Z

After chatting with people, I'm going to fixup the commits and merge since this branch fixes DENV2/AII (aka DENV2/IV) assignment: #28 (comment)

Restore nextclade assignment rules to identify dengue subtypes The final subtype are added to the metadata_all.tsv file and subsequent metadata_denv1.tsv to metadata_denv4.tsv files.

Set final targets to be the pair of metadata_denvX.tsv and sequences_denvX.fasta files. The insertion, alignment and translation files had been produced by nextclade in other pathogen repos, however, we are not using them in dengue yet. If we need them later on, we can add in alignment and translation files (split by serotype) back in at that time.

Since we moved to using the ncbi dataset's 'virus_tax_id' field instead.

Only run nextclade on sequences at least 1000 nt long, this would avoid misclassification of short sequences. The dengue E gene has an approximate length of 1400.

While the alignment parameters can be set in the pathogen.json file, specifying the nextclade alignment parameters in the config file makes it easier to fine tune based on current sequence diversity being analyzed. This will allow us to then later propagate them to the pathogen.json file for Nextclade Web.

…arate rules for clarity This rule can be simplified by splitting into 3 different rules. 1. Aggregates all serotype Nextclade outputs to create a single nextclade_subtype.tsv mapping file 2. Joins metadata with nextclade_subtype.tsv into the final metadata_all.tsv file 3. Splits out the all metadata + nextclade file into individual serotype metadata.tsv files The reasoning is that the metadata_all.tsv has all dengue metadata, including sequences that are not assigned serotypes. Sequences without serotype are not listed in any of the subtype results. Therefore the nextclade_subtypes are added as a new field in metadata_all.tsv and then subsequently split back out into individual serotypes to make them available via data.nextstrain.org. The serotypes still need to be hardcoded because we only have Nextclade datasets for these specific serotypes. Use these hardcoded serotypes as wildcard constraints in the Nextclade rules to deduplicate the shell commands and parallelize the final `split_metadata_by_serotype` rule. co-authored-by: Jover Lee <joverlee521@gmail.com>

This is currently an empty file to indicate the top level pathogen repo. The inclusion of this file allows the nextstrain build invocation to work from any directory regardless of runtime. This is necessary so that rules in "ingest" can access the the external directory "nextclade_data".

j23414 force-pushed the nextclade_assignment branch 2 times, most recently from 9ae8383 to e638db2 Compare December 6, 2023 21:13

j23414 requested a review from a team December 12, 2023 22:59

j23414 marked this pull request as ready for review December 12, 2023 22:59

j23414 force-pushed the nextclade_assignment branch from e638db2 to 3321925 Compare December 12, 2023 23:11

j23414 force-pushed the nextclade_assignment branch from ceae8c5 to 5205fd0 Compare January 2, 2024 21:21

joverlee521 reviewed Jan 5, 2024

View reviewed changes

j23414 force-pushed the nextclade_assignment branch 2 times, most recently from 3425884 to 9b9a10d Compare January 9, 2024 21:49

j23414 force-pushed the nextclade_assignment branch 4 times, most recently from 1ee5ac5 to 2dce01b Compare January 31, 2024 21:55

j23414 force-pushed the nextclade_assignment branch from 2dce01b to de11274 Compare February 5, 2024 19:39

This was referenced Feb 6, 2024

Split by dengue serotype (denv1-denv4) #19

Closed

Harmonize with pathogen repo guide #22

Closed

Add E gene builds #17

Closed

Add workflow for producing the Nextclade dengue dataset #25

Merged

j23414 force-pushed the nextclade_assignment branch from 35ee5d3 to e4e78b0 Compare February 14, 2024 19:25

joverlee521 requested changes Feb 14, 2024

View reviewed changes

j23414 force-pushed the nextclade_assignment branch 2 times, most recently from 6c4facd to 37c800f Compare February 16, 2024 19:31

j23414 commented Feb 21, 2024

View reviewed changes

ingest/workflow/snakemake_rules/nextclade.smk Show resolved Hide resolved

j23414 force-pushed the nextclade_assignment branch 2 times, most recently from 3049169 to fbc972a Compare February 21, 2024 21:46

j23414 added 2 commits February 23, 2024 11:58

ingest/nextclade: Add nextclade dengue dataset

9a371df

ingest/nextclade: Restore nextclade assignment rules

e73fd31

Restore nextclade assignment rules to identify dengue subtypes The final subtype are added to the metadata_all.tsv file and subsequent metadata_denv1.tsv to metadata_denv4.tsv files.

j23414 and others added 9 commits February 23, 2024 11:59

ingest/nextclade: Update to Nextclade v3

cf42ced

ingest/nextclade: Add labeled example data

603cd24

ingest/nextclade: Delete the serotype nextclade dataset

467845f

Since we moved to using the ncbi dataset's 'virus_tax_id' field instead.

ingest/nextclade: Adjust input length on nextclade runs

299bbb4

Only run nextclade on sequences at least 1000 nt long, this would avoid misclassification of short sequences. The dengue E gene has an approximate length of 1400.

ingest: Update docs

b5dc665

j23414 force-pushed the nextclade_assignment branch from fbc972a to b5dc665 Compare February 23, 2024 20:09

j23414 merged commit 36a0bb1 into main Feb 24, 2024
32 checks passed

j23414 deleted the nextclade_assignment branch February 24, 2024 00:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nextclade assignment #16

Nextclade assignment #16

j23414 commented Dec 5, 2023 •

edited

Loading

joverlee521 Jan 5, 2024

j23414 Jan 8, 2024

j23414 Jan 9, 2024

joverlee521 Jan 9, 2024

j23414 Jan 9, 2024

joverlee521 Jan 10, 2024

ivan-aksamentov Jan 10, 2024 •

edited

Loading

ivan-aksamentov Jan 10, 2024 •

edited

Loading

ivan-aksamentov Jan 10, 2024 •

edited

Loading

corneliusroemer Jan 10, 2024

corneliusroemer commented Jan 10, 2024

j23414 commented Feb 14, 2024

joverlee521 left a comment

joverlee521 Feb 14, 2024

j23414 Feb 16, 2024

joverlee521 Feb 16, 2024

j23414 Feb 16, 2024

joverlee521 Feb 14, 2024

j23414 Feb 16, 2024

joverlee521 Feb 17, 2024

j23414 Feb 21, 2024 •

edited

Loading

j23414 commented Feb 22, 2024

		min_length=1000, # E gene length is approximately 1400
		min_seed_cover=0.1,

Nextclade assignment #16

Nextclade assignment #16

Conversation

j23414 commented Dec 5, 2023 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivan-aksamentov Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

ivan-aksamentov Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

ivan-aksamentov Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

corneliusroemer commented Jan 10, 2024

j23414 commented Feb 14, 2024

joverlee521 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j23414 Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

j23414 commented Feb 22, 2024

j23414 commented Dec 5, 2023 •

edited

Loading

ivan-aksamentov Jan 10, 2024 •

edited

Loading

ivan-aksamentov Jan 10, 2024 •

edited

Loading

ivan-aksamentov Jan 10, 2024 •

edited

Loading

j23414 Feb 21, 2024 •

edited

Loading