Migrate to Nextclade v3 #281

joverlee521 · 2024-10-18T23:46:05Z

Description of proposed changes

Migrates the ingest workflow to Nextclade v3, which simplifies the workflow since we no longer need to use two Nextclade datasets. Includes additional clean ups to use augur curate rename and augur merge, see commits for details.

Related issue(s)

Resolves #280

Checklist

Checks pass
Trial run of ingest workflow

Since the v3 Nextclade datasets for MPXV and hMPXV both use NC_063383.1 as the reference, we can just run Nextclade once using the MPXV dataset.

Move the downloaded Nextclade dataset to the `data/` directory so that it's automatically ignored by git. Move Nextclade outputs to `results` as they are part of the final outputs of the ingest workflow.

Modified from <nextstrain/measles@faebd64>. I switched the order of `augur curate rename` and `tsv-select` because the Nextclade data includes fields that we don't use that hit the csv field size limit. This commit also includes automated reformatting by `snakefmt`.

Modified from <nextstrain/measles@4d73b7f>

Update the `files_to_upload` to match the output files from Nextclade v3.

Remove reference to nextalign and update available files to match the output files from Nextclade v3.

joverlee521 · 2024-10-21T23:43:21Z

Trial run successfully completed and uploaded results to s3://nextstrain-data/files/workflows/mpox/branch/ingest-nextclade-v3/

There are many changes across the metadata.tsv compared to the production metadata.tsv because of changes in the Nextclade columns. A majority of these changes come from the divergence column, which is expected because the v3 MPXV Nextclade dataset uses NC_063383.1 as the reference instead of a reconstructed ancestor.

I took a quick look at clade changes:

A majority of clade changes are a result of the new Ia & Ib clade distinctions
42 sequences that used to fail now get assigned clades
12 sequences that were labelled as outgroup now fail

Detailed clade change counts

prod_clade	v3_clade	count
	I	6
	II	4
	IIa	5
	IIb	10
	Ia	2
	outgroup	15
I	II	1
I	IIb	1
I	Ia	518
I	Ib	133
I	outgroup	22
II	IIa	1
II	IIb	14
II	outgroup	7
IIa	II	2
IIa	IIb	24
IIa	outgroup	1
IIb	outgroup	2
outgroup		12
outgroup	I	1
outgroup	II	4
outgroup	IIa	4
outgroup	IIb	28

These changes seem reasonable to me, but would like a quick review from @corneliusroemer

ingest/rules/nextclade.smk

Adding log/benchmark to all rules in nextclade.smk according to the Nextstrain Snakemake style guide <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html>

Chatter on Slack regarding workflow runtimes made me realize that since we only run a single Nextclade job now, we no longer need to put a hard limit on the threads. <https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1729629899871479?thread_ts=1729629155.721859&cid=C01LCTT7JNN>

joverlee521 · 2024-10-22T23:33:51Z

ingest/rules/nextclade.smk

@@ -30,7 +30,7 @@ rule run_nextclade:
        # The lambda is used to deactivate automatic wildcard expansion.
        # https://github.com/snakemake/snakemake/blob/384d0066c512b0429719085f2cf886fdb97fd80a/snakemake/rules.py#L997-L1000
        translations=lambda w: "results/translations/{cds}.fasta",
-    threads: 4
+    threads: workflow.cores


This shaved off 8 mins for the Nextclade job, but only 3 mins off the total workflow.

Using all of the cores prevents Snakemake from scheduling other concurrent jobs such as upload_to_s3 jobs. The upload jobs can be slow as well, so we could adjust the threads to something like workflow.cores * 0.75.

It's only a couple minute difference so I'm not going to worry about it for now.

joverlee521 added 6 commits October 18, 2024 16:16

ingest: Update to nextclade3

3850af5

Since the v3 Nextclade datasets for MPXV and hMPXV both use NC_063383.1 as the reference, we can just run Nextclade once using the MPXV dataset.

ingest: move Nextclade dataset to data/ and outputs to results/

cbcc7d4

Move the downloaded Nextclade dataset to the `data/` directory so that it's automatically ignored by git. Move Nextclade outputs to `results` as they are part of the final outputs of the ingest workflow.

ingest: Merge Nextclade metadata with augur merge

54ca356

Modified from <nextstrain/measles@4d73b7f>

ingest/nextstrain_automation: Update files_to_upload

475b64f

Update the `files_to_upload` to match the output files from Nextclade v3.

phylogenetic: Update description.md for Nextclade v3

6b5153c

Remove reference to nextalign and update available files to match the output files from Nextclade v3.

joverlee521 force-pushed the ingest-nextclade-v3 branch from 98b856c to 6b5153c Compare October 19, 2024 00:06

genehack approved these changes Oct 22, 2024

View reviewed changes

ingest/rules/nextclade.smk Show resolved Hide resolved

joverlee521 added 2 commits October 22, 2024 11:19

ingest/nextclade: Add log and benchmark

42c1040

Adding log/benchmark to all rules in nextclade.smk according to the Nextstrain Snakemake style guide <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html>

joverlee521 commented Oct 22, 2024

View reviewed changes

genehack approved these changes Oct 25, 2024

View reviewed changes

joverlee521 merged commit 4ce0b5b into master Oct 30, 2024
14 checks passed

joverlee521 deleted the ingest-nextclade-v3 branch October 30, 2024 16:14

This was referenced Oct 30, 2024

phylogenetic: Clade I build failed during filter #283

Closed

Fix Clade I build #284

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to Nextclade v3 #281

Migrate to Nextclade v3 #281

joverlee521 commented Oct 18, 2024 •

edited

Loading

joverlee521 commented Oct 21, 2024

joverlee521 Oct 22, 2024

Migrate to Nextclade v3 #281

Migrate to Nextclade v3 #281

Conversation

joverlee521 commented Oct 18, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

joverlee521 commented Oct 21, 2024

joverlee521 Oct 22, 2024

Choose a reason for hiding this comment

joverlee521 commented Oct 18, 2024 •

edited

Loading