Refactor workflow #76

rneher · 2022-01-12T08:03:43Z

This PR refactors the old work flow that was spread out over files like Snakefile_base, Snakefile_WHO, etc into a structure that follows the recommendations. In particular, it aims at splitting the workflows into substeps that have defined input and output files to minimize the inter-dependencies of different parts of the workflow. Complex input functions are avoided as much as possible to increase legibility.

TODO

Rename new build outputs to match the current Auspice filenames (for example, flu_seasonal_h3n2_2y_ha.json from the refactored workflow needs to be flu_seasonal_h3n2_ha_2y.json to match the live builds). Some options include:
- Allow the build config to define an auspice_name that parallels but remains decoupled from the build_name so the build name could be flu_seasonal_{lineage}_6m but the auspice_name could be flu_seasonal_{lineage}_{{segment}}_6m. There is some precedent for this approach in the auspice_json_prefix option in the ncov workflow.
- Add a custom rule to rename public build outputs (e.g., with cp) like we do for dated JSONs in ncov.
- Update on Jan 25, 2023: This is now implemented for the public builds through custom renaming rule plus an auspice_name build parameter. To get all of the properly named public builds we just need to run: snakemake --configfile profiles/nextstrain-public.yaml -- all_public. This all_public target will generate the expected files in the auspice_renamed/ directory.
Root trees with the alignment reference and then remove that root sequence after rooting. Some options include:
- Implement proposal in Include option to prune outgroup in augur tree or augur refine augur#340
- Copy custom rule to prune outgroup after refine step from the workflow on master branch
- Update on Jan 27, 2023 by @huddlej: I ended up copying and modifying the prune outgroup script to prune the outgroup using the same reference FASTA used for alignment. I also moved the pruning immediately after tree building and removed the midpoint rooting logic from the "sanitize" script to avoid undoing the rooting that happens during tree building.
Port rules to build Auspice JSONs for private.nextflu.org
Verify that public and private builds run and deploy as expected

Running public builds (40 trees)

nextstrain build \
  --aws-batch \
  --detach \
  --cpus 36 \
  --memory 72gib \
  . \
  all_public \
  -p \
  --configfile profiles/nextstrain-public.yaml

Output is in auspice_renamed/.

Running private.nextflu.org builds (80 trees)

nextstrain build \
  --aws-batch \
  --detach \
  --cpus 36 \
  --memory 72gib \
  . \
  all_who \
  -p \
  --configfile profiles/private.nextflu.org.yaml

Output is in auspice-who/.

Future TODOs

Copy all reference sequences into their respective directories and split them into FASTA and GFF files (using a script to convert GenBank to FASTA + GFF)

huddlej

Thank you for putting this together, @rneher! This is already a major improvement over the current workflow both in the separation of concerns into separate rules and the abstraction of the build configs into separate config files. I like that we can use the powerful array builds approach when we want, but that the simpler flat config approach works just as well.

My biggest structural questions have to do with who we think will use this workflow:

Do we want to support this as a canonical seasonal flu workflow that other public health organizations can run on their own data like we do for SARS-CoV-2?
Can we support definition of specific input sequences and metadata in the config, when fauna inputs are not an option (both for our own internal uses and for external users).

In reviewing this and what a user's experience would be like working from GISAID, I noticed that we can now download ~20,000 sequences at once from GISAID. This means most people could run some powerful analyses from a couple of simple GISAID downloads. For an H3N2 HA and NA analysis, for example, this would produce two FASTA files like gisaid_epiflu_sequence_2015-2018_ha.fasta and gisaid_epiflu_sequence_2015-2018_na.fasta that would need to run through augur parse to get the separate sequences and metadata. Can we support this simple use case in this workflow even for our own work?

It would also be nice to eventually move all of the configuration bits out of the Snakefiles and into config files. For example, clock rates, gene names, etc. are all candidates to move into config files where they are more visible, easier for non-Snakemake experts to modify, etc.

Snakefile

huddlej · 2022-01-13T23:58:26Z

Snakefile

-        "data"
-    shell:
-        "rm -rfv {params}"
+        [f"auspice/{b}_ha.json" for b in config[build_dir + ""]] + \


I didn’t understand why we needed this additional syntax/functionality for composing build directories. Is this the approach to creating "WHO" and "live" builds?

I this should be rethought. this was mostly my quick way of hacking together various targets for testing. But being able to specify an auspice dir has been useful to run different analysis and profile in the same directory and not struggle with clean removing stuff you want to keep.

config/h1n1pdm/ha/genemap.gff

profiles/europe/builds.yaml

huddlej · 2022-01-14T00:13:32Z

profiles/europe/builds.yaml

@@ -0,0 +1,86 @@
+data_source: fauna


I like that fauna is abstracted away in this implementation.

Do we want to allow users (including us) to define their own input sequences/metadata as we do in ncov? This would provide a path for public health partners to use our workflow from GISAID downloads without needing access to fauna. The bigger question is: do we want to support this as a public Nextstrain workflow for flu?

huddlej · 2022-01-14T22:58:23Z

workflow/snakemake_rules/core.smk

+            --nthreads {threads}
+        """
+
+rule sanitize_trees:


Is there a reason we don't pair strains by segment earlier than this in the workflow? It seems like the metadata merge would be the earliest place to ensure we have the same strains. Then we wouldn't need this additional rule to sanitize the trees...

Just a note that the new script for joining metadata (scripts/join_metadata.py) supports inner joins of the data, so we could choose to limit analyses to strains with all requested segments at the beginning of the workflow. The steps that follow that join and work from the joined metadata should always pull the same strains for tree building...

huddlej · 2022-01-14T23:01:36Z

workflow/snakemake_rules/export.smk

+    if config.get('titer-models',False):
+        inputs.append(rules.titers_sub.output.titers_model)
+        inputs.append(rules.titers_tree.output.titers_model)
+    if config.get('glycosilation', False) and wildcards.segment in ['ha', 'na']:
+        inputs.append(rules.glyc.output.glyc)
+    if config.get('lbi', False) and wildcards.segment in ['ha', 'na']:
+        inputs.append(rules.lbi.output.lbi)


These top-level boolean keys like titer-models don't allow us to limit which segments to apply the models to from the config, so we'll run titer models on NA, PB1, etc. Could we move these types of parameters into the build definitions (like the vaccines option below) instead so we have more control over which builds use which annotations?

Then we can eliminate all of these additional segment tests that happen for glycosylation, LBI, etc.

huddlej · 2022-01-14T23:03:46Z

workflow/snakemake_rules/select_strains.smk

+def get_subsample_input(w):
+    files = {"metadata": f"data/{config['builds'][w.build_name]['lineage']}/metadata.tsv"}
+    if config['builds'][w.build_name]['subsamples'][w.subsample].get('priorities', '')=='titers':
+        files['titers']=build_dir + f"/{w.build_name}/titer_priorities.tsv"
+    return files
+
+rule subsample:
+    input:
+        unpack(get_subsample_input)


Instead of unpacking output from a wildcard function like this, what if we allow users to define the “titer priorities” file as a subsample input in the config file (e.g., priorities_file: "{build_name}/titer_priorities.tsv")? This would allow Snakemake to build the titer priorities on the fly, as needed, and also allow users to define their own priorities file. This would decouple the titers bit from the priorities bit.

workflow/snakemake_rules/select_strains.smk

huddlej · 2022-01-20T21:21:42Z

workflow/snakemake_rules/select_strains.smk

+and the meta data subset.
+
+input:
+ - "data/lineage/metadata.tsv"


In our discussion today, we want to support users who download separate segment FASTAs from GISAID. We may need to support files like data/lineage/metadata_segment.tsv

huddlej · 2022-01-20T21:42:45Z

@joverlee521 made the point that we also want to support example data from open data with an example Auspice JSON so users can confirm that the basic workflow will work for them as a tutorial.

We also need to move the distance files, fitness components, etc. still.

I will try running the workflow with standard GISAID downloads and determine what minimal changes need to happen for this to work.

workflow/snakemake_rules/download_from_fauna.smk

huddlej · 2022-05-24T22:31:03Z

workflow/snakemake_rules/core.smk

+
+localrules: clades, sanitize_trees
+
+build_dir = config.get("build_dir", "builds")


Using this workflow more, I was surprised by this directory as the place to look for my output. The rest of our workflows use results/ with the contents of that directory depending on the pathogen workflow (zika stores a single set of outputs top-level while ncov nests build-specific outputs in subdirectories). Can we switch back to that here?

Gene maps represent configuration data and not workflow functionality. This change sets us up to move these data into a configuration file in the future.

Fixes several bugs that I added during the port from the old workflow to the new including: - passing genes as a space-delimited list instead of comma-delimited - setting the same max date for all regions, so pivot arrays are the same size during the global frequency estimate step - quote region-specific translations file names since they have spaces in their names

Fixes a typo in the "--gene-map" argument to the entropy command and passes the alignments for gene translations with the internal nodes to the sequence export script. Adds the missing sequence export script.

Fixes a bug with the entropy panel caused by passing a comma-delimited list where a space-delimited list was expected.

The public builds rely on CDC ferret titers only which are not available yet in our S3 bucket, so we need to download from fauna for now.

Keeps 6-year and 12-year Yam trees for historical reasons.

Actually, we don't seem to be running these builds on the public site currently, so there is no need to keep updating the 6- and 12-year builds.

The public flu builds that don't use multiple forecasting models don't need to use the HI antigenic novelty model with a "cell_fra" suffix to distinguish it from other similar models.

Restores public Vic clade annotations.

Allows the list of regions to be a superset of the list of frequencies available per region, to handle the case when a given region has no data for a build and therefore has no frequencies. The new logic checks for the presence of each given region name (e.g., "Japan Korea") in a given frequencies JSON filename to find a list of available regions.

Fixes a data-dependent bug when the workflow tries to estimate region-specific frequencies for a region that has no aligned amino acid sequences. This fix includes two important steps: 1. augur filter run on amino acid sequence alignments to select data for a specific region is allowed to have empty output (printing only a warning, so we can track these events in logs), so the workflow can create an empty FASTA file for regions without data. These empty files represent that state of missing data. This implementation uses a new argument for augur filter provided in version 21.1.0. This commit updates the Conda environment accordingly. 2. The workflow conditionally builds a list of regional mutation frequencies to estimate for each build based on which amino acid sequence files are not empty. Because there can be multiple "genes" per segment, the workflow requires each region to have non-empty amino acid sequences for all genes in a segment. This implementation relies on converting the augur filter rule to a "checkpoint", so Snakemake knows to update the DAG after that rule runs. The aggregation of which region frequencies to estimate happens by calling an input function to the final global frequencies rule that checks the content of the filtered FASTA files and builds the list of region frequency JSONs to create. This last step works because the global frequency estimation script accepts a list of all possible regions and a list of available frequency JSONs.

Sorts rows in the measurements panel by the minimum y-axis position of the clade associated with a reference strain, clade, haplotype, or serum id in descending order. This sorting aligns the measurements panel entries more closely to the phylogenetic layout. Within each clade, rows are sorted alphabetically in ascending order, to facilitate lookups by name.

Adds vaccine strains recently selected for H1, corrects H3 vaccine strains, and adds recent egg-passaged vaccine strains to the JSONs, since these strains are included in our private nextflu builds.

Corrects the nucleotide positions of specific alleles expected per clade to reflect the new reference sequence coordinates for H3N2 and Vic.

Adds a `year_month` annotation through the existing `epiweeks` annotation that provides a coarser grain filtering by date. This kind of filtering is most useful in narrative writing, where we want to easily update a prior filter query string to reflect the latest data. Adding a new month to a list of year/months is much easier than adding all of the epiweeks.

Fix merge conflicts with `git merge -s ours master`, which effectively ignores changes in `master` when there are conflicts. Subsequent commits will add back changes that we _do_ want from master.

Corrects the HA gene coordinates for B/Vic such that HA1 and HA2 translations have the expected lengths (347 and 223, respectively) where previously the HA1 sequence was 1 amino acid longer at the end and the HA2 sequence was 1 shorter at the beginning. Documents the source of the new coordinates with the PDB record and corresponding publication.

Applies a standard UNIX sort to all outliers files, to simplify merging with outliers from `master` branch.

Updates outliers for H1N1pdm and H3N2 that were present in the master branch and not in this branch.

Uses UNIX sort -k 1,1 to sort strains ahead of merge with strain lists from master branch.

Updates lists of reference strains for H1N1pdm, H3N2, and Vic using strains in the master branch.

This adds GH Action workflows for both the Nextstrain public builds and the private Nextflu builds. The `nextstrain build` commands are copied from the example commands in #76

huddlej reviewed Jan 14, 2022

View reviewed changes

huddlej reviewed Jan 20, 2022

View reviewed changes

corneliusroemer reviewed Jan 21, 2022

View reviewed changes

workflow/snakemake_rules/download_from_fauna.smk Outdated Show resolved Hide resolved

huddlej mentioned this pull request Apr 19, 2022

Add measurements panel in nextflu-private builds #81

Closed

joverlee521 self-requested a review April 21, 2022 01:14

huddlej reviewed May 24, 2022

View reviewed changes

rneher and others added 23 commits June 24, 2022 14:11

added titer models

94fb775

fix annotations

b72fe40

make titer rule wait for translations

84e7815

add wildcard constraints

082de81

add standard build and second array build

a4453ca

small edits

41f6eb6

add glyc and lbi

03a37b9

add vic and yam configs

2f9e552

add europe profile

a500b95

extend nextstrain profile

b37eabb

fixes and add forgotten fitness.smk

f3e6ca9

add simple example profile

e85fa19

add vaccine markers

7297113

remove some old scripts

e504092

add scicore cluster config

2220c22

make download rules local rules

b40e016

small changes in profiles and strain selection

ff480b1

add stub to make report figures

efb8929

add allflu profile

2d3ea21

add missing report rules file

5f75284

Fix misspelling of "glycosylation"

b61f2f6

Remove unnecessary list comprehension

a487279

Define gene maps with a dict instead of a function

37f96a9

Gene maps represent configuration data and not workflow functionality. This change sets us up to move these data into a configuration file in the future.

huddlej added 8 commits February 24, 2023 16:20

Fix bugs with sequence export and entropy

aa39e27

Fixes a typo in the "--gene-map" argument to the entropy command and passes the alignments for gene translations with the internal nodes to the sequence export script. Adds the missing sequence export script.

Pass space-delimited list of genes to entropy

56f1604

Fixes a bug with the entropy panel caused by passing a comma-delimited list where a space-delimited list was expected.

Use fauna to download data for public builds

bcbe56c

The public builds rely on CDC ferret titers only which are not available yet in our S3 bucket, so we need to download from fauna for now.

Don't build recent Yam trees

36d39ce

Keeps 6-year and 12-year Yam trees for historical reasons.

Don't run Yam builds at all

38f3202

Actually, we don't seem to be running these builds on the public site currently, so there is no need to keep updating the 6- and 12-year builds.

Use cross-immunity metric without cell FRA suffix

e1f455f

The public flu builds that don't use multiple forecasting models don't need to use the HI antigenic novelty model with a "cell_fra" suffix to distinguish it from other similar models.

Only use emerging Vic clades in private group

de78753

Restores public Vic clade annotations.

joverlee521 mentioned this pull request Mar 8, 2023

filter: Add --empty-results-reporting={error,warn,skip} option nextstrain/augur#1175

Merged

1 task

huddlej and others added 12 commits March 14, 2023 12:25

Update vaccine strains

5c9a1a5

Adds vaccine strains recently selected for H1, corrects H3 vaccine strains, and adds recent egg-passaged vaccine strains to the JSONs, since these strains are included in our private nextflu builds.

Fix clade definitions based on nucleotides

f3ccb9e

Corrects the nucleotide positions of specific alleles expected per clade to reflect the new reference sequence coordinates for H3N2 and Vic.

Merge branch 'master' into refactor-workflow

94a3ed4

Fix merge conflicts with `git merge -s ours master`, which effectively ignores changes in `master` when there are conflicts. Subsequent commits will add back changes that we _do_ want from master.

Sort outliers with sort -k 1,1

62659e2

Applies a standard UNIX sort to all outliers files, to simplify merging with outliers from `master` branch.

Add outliers from master branch

936689b

Updates outliers for H1N1pdm and H3N2 that were present in the master branch and not in this branch.

Sort reference strains

692a4cc

Uses UNIX sort -k 1,1 to sort strains ahead of merge with strain lists from master branch.

Add reference strains from master branch

b0f444f

Updates lists of reference strains for H1N1pdm, H3N2, and Vic using strains in the master branch.

joverlee521 force-pushed the refactor-workflow branch from 16065ea to b0f444f Compare March 31, 2023 22:16

Add auspice_renamed/ to .gitignore

8bdf310

joverlee521 merged commit d2ce80e into master Mar 31, 2023

joverlee521 deleted the refactor-workflow branch March 31, 2023 22:21

This was referenced Nov 27, 2023

Use augur filter --min/max-date #83

Closed

select_strains.py fails if metadata age column has missing values #72

Closed

huddlej mentioned this pull request Jan 30, 2024

Use nextalign instead of custom Python codon alignment script #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor workflow #76

Refactor workflow #76

rneher commented Jan 12, 2022 •

edited by joverlee521

Loading

huddlej left a comment

huddlej Jan 13, 2022

rneher Jan 15, 2022

huddlej Jan 14, 2022

huddlej Jan 14, 2022

huddlej May 25, 2022

huddlej Jan 14, 2022

huddlej Jan 14, 2022

huddlej Jan 20, 2022

huddlej commented Jan 20, 2022

huddlej May 24, 2022


		localrules: clades, sanitize_trees

		build_dir = config.get("build_dir", "builds")

Refactor workflow #76

Refactor workflow #76

Conversation

rneher commented Jan 12, 2022 • edited by joverlee521 Loading

TODO

Running public builds (40 trees)

Running private.nextflu.org builds (80 trees)

Future TODOs

huddlej left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huddlej commented Jan 20, 2022

Choose a reason for hiding this comment

rneher commented Jan 12, 2022 •

edited by joverlee521

Loading