Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor workflow #76

Merged
merged 257 commits into from
Mar 31, 2023
Merged

Refactor workflow #76

merged 257 commits into from
Mar 31, 2023

Conversation

rneher
Copy link
Member

@rneher rneher commented Jan 12, 2022

This PR refactors the old work flow that was spread out over files like Snakefile_base, Snakefile_WHO, etc into a structure that follows the recommendations. In particular, it aims at splitting the workflows into substeps that have defined input and output files to minimize the inter-dependencies of different parts of the workflow. Complex input functions are avoided as much as possible to increase legibility.

TODO

  • Rename new build outputs to match the current Auspice filenames (for example, flu_seasonal_h3n2_2y_ha.json from the refactored workflow needs to be flu_seasonal_h3n2_ha_2y.json to match the live builds). Some options include:
    • Allow the build config to define an auspice_name that parallels but remains decoupled from the build_name so the build name could be flu_seasonal_{lineage}_6m but the auspice_name could be flu_seasonal_{lineage}_{{segment}}_6m. There is some precedent for this approach in the auspice_json_prefix option in the ncov workflow.
    • Add a custom rule to rename public build outputs (e.g., with cp) like we do for dated JSONs in ncov.
    • Update on Jan 25, 2023: This is now implemented for the public builds through custom renaming rule plus an auspice_name build parameter. To get all of the properly named public builds we just need to run: snakemake --configfile profiles/nextstrain-public.yaml -- all_public. This all_public target will generate the expected files in the auspice_renamed/ directory.
  • Root trees with the alignment reference and then remove that root sequence after rooting. Some options include:
  • Port rules to build Auspice JSONs for private.nextflu.org
  • Verify that public and private builds run and deploy as expected

Running public builds (40 trees)

nextstrain build \
  --aws-batch \
  --detach \
  --cpus 36 \
  --memory 72gib \
  . \
  all_public \
  -p \
  --configfile profiles/nextstrain-public.yaml

Output is in auspice_renamed/.

Running private.nextflu.org builds (80 trees)

nextstrain build \
  --aws-batch \
  --detach \
  --cpus 36 \
  --memory 72gib \
  . \
  all_who \
  -p \
  --configfile profiles/private.nextflu.org.yaml

Output is in auspice-who/.

Future TODOs

  • Copy all reference sequences into their respective directories and split them into FASTA and GFF files (using a script to convert GenBank to FASTA + GFF)

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together, @rneher! This is already a major improvement over the current workflow both in the separation of concerns into separate rules and the abstraction of the build configs into separate config files. I like that we can use the powerful array builds approach when we want, but that the simpler flat config approach works just as well.

My biggest structural questions have to do with who we think will use this workflow:

  1. Do we want to support this as a canonical seasonal flu workflow that other public health organizations can run on their own data like we do for SARS-CoV-2?
  2. Can we support definition of specific input sequences and metadata in the config, when fauna inputs are not an option (both for our own internal uses and for external users).

In reviewing this and what a user's experience would be like working from GISAID, I noticed that we can now download ~20,000 sequences at once from GISAID. This means most people could run some powerful analyses from a couple of simple GISAID downloads. For an H3N2 HA and NA analysis, for example, this would produce two FASTA files like gisaid_epiflu_sequence_2015-2018_ha.fasta and gisaid_epiflu_sequence_2015-2018_na.fasta that would need to run through augur parse to get the separate sequences and metadata. Can we support this simple use case in this workflow even for our own work?

It would also be nice to eventually move all of the configuration bits out of the Snakefiles and into config files. For example, clock rates, gene names, etc. are all candidates to move into config files where they are more visible, easier for non-Snakemake experts to modify, etc.

Snakefile Show resolved Hide resolved
Snakefile Outdated
"data"
shell:
"rm -rfv {params}"
[f"auspice/{b}_ha.json" for b in config[build_dir + ""]] + \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t understand why we needed this additional syntax/functionality for composing build directories. Is this the approach to creating "WHO" and "live" builds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I this should be rethought. this was mostly my quick way of hacking together various targets for testing. But being able to specify an auspice dir has been useful to run different analysis and profile in the same directory and not struggle with clean removing stuff you want to keep.

config/h1n1pdm/ha/genemap.gff Show resolved Hide resolved
profiles/europe/builds.yaml Outdated Show resolved Hide resolved
@@ -0,0 +1,86 @@
data_source: fauna
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that fauna is abstracted away in this implementation.

Do we want to allow users (including us) to define their own input sequences/metadata as we do in ncov? This would provide a path for public health partners to use our workflow from GISAID downloads without needing access to fauna. The bigger question is: do we want to support this as a public Nextstrain workflow for flu?

--nthreads {threads}
"""

rule sanitize_trees:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't pair strains by segment earlier than this in the workflow? It seems like the metadata merge would be the earliest place to ensure we have the same strains. Then we wouldn't need this additional rule to sanitize the trees...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that the new script for joining metadata (scripts/join_metadata.py) supports inner joins of the data, so we could choose to limit analyses to strains with all requested segments at the beginning of the workflow. The steps that follow that join and work from the joined metadata should always pull the same strains for tree building...

Comment on lines 16 to 22
if config.get('titer-models',False):
inputs.append(rules.titers_sub.output.titers_model)
inputs.append(rules.titers_tree.output.titers_model)
if config.get('glycosilation', False) and wildcards.segment in ['ha', 'na']:
inputs.append(rules.glyc.output.glyc)
if config.get('lbi', False) and wildcards.segment in ['ha', 'na']:
inputs.append(rules.lbi.output.lbi)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These top-level boolean keys like titer-models don't allow us to limit which segments to apply the models to from the config, so we'll run titer models on NA, PB1, etc. Could we move these types of parameters into the build definitions (like the vaccines option below) instead so we have more control over which builds use which annotations?

Then we can eliminate all of these additional segment tests that happen for glycosylation, LBI, etc.

Comment on lines 43 to 48
def get_subsample_input(w):
files = {"metadata": f"data/{config['builds'][w.build_name]['lineage']}/metadata.tsv"}
if config['builds'][w.build_name]['subsamples'][w.subsample].get('priorities', '')=='titers':
files['titers']=build_dir + f"/{w.build_name}/titer_priorities.tsv"
return files

rule subsample:
input:
unpack(get_subsample_input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of unpacking output from a wildcard function like this, what if we allow users to define the “titer priorities” file as a subsample input in the config file (e.g., priorities_file: "{build_name}/titer_priorities.tsv")? This would allow Snakemake to build the titer priorities on the fly, as needed, and also allow users to define their own priorities file. This would decouple the titers bit from the priorities bit.

workflow/snakemake_rules/select_strains.smk Outdated Show resolved Hide resolved
workflow/snakemake_rules/select_strains.smk Outdated Show resolved Hide resolved
and the meta data subset.

input:
- "data/lineage/metadata.tsv"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our discussion today, we want to support users who download separate segment FASTAs from GISAID. We may need to support files like data/lineage/metadata_segment.tsv

@huddlej
Copy link
Contributor

huddlej commented Jan 20, 2022

@joverlee521 made the point that we also want to support example data from open data with an example Auspice JSON so users can confirm that the basic workflow will work for them as a tutorial.

We also need to move the distance files, fitness components, etc. still.

I will try running the workflow with standard GISAID downloads and determine what minimal changes need to happen for this to work.


localrules: clades, sanitize_trees

build_dir = config.get("build_dir", "builds")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this workflow more, I was surprised by this directory as the place to look for my output. The rest of our workflows use results/ with the contents of that directory depending on the pathogen workflow (zika stores a single set of outputs top-level while ncov nests build-specific outputs in subdirectories). Can we switch back to that here?

Fixes several bugs that I added during the port from the old workflow to
the new including:

  - passing genes as a space-delimited list instead of comma-delimited
  - setting the same max date for all regions, so pivot arrays are the
    same size during the global frequency estimate step
  - quote region-specific translations file names since they have spaces
    in their names
Fixes a typo in the "--gene-map" argument to the entropy command and
passes the alignments for gene translations with the internal nodes to
the sequence export script. Adds the missing sequence export script.
Fixes a bug with the entropy panel caused by passing a comma-delimited
list where a space-delimited list was expected.
The public builds rely on CDC ferret titers only which are not available
yet in our S3 bucket, so we need to download from fauna for now.
Keeps 6-year and 12-year Yam trees for historical reasons.
Actually, we don't seem to be running these builds on the public site
currently, so there is no need to keep updating the 6- and 12-year
builds.
The public flu builds that don't use multiple forecasting models don't
need to use the HI antigenic novelty model with a "cell_fra" suffix to
distinguish it from other similar models.
Restores public Vic clade annotations.
huddlej and others added 12 commits March 14, 2023 12:25
Allows the list of regions to be a superset of the list of frequencies
available per region, to handle the case when a given region has no data
for a build and therefore has no frequencies. The new logic checks for
the presence of each given region name (e.g., "Japan Korea") in a given
frequencies JSON filename to find a list of available regions.
Fixes a data-dependent bug when the workflow tries to estimate
region-specific frequencies for a region that has no aligned amino acid
sequences. This fix includes two important steps:

1. augur filter run on amino acid sequence alignments to select data for
a specific region is allowed to have empty output (printing only a
warning, so we can track these events in logs), so the workflow can
create an empty FASTA file for regions without data. These empty files
represent that state of missing data. This implementation uses a new
argument for augur filter provided in version 21.1.0. This commit
updates the Conda environment accordingly.

2. The workflow conditionally builds a list of regional mutation
frequencies to estimate for each build based on which amino acid
sequence files are not empty. Because there can be multiple "genes" per
segment, the workflow requires each region to have non-empty amino acid
sequences for all genes in a segment. This implementation relies on
converting the augur filter rule to a "checkpoint", so Snakemake knows
to update the DAG after that rule runs. The aggregation of which region
frequencies to estimate happens by calling an input function to the
final global frequencies rule that checks the content of the filtered
FASTA files and builds the list of region frequency JSONs to create.

This last step works because the global frequency estimation script
accepts a list of all possible regions and a list of available frequency
JSONs.
Sorts rows in the measurements panel by the minimum y-axis position of
the clade associated with a reference strain, clade, haplotype, or serum
id in descending order. This sorting aligns the measurements panel
entries more closely to the phylogenetic layout. Within each clade, rows
are sorted alphabetically in ascending order, to facilitate lookups by
name.
Adds vaccine strains recently selected for H1, corrects H3 vaccine
strains, and adds recent egg-passaged vaccine strains to the JSONs,
since these strains are included in our private nextflu builds.
Corrects the nucleotide positions of specific alleles expected per clade
to reflect the new reference sequence coordinates for H3N2 and Vic.
Adds a `year_month` annotation through the existing `epiweeks`
annotation that provides a coarser grain filtering by date. This kind of
filtering is most useful in narrative writing, where we want to easily
update a prior filter query string to reflect the latest data. Adding a
new month to a list of year/months is much easier than adding all of the
epiweeks.
Fix merge conflicts with `git merge -s ours master`, which
effectively ignores changes in `master` when there are conflicts.

Subsequent commits will add back changes that we _do_ want from master.
Corrects the HA gene coordinates for B/Vic such that HA1 and HA2
translations have the expected lengths (347 and 223, respectively) where
previously the HA1 sequence was 1 amino acid longer at the end and the
HA2 sequence was 1 shorter at the beginning. Documents the source of the
new coordinates with the PDB record and corresponding publication.
Applies a standard UNIX sort to all outliers files, to simplify merging
with outliers from `master` branch.
Updates outliers for H1N1pdm and H3N2 that were present in the master
branch and not in this branch.
Uses UNIX sort -k 1,1 to sort strains ahead of merge with strain lists
from master branch.
Updates lists of reference strains for H1N1pdm, H3N2, and Vic using
strains in the master branch.
@joverlee521 joverlee521 merged commit d2ce80e into master Mar 31, 2023
@joverlee521 joverlee521 deleted the refactor-workflow branch March 31, 2023 22:21
joverlee521 added a commit that referenced this pull request Apr 28, 2023
This adds GH Action workflows for both the Nextstrain public builds
and the private Nextflu builds.

The `nextstrain build` commands are copied from the example commands in
#76
joverlee521 added a commit that referenced this pull request Apr 28, 2023
This adds GH Action workflows for both the Nextstrain public builds
and the private Nextflu builds.

The `nextstrain build` commands are copied from the example commands in
#76
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

4 participants