-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New ingest one snakefile #10
Conversation
Future commits will change this to work with Dengue data
If pathogen is not listed in nextclade_data, remove nextclade rules and scripts until it is added. https://github.com/nextstrain/nextclade_data/tree/release/data/datasets
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
Dengue requires special handling because it has multiple serotypes. Added dengue serotypes: all, denv1, denv2, denv3 and denv4 Co-authored-by: Jover Lee <joverlee521@gmail.com>
The dengue genome is approximately 10k or 11k. Therefore, we can filter out any sequences that are less than 5k or greater than 15k. A list of added GenBank filters is below: * Pull sequences longer than 5k but less than 15k * Only pull VRL (viral) datasets (no PAT or patents) * Pull UpdateDate_dt entry to potentially only pull "recent data sets" in case the dataset gets too large Co-authored-by: Jover Lee <joverlee521@gmail.com>
Since strain name may be in Isolate_s or Strain_s, we need to check both columns for a reasonable strain name. Dengue virus types denv1 to 4 can be derived if their NCBI taxon IDs are listed in ViralLineage_IDs. * derive strain name from Strain_s if Isolate_s is blank * derive denv1 to 4 depending on ViralLineage_IDs
* update help statement * make --outfile required * simplify reordering output columns * nuanced viruslineage_ids processing * when multiple paper urls, pick one * 'strain' and 'strain_s' were populated by 'Isolate_s' and 'Strain_s' pulled from genbank_url The following was added after discussion with trs Check for the non-"happy path" cases first and then return early (or erroring early, as the case may be). This leaves the "happy path" (or "expected path") as the remainder of the function. * return early if publications is empty Co-authored-by: Thomas Sibley <tom@zulutango.org>
Search for valid strain name in the following order: 'strain', 'strain_s', 'accession'. Move the order into configs instead of hardcoding it in the post_process_metadata.py script.
Co-authored-by: Thomas Sibley <tom@zulutango.org>
Since some strains (or isolates) may be resequenced resulting in duplicate strain names in the dengue dataset, index entries by GenBank Accession IDs.
Could not find genbank accession from GenBank or prior sequences.fasta.zst files.
Compromise by duplicating scripts from monkeypox until a generalized pathogen repository exists or these scripts get enfolded into an augur subcommands
Since fetch_from_genbank can query NCBI up to 5 times for each of the serotypes, try to limit concurrent queries to under 3. Using 2 to be cautious. Following the format shown at: nextstrain/ncov#1045
Since align may be running in 5 parallel jobs (all, denv1, denv2, denv3 denv4), reverted this rule to original code of using 1 thread. However, added a threads parameter in the align rule so that this is easy to modify.
To simplify the workflow, instead of post processing metadata to clean up strain names and set dengue serotype based on virus lineage ID after the transform step, incorporate post processing directly into the transform step. This step was moved above any manual annotations. This also simplified the code so we were not having two code blocks determining the final metadata columns which may have become inconsistent.
…n build system The running behavior is: * If `data` directory exists, proceed with build as normal. * If `s3_src` is pointing to a pre-cleaned dengue dataset, fetch dataset and proceed as normal. * Otherwise, fetch dataset using ingest, clean it, and proceed as normal. In order to implement the above, we are using Snakemake modules¹ which allows us to define a `prefix` for the ingest input/output paths. However, these do not get applied to the scripts called within `shelL` . So we are using this workaround² to add a base dir to the scripts. ¹ https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#modules ² https://stackoverflow.com/a/66890412 Co-authored-by: Jover Lee <joverlee521@gmail.com>
Changes copied from ncov-ingest repo: nextstrain/ncov-ingest@3dbf473
Only use the default configfiles if no configs are provided. This allows users to override the default configs with Snakemake commandline options.
Use the config to set the ingest basedir so that we're not using some arbitrary variable.
bca3415
to
dafdd12
Compare
This commit modifies the configurations and snakemake rules to treat geolocation-rules.tsv and annotations.tsv as input files instead of strings. This change addresses a warning message that occurs when using relative file paths starting with './'. The warning suggests omitting the './' to avoid redundancy and ensure consistent file matching in Snakemake. ``` Relative file path './source-data/geolocation-rules.tsv' starts with './'. This is redundant and strongly discouraged. It can also lead to inconsistent results of the file-matching approach used by Snakemake. You can simply omit the './' for relative file paths. ``` Furthermore, by treating source-data/annotations.tsv as a file, it provides hints to snakemake modules to add a prefix, such as "ingest/source-data/", if necessary. This modification aligns with the changes made in the following pull request: nextstrain/dengue#10
|
||
serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4'] | ||
|
||
rule all: | ||
input: | ||
auspice_json = expand("auspice/dengue_{serotype}.json", serotype=serotypes) | ||
|
||
module ingest_workflow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't used modules myself, so I'm interested to know how you and @joverlee521 found them to work with - especially related to debugging / developing?
As a counterexample, I've been playing with an HBV build which uses our convention of separating out ingest data and code into ./ingest/
, but still runs everything from a top-level ./Snakemake
(which calls include: "ingest/ingest.smk"
).
The advantage of modules (as used in this context) is that they continue to allow ./ingest/
to function in a completely stand-alone way, which has obvious advantages. Perhaps using modules also enforces the separation of concerns between ingest + phylo in a desirable way? The cost is increased complexity. After seeing how complex the ncov
snakemake workflow became over the years I'm trying to gauge whether the trade-offs here are worth it.
) This commit modifies the configurations and snakemake rules to treat geolocation-rules.tsv and annotations.tsv as input files instead of strings. This change addresses a warning message that occurs when using relative file paths starting with './'. The warning suggests omitting the './' to avoid redundancy and ensure consistent file matching in Snakemake. ``` Relative file path './source-data/geolocation-rules.tsv' starts with './'. This is redundant and strongly discouraged. It can also lead to inconsistent results of the file-matching approach used by Snakemake. You can simply omit the './' for relative file paths. ``` Furthermore, by treating source-data/annotations.tsv as a file, it provides hints to snakemake modules to add a prefix, such as "ingest/source-data/", if necessary. This modification aligns with the changes made in the following pull request: nextstrain/dengue#10
The purpose of this was some early exploration of dafdd12 to support placing workflows in subdirectories. Closing this, since we ended up going a different direction: https://github.com/nextstrain/cli/releases/tag/8.2.0. |
Description of proposed changes
Explore using Snakemake modules for ingesting data into the Nextstrain build system.
This PR would merge into the
new_ingest
branch and continue thedengue ingest
work.The running behavior is:
data
directory exists, proceed with build as normal.s3_src
is pointing to a pre-cleaned dengue dataset, fetch dataset and proceed as normal.In order to implement the above, we are using Snakemake modules¹
which allows us to define a
prefix
for the ingest input/output paths.However, these do not get applied to the scripts called within
shelL
.So we are using this workaround² to add a base dir to the scripts.
¹ https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#modules
² https://stackoverflow.com/a/66890412
Testing
Can run a manual check with the following commands:
Although the following should still work too: