-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize Ingest #6
Conversation
I haven't done a detailed review, but a couple high-level comments to start off:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, @j23414! It's a big lift to convert tidyverse logic to pandas logic especially since pandas has a huge learning curve and remains confusing in many ways even to people who have used it for years. Most of the comments below touch on Python conventions used throughout the other Nextstrain Python codebase, but there is an important note about the mapping of NCBI lineage ids to serotype names that's more of a data formatting consideration.
Point 2 is addressed with 2c8553a Re: Point 1, I still disagree...mostly because I'm currently trying to propagate changes across zika and ebola ingests which is already tedious when copying changes across their snakefiles. One might argue to only polish ingest scripts on one repo (dengue), however in my experience that results in over-specialized scripts that become really difficult to generalize later. To meet the final end-goal of Point 1 (if not the immediate developmental path of Point 1), I'm happy to make final copies in a future PR after I'm sure the pipeline works for:
|
@j23414 Nod. I don't want to get in the way of your active development. As a final state, though, I still think we should not rely on this sort of dynamic downloading. It's got all the problems of a package management system (dependencies, versioning, updates, etc), without any of the solutions a mature package management system provides. There's also issues of brittle close coupling it introduces. Consider that someone changing files in the monkeypox workflow would (rightfully so, I'd argue) not think that modifying one of the programs in |
040c503
to
0ca3400
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I normally review PRs commit-by-commit, but this was not feasible here given the commit structure and sequence. Instead, I reviewed this PR as one large diff after excluding the initial verbatim copy of the ingest machinery from monkeypox ("add ingest from monkeypox repo" (645acb2)).
# HEAD was "fix: switch to augur curate normalize-strings" (b41fb45)
git diff --compact-summary --patch 645acb2..@
This exclusion helped reduce the amount under review, since it's been reviewed previously, and highlight what had to change from monkeypox to here.
A couple general top-level comments, before specifics:
-
It'd be good to incorporate changes from Parameterize ncbi_id in fetch_sequences mpox#146 here, one way or another. I'm sure that's your plan! but calling it out so we don't forget.
-
Running
nextstrain build ingest
produces files iningest/.snakemake/
andingest/logs/
that should be in a git ignore file.
b9865c9
to
c932c00
Compare
099600c
to
430ff78
Compare
Working on rebasing this! Do not review yet! Thanks @victorlin!!! |
The dengue genome is approximately 10k or 11k. Therefore, we can filter out any sequences that are less than 5k or greater than 15k. A list of added GenBank filters is below: * Pull sequences longer than 5k but less than 15k * Only pull VRL (viral) datasets (no PAT or patents) * Pull UpdateDate_dt entry to potentially only pull "recent data sets" in case the dataset gets too large Co-authored-by: Jover Lee <joverlee521@gmail.com>
Since strain name may be in Isolate_s or Strain_s, we need to check both columns for a reasonable strain name. Dengue virus types denv1 to 4 can be derived if their NCBI taxon IDs are listed in ViralLineage_IDs. * derive strain name from Strain_s if Isolate_s is blank * derive denv1 to 4 depending on ViralLineage_IDs
* update help statement * make --outfile required * simplify reordering output columns * nuanced viruslineage_ids processing * when multiple paper urls, pick one * 'strain' and 'strain_s' were populated by 'Isolate_s' and 'Strain_s' pulled from genbank_url The following was added after discussion with trs Check for the non-"happy path" cases first and then return early (or erroring early, as the case may be). This leaves the "happy path" (or "expected path") as the remainder of the function. * return early if publications is empty Co-authored-by: Thomas Sibley <tom@zulutango.org>
Search for valid strain name in the following order: 'strain', 'strain_s', 'accession'. Move the order into configs instead of hardcoding it in the post_process_metadata.py script.
Co-authored-by: Thomas Sibley <tom@zulutango.org>
Since some strains (or isolates) may be resequenced resulting in duplicate strain names in the dengue dataset, index entries by GenBank Accession IDs.
Could not find genbank accession from GenBank or prior sequences.fasta.zst files.
Compromise by duplicating scripts from monkeypox until a generalized pathogen repository exists or these scripts get enfolded into an augur subcommands
Since fetch_from_genbank can query NCBI up to 5 times for each of the serotypes, try to limit concurrent queries to under 3. Using 2 to be cautious. Following the format shown at: nextstrain/ncov#1045
Since align may be running in 5 parallel jobs (all, denv1, denv2, denv3 denv4), reverted this rule to original code of using 1 thread. However, added a threads parameter in the align rule so that this is easy to modify.
To simplify the workflow, instead of post processing metadata to clean up strain names and set dengue serotype based on virus lineage ID after the transform step, incorporate post processing directly into the transform step. This step was moved above any manual annotations. This also simplified the code so we were not having two code blocks determining the final metadata columns which may have become inconsistent.
This PR superseded by merged PRs: |
Description of proposed changes
Ingest data from genbank to generate:
for a dengue build.
Instead of separately pulling denv1 to denv4, all types are combined in one file with an annotated column:(2023-03-25, to avoid confusion keep serotypes separate)Unordered list of remaining tasks that may change:
Pull and merge cached datasets to avoid recomputePulltitle
from PubMed instead of Description lineRelated issue(s)
Testing
Local Test
Can test this locally by running
May need to install tidyverse(No longer need R since the script was refactored into python)