Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize Ingest #6

Closed
wants to merge 29 commits into from
Closed

Generalize Ingest #6

wants to merge 29 commits into from

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Nov 17, 2022

Description of proposed changes

Ingest data from genbank to generate:

  • data/sequences_all.fasta
  • data/metadata_all.tsv

for a dengue build.

Instead of separately pulling denv1 to denv4, all types are combined in one file with an annotated column: (2023-03-25, to avoid confusion keep serotypes separate)

  • data/sequences_denv1.fasta
  • data/metadata_denv1.tsv
  • data/sequences_denv2.fasta
  • data/metadata_denv2.tsv
  • data/sequences_denv3.fasta
  • data/metadata_denv3.tsv
  • data/sequences_denv4.fasta
  • data/metadata_denv4.tsv

Screen Shot 2022-11-17 at 3 50 30 PM

Unordered list of remaining tasks that may change:

  • Update docs
  • Pull and merge cached datasets to avoid recompute
  • Pull title from PubMed instead of Description line

Related issue(s)

Testing

  • Checks pass

Local Test

Can test this locally by running

git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout new_ingest
cd ingest
nextstrain build .

ls -ltr data
wc -l data/metadata_all.tsv

May need to install tidyverse (No longer need R since the script was refactored into python)

R
install.packages(tidyverse)

@tsibley
Copy link
Member

tsibley commented Nov 23, 2022

I haven't done a detailed review, but a couple high-level comments to start off:

  1. I don't think just-in-time downloading of programs from monkeypox is the way we want to reuse stuff. It seems to me that it will be fragile and hard to maintain over time.

  2. While I personally enjoy multi-lingual projects and am sympathetic to the best tools being the ones you know (and do like tidyverse), I think we should replace the lone R program here with something in Python for consistency with the rest of the repo/ecosystem.

    The R program also won't work in our Docker runtime (e.g. try it with nextstrain build --docker ingest/) and if it works in our Conda and ambient runtimes for someone, it's only by chance. We could add R support to our runtimes, but I think that's a bigger scope of work. I also noted that the R program loads the whole metadata file into memory when it could instead do a streaming transform (e.g. as csv-to-ndjson does).

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, @j23414! It's a big lift to convert tidyverse logic to pandas logic especially since pandas has a huge learning curve and remains confusing in many ways even to people who have used it for years. Most of the comments below touch on Python conventions used throughout the other Nextstrain Python codebase, but there is an important note about the mapping of NCBI lineage ids to serotype names that's more of a data formatting consideration.

ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
ingest/bin/post_process_metadata.py Outdated Show resolved Hide resolved
@j23414
Copy link
Contributor Author

j23414 commented Dec 6, 2022

I haven't done a detailed review, but a couple high-level comments to start off:

  1. I don't think just-in-time downloading of programs from monkeypox is the way we want to reuse stuff. It seems to me that it will be fragile and hard to maintain over time.
  2. While I personally enjoy multi-lingual projects and am sympathetic to the best tools being the ones you know (and do like tidyverse), I think we should replace the lone R program here with something in Python for consistency with the rest of the repo/ecosystem.
    The R program also won't work in our Docker runtime (e.g. try it with nextstrain build --docker ingest/) and if it works in our Conda and ambient runtimes for someone, it's only by chance. We could add R support to our runtimes, but I think that's a bigger scope of work. I also noted that the R program loads the whole metadata file into memory when it could instead do a streaming transform (e.g. as csv-to-ndjson does).

Point 2 is addressed with 2c8553a

Re: Point 1, I still disagree...mostly because I'm currently trying to propagate changes across zika and ebola ingests which is already tedious when copying changes across their snakefiles. One might argue to only polish ingest scripts on one repo (dengue), however in my experience that results in over-specialized scripts that become really difficult to generalize later.

To meet the final end-goal of Point 1 (if not the immediate developmental path of Point 1), I'm happy to make final copies in a future PR after I'm sure the pipeline works for:

Otherwise my dev path has already branched into:

@tsibley
Copy link
Member

tsibley commented Dec 12, 2022

@j23414 Nod. I don't want to get in the way of your active development.

As a final state, though, I still think we should not rely on this sort of dynamic downloading. It's got all the problems of a package management system (dependencies, versioning, updates, etc), without any of the solutions a mature package management system provides.

There's also issues of brittle close coupling it introduces. Consider that someone changing files in the monkeypox workflow would (rightfully so, I'd argue) not think that modifying one of the programs in scripts/ could break the unrelated dengue workflow.

@huddlej huddlej mentioned this pull request Dec 14, 2022
1 task
@j23414 j23414 changed the title WIP: New ingest [do not merge yet] Generalize Ingest Jan 4, 2023
@j23414 j23414 mentioned this pull request Mar 24, 2023
1 task
@tsibley tsibley requested a review from a team March 24, 2023 23:37
@j23414 j23414 force-pushed the new_ingest branch 2 times, most recently from 040c503 to 0ca3400 Compare March 25, 2023 10:18
@tsibley tsibley self-requested a review March 27, 2023 22:30
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I normally review PRs commit-by-commit, but this was not feasible here given the commit structure and sequence. Instead, I reviewed this PR as one large diff after excluding the initial verbatim copy of the ingest machinery from monkeypox ("add ingest from monkeypox repo" (645acb2)).

# HEAD was "fix: switch to augur curate normalize-strings" (b41fb45)
git diff --compact-summary --patch 645acb2..@

This exclusion helped reduce the amount under review, since it's been reviewed previously, and highlight what had to change from monkeypox to here.

A couple general top-level comments, before specifics:

  • It'd be good to incorporate changes from Parameterize ncbi_id in fetch_sequences mpox#146 here, one way or another. I'm sure that's your plan! but calling it out so we don't forget.

  • Running nextstrain build ingest produces files in ingest/.snakemake/ and ingest/logs/ that should be in a git ignore file.

README.md Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
Snakefile Outdated Show resolved Hide resolved
bin/set_final_strain_name.py Outdated Show resolved Hide resolved
ingest/config/optional.yaml Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/fetch_sequences.smk Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/transform.smk Outdated Show resolved Hide resolved
ingest/workflow/snakemake_rules/transform.smk Outdated Show resolved Hide resolved
ingest/config/config.yaml Outdated Show resolved Hide resolved
@j23414 j23414 force-pushed the new_ingest branch 2 times, most recently from b9865c9 to c932c00 Compare March 28, 2023 18:34
@j23414 j23414 force-pushed the new_ingest branch 10 times, most recently from 099600c to 430ff78 Compare April 5, 2023 02:40
@j23414 j23414 requested a review from a team April 5, 2023 04:20
@j23414 j23414 marked this pull request as draft April 11, 2023 23:59
@j23414
Copy link
Contributor Author

j23414 commented Apr 12, 2023

Working on rebasing this! Do not review yet! Thanks @victorlin!!!

j23414 and others added 22 commits August 18, 2023 21:42
The dengue genome is approximately 10k or 11k. Therefore, we can filter
out any sequences that are less than 5k or greater than 15k.

A list of added GenBank filters is below:

* Pull sequences longer than 5k but less than 15k
* Only pull VRL (viral) datasets (no PAT or patents)
* Pull UpdateDate_dt entry to potentially only pull "recent data sets"
  in case the dataset gets too large

Co-authored-by: Jover Lee <joverlee521@gmail.com>
Since strain name may be in Isolate_s or Strain_s, we need to check both
columns for a reasonable strain name. Dengue virus types denv1 to 4 can
be derived if their NCBI taxon IDs are listed in ViralLineage_IDs.

* derive strain name from Strain_s if Isolate_s is blank
* derive denv1 to 4 depending on ViralLineage_IDs
* update help statement
* make --outfile required
* simplify reordering output columns
* nuanced viruslineage_ids processing
* when multiple paper urls, pick one
* 'strain' and 'strain_s' were populated by 'Isolate_s' and 'Strain_s'
pulled from genbank_url

The following was added after discussion with trs

Check for the non-"happy path" cases first and then return early (or
erroring early, as the case may be). This leaves the "happy path" (or
"expected path") as the remainder of the function.

* return early if publications is empty

Co-authored-by: Thomas Sibley <tom@zulutango.org>
Search for valid strain name in the following order: 'strain', 'strain_s',
'accession'. Move the order into configs instead of hardcoding it in the
post_process_metadata.py script.
Co-authored-by: Thomas Sibley <tom@zulutango.org>
Since some strains (or isolates) may be resequenced resulting in duplicate
strain names in the dengue dataset, index entries by GenBank Accession IDs.
Could not find genbank accession from GenBank or prior sequences.fasta.zst files.
Compromise by duplicating scripts from monkeypox until a generalized
pathogen repository exists or these scripts get enfolded into an augur
subcommands
Since fetch_from_genbank can query NCBI up to 5 times for each of the serotypes, try to limit concurrent queries to under 3. Using 2 to be cautious.

Following the format shown at:
nextstrain/ncov#1045
Since align may be running in 5 parallel jobs (all, denv1, denv2, denv3
denv4), reverted this rule to original code of using 1 thread. However,
added a threads parameter in the align rule so that this is easy to modify.
To simplify the workflow, instead of post processing metadata to clean up
strain names and set dengue serotype based on virus lineage ID after the
transform step, incorporate post processing directly into the transform step.
This step was moved above any manual annotations. This also simplified the
code so we were not having two code blocks determining the final metadata columns
which may have become inconsistent.
@j23414
Copy link
Contributor Author

j23414 commented May 23, 2024

@j23414 j23414 closed this May 23, 2024
@j23414 j23414 deleted the new_ingest branch May 23, 2024 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants