-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize Ingest #6
Commits on Aug 19, 2023
-
Ingest: Copy ingest from monkeypox repo
Future commits will change this to work with Dengue data
Configuration menu - View commit details
-
Copy full SHA for 95b1718 - Browse repository at this point
Copy the full SHA 95b1718View commit details -
Configuration menu - View commit details
-
Copy full SHA for eefe5c1 - Browse repository at this point
Copy the full SHA eefe5c1View commit details -
If pathogen is not listed in nextclade_data, remove nextclade rules and scripts until it is added. https://github.com/nextstrain/nextclade_data/tree/release/data/datasets
Configuration menu - View commit details
-
Copy full SHA for aa8b868 - Browse repository at this point
Copy the full SHA aa8b868View commit details -
Remove bin/scripts duplication
If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.
Configuration menu - View commit details
-
Copy full SHA for a71cc55 - Browse repository at this point
Copy the full SHA a71cc55View commit details -
fix: Use curl for downloading files
Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)
Configuration menu - View commit details
-
Copy full SHA for 05c181b - Browse repository at this point
Copy the full SHA 05c181bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 49b0764 - Browse repository at this point
Copy the full SHA 49b0764View commit details -
Dengue-specific-ingest: Add dengue serotype wildcards
Dengue requires special handling because it has multiple serotypes. Added dengue serotypes: all, denv1, denv2, denv3 and denv4 Co-authored-by: Jover Lee <joverlee521@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 29368b6 - Browse repository at this point
Copy the full SHA 29368b6View commit details -
Dengue-specific-ingest: Add download filters
The dengue genome is approximately 10k or 11k. Therefore, we can filter out any sequences that are less than 5k or greater than 15k. A list of added GenBank filters is below: * Pull sequences longer than 5k but less than 15k * Only pull VRL (viral) datasets (no PAT or patents) * Pull UpdateDate_dt entry to potentially only pull "recent data sets" in case the dataset gets too large Co-authored-by: Jover Lee <joverlee521@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for c2ad70f - Browse repository at this point
Copy the full SHA c2ad70fView commit details -
Since strain name may be in Isolate_s or Strain_s, we need to check both columns for a reasonable strain name. Dengue virus types denv1 to 4 can be derived if their NCBI taxon IDs are listed in ViralLineage_IDs. * derive strain name from Strain_s if Isolate_s is blank * derive denv1 to 4 depending on ViralLineage_IDs
Configuration menu - View commit details
-
Copy full SHA for 159b928 - Browse repository at this point
Copy the full SHA 159b928View commit details -
Replace post processing R script with python
* update help statement * make --outfile required * simplify reordering output columns * nuanced viruslineage_ids processing * when multiple paper urls, pick one * 'strain' and 'strain_s' were populated by 'Isolate_s' and 'Strain_s' pulled from genbank_url The following was added after discussion with trs Check for the non-"happy path" cases first and then return early (or erroring early, as the case may be). This leaves the "happy path" (or "expected path") as the remainder of the function. * return early if publications is empty Co-authored-by: Thomas Sibley <tom@zulutango.org>
Configuration menu - View commit details
-
Copy full SHA for 9935631 - Browse repository at this point
Copy the full SHA 9935631View commit details -
[ingest] Simplify finding strain name
Search for valid strain name in the following order: 'strain', 'strain_s', 'accession'. Move the order into configs instead of hardcoding it in the post_process_metadata.py script.
Configuration menu - View commit details
-
Copy full SHA for 03c0819 - Browse repository at this point
Copy the full SHA 03c0819View commit details -
Configuration menu - View commit details
-
Copy full SHA for 931f5ae - Browse repository at this point
Copy the full SHA 931f5aeView commit details -
fix: makes the compress rule more generic
Co-authored-by: Thomas Sibley <tom@zulutango.org>
Configuration menu - View commit details
-
Copy full SHA for 205165a - Browse repository at this point
Copy the full SHA 205165aView commit details -
Build: Index by genbank accession instead of duplicate strain names
Since some strains (or isolates) may be resequenced resulting in duplicate strain names in the dengue dataset, index entries by GenBank Accession IDs.
Configuration menu - View commit details
-
Copy full SHA for 5d7aa6d - Browse repository at this point
Copy the full SHA 5d7aa6dView commit details -
fix: remove entries where accession is not found
Could not find genbank accession from GenBank or prior sequences.fasta.zst files.
Configuration menu - View commit details
-
Copy full SHA for 7c2da29 - Browse repository at this point
Copy the full SHA 7c2da29View commit details -
Ingest: Compromise by duplicating scripts
Compromise by duplicating scripts from monkeypox until a generalized pathogen repository exists or these scripts get enfolded into an augur subcommands
Configuration menu - View commit details
-
Copy full SHA for 269837e - Browse repository at this point
Copy the full SHA 269837eView commit details -
Configuration menu - View commit details
-
Copy full SHA for cc8731c - Browse repository at this point
Copy the full SHA cc8731cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 0a9a2a5 - Browse repository at this point
Copy the full SHA 0a9a2a5View commit details -
[wip] attempt at limiting concurrent deploys
Since fetch_from_genbank can query NCBI up to 5 times for each of the serotypes, try to limit concurrent queries to under 3. Using 2 to be cautious. Following the format shown at: nextstrain/ncov#1045
Configuration menu - View commit details
-
Copy full SHA for 641c020 - Browse repository at this point
Copy the full SHA 641c020View commit details -
Build: parameterize threads in align rule
Since align may be running in 5 parallel jobs (all, denv1, denv2, denv3 denv4), reverted this rule to original code of using 1 thread. However, added a threads parameter in the align rule so that this is easy to modify.
Configuration menu - View commit details
-
Copy full SHA for 0ec11a9 - Browse repository at this point
Copy the full SHA 0ec11a9View commit details -
Configuration menu - View commit details
-
Copy full SHA for dec2ec3 - Browse repository at this point
Copy the full SHA dec2ec3View commit details -
Configuration menu - View commit details
-
Copy full SHA for e80947d - Browse repository at this point
Copy the full SHA e80947dView commit details -
refactor: move post_process_metadata to rule transform
To simplify the workflow, instead of post processing metadata to clean up strain names and set dengue serotype based on virus lineage ID after the transform step, incorporate post processing directly into the transform step. This step was moved above any manual annotations. This also simplified the code so we were not having two code blocks determining the final metadata columns which may have become inconsistent.
Configuration menu - View commit details
-
Copy full SHA for 30fbeef - Browse repository at this point
Copy the full SHA 30fbeefView commit details -
Configuration menu - View commit details
-
Copy full SHA for d25a3e7 - Browse repository at this point
Copy the full SHA d25a3e7View commit details -
Configuration menu - View commit details
-
Copy full SHA for a309e5f - Browse repository at this point
Copy the full SHA a309e5fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 5c6baf4 - Browse repository at this point
Copy the full SHA 5c6baf4View commit details -
Configuration menu - View commit details
-
Copy full SHA for f82298f - Browse repository at this point
Copy the full SHA f82298fView commit details -
Configuration menu - View commit details
-
Copy full SHA for ac34243 - Browse repository at this point
Copy the full SHA ac34243View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9dbccc4 - Browse repository at this point
Copy the full SHA 9dbccc4View commit details