Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize Ingest #6

Closed
wants to merge 29 commits into from
Closed

Generalize Ingest #6

wants to merge 29 commits into from

Commits on Aug 19, 2023

  1. Ingest: Copy ingest from monkeypox repo

    Future commits will change this to work with Dengue data
    j23414 authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    95b1718 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    eefe5c1 View commit details
    Browse the repository at this point in the history
  3. Ingest: Remove Nextclade

    If pathogen is not listed in nextclade_data, remove nextclade rules and
    scripts until it is added.
    
    https://github.com/nextstrain/nextclade_data/tree/release/data/datasets
    j23414 authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    aa8b868 View commit details
    Browse the repository at this point in the history
  4. Remove bin/scripts duplication

    If a script does not need to be modified for a pathogen ingest, pull
    script during runtime instead of maintaining a potentially diverging
    copy.
    
    Use a permalink for each script to allow us to version the software we
    use in this workflow without being affected by upstream changes until
    we want to bump the version. This design adds more maintenance to this
    workflow, but it also protects users against unexpected issues that are
    outside of their control.
    Discussed in nextstrain/ebola#6 (comment)
    
    However, this commit gets reverted later based on discussions.
    j23414 authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    a71cc55 View commit details
    Browse the repository at this point in the history
  5. fix: Use curl for downloading files

    Pick curl instead of detecting curl/wget as discussed in:
    nextstrain/ebola#6 (comment)
    j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    05c181b View commit details
    Browse the repository at this point in the history
  6. Ingest: Replace monkeypox text and parameters with dengue

    j23414 authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    49b0764 View commit details
    Browse the repository at this point in the history
  7. Dengue-specific-ingest: Add dengue serotype wildcards

    Dengue requires special handling because it has multiple serotypes.
    Added dengue serotypes: all, denv1, denv2, denv3 and denv4
    
    Co-authored-by: Jover Lee <joverlee521@gmail.com>
    2 people authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    29368b6 View commit details
    Browse the repository at this point in the history
  8. Dengue-specific-ingest: Add download filters

    The dengue genome is approximately 10k or 11k. Therefore, we can filter
    out any sequences that are less than 5k or greater than 15k.
    
    A list of added GenBank filters is below:
    
    * Pull sequences longer than 5k but less than 15k
    * Only pull VRL (viral) datasets (no PAT or patents)
    * Pull UpdateDate_dt entry to potentially only pull "recent data sets"
      in case the dataset gets too large
    
    Co-authored-by: Jover Lee <joverlee521@gmail.com>
    2 people authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    c2ad70f View commit details
    Browse the repository at this point in the history
  9. Add post processing script

    Since strain name may be in Isolate_s or Strain_s, we need to check both
    columns for a reasonable strain name. Dengue virus types denv1 to 4 can
    be derived if their NCBI taxon IDs are listed in ViralLineage_IDs.
    
    * derive strain name from Strain_s if Isolate_s is blank
    * derive denv1 to 4 depending on ViralLineage_IDs
    j23414 authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    159b928 View commit details
    Browse the repository at this point in the history
  10. Replace post processing R script with python

    * update help statement
    * make --outfile required
    * simplify reordering output columns
    * nuanced viruslineage_ids processing
    * when multiple paper urls, pick one
    * 'strain' and 'strain_s' were populated by 'Isolate_s' and 'Strain_s'
    pulled from genbank_url
    
    The following was added after discussion with trs
    
    Check for the non-"happy path" cases first and then return early (or
    erroring early, as the case may be). This leaves the "happy path" (or
    "expected path") as the remainder of the function.
    
    * return early if publications is empty
    
    Co-authored-by: Thomas Sibley <tom@zulutango.org>
    2 people authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    9935631 View commit details
    Browse the repository at this point in the history
  11. [ingest] Simplify finding strain name

    Search for valid strain name in the following order: 'strain', 'strain_s',
    'accession'. Move the order into configs instead of hardcoding it in the
    post_process_metadata.py script.
    j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    03c0819 View commit details
    Browse the repository at this point in the history
  12. zstd compress output files

    j23414 authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    931f5ae View commit details
    Browse the repository at this point in the history
  13. fix: makes the compress rule more generic

    Co-authored-by: Thomas Sibley <tom@zulutango.org>
    j23414 and tsibley committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    205165a View commit details
    Browse the repository at this point in the history
  14. Build: Index by genbank accession instead of duplicate strain names

    Since some strains (or isolates) may be resequenced resulting in duplicate
    strain names in the dengue dataset, index entries by GenBank Accession IDs.
    j23414 authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    5d7aa6d View commit details
    Browse the repository at this point in the history
  15. fix: remove entries where accession is not found

    Could not find genbank accession from GenBank or prior sequences.fasta.zst files.
    j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    7c2da29 View commit details
    Browse the repository at this point in the history
  16. Ingest: Compromise by duplicating scripts

    Compromise by duplicating scripts from monkeypox until a generalized
    pathogen repository exists or these scripts get enfolded into an augur
    subcommands
    j23414 authored and j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    269837e View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    cc8731c View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    0a9a2a5 View commit details
    Browse the repository at this point in the history
  19. [wip] attempt at limiting concurrent deploys

    Since fetch_from_genbank can query NCBI up to 5 times for each of the serotypes, try to limit concurrent queries to under 3. Using 2 to be cautious.
    
    Following the format shown at:
    nextstrain/ncov#1045
    j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    641c020 View commit details
    Browse the repository at this point in the history
  20. Build: parameterize threads in align rule

    Since align may be running in 5 parallel jobs (all, denv1, denv2, denv3
    denv4), reverted this rule to original code of using 1 thread. However,
    added a threads parameter in the align rule so that this is easy to modify.
    j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    0ec11a9 View commit details
    Browse the repository at this point in the history
  21. Configuration menu
    Copy the full SHA
    dec2ec3 View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    e80947d View commit details
    Browse the repository at this point in the history
  23. refactor: move post_process_metadata to rule transform

    To simplify the workflow, instead of post processing metadata to clean up
    strain names and set dengue serotype based on virus lineage ID after the
    transform step, incorporate post processing directly into the transform step.
    This step was moved above any manual annotations. This also simplified the
    code so we were not having two code blocks determining the final metadata columns
    which may have become inconsistent.
    j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    30fbeef View commit details
    Browse the repository at this point in the history
  24. Configuration menu
    Copy the full SHA
    d25a3e7 View commit details
    Browse the repository at this point in the history
  25. mark temp intermediate files

    j23414 committed Aug 19, 2023
    Configuration menu
    Copy the full SHA
    a309e5f View commit details
    Browse the repository at this point in the history
  26. Configuration menu
    Copy the full SHA
    5c6baf4 View commit details
    Browse the repository at this point in the history
  27. Configuration menu
    Copy the full SHA
    f82298f View commit details
    Browse the repository at this point in the history
  28. Configuration menu
    Copy the full SHA
    ac34243 View commit details
    Browse the repository at this point in the history
  29. Configuration menu
    Copy the full SHA
    9dbccc4 View commit details
    Browse the repository at this point in the history