Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve parsing of GenBank / GFF files #1351

Merged
merged 10 commits into from
Dec 20, 2023

Commits on Dec 20, 2023

  1. [utils] refactor 'load_features', add docstrings

    This follows my own advice in <#953 (comment)>
    and sets the stage for subsequent commits to improve the code. There are
    no functional changes here.
    jameshadfield committed Dec 20, 2023
    Configuration menu
    Copy the full SHA
    8f17677 View commit details
    Browse the repository at this point in the history
  2. [utils] lift error handling to load_features

    Both functions which call this would check for a return value of None
    (indicating that the file didn't exist) and cause the augur function to
    exit. It's cleaner to lift this into `load_features` and this makes it
    easier for that function to raise errors in the future (e.g. on
    malformed/empty reference files).
    jameshadfield committed Dec 20, 2023
    Configuration menu
    Copy the full SHA
    3b27cdc View commit details
    Browse the repository at this point in the history
  3. [utils] only allow GFFs with one record

    Also catches the edge case where a GFF has no valid rows. The printed
    error messages should be helpful enough to identify the GFF formatting
    error(s).
    jameshadfield committed Dec 20, 2023
    Configuration menu
    Copy the full SHA
    e57e808 View commit details
    Browse the repository at this point in the history
  4. [utils] GFF parsing doesn't depend on --genes

    See comment added in code which details the previous behaviour. While
    silently skipping genes without the necessary attributes isn't the best
    solution in my opinion, it's at least now consistent (and also
    consistent with how we handle GenBank parsing).
    
    Closes #1349
    jameshadfield committed Dec 20, 2023
    Configuration menu
    Copy the full SHA
    a59272c View commit details
    Browse the repository at this point in the history
  5. [utils] always extract 'nuc' annotation (GFF)

    This fixes a long-standing oversight in Augur where GFF files could only
    define the nucleotide coordinates via a row with GFF type 'source'. We
    now parse the (preferred) GFF type 'region' as well as the
    '##sequence-region pragma'. This allows us to exit if the nuc
    coordinates are not defined, and the error message should help users
    correct their GFF files.
    
    Note that the code is not yet implemented which guarantees the extracted
    nuc coordinates will be exported in the JSON.
    
    The changes here resulted in a number of tests needing updated, largely
    due to us now parsing an additional feature from GFF files (the 'nuc'
    feature)
    jameshadfield committed Dec 20, 2023
    Configuration menu
    Copy the full SHA
    afa8bd6 View commit details
    Browse the repository at this point in the history
  6. [utils] always extract 'nuc' annotation (GenBank)

    A companion commit to the previous one, this time using GenBank files.
    jameshadfield committed Dec 20, 2023
    Configuration menu
    Copy the full SHA
    af62d50 View commit details
    Browse the repository at this point in the history
  7. [utils] Forbid gene/CDS with name 'nuc'

    This name is reserved in our annotations schema to refer to the genome /
    segment / sequence nucleotide annotation.
    jameshadfield committed Dec 20, 2023
    Configuration menu
    Copy the full SHA
    580666a View commit details
    Browse the repository at this point in the history
  8. [translate] guarantee nuc annotation produced

    This builds off the preceding 3 commits which guarantee that a 'nuc'
    feature will be parsed from the reference file. We now guarantee it'll
    be exported in the node-data JSON.
    
    Note that the change to the TB aa_muts.json test file was due to a bug
    in the previous code, where `'type: feat['type']` would incorrectly
    reuse the last defined `feat` in the preceding loop. (I think this is a
    pitfall of using large "real-life" test files as it's impractical to
    manually check the source-of-truth we are comparing against.)
    
    Since the 'nuc' feature is guaranteed to exist, we can also check it
    against the existing nuc annotation within `augur ancestral`, where
    applicable.
    
    Closes #953, although there is good commentary in that issue about
    improving our parsing of GFFs more generally than that implemented here.
    
    Closes #1346
    jameshadfield committed Dec 20, 2023
    Configuration menu
    Copy the full SHA
    1d17699 View commit details
    Browse the repository at this point in the history
  9. [utils] warn if unreadable gene/CDS feature

    These were previously either silently skipped or, in some cases an
    unhandled exception was raised.
    jameshadfield committed Dec 20, 2023
    Configuration menu
    Copy the full SHA
    cd7055a View commit details
    Browse the repository at this point in the history
  10. changelog

    jameshadfield committed Dec 20, 2023
    Configuration menu
    Copy the full SHA
    c91ca33 View commit details
    Browse the repository at this point in the history