-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parsing of GenBank / GFF files #1351
Commits on Dec 20, 2023
-
[utils] refactor 'load_features', add docstrings
This follows my own advice in <#953 (comment)> and sets the stage for subsequent commits to improve the code. There are no functional changes here.
Configuration menu - View commit details
-
Copy full SHA for 8f17677 - Browse repository at this point
Copy the full SHA 8f17677View commit details -
[utils] lift error handling to load_features
Both functions which call this would check for a return value of None (indicating that the file didn't exist) and cause the augur function to exit. It's cleaner to lift this into `load_features` and this makes it easier for that function to raise errors in the future (e.g. on malformed/empty reference files).
Configuration menu - View commit details
-
Copy full SHA for 3b27cdc - Browse repository at this point
Copy the full SHA 3b27cdcView commit details -
[utils] only allow GFFs with one record
Also catches the edge case where a GFF has no valid rows. The printed error messages should be helpful enough to identify the GFF formatting error(s).
Configuration menu - View commit details
-
Copy full SHA for e57e808 - Browse repository at this point
Copy the full SHA e57e808View commit details -
[utils] GFF parsing doesn't depend on --genes
See comment added in code which details the previous behaviour. While silently skipping genes without the necessary attributes isn't the best solution in my opinion, it's at least now consistent (and also consistent with how we handle GenBank parsing). Closes #1349
Configuration menu - View commit details
-
Copy full SHA for a59272c - Browse repository at this point
Copy the full SHA a59272cView commit details -
[utils] always extract 'nuc' annotation (GFF)
This fixes a long-standing oversight in Augur where GFF files could only define the nucleotide coordinates via a row with GFF type 'source'. We now parse the (preferred) GFF type 'region' as well as the '##sequence-region pragma'. This allows us to exit if the nuc coordinates are not defined, and the error message should help users correct their GFF files. Note that the code is not yet implemented which guarantees the extracted nuc coordinates will be exported in the JSON. The changes here resulted in a number of tests needing updated, largely due to us now parsing an additional feature from GFF files (the 'nuc' feature)
Configuration menu - View commit details
-
Copy full SHA for afa8bd6 - Browse repository at this point
Copy the full SHA afa8bd6View commit details -
[utils] always extract 'nuc' annotation (GenBank)
A companion commit to the previous one, this time using GenBank files.
Configuration menu - View commit details
-
Copy full SHA for af62d50 - Browse repository at this point
Copy the full SHA af62d50View commit details -
[utils] Forbid gene/CDS with name 'nuc'
This name is reserved in our annotations schema to refer to the genome / segment / sequence nucleotide annotation.
Configuration menu - View commit details
-
Copy full SHA for 580666a - Browse repository at this point
Copy the full SHA 580666aView commit details -
[translate] guarantee nuc annotation produced
This builds off the preceding 3 commits which guarantee that a 'nuc' feature will be parsed from the reference file. We now guarantee it'll be exported in the node-data JSON. Note that the change to the TB aa_muts.json test file was due to a bug in the previous code, where `'type: feat['type']` would incorrectly reuse the last defined `feat` in the preceding loop. (I think this is a pitfall of using large "real-life" test files as it's impractical to manually check the source-of-truth we are comparing against.) Since the 'nuc' feature is guaranteed to exist, we can also check it against the existing nuc annotation within `augur ancestral`, where applicable. Closes #953, although there is good commentary in that issue about improving our parsing of GFFs more generally than that implemented here. Closes #1346
Configuration menu - View commit details
-
Copy full SHA for 1d17699 - Browse repository at this point
Copy the full SHA 1d17699View commit details -
[utils] warn if unreadable gene/CDS feature
These were previously either silently skipped or, in some cases an unhandled exception was raised.
Configuration menu - View commit details
-
Copy full SHA for cd7055a - Browse repository at this point
Copy the full SHA cd7055aView commit details -
Configuration menu - View commit details
-
Copy full SHA for c91ca33 - Browse repository at this point
Copy the full SHA c91ca33View commit details