-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve parsing of GenBank / GFF files #1351
Conversation
Including such information will become mandatory in Augur following <nextstrain/augur#1351>. Segment length from <https://www.ncbi.nlm.nih.gov/nuccore/CY114383>
5a1fa2c
to
17f76c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making such a clearly ordered and documented list of commits for this PR, @jameshadfield. It is a pleasure to read through them in order and easy to follow your logic.
I didn't make it through all of the commits (only halfway through), but I wanted to share my thoughts from the first half now, to get the conversation going. I'll continue reviewing tomorrow.
CHANGES.md
Outdated
@@ -2,6 +2,14 @@ | |||
|
|||
## __NEXT__ | |||
|
|||
* ancestral, translate: Major improvements to how we parse GFF and GenBank reference files. [#1351][] (@jameshadfield) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: We should add the changes from this PR into "Major changes" or "Features", depending on whether we end up requiring sequence pragmas, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you normally categorise changes? e.g. would
For GenBank files we now require the (mandatory) source feature to be present
be a breaking change and therefore a "Major change". Similarly for things like
For GFF files, we extract the genome/sequence coordinates by inspecting the sequence-region pragma, region type and/or source type. This information is now required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd classify anything that potentially breaks an existing interface as a major change. In following description, the first sentence describes a new feature since it's adding some new functionality that didn't exist (although you could interpret this as fixing a bug in our GFF parsing 😄). The second sentence makes the change a major change since the new requirement for the nucleotide coordinates could break an existing workflow.
For GFF files, we extract the genome/sequence coordinates by inspecting the sequence-region pragma, region type and/or source type. This information is now required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Good general advice. Updated changelog accordingly.
if feat.type == "region": | ||
nuc['region'] = feat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there documentation we could cite here to explain why we look for (or prefer) a region
type? I only see this referenced once in the GFF3 spec under the "circular genomes" section. The rest of the examples in the GFF spec use the sequence region pragma.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in this thread...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! But just to make sure I understand, did you find the region
type when looking into circular genome annotations in the GFF spec or did it popup somewhere else outside of the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen it elsewhere, but I can't remember where exactly. It's also part of the SO list. (I've never seen source used outside of Nextstrain.) When working on these issues I was surprised that GFF and (especially) GenBank specs are hard to find / somewhat incomplete.
augur/utils.py
Outdated
if 'pragma' in nuc: | ||
return nuc['pragma'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer the region
entry over the pragma
entry here? I'm interpreting this to mean that if the GFF file has both entries, we take the region
entry, but maybe that's wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as per your earlier review comment I'm happy to prefer pragma over region. But it doesn't matter very much in practice - if there are multiple entries then we guarantee they agree with each other and export one. The only difference is that one will have "type: region"
and the other "type: ##sequence-region pragma"
. Auspice doesn't read this information, and I'm guessing no scripts do either!
Similarly for the 'seqid' field in our annotations. I don't think anything uses this, and I'd like to remove it, but it feels like more trouble than it's worth as maybe there is something that uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. So, since it doesn't matter which one we pick, we could skip the check for the different keys and return the first value when the dict isn't empty:
if len(nuc) > 0:
return list(nuc.values())[0]
else:
raise AugurError(f"Reference {reference!r} didn't define any information we can use to create the 'nuc' annotation. You can use a line with a 'record' or 'source' GFF type or a ##sequence-region pragma.")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to use the "##sequence-region pragma" if available (part of this diff). I'd like to keep the checks for contradicting coordinates (see my comment in another thread).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameshadfield I didn't mean to suggest replacing the check for contradicting coordinates. The code example above would replace the if/elif/else branching logic that checks for which coordinates to return. Since they are all the same, it doesn't matter which one we return and we can just pick the first one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - and in practice this'll be deterministic because dictionary key creation order is preserved, so we could guarantee the pragma is exported if it exists, but I prefer the explicit (but more verbose) form you suggested here.
augur/utils.py
Outdated
return nuc['pragma'] | ||
if 'source' in nuc: | ||
return nuc['source'] | ||
raise AugurError("Internal error (_read_nuc_annotation_from_gff fallthrough). This is a bug in Augur's code.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we could replace this exception with the exception associated with an empty nuc
dictionary here and use an if
/elif
/else
structure to choose what to return like so:
if 'region' in nuc:
return nuc['region']
elif 'pragma' in nuc:
return nuc['pragma']
elif 'source' in nuc:
return nuc['source']
else:
raise AugurError(f"Reference {reference!r} didn't define any information we can use to create the 'nuc' annotation. You can use a line with a 'record' or 'source' GFF type or a ##sequence-region pragma.")
We shouldn't need a "fallthrough" exception then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (part of this diff]
# Ensure the already-created nuc annotation coordinates match those parsed from the reference file | ||
if (features['nuc'].location.start+1 != anc_seqs['annotations']['nuc']['start'] or | ||
features['nuc'].location.end != anc_seqs['annotations']['nuc']['end']): | ||
raise AugurError(f"The 'nuc' annotation coordinates parsed from {args.annotation!r} ({features['nuc'].location.start+1}..{features['nuc'].location.end})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bummer that this + 1
logic needs to exist in so many places (ancestral, translate, and utils) to communicate or compare coordinates between different systems. It's the kind of logic that is easy to forget to include in future code and then lead to subtle bugs. I don't have a great suggestion to fix this except maybe encapsulating or replacing the SeqFeature
functionality so we can interact with objects that format coordinates in our expected "one-origin, inclusive" format. That's outside of the scope of this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally. Short of adding another layer on top of SeqFeature
(which might alleviate this problem but at the cost of another layer to understand) I've leaned heavily on writing tests to catch this. Auspice has this issue as well and tests were the only path to sanity!
If this were JavaScript you could monkeypatch SeqFeature
to add the method (prototype) startOneBased
, but in python you would have to create a separate subclass as far as I understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - so you can monkeypatch things in Python. For instance, adding
from Bio.SeqFeature import FeatureLocation
def start_one_based(self):
return self.start + 1
FeatureLocation.start_one_based = start_one_based
at the top of utils.py
seemingly allows us to call <feature>.location.start_one_based()
, even when those features are created by SeqIO.read(...)
. I found this a bit surprising because that code is (somewhere) going to have its own import FeatureLocation
, so I presume Python's import logic de-duplicates those imports and thus the above code snippet refers to the one-and-only FeatureLocation
class.
Adding (or rebinding) methods also flows through to already instantiated objects. My understanding of how this works in Python isn't great (JS is much clearer for me). E.g. <instance>.<method>
is clearly pointing to the (one-and-only) class definition, but <instance>.<some_data>
is clearly pointing to an instance-specific address. My guess would be that instances have a lookup table of attributes (which in this example would include <some_data>
), and if an attribute's not there we go look for it in the class (and then the parent class etc). It's python, so it's probably using namespaces.
To complicate matters a bit more, .location.start
is an instance-attribute (of .location
) and is itself an instance of <class 'Bio.SeqFeature.ExactPosition'>
, so it couldn't be monkeypatched the way I demonstrate above. We might be able to emulate enough of the behaviour by using something like:
FeatureLocation.start_one_based = property(start_one_based)
So long story short, I'm not going to do this in this PR, but we absolutely could do this across Augur to improve the code readability.
I didn't end up having any more major comments today. Thanks again for doing all of the heavy lifting to fix this parsing logic so rigorously, @jameshadfield. Thanks also for the fix to the NA gene coordinates. |
357345f
to
84343d8
Compare
95e427a
to
454c53c
Compare
84343d8
to
03c1965
Compare
d11ba4e
to
d1f3247
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #1351 +/- ##
==========================================
+ Coverage 65.96% 66.14% +0.17%
==========================================
Files 68 68
Lines 7172 7221 +49
Branches 1761 1777 +16
==========================================
+ Hits 4731 4776 +45
Misses 2176 2176
- Partials 265 269 +4 ☔ View full report in Codecov by Sentry. |
This follows my own advice in <#953 (comment)> and sets the stage for subsequent commits to improve the code. There are no functional changes here.
Both functions which call this would check for a return value of None (indicating that the file didn't exist) and cause the augur function to exit. It's cleaner to lift this into `load_features` and this makes it easier for that function to raise errors in the future (e.g. on malformed/empty reference files).
Also catches the edge case where a GFF has no valid rows. The printed error messages should be helpful enough to identify the GFF formatting error(s).
See comment added in code which details the previous behaviour. While silently skipping genes without the necessary attributes isn't the best solution in my opinion, it's at least now consistent (and also consistent with how we handle GenBank parsing). Closes #1349
This fixes a long-standing oversight in Augur where GFF files could only define the nucleotide coordinates via a row with GFF type 'source'. We now parse the (preferred) GFF type 'region' as well as the '##sequence-region pragma'. This allows us to exit if the nuc coordinates are not defined, and the error message should help users correct their GFF files. Note that the code is not yet implemented which guarantees the extracted nuc coordinates will be exported in the JSON. The changes here resulted in a number of tests needing updated, largely due to us now parsing an additional feature from GFF files (the 'nuc' feature)
A companion commit to the previous one, this time using GenBank files.
This name is reserved in our annotations schema to refer to the genome / segment / sequence nucleotide annotation.
This builds off the preceding 3 commits which guarantee that a 'nuc' feature will be parsed from the reference file. We now guarantee it'll be exported in the node-data JSON. Note that the change to the TB aa_muts.json test file was due to a bug in the previous code, where `'type: feat['type']` would incorrectly reuse the last defined `feat` in the preceding loop. (I think this is a pitfall of using large "real-life" test files as it's impractical to manually check the source-of-truth we are comparing against.) Since the 'nuc' feature is guaranteed to exist, we can also check it against the existing nuc annotation within `augur ancestral`, where applicable. Closes #953, although there is good commentary in that issue about improving our parsing of GFFs more generally than that implemented here. Closes #1346
These were previously either silently skipped or, in some cases an unhandled exception was raised.
d1f3247
to
c91ca33
Compare
Thanks again for the review @huddlej Rebased onto master. Merging despite CI failure, as that's unrelated to this PR - it's failing while running the |
See commit messages for full details. Lots of tests added and general improvements beyond just addressing the issues closed.
Closes #1349
Closes #1346
Closes #953