Improve parsing of GenBank / GFF files #1351

jameshadfield · 2023-12-05T22:59:20Z

See commit messages for full details. Lots of tests added and general improvements beyond just addressing the issues closed.

Closes #1349
Closes #1346
Closes #953

Including such information will become mandatory in Augur following <nextstrain/augur#1351>. Segment length from <https://www.ncbi.nlm.nih.gov/nuccore/CY114383>

huddlej

Thank you for making such a clearly ordered and documented list of commits for this PR, @jameshadfield. It is a pleasure to read through them in order and easy to follow your logic.

I didn't make it through all of the commits (only halfway through), but I wanted to share my thoughts from the first half now, to get the conversation going. I'll continue reviewing tomorrow.

huddlej · 2023-12-06T23:08:31Z

CHANGES.md

@@ -2,6 +2,14 @@

 ## __NEXT__

+* ancestral, translate: Major improvements to how we parse GFF and GenBank reference files. [#1351][] (@jameshadfield)


Minor: We should add the changes from this PR into "Major changes" or "Features", depending on whether we end up requiring sequence pragmas, etc.

#1344 (comment) 😉

How do you normally categorise changes? e.g. would

For GenBank files we now require the (mandatory) source feature to be present

be a breaking change and therefore a "Major change". Similarly for things like

For GFF files, we extract the genome/sequence coordinates by inspecting the sequence-region pragma, region type and/or source type. This information is now required.

I'd classify anything that potentially breaks an existing interface as a major change. In following description, the first sentence describes a new feature since it's adding some new functionality that didn't exist (although you could interpret this as fixing a bug in our GFF parsing 😄). The second sentence makes the change a major change since the new requirement for the nucleotide coordinates could break an existing workflow.

For GFF files, we extract the genome/sequence coordinates by inspecting the sequence-region pragma, region type and/or source type. This information is now required.

Thanks! Good general advice. Updated changelog accordingly.

augur/utils.py

huddlej · 2023-12-07T00:17:55Z

augur/utils.py

+        if feat.type == "region":
+            nuc['region'] = feat


Is there documentation we could cite here to explain why we look for (or prefer) a region type? I only see this referenced once in the GFF3 spec under the "circular genomes" section. The rest of the examples in the GFF spec use the sequence region pragma.

Addressed in this thread...

Thanks! But just to make sure I understand, did you find the region type when looking into circular genome annotations in the GFF spec or did it popup somewhere else outside of the spec?

I've seen it elsewhere, but I can't remember where exactly. It's also part of the SO list. (I've never seen source used outside of Nextstrain.) When working on these issues I was surprised that GFF and (especially) GenBank specs are hard to find / somewhat incomplete.

huddlej · 2023-12-07T00:19:18Z

augur/utils.py

+    if 'pragma' in nuc:
+        return nuc['pragma']


Is there a reason to prefer the region entry over the pragma entry here? I'm interpreting this to mean that if the GFF file has both entries, we take the region entry, but maybe that's wrong?

Yeah, as per your earlier review comment I'm happy to prefer pragma over region. But it doesn't matter very much in practice - if there are multiple entries then we guarantee they agree with each other and export one. The only difference is that one will have "type: region" and the other "type: ##sequence-region pragma". Auspice doesn't read this information, and I'm guessing no scripts do either!

Similarly for the 'seqid' field in our annotations. I don't think anything uses this, and I'd like to remove it, but it feels like more trouble than it's worth as maybe there is something that uses it.

Got it. So, since it doesn't matter which one we pick, we could skip the check for the different keys and return the first value when the dict isn't empty:

if len(nuc) > 0: return list(nuc.values())[0] else: raise AugurError(f"Reference {reference!r} didn't define any information we can use to create the 'nuc' annotation. You can use a line with a 'record' or 'source' GFF type or a ##sequence-region pragma.")

changed to use the "##sequence-region pragma" if available (part of this diff). I'd like to keep the checks for contradicting coordinates (see my comment in another thread).

@jameshadfield I didn't mean to suggest replacing the check for contradicting coordinates. The code example above would replace the if/elif/else branching logic that checks for which coordinates to return. Since they are all the same, it doesn't matter which one we return and we can just pick the first one.

Yeah - and in practice this'll be deterministic because dictionary key creation order is preserved, so we could guarantee the pragma is exported if it exists, but I prefer the explicit (but more verbose) form you suggested here.

huddlej · 2023-12-07T00:24:36Z

augur/utils.py

+        return nuc['pragma']
+    if 'source' in nuc:
+        return nuc['source']
+    raise AugurError("Internal error (_read_nuc_annotation_from_gff fallthrough). This is a bug in Augur's code.")


It seems like we could replace this exception with the exception associated with an empty nuc dictionary here and use an if/elif/else structure to choose what to return like so:

if 'region' in nuc: return nuc['region'] elif 'pragma' in nuc: return nuc['pragma'] elif 'source' in nuc: return nuc['source'] else: raise AugurError(f"Reference {reference!r} didn't define any information we can use to create the 'nuc' annotation. You can use a line with a 'record' or 'source' GFF type or a ##sequence-region pragma.")

We shouldn't need a "fallthrough" exception then.

Fixed (part of this diff]

augur/utils.py

huddlej · 2023-12-07T21:16:11Z

augur/ancestral.py

+        # Ensure the already-created nuc annotation coordinates match those parsed from the reference file
+        if (features['nuc'].location.start+1 != anc_seqs['annotations']['nuc']['start'] or
+            features['nuc'].location.end != anc_seqs['annotations']['nuc']['end']):
+            raise AugurError(f"The 'nuc' annotation coordinates parsed from {args.annotation!r} ({features['nuc'].location.start+1}..{features['nuc'].location.end})"


It's a bummer that this + 1 logic needs to exist in so many places (ancestral, translate, and utils) to communicate or compare coordinates between different systems. It's the kind of logic that is easy to forget to include in future code and then lead to subtle bugs. I don't have a great suggestion to fix this except maybe encapsulating or replacing the SeqFeature functionality so we can interact with objects that format coordinates in our expected "one-origin, inclusive" format. That's outside of the scope of this PR though.

Totally. Short of adding another layer on top of SeqFeature (which might alleviate this problem but at the cost of another layer to understand) I've leaned heavily on writing tests to catch this. Auspice has this issue as well and tests were the only path to sanity!

If this were JavaScript you could monkeypatch SeqFeature to add the method (prototype) startOneBased, but in python you would have to create a separate subclass as far as I understand.

Oh - so you can monkeypatch things in Python. For instance, adding

from Bio.SeqFeature import FeatureLocation def start_one_based(self): return self.start + 1 FeatureLocation.start_one_based = start_one_based

at the top of utils.py seemingly allows us to call <feature>.location.start_one_based(), even when those features are created by SeqIO.read(...). I found this a bit surprising because that code is (somewhere) going to have its own import FeatureLocation, so I presume Python's import logic de-duplicates those imports and thus the above code snippet refers to the one-and-only FeatureLocation class.

Adding (or rebinding) methods also flows through to already instantiated objects. My understanding of how this works in Python isn't great (JS is much clearer for me). E.g. <instance>.<method> is clearly pointing to the (one-and-only) class definition, but <instance>.<some_data> is clearly pointing to an instance-specific address. My guess would be that instances have a lookup table of attributes (which in this example would include <some_data>), and if an attribute's not there we go look for it in the class (and then the parent class etc). It's python, so it's probably using namespaces.

To complicate matters a bit more, .location.start is an instance-attribute (of .location) and is itself an instance of <class 'Bio.SeqFeature.ExactPosition'>, so it couldn't be monkeypatched the way I demonstrate above. We might be able to emulate enough of the behaviour by using something like:

FeatureLocation.start_one_based = property(start_one_based)

So long story short, I'm not going to do this in this PR, but we absolutely could do this across Augur to improve the code readability.

huddlej · 2023-12-07T21:17:45Z

I didn't end up having any more major comments today. Thanks again for doing all of the heavy lifting to fix this parsing logic so rigorously, @jameshadfield. Thanks also for the fix to the NA gene coordinates.

codecov · 2023-12-20T00:59:34Z

Codecov Report

Attention: 20 lines in your changes are missing coverage. Please review.

Comparison is base (fc900aa) 65.96% compared to head (c91ca33) 66.14%.

Files	Patch %	Lines
augur/utils.py	81.05%	9 Missing and 9 partials ⚠️
augur/ancestral.py	0.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1351      +/-   ##
==========================================
+ Coverage   65.96%   66.14%   +0.17%     
==========================================
  Files          68       68              
  Lines        7172     7221      +49     
  Branches     1761     1777      +16     
==========================================
+ Hits         4731     4776      +45     
  Misses       2176     2176              
- Partials      265      269       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

This follows my own advice in <#953 (comment)> and sets the stage for subsequent commits to improve the code. There are no functional changes here.

Both functions which call this would check for a return value of None (indicating that the file didn't exist) and cause the augur function to exit. It's cleaner to lift this into `load_features` and this makes it easier for that function to raise errors in the future (e.g. on malformed/empty reference files).

Also catches the edge case where a GFF has no valid rows. The printed error messages should be helpful enough to identify the GFF formatting error(s).

See comment added in code which details the previous behaviour. While silently skipping genes without the necessary attributes isn't the best solution in my opinion, it's at least now consistent (and also consistent with how we handle GenBank parsing). Closes #1349

This fixes a long-standing oversight in Augur where GFF files could only define the nucleotide coordinates via a row with GFF type 'source'. We now parse the (preferred) GFF type 'region' as well as the '##sequence-region pragma'. This allows us to exit if the nuc coordinates are not defined, and the error message should help users correct their GFF files. Note that the code is not yet implemented which guarantees the extracted nuc coordinates will be exported in the JSON. The changes here resulted in a number of tests needing updated, largely due to us now parsing an additional feature from GFF files (the 'nuc' feature)

A companion commit to the previous one, this time using GenBank files.

This name is reserved in our annotations schema to refer to the genome / segment / sequence nucleotide annotation.

This builds off the preceding 3 commits which guarantee that a 'nuc' feature will be parsed from the reference file. We now guarantee it'll be exported in the node-data JSON. Note that the change to the TB aa_muts.json test file was due to a bug in the previous code, where `'type: feat['type']` would incorrectly reuse the last defined `feat` in the preceding loop. (I think this is a pitfall of using large "real-life" test files as it's impractical to manually check the source-of-truth we are comparing against.) Since the 'nuc' feature is guaranteed to exist, we can also check it against the existing nuc annotation within `augur ancestral`, where applicable. Closes #953, although there is good commentary in that issue about improving our parsing of GFFs more generally than that implemented here. Closes #1346

These were previously either silently skipped or, in some cases an unhandled exception was raised.

jameshadfield · 2023-12-20T01:26:32Z

Thanks again for the review @huddlej

Rebased onto master.

Merging despite CI failure, as that's unrelated to this PR - it's failing while running the pathogen-repo-rsv workflow (so did #1348). Issue created at nextstrain/rsv#50

jameshadfield mentioned this pull request Dec 5, 2023

Add h3n2/na segment length to GFF nextstrain/seasonal-flu#132

Merged

jameshadfield force-pushed the james/translate/nuc-annotation branch from 5a1fa2c to 17f76c1 Compare December 5, 2023 23:13

huddlej reviewed Dec 7, 2023

View reviewed changes

jameshadfield force-pushed the james/ancestral-translate-fixes branch from 357345f to 84343d8 Compare December 11, 2023 01:11

jameshadfield force-pushed the james/translate/nuc-annotation branch 2 times, most recently from 95e427a to 454c53c Compare December 11, 2023 04:04

jameshadfield force-pushed the james/ancestral-translate-fixes branch from 84343d8 to 03c1965 Compare December 19, 2023 23:24

jameshadfield force-pushed the james/translate/nuc-annotation branch 2 times, most recently from d11ba4e to d1f3247 Compare December 20, 2023 00:40

Base automatically changed from james/ancestral-translate-fixes to master December 20, 2023 00:43

jameshadfield added 10 commits December 20, 2023 14:03

[utils] refactor 'load_features', add docstrings

8f17677

This follows my own advice in <#953 (comment)> and sets the stage for subsequent commits to improve the code. There are no functional changes here.

[utils] only allow GFFs with one record

e57e808

Also catches the edge case where a GFF has no valid rows. The printed error messages should be helpful enough to identify the GFF formatting error(s).

[utils] always extract 'nuc' annotation (GenBank)

af62d50

A companion commit to the previous one, this time using GenBank files.

[utils] Forbid gene/CDS with name 'nuc'

580666a

This name is reserved in our annotations schema to refer to the genome / segment / sequence nucleotide annotation.

[utils] warn if unreadable gene/CDS feature

cd7055a

These were previously either silently skipped or, in some cases an unhandled exception was raised.

changelog

c91ca33

jameshadfield force-pushed the james/translate/nuc-annotation branch from d1f3247 to c91ca33 Compare December 20, 2023 01:03

jameshadfield merged commit e346693 into master Dec 20, 2023
23 of 24 checks passed

jameshadfield deleted the james/translate/nuc-annotation branch December 20, 2023 01:26

corneliusroemer mentioned this pull request Sep 25, 2024

Also accept "region" as synonym of "source" for gb files? #1643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parsing of GenBank / GFF files #1351

Improve parsing of GenBank / GFF files #1351

jameshadfield commented Dec 5, 2023

huddlej left a comment

huddlej Dec 6, 2023

jameshadfield Dec 7, 2023

jameshadfield Dec 11, 2023

huddlej Dec 11, 2023

jameshadfield Dec 20, 2023

huddlej Dec 7, 2023

jameshadfield Dec 11, 2023

huddlej Dec 11, 2023

jameshadfield Dec 11, 2023

huddlej Dec 7, 2023

jameshadfield Dec 7, 2023 •

edited

Loading

huddlej Dec 7, 2023

jameshadfield Dec 11, 2023

huddlej Dec 11, 2023

jameshadfield Dec 11, 2023

huddlej Dec 7, 2023

jameshadfield Dec 11, 2023

huddlej Dec 7, 2023

jameshadfield Dec 7, 2023 •

edited

Loading

jameshadfield Dec 11, 2023 •

edited

Loading

huddlej commented Dec 7, 2023

codecov bot commented Dec 20, 2023 •

edited

Loading

jameshadfield commented Dec 20, 2023 •

edited

Loading

		@@ -2,6 +2,14 @@

		## __NEXT__

		* ancestral, translate: Major improvements to how we parse GFF and GenBank reference files. [#1351][] (@jameshadfield)

Improve parsing of GenBank / GFF files #1351

Improve parsing of GenBank / GFF files #1351

Conversation

jameshadfield commented Dec 5, 2023

huddlej left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshadfield Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

jameshadfield Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

huddlej commented Dec 7, 2023

codecov bot commented Dec 20, 2023 • edited Loading

Codecov Report

jameshadfield commented Dec 20, 2023 • edited Loading

jameshadfield Dec 7, 2023 •

edited

Loading

jameshadfield Dec 7, 2023 •

edited

Loading

jameshadfield Dec 11, 2023 •

edited

Loading

codecov bot commented Dec 20, 2023 •

edited

Loading

jameshadfield commented Dec 20, 2023 •

edited

Loading