-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: VRS 1.3 digests may be computed for Alleles and Sequence Locations #427
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some docstrings / comments would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#428 will resolve failing tests in 3.10 + 3.11.
There are some minor things that we can clean up in separate PRs since you want to get this out fast. Below are my thoughts/suggestions that I think should be addressed in this PR.
src/ga4gh/vrs/models.py
Outdated
else: | ||
raise ValueError(f'{value} is not int or list.') | ||
out.append(result) | ||
return f'{{"interval":{{"end":{out[1]},"start":{out[0]},"type":"SequenceInterval"}},"sequence_id":"{self.sequenceReference.refgetAccession.split('.')[1]}","type":"SequenceLocation"}}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that sequenceReference
is not null and that it is type SequenceReference
. We should handle cases where sequenceReference
is null and cases where sequenceReference
is an IRI
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed.
@@ -122,7 +122,7 @@ def parse_ga4gh_identifier(ir): | |||
raise ValueError(ir) from e | |||
|
|||
|
|||
def ga4gh_identify(vro, in_place='default'): | |||
def ga4gh_identify(vro, in_place='default', as_version=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably good to use an enum here if we are expected to support all previous versions (1.2, 1.1, 1.0) in future PRs. If not, we could probably just rename as_version
to to_1_3_version
or something and have it be a bool that defaults to False
. Just so we're not hardcoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If supporting multiple versions and using enum, it might be good to fail fast in the beginning of the function and raise an exception if version is not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree–intent is to support multiple versions down the road and we should leave that door open. Also agree with fail fast approach.
"""This method will return a serialized string following the conventions for | ||
Allele serialization as defined in the VRS version specified by 'as_version`.""" | ||
location_digest = self.location.compute_digest(as_version=as_version) | ||
sequence = get_pydantic_root(self.state.sequence) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to check that state
is a LiteralSequenceExpression
and raise exception if not. LengthExpression
does not have a sequence
field and not sure if we want ReferenceLengthExpression
to be converted to LSE
in 1.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LSE or RLE is okay here; both have a sequence
field, and for small variants the RLE sequence
is populated by default. However, I agree that we want to do some processing here as a separate PR; ideally we would implement the RLE to LSE function (it should work like this) and just apply that in the case this is an RLE without a defined sequence.
Co-authored-by: Kori Kuzma <korikuzma@gmail.com>
@korikuzma I agree with your recommended changes and left comments on each. However, for the purpose of the ASHG demo this is good enough and I am going to merge. Would you kindly take on implementing the above as follow-up PRs? |
@ahwagner what's the priority for this? |
Medium-to-low priority: I would like to see these improvements implemented by the end of August. |
…ze_as_version` Building off of #382 and #427 * Add `PrevVrsVersion` enum to store previous versions of VRS that is supported for computing digests/identifiers * Updates function signatures + docstrings for ga4gh digest/serialize/identifier * Adds restrictions to `ga4gh_serialize_as_version` * For `SequenceLocation`: `sequenceReference` must be provided and must be a valid `SequenceReference` obj * For `Allele`: Only `LiteralSequenceExpression` and `ReferenceLengthExpression` are supported and must provide a `sequence` nonnull attribute.
…e_as_version` (#431) Building off of #382 and #427 * Add `PrevVrsVersion` enum to store previous versions of VRS that is supported for computing digests/identifiers * Updates function signatures + docstrings for ga4gh digest/serialize/identifier * Adds restrictions to `ga4gh_serialize_as_version` * For `SequenceLocation`: `sequenceReference` must be provided and must be a valid `SequenceReference` obj * For `Allele`: Only `LiteralSequenceExpression` and `ReferenceLengthExpression` are supported and must provide a `sequence` nonnull attribute.
Note: tests only pass for Python 3.12 due to changes in f-string implementation introduced in 3.12. These should be addressed before merge.
closes #382