Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VRS Hackathon Genotype draft #394

Merged
merged 31 commits into from
Oct 7, 2022
Merged

VRS Hackathon Genotype draft #394

merged 31 commits into from
Oct 7, 2022

Conversation

ahwagner
Copy link
Member

@ahwagner ahwagner commented Jul 11, 2022

This is a draft model of the Genotype class that was refined and tested during the ISMB 2022 VRS Hackathon.

This draft addresses #393.

TODO: add maturity labels to classes.

@ahwagner ahwagner mentioned this pull request Jul 19, 2022
Copy link
Member

@mbaudis mbaudis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from my question regarding the min 2 alleles for a haplotype - great :-)

Copy link
Contributor

@larrybabb larrybabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these changes should be made before we approve a "1.3" release, but I'm unclear if we are just trying to get this into the main branch for now with the intention of doing a 1.3 later. If this is about approving 1.3 then I would say we need to make some of these changes. Which we should do together as they are a bit subjective and should include some discussion to finalize.

schema/defs/vrs/Genotype.rst Show resolved Hide resolved
docs/source/terms_and_model.rst Outdated Show resolved Hide resolved
schema/defs/vrs/GenotypeMember.rst Outdated Show resolved Hide resolved
schema/defs/vrs/Haplotype.rst Outdated Show resolved Hide resolved
schema/ga4gh.yaml Show resolved Hide resolved
schema/ga4gh.yaml Show resolved Hide resolved
schema/vrs-source.yaml Show resolved Hide resolved
schema/vrs-source.yaml Show resolved Hide resolved
schema/vrs-source.yaml Show resolved Hide resolved
schema/vrs-source.yaml Show resolved Hide resolved
schema/ga4gh.yaml Show resolved Hide resolved
schema/defs/vrs/Genotype.rst Outdated Show resolved Hide resolved
schema/defs/vrs/GenotypeMember.rst Show resolved Hide resolved
schema/defs/vrs/Haplotype.rst Outdated Show resolved Hide resolved
@ahwagner
Copy link
Member Author

ahwagner commented Jul 28, 2022

In addition to the above comments (thank you all!) I have two additional issues with this model, and how it would be used in VRS, that need to be addressed before I think we can merge this.

The first is an issue with GenotypeMember. I like this as a Basic Type, and I think it would be a mistake to make this identifiable as a dependent structure for representing the number of Alleles/Haplotypes. However, leaving it as non-identifiable (as it currently is) means that the Genotype.members array will not be sorted during serialization (but it should be). To address this, we must do one of the following:

  1. consider alternative structures for conveying the number/range of a given Allele / Haplotype in a Genotype
    a. possibly use the CopyNumber class instead of GenotypeMember to assign the copies/count to the MolecularVariation? I think this doesn't cleanly align with the stated purpose of a Genotype, but it does solve the above technical issue cleanly.
    b. repeat the MolecularVariation explicitly inside the Genotype (doing so would remove our ability to represent definite/indefinite range counts of a given Allele / Haplotype within the Genotype)
  2. revise our rules / add custom attributes to JSON Schema array properties indicating whether or not to preserve the order of ANY array during serialization (generalizable strategy that can help elsewhere – my preference)
  3. make GenotypeMember identifiable (ick)
    a. or make it sortable via the nested GenotypeMember.variation digest (complex)

My second issue is with the semantics and implementation guidance for this data class. We have described GenotypeMembers as "in-trans" but I think we need to be very clear about what this means. Revisiting the point @rrfreimuth raised, most modern assays allow us to estimate the number of any allele that aligns to a reference locus, even if the origin of one or more copies of that allele is at a duplicated locus elsewhere. In cancer, we often see this at high copies, but in my experience we also don't care about the semantics of "in-trans" / "phasing" and the notion of "Genotype" when evaluating such loci–just the total number of copies of that Allele system-wide.

So I think we should add some implementation guidance to accommodate this, clarifying that all Alleles / Haplotypes represented in a Genotype are presumed to be in-trans on homologous sequences, and therefore are constrained to representation on the same sequence. We should also add to our implementation guidance that no two GenotypeMember objects can have the same Allele subject (unless we change the model to accommodate my first concern above).

Alternatively, we revise the Genotype class to be a system-wide set of homologous Alleles / Haplotypes from any locus, in which case it may be appropriate to drop language regarding phasing and just leverage the existing CopyNumber structure in lieu of GenotypeMember (this is also solution 1a above).

@larrybabb
Copy link
Contributor

However, leaving it as non-identifiable (as it currently is) means that the Genotype.members array will not be sorted during serialization (but it should be).

@ahwagner I'm not happy with any of the 3 options above. Couldn't we modify the serializer/digester function to always generate a computed identifier on any "array" elements like this and use that to generate the sorted/parent digest without explicitly making the GenotypeMember truly identifiable. In others words, do the identifier work in hidden mode for the purpose of generating the next level up identifier. I don't think we want or need to expose a computed identifier for a GenotypeMember, but it doesn't mean that one can't be created behind the scenes for the purposes of keeping things consistent as we grow.

@larrybabb
Copy link
Contributor

So I think we should add some implementation guidance to accommodate this, clarifying that all Alleles / Haplotypes represented in a Genotype are presumed to be in-trans on homologous sequences, and therefore are constrained to representation on the same sequence. We should also add to our implementation guidance that no two GenotypeMember objects can have the same Allele subject (unless we change the model to accommodate my first concern above).

I think this is the right direction. But I'm not sure I really get why having alleles at different loci as GenotypeMember's is so problematic. If we are clear in that every GenotypeMember instance (including each "count" occurrence) is in-trans withe every other one within the genotype, then we should still be able to express genotype members as alleles from different loci.

Although, if we want to be super precise we could make this constraint and then if someone wants to represent a compound het, then they would need to create the haplotype for the two loci containing one "same as ref" allele such that the in-trans haplotype represented the compound het. This would guarantee that someone is clearly stating that there are two changes that are in-trans but not on the same copy. I suppose this is cleaner and avoids the ability for folks to create different forms of the same equivalent genotypes.

@ahwagner
Copy link
Member Author

However, leaving it as non-identifiable (as it currently is) means that the Genotype.members array will not be sorted during serialization (but it should be).

@ahwagner I'm not happy with any of the 3 options above. Couldn't we modify the serializer/digester function to always generate a computed identifier on any "array" elements like this and use that to generate the sorted/parent digest without explicitly making the GenotypeMember truly identifiable. In others words, do the identifier work in hidden mode for the purpose of generating the next level up identifier. I don't think we want or need to expose a computed identifier for a GenotypeMember, but it doesn't mean that one can't be created behind the scenes for the purposes of keeping things consistent as we grow.

Taking this approach is viable, but would also require us to add explicit indices for ComposedSequenceExpressions and other array structures where order is meaningful. @reece and @andreasprlic do you have any opinions on a preferred strategy here?

@larrybabb
Copy link
Contributor

larrybabb commented Jul 28, 2022

Taking this approach is viable, but would also require us to add explicit indices for ComposedSequenceExpressions and other array structures where order is meaningful. @reece and @andreasprlic do you have any opinions on a preferred strategy here?

OK. so maybe if order is meaningful we don't decompose the array members into digests and sort, but when order isn't meaningful we do. Everyone wins. right?

Honestly, it would be useful to explicitly note on any of our attributes that are arrays whether or not order is critical. Right now, it's not very transparent. I had lost track of the array structures that we decided had a meaningful order.

The other way to look at this (for consistency sake) is to say the physical message array order is never a requirement, if there is a meaningful "logical" order of things then we will clearly express that in the implementation guidance. Then we can "sort" array's with meaningful order based on the logic rules and assure that the message sender doesn't ever have to worry about getting it wrong.

@ahwagner
Copy link
Member Author

Taking this approach is viable, but would also require us to add explicit indices for ComposedSequenceExpressions and other array structures where order is meaningful. @reece and @andreasprlic do you have any opinions on a preferred strategy here?

OK. so maybe if order is meaningful we don't decompose the array members into digests and sort, but when order isn't meaningful we do. Everyone wins. right?

Honestly, it would be useful to explicitly note on any of our attributes that are arrays whether or not order is critical. Right now, it's not very transparent. I had lost track of the array structures that we decided had a meaningful order.

The other way to look at this (for consistency sake) is to say the physical message array order is never a requirement, if there is a meaningful "logical" order of things then we will clearly express that in the implementation guidance. Then we can "sort" array's with meaningful order based on the logic rules and assure that the message sender doesn't ever have to worry about getting it wrong.

I agree with this, and think we can take this a step further by adding an explicit attribute to arrays (similar to minItems, items, and other JSON Schema attributes) that tells VRS-Python and other downstream tools when to sort / not sort any given array for serialization. This is what I meant in option 2 above. To be clear, these attributes are not data properties–they would not be instantiated in messages, only present in the schema.

@larrybabb
Copy link
Contributor

@ahwagner Thank you for clarifying that. I didn't see that initially. Then #2 would be my preference also.

Copy link
Contributor

@larrybabb larrybabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, assuming you get the one last item in.

@ahwagner ahwagner linked an issue Sep 13, 2022 that may be closed by this pull request
* In diploid organisms, there are typically two instances of each autosomal chromosome,
and therefore two instances of sequence at a particular location. Thus, Genotypes will
and therefore two instances of sequence at a particular locus. Thus, Genotypes will

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add that if the desire is to express a specific diplotype, it could be represented as a genotype of two haplotypes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Added some text for this in 40ef2af.

@@ -31,4 +31,4 @@ Some Genotype attributes are inherited from :ref:`Variation`.
* - count
- :ref:`Number` | :ref:`IndefiniteRange` | :ref:`DefiniteRange`
- 1..1
- The total number of copies of all :ref:`MolecularVariation` at this locus, MUST be greater than or equal to the sum of :ref:`GenotypeMember` copy counts. If greater than the total counts, this implies additional :ref:`MolecularVariation` that are expected to exist but are not explicitly indicated.
- The total number of copies of all :ref:`MolecularVariation` at this locus, MUST be greater than or equal to the sum of :ref:`GenotypeMember` copy counts and MUST be greater than or equal to 1. If greater than the total of GenotypeMember counts, this field describes additional :ref:`MolecularVariation` that exist but are not explicitly described.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand why the total nr of genotypemember counts can be different from the overall count. Do you have an example where this might be needed?

Copy link
Member Author

@ahwagner ahwagner Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example we referenced when considering this use case was provided by @larrybabb, in this example report from eMERGE. What was found is that in these cases a heterozygous variant is reported, but no mention to the second allele (presumably reference-agree) is given.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, should we have a recommendation then how to represent knowledge about reference-state on one of the chromosomes as part of this? It feels like this is a common enough scenario so it might be good to provide more documentation.

Copy link
Contributor

@larrybabb larrybabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work. looks spot on.

@ahwagner ahwagner mentioned this pull request Oct 7, 2022
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Genotypes
7 participants