-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VRS Hackathon Genotype draft #394
Conversation
remove ComposedSequenceExpression from 1.2, belongs in 1.3+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from my question regarding the min 2 alleles for a haplotype - great :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these changes should be made before we approve a "1.3" release, but I'm unclear if we are just trying to get this into the main branch for now with the intention of doing a 1.3 later. If this is about approving 1.3 then I would say we need to make some of these changes. Which we should do together as they are a bit subjective and should include some discussion to finalize.
In addition to the above comments (thank you all!) I have two additional issues with this model, and how it would be used in VRS, that need to be addressed before I think we can merge this. The first is an issue with
My second issue is with the semantics and implementation guidance for this data class. We have described GenotypeMembers as "in-trans" but I think we need to be very clear about what this means. Revisiting the point @rrfreimuth raised, most modern assays allow us to estimate the number of any allele that aligns to a reference locus, even if the origin of one or more copies of that allele is at a duplicated locus elsewhere. In cancer, we often see this at high copies, but in my experience we also don't care about the semantics of "in-trans" / "phasing" and the notion of "Genotype" when evaluating such loci–just the total number of copies of that Allele system-wide. So I think we should add some implementation guidance to accommodate this, clarifying that all Alleles / Haplotypes represented in a Genotype are presumed to be in-trans on homologous sequences, and therefore are constrained to representation on the same sequence. We should also add to our implementation guidance that no two GenotypeMember objects can have the same Allele subject (unless we change the model to accommodate my first concern above). Alternatively, we revise the |
@ahwagner I'm not happy with any of the 3 options above. Couldn't we modify the serializer/digester function to always generate a computed identifier on any "array" elements like this and use that to generate the sorted/parent digest without explicitly making the GenotypeMember truly identifiable. In others words, do the identifier work in hidden mode for the purpose of generating the next level up identifier. I don't think we want or need to expose a computed identifier for a GenotypeMember, but it doesn't mean that one can't be created behind the scenes for the purposes of keeping things consistent as we grow. |
I think this is the right direction. But I'm not sure I really get why having alleles at different loci as GenotypeMember's is so problematic. If we are clear in that every GenotypeMember instance (including each "count" occurrence) is in-trans withe every other one within the genotype, then we should still be able to express genotype members as alleles from different loci. Although, if we want to be super precise we could make this constraint and then if someone wants to represent a compound het, then they would need to create the haplotype for the two loci containing one "same as ref" allele such that the in-trans haplotype represented the compound het. This would guarantee that someone is clearly stating that there are two changes that are in-trans but not on the same copy. I suppose this is cleaner and avoids the ability for folks to create different forms of the same equivalent genotypes. |
Taking this approach is viable, but would also require us to add explicit indices for ComposedSequenceExpressions and other |
OK. so maybe if order is meaningful we don't decompose the array members into digests and sort, but when order isn't meaningful we do. Everyone wins. right? Honestly, it would be useful to explicitly note on any of our attributes that are arrays whether or not order is critical. Right now, it's not very transparent. I had lost track of the array structures that we decided had a meaningful order. The other way to look at this (for consistency sake) is to say the physical message array order is never a requirement, if there is a meaningful "logical" order of things then we will clearly express that in the implementation guidance. Then we can "sort" array's with meaningful order based on the logic rules and assure that the message sender doesn't ever have to worry about getting it wrong. |
I agree with this, and think we can take this a step further by adding an explicit attribute to arrays (similar to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, assuming you get the one last item in.
* In diploid organisms, there are typically two instances of each autosomal chromosome, | ||
and therefore two instances of sequence at a particular location. Thus, Genotypes will | ||
and therefore two instances of sequence at a particular locus. Thus, Genotypes will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add that if the desire is to express a specific diplotype, it could be represented as a genotype of two haplotypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Added some text for this in 40ef2af.
@@ -31,4 +31,4 @@ Some Genotype attributes are inherited from :ref:`Variation`. | |||
* - count | |||
- :ref:`Number` | :ref:`IndefiniteRange` | :ref:`DefiniteRange` | |||
- 1..1 | |||
- The total number of copies of all :ref:`MolecularVariation` at this locus, MUST be greater than or equal to the sum of :ref:`GenotypeMember` copy counts. If greater than the total counts, this implies additional :ref:`MolecularVariation` that are expected to exist but are not explicitly indicated. | |||
- The total number of copies of all :ref:`MolecularVariation` at this locus, MUST be greater than or equal to the sum of :ref:`GenotypeMember` copy counts and MUST be greater than or equal to 1. If greater than the total of GenotypeMember counts, this field describes additional :ref:`MolecularVariation` that exist but are not explicitly described. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand why the total nr of genotypemember counts can be different from the overall count. Do you have an example where this might be needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example we referenced when considering this use case was provided by @larrybabb, in this example report from eMERGE. What was found is that in these cases a heterozygous variant is reported, but no mention to the second allele (presumably reference-agree) is given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, should we have a recommendation then how to represent knowledge about reference-state on one of the chromosomes as part of this? It feels like this is a common enough scenario so it might be good to provide more documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice work. looks spot on.
This is a draft model of the Genotype class that was refined and tested during the ISMB 2022 VRS Hackathon.
This draft addresses #393.
TODO: add maturity labels to classes.