VRS Hackathon Genotype draft #394

ahwagner · 2022-07-11T22:32:08Z

This is a draft model of the Genotype class that was refined and tested during the ISMB 2022 VRS Hackathon.

This draft addresses #393.

TODO: add maturity labels to classes.

remove ComposedSequenceExpression from 1.2, belongs in 1.3+

schema/defs/vrs/Genotype.rst

schema/vrs.json

mbaudis

Apart from my question regarding the min 2 alleles for a haplotype - great :-)

larrybabb

I think these changes should be made before we approve a "1.3" release, but I'm unclear if we are just trying to get this into the main branch for now with the intention of doing a 1.3 later. If this is about approving 1.3 then I would say we need to make some of these changes. Which we should do together as they are a bit subjective and should include some discussion to finalize.

schema/defs/vrs/Genotype.rst

docs/source/terms_and_model.rst

schema/defs/vrs/GenotypeMember.rst

schema/defs/vrs/Haplotype.rst

schema/ga4gh.yaml

schema/vrs-source.yaml

schema/ga4gh.yaml

schema/defs/vrs/Genotype.rst

schema/defs/vrs/GenotypeMember.rst

schema/defs/vrs/Haplotype.rst

ahwagner · 2022-07-28T04:42:48Z

In addition to the above comments (thank you all!) I have two additional issues with this model, and how it would be used in VRS, that need to be addressed before I think we can merge this.

The first is an issue with GenotypeMember. I like this as a Basic Type, and I think it would be a mistake to make this identifiable as a dependent structure for representing the number of Alleles/Haplotypes. However, leaving it as non-identifiable (as it currently is) means that the Genotype.members array will not be sorted during serialization (but it should be). To address this, we must do one of the following:

consider alternative structures for conveying the number/range of a given Allele / Haplotype in a Genotype
a. possibly use the CopyNumber class instead of GenotypeMember to assign the copies/count to the MolecularVariation? I think this doesn't cleanly align with the stated purpose of a Genotype, but it does solve the above technical issue cleanly.
b. repeat the MolecularVariation explicitly inside the Genotype (doing so would remove our ability to represent definite/indefinite range counts of a given Allele / Haplotype within the Genotype)
revise our rules / add custom attributes to JSON Schema array properties indicating whether or not to preserve the order of ANY array during serialization (generalizable strategy that can help elsewhere – my preference)
make GenotypeMember identifiable (ick)
a. or make it sortable via the nested GenotypeMember.variation digest (complex)

My second issue is with the semantics and implementation guidance for this data class. We have described GenotypeMembers as "in-trans" but I think we need to be very clear about what this means. Revisiting the point @rrfreimuth raised, most modern assays allow us to estimate the number of any allele that aligns to a reference locus, even if the origin of one or more copies of that allele is at a duplicated locus elsewhere. In cancer, we often see this at high copies, but in my experience we also don't care about the semantics of "in-trans" / "phasing" and the notion of "Genotype" when evaluating such loci–just the total number of copies of that Allele system-wide.

So I think we should add some implementation guidance to accommodate this, clarifying that all Alleles / Haplotypes represented in a Genotype are presumed to be in-trans on homologous sequences, and therefore are constrained to representation on the same sequence. We should also add to our implementation guidance that no two GenotypeMember objects can have the same Allele subject (unless we change the model to accommodate my first concern above).

Alternatively, we revise the Genotype class to be a system-wide set of homologous Alleles / Haplotypes from any locus, in which case it may be appropriate to drop language regarding phasing and just leverage the existing CopyNumber structure in lieu of GenotypeMember (this is also solution 1a above).

larrybabb · 2022-07-28T14:03:19Z

However, leaving it as non-identifiable (as it currently is) means that the Genotype.members array will not be sorted during serialization (but it should be).

@ahwagner I'm not happy with any of the 3 options above. Couldn't we modify the serializer/digester function to always generate a computed identifier on any "array" elements like this and use that to generate the sorted/parent digest without explicitly making the GenotypeMember truly identifiable. In others words, do the identifier work in hidden mode for the purpose of generating the next level up identifier. I don't think we want or need to expose a computed identifier for a GenotypeMember, but it doesn't mean that one can't be created behind the scenes for the purposes of keeping things consistent as we grow.

larrybabb · 2022-07-28T14:13:42Z

So I think we should add some implementation guidance to accommodate this, clarifying that all Alleles / Haplotypes represented in a Genotype are presumed to be in-trans on homologous sequences, and therefore are constrained to representation on the same sequence. We should also add to our implementation guidance that no two GenotypeMember objects can have the same Allele subject (unless we change the model to accommodate my first concern above).

I think this is the right direction. But I'm not sure I really get why having alleles at different loci as GenotypeMember's is so problematic. If we are clear in that every GenotypeMember instance (including each "count" occurrence) is in-trans withe every other one within the genotype, then we should still be able to express genotype members as alleles from different loci.

Although, if we want to be super precise we could make this constraint and then if someone wants to represent a compound het, then they would need to create the haplotype for the two loci containing one "same as ref" allele such that the in-trans haplotype represented the compound het. This would guarantee that someone is clearly stating that there are two changes that are in-trans but not on the same copy. I suppose this is cleaner and avoids the ability for folks to create different forms of the same equivalent genotypes.

ahwagner · 2022-07-28T15:55:15Z

However, leaving it as non-identifiable (as it currently is) means that the Genotype.members array will not be sorted during serialization (but it should be).

@ahwagner I'm not happy with any of the 3 options above. Couldn't we modify the serializer/digester function to always generate a computed identifier on any "array" elements like this and use that to generate the sorted/parent digest without explicitly making the GenotypeMember truly identifiable. In others words, do the identifier work in hidden mode for the purpose of generating the next level up identifier. I don't think we want or need to expose a computed identifier for a GenotypeMember, but it doesn't mean that one can't be created behind the scenes for the purposes of keeping things consistent as we grow.

Taking this approach is viable, but would also require us to add explicit indices for ComposedSequenceExpressions and other array structures where order is meaningful. @reece and @andreasprlic do you have any opinions on a preferred strategy here?

larrybabb · 2022-07-28T15:58:19Z

Taking this approach is viable, but would also require us to add explicit indices for ComposedSequenceExpressions and other array structures where order is meaningful. @reece and @andreasprlic do you have any opinions on a preferred strategy here?

OK. so maybe if order is meaningful we don't decompose the array members into digests and sort, but when order isn't meaningful we do. Everyone wins. right?

Honestly, it would be useful to explicitly note on any of our attributes that are arrays whether or not order is critical. Right now, it's not very transparent. I had lost track of the array structures that we decided had a meaningful order.

The other way to look at this (for consistency sake) is to say the physical message array order is never a requirement, if there is a meaningful "logical" order of things then we will clearly express that in the implementation guidance. Then we can "sort" array's with meaningful order based on the logic rules and assure that the message sender doesn't ever have to worry about getting it wrong.

ahwagner · 2022-07-28T16:52:50Z

Taking this approach is viable, but would also require us to add explicit indices for ComposedSequenceExpressions and other array structures where order is meaningful. @reece and @andreasprlic do you have any opinions on a preferred strategy here?

OK. so maybe if order is meaningful we don't decompose the array members into digests and sort, but when order isn't meaningful we do. Everyone wins. right?

Honestly, it would be useful to explicitly note on any of our attributes that are arrays whether or not order is critical. Right now, it's not very transparent. I had lost track of the array structures that we decided had a meaningful order.

The other way to look at this (for consistency sake) is to say the physical message array order is never a requirement, if there is a meaningful "logical" order of things then we will clearly express that in the implementation guidance. Then we can "sort" array's with meaningful order based on the logic rules and assure that the message sender doesn't ever have to worry about getting it wrong.

I agree with this, and think we can take this a step further by adding an explicit attribute to arrays (similar to minItems, items, and other JSON Schema attributes) that tells VRS-Python and other downstream tools when to sort / not sort any given array for serialization. This is what I meant in option 2 above. To be clear, these attributes are not data properties–they would not be instantiated in messages, only present in the schema.

larrybabb · 2022-07-28T17:25:09Z

@ahwagner Thank you for clarifying that. I didn't see that initially. Then #2 would be my preference also.

larrybabb

looks good, assuming you get the one last item in.

andreasprlic · 2022-10-03T21:34:32Z

docs/source/terms_and_model.rst

 * In diploid organisms, there are typically two instances of each autosomal chromosome,
-  and therefore two instances of sequence at a particular location. Thus, Genotypes will
+  and therefore two instances of sequence at a particular locus. Thus, Genotypes will


I'd add that if the desire is to express a specific diplotype, it could be represented as a genotype of two haplotypes.

Sure! Added some text for this in 40ef2af.

andreasprlic · 2022-10-03T21:39:25Z

schema/defs/vrs/Genotype.rst

@@ -31,4 +31,4 @@ Some Genotype attributes are inherited from :ref:`Variation`.
   *  - count
      - :ref:`Number` | :ref:`IndefiniteRange` | :ref:`DefiniteRange`
      - 1..1
-      - The total number of copies of all :ref:`MolecularVariation` at this locus, MUST be greater than or equal to the sum of :ref:`GenotypeMember` copy counts. If greater than the total counts, this implies additional  :ref:`MolecularVariation` that are expected to exist but are not explicitly indicated.
+      - The total number of copies of all :ref:`MolecularVariation` at this locus, MUST be greater than or equal to the sum of :ref:`GenotypeMember` copy counts and MUST be greater than or equal to 1. If greater than the total of GenotypeMember counts, this field describes  additional :ref:`MolecularVariation` that exist but are not  explicitly described.


I am not sure I understand why the total nr of genotypemember counts can be different from the overall count. Do you have an example where this might be needed?

The example we referenced when considering this use case was provided by @larrybabb, in this example report from eMERGE. What was found is that in these cases a heterozygous variant is reported, but no mention to the second allele (presumably reference-agree) is given.

ok, should we have a recommendation then how to represent knowledge about reference-state on one of the chromosomes as part of this? It feels like this is a common enough scenario so it might be good to provide more documentation.

larrybabb

nice work. looks spot on.

ahwagner added 5 commits February 24, 2022 12:14

remove ComposedSequenceExpression from 1.2, belongs in 1.3+

ddecafe

Merge pull request #380 from ga4gh/1.2.2-patch

f44aef1

remove ComposedSequenceExpression from 1.2, belongs in 1.3+

update requirements for RTD builds

beb269a

restricting Haplotypes to 2+ members (Tristan)

7364d01

add Genotype

26dfffc

ahwagner requested review from andreasprlic and larrybabb as code owners July 11, 2022 22:32

ahwagner added 5 commits July 12, 2022 16:46

add defaults

c4192bf

squash that bug

7d6582c

fixed genotype molecularvariation construction error

9dc4b3b

Merge branch 'swsu' of github.com:ga4gh/vrs into swsu

17bdd17

genotype prefix

fd7a3b8

ahwagner mentioned this pull request Jul 19, 2022

Implement Genotypes #202

Closed

andreasprlic reviewed Jul 19, 2022

View reviewed changes

schema/defs/vrs/Genotype.rst Show resolved Hide resolved

mbaudis reviewed Jul 20, 2022

View reviewed changes

schema/vrs.json Show resolved Hide resolved

mbaudis approved these changes Jul 20, 2022

View reviewed changes

larrybabb reviewed Jul 20, 2022

View reviewed changes

reece approved these changes Jul 22, 2022

View reviewed changes

schema/ga4gh.yaml Show resolved Hide resolved

schema/defs/vrs/Genotype.rst Outdated Show resolved Hide resolved

schema/defs/vrs/GenotypeMember.rst Show resolved Hide resolved

schema/defs/vrs/Haplotype.rst Outdated Show resolved Hide resolved

use strict mode from gks.metaschema

72e0933

This was referenced Jul 25, 2022

Moving to a generic TextConcept entity #395

Closed

Deprecate CopyNumber #396

Closed

ahwagner mentioned this pull request Jul 29, 2022

Create a GenotypeMembers anchor #397

Closed

update Genotype definition

2d2532d

ahwagner added 2 commits September 12, 2022 10:58

add definition

983f8b3

update inheritance model

36b1268

ahwagner requested review from larrybabb and andreasprlic September 12, 2022 16:05

ahwagner added 3 commits September 12, 2022 13:45

update note

6edd4e7

remove note intro

ed3306a

reverse the VRS

429087e

wesleygoar approved these changes Sep 12, 2022

View reviewed changes

larrybabb approved these changes Sep 12, 2022

View reviewed changes

addresses #394 (comment)

63d86d1

ahwagner linked an issue Sep 13, 2022 that may be closed by this pull request

Implement Genotypes #202

Closed

closes #401

3984b52

ahwagner requested a review from larrybabb October 3, 2022 21:03

andreasprlic reviewed Oct 3, 2022

View reviewed changes

addresses #394 (comment)

40ef2af

ahwagner requested a review from andreasprlic October 3, 2022 22:51

larrybabb approved these changes Oct 4, 2022

View reviewed changes

ahwagner added 5 commits October 7, 2022 10:52

merge main

fa70d4e

update metaschema proc version

ffdc666

enable pre-releases

0f72199

fix pip install command

90893a4

merge gt & rcn docs

2a718ab

ahwagner mentioned this pull request Oct 7, 2022

review documentation for new classes #403

Closed

4 tasks

ahwagner added 2 commits October 7, 2022 11:11

Merge branch 'main' into swsu

013fe65

update vrs.yaml

92e88fd

ahwagner merged commit 1750ddb into main Oct 7, 2022

ahwagner deleted the swsu branch October 7, 2022 15:22

This was referenced Nov 3, 2022

Update digest serialization rules in docs #410

Closed

indicating sort behavior across VRS classes #411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRS Hackathon Genotype draft #394

VRS Hackathon Genotype draft #394

ahwagner commented Jul 11, 2022 •

edited

Loading

mbaudis left a comment

larrybabb left a comment

ahwagner commented Jul 28, 2022 •

edited

Loading

larrybabb commented Jul 28, 2022

larrybabb commented Jul 28, 2022

ahwagner commented Jul 28, 2022

larrybabb commented Jul 28, 2022 •

edited

Loading

ahwagner commented Jul 28, 2022

larrybabb commented Jul 28, 2022

larrybabb left a comment

andreasprlic Oct 3, 2022

ahwagner Oct 3, 2022

andreasprlic Oct 3, 2022

ahwagner Oct 3, 2022 •

edited

Loading

andreasprlic Oct 4, 2022

larrybabb left a comment

VRS Hackathon Genotype draft #394

VRS Hackathon Genotype draft #394

Conversation

ahwagner commented Jul 11, 2022 • edited Loading

mbaudis left a comment

Choose a reason for hiding this comment

larrybabb left a comment

Choose a reason for hiding this comment

ahwagner commented Jul 28, 2022 • edited Loading

larrybabb commented Jul 28, 2022

larrybabb commented Jul 28, 2022

ahwagner commented Jul 28, 2022

larrybabb commented Jul 28, 2022 • edited Loading

ahwagner commented Jul 28, 2022

larrybabb commented Jul 28, 2022

larrybabb left a comment

Choose a reason for hiding this comment

andreasprlic Oct 3, 2022

Choose a reason for hiding this comment

ahwagner Oct 3, 2022

Choose a reason for hiding this comment

andreasprlic Oct 3, 2022

Choose a reason for hiding this comment

ahwagner Oct 3, 2022 • edited Loading

Choose a reason for hiding this comment

andreasprlic Oct 4, 2022

Choose a reason for hiding this comment

larrybabb left a comment

Choose a reason for hiding this comment

ahwagner commented Jul 11, 2022 •

edited

Loading

ahwagner commented Jul 28, 2022 •

edited

Loading

larrybabb commented Jul 28, 2022 •

edited

Loading

ahwagner Oct 3, 2022 •

edited

Loading