Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VRS Hackathon Genotype draft #394

Merged
merged 31 commits into from
Oct 7, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
ddecafe
remove ComposedSequenceExpression from 1.2, belongs in 1.3+
ahwagner Feb 24, 2022
f44aef1
Merge pull request #380 from ga4gh/1.2.2-patch
ahwagner Feb 25, 2022
beb269a
update requirements for RTD builds
ahwagner Mar 28, 2022
7364d01
restricting Haplotypes to 2+ members (Tristan)
ahwagner Jul 11, 2022
26dfffc
add Genotype
ahwagner Jul 11, 2022
c4192bf
add defaults
ahwagner Jul 12, 2022
7d6582c
squash that bug
ahwagner Jul 12, 2022
9dc4b3b
fixed genotype molecularvariation construction error
ahwagner Jul 12, 2022
17bdd17
Merge branch 'swsu' of github.com:ga4gh/vrs into swsu
ahwagner Jul 12, 2022
fd7a3b8
genotype prefix
ahwagner Jul 12, 2022
72e0933
use strict mode from gks.metaschema
ahwagner Jul 24, 2022
2d2532d
update Genotype definition
ahwagner Jul 29, 2022
40909a3
update genotypemember definition. closees #397
ahwagner Jul 29, 2022
2def58c
update docs
ahwagner Sep 12, 2022
dc2c85a
update tests
ahwagner Sep 12, 2022
969a52b
remove unused imports
ahwagner Sep 12, 2022
983f8b3
add definition
ahwagner Sep 12, 2022
36b1268
update inheritance model
ahwagner Sep 12, 2022
6edd4e7
update note
ahwagner Sep 12, 2022
ed3306a
remove note intro
ahwagner Sep 12, 2022
429087e
reverse the VRS
ahwagner Sep 12, 2022
63d86d1
addresses https://github.com/ga4gh/vrs/pull/394#discussion_r932147364
ahwagner Sep 13, 2022
3984b52
closes #401
ahwagner Oct 3, 2022
40ef2af
addresses https://github.com/ga4gh/vrs/pull/394#discussion_r986247682
ahwagner Oct 3, 2022
fa70d4e
merge main
ahwagner Oct 7, 2022
ffdc666
update metaschema proc version
ahwagner Oct 7, 2022
0f72199
enable pre-releases
ahwagner Oct 7, 2022
90893a4
fix pip install command
ahwagner Oct 7, 2022
2a718ab
merge gt & rcn docs
ahwagner Oct 7, 2022
013fe65
Merge branch 'main' into swsu
ahwagner Oct 7, 2022
92e88fd
update vrs.yaml
ahwagner Oct 7, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip setuptools
pip install -r .requirements.txt
pip install --pre -r .requirements.txt

- name: Test with pytest
run: |
Expand Down
2 changes: 1 addition & 1 deletion .requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@ python-jsonschema-objects>=0.4.0
jsonschema==3.2.0
ipython
pyyaml
ga4gh.gks.metaschema>=0.1.1
ga4gh.gks.metaschema==0.2.0rc4
sphinx ~= 3.5
123 changes: 0 additions & 123 deletions docs/source/appendices/future_plans.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,129 +96,6 @@ Under consideration. See https://github.com/ga4gh/vrs/issues/28.
t(9;22)(q34;q11) in BCR-ABL


.. _genotype:

Genotype
########

The genetic state of an organism, whether complete (defined over the
whole genome) or incomplete (defined over a subset of the genome).

**Computational definition**

A list of Haplotypes.

**Information model**

.. list-table::
:class: reece-wrap
:header-rows: 1
:align: left
:widths: auto

* - Field
- Type
- Limits
- Description
* - _id
- :ref:`CURIE`
- 0..1
- Variation Id; MUST be unique within document
* - type
- string
- 1..1
- Variation type; MUST be set to '**Genotype**'
* - completeness
- enum
- 1..1
- Declaration of completeness of the Haplotype definition.
Values are:

* UNKNOWN: Other Haplotypes may exist.
* PARTIAL: Other Haplotypes exist but are unspecified.
* COMPLETE: The Genotype declares a complete set of Haplotypes.

* - members
- :ref:`Haplotype`\[] or :ref:`CURIE`\[]
- 0..*
- List of Haplotypes or Haplotype identifiers; length MUST agree
with ploidy of genomic region


**Implementation guidance**

* Haplotypes in a Genotype MAY occur at different locations or on
different reference sequences. For example, an individual may have
haplotypes on two population-specific references.
* Haplotypes in a Genotype MAY contain differing numbers of Alleles or
Alleles at different Locations.

**Notes**

* The term "genotype" has two, related definitions in common use. The
narrower definition is a set of alleles observed at a single
location and with a ploidy of two, such as a pair of single residue
variants on an autosome. The broader, generalized definition is a
set of alleles at multiple locations and/or with ploidy other than
two.The VRS Genotype entity is based on this broader definition.
* The term "diplotype" is often used to refer to two haplotypes. The
VRS Genotype entity subsumes the conventional definition of
diplotype. Therefore, the VRS model does not include an explicit
entity for diplotypes. See :ref:`this note
<genotypes-represent-haplotypes-with-arbitrary-ploidy>` for a
discussion.
* The VRS model makes no assumptions about ploidy of an organism or
individual. The number of Haplotypes in a Genotype is the observed
ploidy of the individual.
* In diploid organisms, there are typically two instances of each
autosomal chromosome, and therefore two instances of sequence at a
particular location. Thus, Genotypes will often list two
Haplotypes. In the case of haploid chromosomes or
haploinsufficiency, the Genotype consists of a single Haplotype.
* A consequence of the computational definition is that Haplotypes at
overlapping or adjacent intervals MUST NOT be included in the same
Genotype. However, two or more Alleles MAY always be rewritten as an
equivalent Allele with a common sequence and interval context.
* The rationale for permitting Genotypes with Haplotypes defined on
different reference sequences is to enable the accurate
representation of segments of DNA with the most appropriate
population-specific reference sequence.

**Sources**

SO: `Genotype (SO:0001027)
<http://www.sequenceontology.org/browser/current_svn/term/SO:0001027>`__
— A genotype is a variant genome, complete or incomplete.

.. _genotypes-represent-haplotypes-with-arbitrary-ploidy:

.. note:: Genotypes represent Haplotypes with arbitrary ploidy
The VRS defines Haplotypes as a list of Alleles, and Genotypes as
a list of Haplotypes. In essence, Haplotypes and Genotypes represent
two distinct dimensions of containment: Haplotypes represent the "in
phase" relationship of Alleles while Genotypes represents sets of
Haplotypes of arbitrary ploidy.

There are two important consequences of these definitions: There is no
single-location Genotype. Users of SNP data will be familiar with
representations like rs7412 C/C, which indicates the diploid state at
a position. In the VRS, this is merely a special case of a
Genotype with two Haplotypes, each of which is defined with only one
Allele (the same Allele in this case). The VRS does not define a
diplotype type. A diplotype is a special case of a VRS Genotype
with exactly two Haplotypes. In practice, software data types that
assume a ploidy of 2 make it very difficult to represent haploid
states, copy number loss, and copy number gain, all of which occur
when representing human data. In addition, assuming ploidy=2 makes
software incompatible with organisms with other ploidy. The VRS
makes no assumptions about "normal" ploidy.

In other words, the VRS does not represent single-position
Genotypes or diplotypes because both concepts are subsumed by the
Allele, Haplotype, and Genotypes entities.



.. _GitHub issue: https://github.com/ga4gh/vrs/issues
.. _genetic variation: https://en.wikipedia.org/wiki/Genetic_variation

Expand Down
141 changes: 94 additions & 47 deletions docs/source/terms_and_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -267,11 +267,8 @@ genetic markers that tend to be transmitted together.
* The locations of Alleles within the Haplotype MUST be interpreted
independently. Alleles that create a net insertion or deletion of
sequence MUST NOT change the location of "downstream" Alleles.
* The `members` attribute is required and MUST contain at least one
Allele.
* Haplotypes with one Allele are intended to be distinct entities from
the Allele by itself. See discussion on :ref:`equivalence`.

* The `members` attribute is required and MUST contain at least two
Alleles.

**Sources**

Expand Down Expand Up @@ -435,6 +432,90 @@ Low-level copy gain of BRCA1:
"type": "RelativeCopyNumber"
}

.. _genotype:

Genotype
$$$$$$$$

A *genotype* is a representation of the variants present at a given genomic locus, and may be referred
to either by individual nucleotide representations (e.g. GT representation in VCF files) or symbolically
(e.g. A/B/O blood type reporting). To support these use cases, VRS genotypes enable representation of
genotypes using either :ref:`Allele` objects (as commonly done in VCF records) or larger :ref:`Haplotype`
objects (which would otherwise be represented using symbolic shorthand).

.. include:: defs/Genotype.rst

**Implementation guidance**

* Haplotypes or Alleles in :ref:`GenotypeMember` objects MAY occur at different locations or on
different reference sequences. For example, an individual may have haplotypes on two
population-specific references.

**Notes**

* The term "genotype" has two, related definitions in common use. The
narrower definition is a set of alleles observed at a single
location and often with a ploidy of two, such as a pair of single residue
variants on an autosome. The broader, generalized definition is a
set of alleles at multiple locations and/or with ploidy other than
two. VRS Genotype entity is based on this broader definition.
* The term "diplotype" is often used to refer to two in-trans haplotypes at a locus.
VRS Genotype entity subsumes the conventional definition of diplotype, though
it describes no explicit in-trans phase relationship. Therefore,
VRS does not include an explicit entity for diplotypes. See :ref:`this note
<genotypes-represent-haplotypes-with-arbitrary-ploidy>` for a discussion.
* VRS makes no assumptions about ploidy of an organism or individual nor any
polysomy affecting a locus. The `genotype.count` attribute explicitly captures the total
count of molecules associated with a genomic locus represented by the Genotype.
* In diploid organisms, there are typically two instances of each autosomal chromosome,
and therefore two instances of sequence at a particular locus. Thus, Genotypes will

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add that if the desire is to express a specific diplotype, it could be represented as a genotype of two haplotypes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Added some text for this in 40ef2af.

often list two GenotypeMembers each based on a distinct Haplotype or Allele. In the case
of haploid chromosomes or haploinsufficiency, the Genotype consists of a single GenotypeMember.
* A specific (heterozygous) diplotype SHOULD be represented as a Genotype of two GenotypeMember
instances each containing a constituent :ref:`Haplotype`. A homozygous diplotype SHOULD be
represented as a Genotype of one constituent GenotypeMember (with `GenotypeMember.count=2`).
* A consequence of the computational definition is that in-cis Haplotypes at overlapping or
adjacent intervals MUST be merged into a single Haplotype for the same Genotype.
* A `GenotypeMember.variation` value MUST be unique among Genotype Members within a Genotype.
When more than one Genotype Member would have the same `variation` value (e.g. in the case
of a homozygous variant), this would be represented as a Genotype Value with a corresponding
`count` (i.e. for a diploid homozygous variant, `GenotypeMember.count = 2`).
* The rationale for permitting Genotypes with Haplotypes defined on different reference
sequences is to enable the accurate representation of segments of DNA with the most
appropriate population-specific reference sequence.
* Deletion of sequence at locus would be represented by the presence of Alleles of deleted
sequence, not absence of Alleles; therefore Genotypes MAY NOT have count < 1.

**Sources**

SO: `Genotype (SO:0001027)
<http://www.sequenceontology.org/browser/current_svn/term/SO:0001027>`__
— A genotype is a variant genome, complete or incomplete.

.. _genotypes-represent-haplotypes-with-arbitrary-ploidy:

.. note::
VRS defines Genotypes using a list of GenotypeMembers defined by
Haplotypes or Alleles. In essence, Haplotypes and Genotypes represent
two distinct dimensions of containment: Haplotypes represent the "in
phase" relationship of Alleles while Genotypes represents sets of
Haplotypes of arbitrary ploidy.

There are two important consequences of these definitions: There is no
single-location Genotype. Users of SNP data will be familiar with
representations like rs7412 C/C, which indicates the diploid state at
a position. In VRS, this is merely a special case of a
Genotype with one GenotypeMember, defined by a single Allele with
two copies. VRS does not define a diplotype class. A diplotype
is a special case of a VRS Genotype with count = 2. In practice, software
data types that assume a ploidy of 2 make it very difficult to represent haploid
states, copy number loss, and copy number gain, all of which occur
when representing human data. In addition, inferred ploidy = 2 makes
software incompatible with organisms with other ploidy. VRS
requires explicit definition of the count of molecules associated with
a genomic locus using the `count` attribute, though this count may be inexact
(e.g. a :ref:`DefiniteRange` or :ref:`IndefiniteRange`).

.. _UtilityVariation:

Utility Variation
Expand Down Expand Up @@ -946,47 +1027,6 @@ large-scale tandem duplications.
"type": "RepeatedSequenceExpression"
}

.. _ComposedSequenceExpression:

ComposedSequenceExpression
##########################

*Composed Sequence* is a class of sequence expression where two or more
constitutive sequence expressions are expressed as an ordered list,
representing a concatenated sequence. This class is useful for expressing
concepts such as the OPMD polyalanine alleles [2]_.

.. [2] Brais b, et al. *Short CCG expansions in the PABP2 gene cause
oculopharyngeal muscular dystrophy* Nat Genet. (1998).

.. include:: defs/ComposedSequenceExpression.rst

**Examples**

.. parsed-literal::

{
"type": "ComposedSequenceExpression",
"components": [
{
"type": "RepeatedSequenceExpression",
"seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCG" },
"count": { "type": "Number", "value": 11 }
},
{
"type": "RepeatedSequenceExpression",
"seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCA" },
"count": { "type": "Number", "value": 3 }
},
{
"type": "RepeatedSequenceExpression",
"seq_expr": { "type": "LiteralSequenceExpression", "sequence": "GCG" },
"count": { "type": "Number", "value": 1 }
}
]
}


.. _Feature:

Feature
Expand Down Expand Up @@ -1113,6 +1153,13 @@ This value is equivalent to the concept of "equal to or greater than
"value": 22
}

.. _genotypemember:

GenotypeMember
##############

.. include:: defs/GenotypeMember.rst

Primitives
@@@@@@@@@@

Expand Down Expand Up @@ -1222,7 +1269,7 @@ derived from the IUPAC one-letter nucleic acid and amino acid codes.
to define an :ref:`Allele`. A Sequence that replaces another Sequence is
called a "replacement sequence".
* In some contexts outside VRS, "reference sequence" may refer
to a member of set of sequences that comprise a genome assembly. In the VRS
to a member of set of sequences that comprise a genome assembly. In VRS
specification, any sequence may be a "reference sequence", including those in
a genome assembly.
* For the purposes of representing sequence variation, it is not
Expand Down
8 changes: 4 additions & 4 deletions schema/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@
JSYAMLS:=vrs.yaml
JSONS:=${JSYAMLS:.yaml=.json}

all: vrs.json defs
all: ${JSONS} defs

vrs.json: vrs.yaml
%.json: %.yaml
jsy2js.py <$< >$@

vrs.yaml: vrs-source.yaml
source2jsy.py <$< >$@
%.yaml: %-source.yaml
source2jsy.py $< >$@

defs:
rm -rf defs
Expand Down
6 changes: 3 additions & 3 deletions schema/defs/vrs/AbsoluteCopyNumber.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
**Computational Definition**

The absolute count of discrete copies of a :ref:`MolecularVariation`, :ref:`Feature`, :ref:`SequenceExpression`, or a :ref:`CURIE` reference within a system (e.g. genome, cell, etc.).
The absolute count of discrete copies of a :ref:`Location`, within a system (e.g. genome, cell, etc.).

**Information Model**

Expand All @@ -25,9 +25,9 @@ Some AbsoluteCopyNumber attributes are inherited from :ref:`Variation`.
- 1..1
- MUST be "AbsoluteCopyNumber"
* - subject
- :ref:`MolecularVariation` | :ref:`Feature` | :ref:`SequenceExpression` | :ref:`CURIE`
- :ref:`Location` | :ref:`CURIE`
- 1..1
- Subject of the Copy Number object
- A location for which the number of systemic copies is described.
* - copies
- :ref:`Number` | :ref:`IndefiniteRange` | :ref:`DefiniteRange`
- 1..1
Expand Down
6 changes: 2 additions & 4 deletions schema/defs/vrs/ComposedSequenceExpression.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ An expression of a sequence composed from multiple other :ref:`Sequence Expressi

**Information Model**

Some ComposedSequenceExpression attributes are inherited from :ref:`SequenceExpression`.

.. list-table::
:class: clean-wrap
:header-rows: 1
Expand All @@ -18,9 +16,9 @@ Some ComposedSequenceExpression attributes are inherited from :ref:`SequenceExpr
- Description
* - type
- string
- 1..1
- 0..1
- MUST be "ComposedSequenceExpression"
* - components
- :ref:`LiteralSequenceExpression` | :ref:`RepeatedSequenceExpression` | :ref:`DerivedSequenceExpression`
- 2..m
- An ordered list of :ref:`SequenceExpression` components comprising the expression.
- An ordered list of :ref:`SequenceExpression` components comprising the expression.
3 changes: 3 additions & 0 deletions schema/defs/vrs/CopyNumber.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
**Computational Definition**

The copies of :ref:`Location` in a system, expressed as an absolute integer quantity (:ref:`AbsoluteCopyNumber`) or a qualitative description of copies relative to a baseline state (:ref:`RelativeCopyNumber`).
Loading