Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CopyNumber: Naming and Semantics #404

Closed
rrfreimuth opened this issue Oct 18, 2022 · 26 comments
Closed

CopyNumber: Naming and Semantics #404

rrfreimuth opened this issue Oct 18, 2022 · 26 comments

Comments

@rrfreimuth
Copy link
Contributor

As terms, "Absolute" and "Relative" are good opposites. In the proposed model, Absolute represents numerical counts, even if there is ambiguity in the count itself (in which case DefiniteRange and IndefiniteRange could be used). Relative represents levels relative to an unstated comparator (e.g., “partial loss”, “copy neutral”, “low-level gain” or “high-level gain”) but it also can represent an absolute count (“complete loss”, meaning copy count = 0).

  1. Terms and values
    1a. The last example - complete loss - throws me a bit because it is an absolute count that needs no comparator but which is included in a value set expressing relative copy number. I'd like to pull that value out, but we need a way to represent it, almost like a Text Expression datatype for Absolute.
    1b. I am not sure whether "low-level gain" and "high-level gain" would be differentiable in practice, so I'd prefer to merge those concepts.

  2. The difference in precision might need to be addressed. Is it possible to have a statement where there is a range of "Relative" terms (e.g., copy neutral to low-level gain)? If so, would we need a structure similar to Definite/Indefinite Range to support those terms?

  3. I was thinking about whether Absolute and Relative might describe Quantitative and Qualitative concepts, respectively, but I keep going back and forth on that. Fundamentally, both are quantitative, but the values for Relative are expressed using qualitative terms (that are derived from some type of quantitative data).

In summary, my head hurts because CopyNumber is a quantitative concept that can be expressed 1) as numerical counts or as textual terms, 2) using absolute or relative frames of reference, 3) with or without ambiguity in precision.

@theferrit32
Copy link

@rrfreimuth, I was just discussing something with @larrybabb that I think overlaps with your discussion here. We have a desire to specify copy number gains and losses without specifying the level of gain or loss, because that's how they are recorded in ClinVar. One option is to just select the less severe case for each, low-level gain and partial loss, but that is maybe not ideal because it is implying information that is not present in the underlying source data. But this is what we will proceed with under the current spec.

There is an EFO ontology in the EBI OLS which has these possibilities:
relative copy number variation:

  • copy number gain
    • high-level gain
    • low-level gain
  • copy number loss
    • complete loss
    • low-level loss

(https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0030066&lang=en&viewMode=All&siblings=false)

We could potentially refer to if the class allowed Codings rather than just an enum. This could potentially present problems because that is within a value object and accepting arbitrary codings increase the likelihood that values won't match. Or we could add a generic gain and loss class to the enum:

vrs/schema/vrs.json

Lines 324 to 332 in f4eefa5

"relative_copy_class": {
"type": "string",
"enum": [
"complete loss",
"partial loss",
"copy neutral",
"low-level gain",
"high-level gain"
],

@larrybabb
Copy link
Contributor

larrybabb commented Nov 10, 2022

@ahwagner I think we may want to stay away from the detailed copy_class values in VRS classes. enumerations are a slippery slope in value objects. Adding to enumerated lists may work going forward, but modifying and removing is a breaking change. By putting such specific gain and loss derivatives on the copy_class it prevents users from providing the general notion of a copy number gain or loss concept as a consistent referencable value object. Having to base the identifier on a more specific class will reduce the ability to find like CNVs. I think the buck stops at simply discerning between gains and losses (and maybe neutral) since these are the primary distinction between CNVs that have identical locations. I think the more precise classifications can go on the descriptors for the CNVs.

Let's keep it super simple?

"relative_copy_class": { 
    "type": "string", 
    "enum": [ 
       "loss", 
       "neutral", 
       "gain" 
    ], 

@larrybabb
Copy link
Contributor

@rrfreimuth regarding your points (well made). Absolute and Relative should probably be renamed to Quantitative and Qualitative, respectively. The fact is that our current AbsoluteCNV is purely a quantitative representation and RelativeCNV is purely a qualitative representation. Both are necessary based on the way data is used in the community. I think if we simplify the "qualititatve" or "relative" terms as suggested by me above then we hit the right level of representation for VRS. All other values that may further specify the CNVs can be applied at the Descriptor level.

It is true that Absolute CNVs can be characterized as gains and losses (or neutral), and that is fine, when you have that precision, but Relative CNVs don't have the data to be able to be expressed as Absolute CNVs. I think we need to accept this a presume those comparisons will need to be handled in higher order models and systems.

@larrybabb
Copy link
Contributor

@rrfreimuth et al we are reworking the Copy Number concepts above in terms of naming and precision.

The Absolute CNV class will be renamed with the qualifying term Quantified. As @rrfreimuth pointed out above the term Absolute indicates a kind of exactness and since this is really about both exact counts, bounded range counts and unbounded range counts, we agree that it would be better represented as QuantifiedCopyNumber Variation.

The Relative CNV class also has its naming challenges, but this class is needed for situations where the CNV can only be represented in a qualified or assessed context. We are aiming to link this concept to the EFO concept of relative_copy_number_variation. That said, we could leave our class named as RelativeCopyNumber or we could change it to AssessedCopyNumber or maybe even QualifiedCopyNumber.

RelativeCopyNumber would draw a direct correlation to the corresponding EFO ontological term for which the instances of this class would be constrained by, so that would provide great consistency.
AssessedCopyNumber would draw a clear distinction as to the type of CNV representation it is in that there is no quantified values to compare it to at the systemic level or relative to a single molecule type.
QuantifiedCopyNumber would be a nice balance to the QuantifiedCopyNumber class that is being targeted to replace AbsoluteCopyNumber (at the top of this entry).

Anyway, if @rrfreimuth and others would weigh in with a decisive preference from this list or offer a concrete alternative for distinguishing these two forms of CNVs that are needed. We would greatly appreciate it.

@rrfreimuth
Copy link
Contributor Author

rrfreimuth commented Dec 12, 2022

@larrybabb Thanks for circling back to this. Great point about binding terminology values too tightly to an object (e.g., enum), such that changes to the terminology (value set) will be a breaking change to the value object. IMO, we should try to avoid that type of thing, which is why I tend to model the core and stable things structurally but outsource the stuff that is likely to evolve to an ontology.

I like QuantifiedCopyNumber as described (supporting exact counts, bounded ranges, and unbounded ranges).

The other concept is harder because it is textual, requires a comparator, and is effectively an interpretation. The latter is important to remember because it can layer over a QuantifiedCopyNumber so we need to think about how they would be used together. It is possible that the underlying quantitative data are not known/reported, however, so the concept must also be able to stand on its own. These are different concepts, not just different representations. Numerical copy count does not require context, but a textual interpretation does.

Examples below (feel free to add/modify).

The frame of reference is important; the interpretation is a different concept than the count itself:

Example Expected # Copies Assayed/Reported Counts Assayed/Reported Interpretation
Precise, neutral 2 (e.g., autosome) 2 copy neutral
Precise, neutral 1 (e.g., X in male) 1 copy neutral

Counts known, interpretation known or inferred (given a known context/expected counts):

Example Expected # Copies Assayed/Reported Counts Assessed or Reported or Inferred Interpretation
Precise, neutral 2 (e.g., autosome) 2 neutral
Precise, loss 2 1 loss
Precise, gain 2 3 gain
Bounded, loss 2 0-1 loss
Bounded, ambiguous 2 1-3 ?
Bounded, gain 2 3-4 gain
Unbounded, loss 2 <2 loss
Unbounded, gain 2 >2 gain

Interpretation and context (expected counts) known, actual counts unknown or not reported:
Note column headings differ from previous

Example Reported Interpretation Expected # Copies Inferred Counts
Textual with context, neutral neutral 2 2
Textual with context, neutral neutral 1 1
Textual with context, loss loss 2 <2
Textual with context, loss loss 1 <1
Textual with context, gain gain 2 >2

Interpretation known only, context (expected counts) unknown, actual counts unknown or not reported:

Example Reported Interpretation Inferred Counts
Textual without context, neutral neutral >=0
Textual without context, loss loss >=0
Textual without context, gain gain >0

My thoughts:

  1. We need to determine which of the above examples will be supported, and how
    1a. Model/schema (value object)
    1b. Model/schema (descriptor)
    1c. Extension
    1d. Terminology/ontology
    1e. Other
  2. Context (expected # copies) is important and needs to be explicitly defined
    2a. In the object (as data)
    2b. In the model (as object definition)
    2c. In a business rule (as implementation constraint/guidance)
  3. The interpretation of copy number (gain/neutral/loss) sounds a lot like variant pathogenicity (to me, anyway). We might want to think about using a similar approach to both.

@mbaudis
Copy link
Member

mbaudis commented Feb 6, 2023

@theferrit32

There is an EFO ontology in the EBI OLS which has these possibilities: relative copy number variation:

Yes, that was created in discussions w/ the ELIXIR h-CNV members <cnvar.org> and the VRS group, starting with an SO issue I'd posted. The original VRS classes were partially derived from these discussions, but removed the PLOH and focal amp concepts as too ambiguous (i.e. dependent on several measures ....).

We've provided a comparison for the terms in Beacon and hCNV.

@mbaudis
Copy link
Member

mbaudis commented Feb 6, 2023

Beacon VCF SO EFO VRS Notes
DUP1 DUP
SVCLAIM=D2
SO:0001742 copy_number_gain EFO:0030070 copy number gain low-level gain (implicit) a sequence alteration whereby the copy number of a given genomic region is greater than the reference sequence
DUP1 DUP
SVCLAIM=D2
SO:0001742 copy_number_gain EFO:0030071 low-level copy number gain low-level gain
DUP1 DUP
SVCLAIM=D2
SO:0001742 copy_number_gain EFO:0030072 high-level copy number gain high-level gain commonly but not consistently used for >=5 copies on a bi-allelic genome region
DUP1 DUP
SVCLAIM=D2
SO:0001742 copy_number_gain EFO:0030073 focal genome amplification high-level gain commonly but not consistently used for >=5 copies on a bi-allelic genome region, of limited size (operationally max. 1-5Mb)
DEL1 DEL
SVCLAIM=D2
SO:0001743 copy_number_loss EFO:0030067 copy number loss partial loss (implicit) a sequence alteration whereby the copy number of a given genomic region is smaller than the reference sequence
DEL1 DEL
SVCLAIM=D2
SO:0001743 copy_number_loss EFO:0030068 low-level copy number loss partial loss
DEL1 DEL
SVCLAIM=D2
SO:0001743 copy_number_loss EFO:0030069 complete genomic deletion complete loss complete genomic deletion (e.g. homozygous deletion on a bi-allelic genome region)

Footnotes

  1. While the use of VCF derived (DUP, DEL) values had been introduced with
    beacon v1, usage of these terms has always been a recommendation rather than an integral part
    of the API. We now encourage the support of more specific terms (particularly EFO)
    by Beacon developers. As example, the Progentix Beacon API uses EFO terms but
    provides an internal term expansion for legacy DUP, DEL support. 2 3 4 5 6 7

  2. VCFv4.4 introduces an SVCLAIM field to disambiguate between in situ events (such as
    tandem duplications; known adjacency/ break junction: SVCLAIM=J) and events where e.g. only the
    change in abundance / read depth (SVCLAIM=D) has been determined. Both J and D flags can be combined. 2 3 4 5 6 7

@mbaudis
Copy link
Member

mbaudis commented Feb 6, 2023

@rrfreimuth Your examples demonstrate that one needs a nice list of example cases to support the implementation of the (really not so difficult and approximately shared throughout a wide "CNV community") relative CNV concepts.

@andreasprlic
Copy link

@ahwagner
Copy link
Member

Proposed resolution:
CopyNumber is left unchanged, and new class of CopyNumberAssessment is created, using EFO codes as described here, plus regional base ploidy assessment code.

@mbaudis
Copy link
Member

mbaudis commented Feb 28, 2023

+1 - splendid solution!

@ahwagner ahwagner mentioned this issue Mar 3, 2023
@larrybabb
Copy link
Contributor

larrybabb commented Mar 6, 2023

We discussed the idea of clearly distinguishing the two "kinds" of CNVs that we are aiming to include in VRS.

Proposal 1 CopyNumber and CopyNumberAssessment
IMO this makes it seem like one is a specialization of the other. This is what instigated the discussion that has lead to renaming both classes so that it was clear they are both kinds of Copy Variation.

Proposal 2 QuantifiedCopyNumber and RelativeCopyNumber
These two terms have been proposed, but there seems to be some resistance to the Quantified qualifier, although I can pinpoint who or what that resistance is.

Proposal 3 CopyNumber and CopyChange
This was put forth by @ahwagner on the vrs call on 3/6/2023. This clarifies that one of these is a "number" (or range of numbers) vs a "change" (relative, assessment, etc..). This seems like a nice compact name that provides distinction but I leave it to others to weigh in.

Any additional proposals should be posted below. If you like one of these three please post a +1 for 1 , 2 or 3 below.

@mbaudis
Copy link
Member

mbaudis commented Mar 7, 2023

As per the discussions in the VRS call there are weak arguments pro/con various options all around. I'm +1 on hashtag-1 (see - no autolinking here) CopyNumberAssessment1; I think the "assessment" conveys that some judgement call, classification has been made w/o implying (to me) that it means any functional ... evaluation.

Only CopyChange looks naked to me; "copy" just is a tad too generic... Also 1.

But i gladly follow any majority here.

Footnotes

  1. I even would consider going very verbose with CopyNumberVariationAssessment since the terminus technicus "copy number variation" never has implied a knowledge about the absolute number but this fills the 80 chars pretty fast 2

@larrybabb
Copy link
Contributor

@mbaudis going this route would potentially establish a new precendent that would require us to weigh which type of variant calling methods or SOPs crossed the line of "judgement calls or classification" such that we would end up with other types of xxxAssessment classes that are meant to represent variation. It seems to be similar to the reasoning behind creating "CategoricalVariation" (in some ways). It might feel more "variation-like" if we flipped the Assessment from a noun to an adjective type qualifier and called it AssessedCopyNumber or AssessedCopyChange. If we are all in agreement that this concept we are calling CopyNumberAssessment is a type of CopyNumber variation then we really should qualify it with a term like Assessed to differentiate it from the other type of CopyNumber which currently has no qualifier in this proposal. If however, this really isn't a kind of CopyNumber then does it really belong within VRS as-is or maybe under a new class of Variation like CategoricalVariation or AssessedVariation or evenVariationAssessment? I'm struggling with the notion that CopyNumberAssessment is more of a kind of Assessment than a Variation.

@mbaudis
Copy link
Member

mbaudis commented Mar 7, 2023

@larrybabb As indicated, I don't mind too much about the name. The most logical would be CopyNumberVariationClass or CopyNumberVariationCategory or CopyNumberVariationType which would clearly point out the scope although for you that might be too much towards the "categorical" instead of a "fuzzy value in some brackets" - I don't mind if this "feels like a categorical variation". I don't like the flipped version because there is nothing special IMO about "assessing" for this case; all variants are assessed, or not, depending on how wide you throw your semantic net.

@rrfreimuth
Copy link
Contributor Author

rrfreimuth commented Mar 8, 2023

There are two types of data that have been discussed: a quantitative number of counts and some kind of interpretation/assessment.

I think the first is more straightforward. A number of counts (or a range of counts) can stand on its own and is unambiguous in meaning.

The second is trickier. I tend to agree with @mbaudis regarding the use of "assess" in this case - it sounds a little too fuzzy. The key, though, is that whatever we choose must be clearly defined as a concept. We should also discuss how the context (or expected number of copies) could be captured to provide the necessary information required to understand the data, which ideally would be interpreted relative to some expected state.

Given the 3 proposals above, I lean towards the second option but am open to persuasion.

@larrybabb
Copy link
Contributor

larrybabb commented Mar 8, 2023

I'm weighing in that option 3 is the best IMO because we really need to distinguish that a true CopyNumber variant is one that is defined by numbers, counts, ranges and CopyChange is a very different thing in that it is a assessment of the quantity of copies (aka CopyNumber variant) to determine the "change" (be it a gain, loss or neutral change) and define that Copy Variation based on the change. While these are both general terms they are distinct enough to clarify to all that these are different things.

BTW - CopyChange can be derived from CopyNumber but the reverse is not true.

And another justification is to find a simple set of terms to define these two different but related concepts.

@cmprocknow
Copy link

It's not clear to me why CopyNumberAssesment aka RelativeCopyNumber aka CopyChange belongs in VRS at the variation level -it feels like a higher order concept that describes a set of (or inclusion in a set of) CopyNumber variations. Can someone throw out an example use case?

@mbaudis
Copy link
Member

mbaudis commented Mar 8, 2023

@cmprocknow Basically the majority of molecular/cytogenetic assessments of regional copy number changes (by hybridization, sequencing) changes is relative. Any CGH, arrayCGH, low-pass WGS-based CNV assessment... It is basically something anybody who works with such data agrees upon.

@larrybabb
Copy link
Contributor

@cmprocknow i'll try.

When a testing lab reports what most folks consider to be a CopyNumber variant from an assay, they can have varying levels of fidelity based on the technology used.

Sometimes these assays and methods call the exact count of copies of a region like GRCh37/hg19 1q21.1(chr1:143134063-143284670)x3 which shows that the assay is calling that there are exactly 3 copies of this sequence in the patient's genome. As you can also see in this ClinVar variation record that folks also refer to this variant using HGVS as a dupe NC_000001.10:g.(?143134063)(143284670_?)dup. Since this submitter specified that they called exactly 3 copies this CNV can be expressed quantitatviely with an absolute Copy Number count (3).

If however, the assay technology was not able to call the exact number or count of copies of a region but instead had a method that expressed this CNV qualitatively with a term like gain, loss or neutral (or some specialization of those general types) then this type of CNV would reflect a more ambiguous but still useful representation. As it turns out many labs report these CNV qualified changes as variants using HGVS del and dup syntax (especially when the length of the gained or lost sequence is greater than a certain size ... 50bp typically). Since theese technologies and methods that generate these type of qualified CNVs or CopyChanges have and will exist for some time, the community needs to be able to identify them, share them and attach knowledge to them.

VRS needs to provide a vehicle to represent both types of CNVs. As it turns out, the qualified or assessment type (CopyChange) can be derived from the quantified or count-based type (CopyNumber) but it is not possible (always) to do the reverse.

Let me know if this is helpful or not. It is important that we all understand what it is we are trying to address. ClinVar (and most reporting labs and knowledgebases) have both kinds of CNVs. For the longest time I presumed large dels and dupes were just Alleles (and they can be in VRS), but the CNV community looks at these large dels and dupes as qualified copy changes unless they know the exact count in which case they can also reference them more precisely.

@ahwagner
Copy link
Member

ahwagner commented Mar 8, 2023

@cmprocknow there's also some background on this topic in #277; hope this helps!

@ahwagner
Copy link
Member

ahwagner commented Mar 8, 2023

Proposal

We go with CopyNumber and QualitativeCopyChange.

If you find this agreeable, please 👍 this comment. If not, please 👎 this comment and provide rationale and an alternate proposal.

Rationale

Like @larrybabb, I was leaning towards option 3 as the best of the described options. Here were my reasons:

  1. I agree with this comment from @mbaudis: "all variants are assessed, or not, depending on how wide you throw your semantic net." While the data expressed by this class may reflect a manual judgement call (aligned with what we might describe as an "assessment"), it may instead reflect an "interpretation" of submitted data (such as reinterpreting an HGVS ambiguous-coordinate "dup" expression in ClinVar as a "copy gain"), or it may instead be a primary measure reported by a sequence analysis tool (i.e. GISTIC copy change calls). I do not think "assessment" definitively distinguishes all of these scenarios from that of the simpler integral copy number objects released in VRS 1.2. For this reason, I would move away from terms like CopyNumberAssessment or AssessedCopyNumber.
  2. The key distinguishing factor, in my view, is that the while the current CopyNumber class describes the number of copies in a system, the other class describes the change of copies in a system relative to a baseline. These are often measured in terms that are not discretized copy counts but rather quantified log ratios of local signal to a regional baseline. So it is important to highlight that these classes capture this distinction of "absolute value" and "change from baseline". I think this is what @rrfreimuth was getting at in his last comment. Indeed, this is why we began with the terms AbsoluteCopyNumber and RelativeCopyNumber classes prior to this issue being raised. I think proposals 2 and 3 both meet this need.
  3. It is notable that this class doesn't actually describe relative changes as integers (i.e. +3 from baseline, -2 from baseline) but rather as categorized "levels" of change (low/high-amp gain from baseline, low/high-amp loss from baseline, at baseline). This is what I liked about proposal 3; it gets away from the notion that data in this class represents a number of copies.

However, after thinking carefully about the comments from @rrfreimuth, @mbaudis, @larrybabb, and others, I think what is nagging about this issue is that this class both describes copy change instead of count, and also does so qualitatively instead of quantitatively (I think this was how @reece also characterized these back in the day). And we are trying to find a class name that works across both of these axes, so we know where each concept naturally resides in this space:

Copy Count Copy Change
Quantitative CopyNumber ?
Qualitative ? <New Class>

So to define the new class we have to ask ourselves if something may exist in the future that would fill in the ? classes above, and what that might look like. It is easy to see what a Quantitative Copy Change may look like, as integer copy changes from a baseline (as discussed above). This is a data class we may create in the future, as I believe there are ISCN expressions based around relative, discrete changes without an established baseline. To my knowledge, there is no Qualitative Copy Count concept used in genomics today. So the question is (if we decided to create it) whether a Quantitative Copy Change would be best expressed as:

  1. a CopyNumber with a RelativeCount object type for the CopyNumber.copies value
  2. a CopyChange with a Number/IndefiniteRange/DefiniteRange for the CopyChange.change value
  3. an independent, third class type of NumericCopyChange or QuantitativeCopyChange

I am personally a fan of the third option here, which may result in more overall classes but would keep each simple and precise. This means we could take our current proposed data structures and label them as CopyNumber (current) and QualitativeCopyChange (revised from RelativeCopyNumber / CopyNumberAssessment).

@cmprocknow
Copy link

@mbaudis, @larrybabb @ahwagner - thank you for the added clarity. My questions now center around implementation consequences given that some variations can only be described qualitatively, + the potential to define or derive qual from quant, but I'd rather discuss synchronously.

@ahwagner
Copy link
Member

@larrybabb suggested that QualitativeCopyChange was too specific and there is no clear use case for QuantiativeCopyChange (which we named to differentiate these classes). I reviewed the 2020 ISCN guidelines and could not identify examples where a quantitative copy change class would be required. Similarly, I did a cursory review of the cytogenetics atlas and found no data that would require such an expression.

It was also mentioned by @mbaudis that CopyChange (absent Number) "looks naked".

We are revising this proposal to accommodate both of these comments. The new proposed class names are:

  • CopyNumberCount
  • CopyNumberChange

To all participants: please 👍 or suggest revisions.

ahwagner added a commit that referenced this issue Mar 31, 2023
@ahwagner
Copy link
Member

Seeing no further objections following the community review period and seeing a plethora of 👍 we have implemented as proposed in 3f29329.

@mbaudis
Copy link
Member

mbaudis commented Apr 1, 2023

👏 to @ahwagner for converting a 404 into content!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants