-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CopyNumber: Naming and Semantics #404
Comments
@rrfreimuth, I was just discussing something with @larrybabb that I think overlaps with your discussion here. We have a desire to specify copy number gains and losses without specifying the level of gain or loss, because that's how they are recorded in ClinVar. One option is to just select the less severe case for each, There is an EFO ontology in the EBI OLS which has these possibilities:
We could potentially refer to if the class allowed Codings rather than just an enum. This could potentially present problems because that is within a value object and accepting arbitrary codings increase the likelihood that values won't match. Or we could add a generic gain and loss class to the enum: Lines 324 to 332 in f4eefa5
|
@ahwagner I think we may want to stay away from the detailed copy_class values in VRS classes. enumerations are a slippery slope in value objects. Adding to enumerated lists may work going forward, but modifying and removing is a breaking change. By putting such specific gain and loss derivatives on the copy_class it prevents users from providing the general notion of a copy number gain or loss concept as a consistent referencable value object. Having to base the identifier on a more specific class will reduce the ability to find like CNVs. I think the buck stops at simply discerning between gains and losses (and maybe neutral) since these are the primary distinction between CNVs that have identical locations. I think the more precise classifications can go on the descriptors for the CNVs. Let's keep it super simple?
|
@rrfreimuth regarding your points (well made). Absolute and Relative should probably be renamed to Quantitative and Qualitative, respectively. The fact is that our current AbsoluteCNV is purely a quantitative representation and RelativeCNV is purely a qualitative representation. Both are necessary based on the way data is used in the community. I think if we simplify the "qualititatve" or "relative" terms as suggested by me above then we hit the right level of representation for VRS. All other values that may further specify the CNVs can be applied at the Descriptor level. It is true that Absolute CNVs can be characterized as gains and losses (or neutral), and that is fine, when you have that precision, but Relative CNVs don't have the data to be able to be expressed as Absolute CNVs. I think we need to accept this a presume those comparisons will need to be handled in higher order models and systems. |
@rrfreimuth et al we are reworking the Copy Number concepts above in terms of naming and precision. The Absolute CNV class will be renamed with the qualifying term Quantified. As @rrfreimuth pointed out above the term Absolute indicates a kind of exactness and since this is really about both exact counts, bounded range counts and unbounded range counts, we agree that it would be better represented as The Relative CNV class also has its naming challenges, but this class is needed for situations where the CNV can only be represented in a
Anyway, if @rrfreimuth and others would weigh in with a decisive preference from this list or offer a concrete alternative for distinguishing these two forms of CNVs that are needed. We would greatly appreciate it. |
@larrybabb Thanks for circling back to this. Great point about binding terminology values too tightly to an object (e.g., enum), such that changes to the terminology (value set) will be a breaking change to the value object. IMO, we should try to avoid that type of thing, which is why I tend to model the core and stable things structurally but outsource the stuff that is likely to evolve to an ontology. I like The other concept is harder because it is textual, requires a comparator, and is effectively an interpretation. The latter is important to remember because it can layer over a Examples below (feel free to add/modify). The frame of reference is important; the interpretation is a different concept than the count itself:
Counts known, interpretation known or inferred (given a known context/expected counts):
Interpretation and context (expected counts) known, actual counts unknown or not reported:
Interpretation known only, context (expected counts) unknown, actual counts unknown or not reported:
My thoughts:
|
Yes, that was created in discussions w/ the ELIXIR h-CNV members <cnvar.org> and the VRS group, starting with an SO issue I'd posted. The original VRS classes were partially derived from these discussions, but removed the PLOH and focal amp concepts as too ambiguous (i.e. dependent on several measures ....). We've provided a comparison for the terms in Beacon and hCNV. |
Footnotes
|
@rrfreimuth Your examples demonstrate that one needs a nice list of example cases to support the implementation of the (really not so difficult and approximately shared throughout a wide "CNV community") relative CNV concepts. |
For reference, here some examples of how CNVs can currently get represented. |
Proposed resolution: |
+1 - splendid solution! |
We discussed the idea of clearly distinguishing the two "kinds" of CNVs that we are aiming to include in VRS. Proposal 1 Proposal 2 Proposal 3 Any additional proposals should be posted below. If you like one of these three please post a +1 for 1 , 2 or 3 below. |
As per the discussions in the VRS call there are weak arguments pro/con various options all around. I'm +1 on hashtag-1 (see - no autolinking here) Only But i gladly follow any majority here. Footnotes |
@mbaudis going this route would potentially establish a new precendent that would require us to weigh which type of variant calling methods or SOPs crossed the line of "judgement calls or classification" such that we would end up with other types of xxxAssessment classes that are meant to represent variation. It seems to be similar to the reasoning behind creating "CategoricalVariation" (in some ways). It might feel more "variation-like" if we flipped the Assessment from a noun to an adjective type qualifier and called it |
@larrybabb As indicated, I don't mind too much about the name. The most logical would be |
There are two types of data that have been discussed: a quantitative number of counts and some kind of interpretation/assessment. I think the first is more straightforward. A number of counts (or a range of counts) can stand on its own and is unambiguous in meaning. The second is trickier. I tend to agree with @mbaudis regarding the use of "assess" in this case - it sounds a little too fuzzy. The key, though, is that whatever we choose must be clearly defined as a concept. We should also discuss how the context (or expected number of copies) could be captured to provide the necessary information required to understand the data, which ideally would be interpreted relative to some expected state. Given the 3 proposals above, I lean towards the second option but am open to persuasion. |
I'm weighing in that option 3 is the best IMO because we really need to distinguish that a true BTW - CopyChange can be derived from CopyNumber but the reverse is not true. And another justification is to find a simple set of terms to define these two different but related concepts. |
It's not clear to me why CopyNumberAssesment aka RelativeCopyNumber aka CopyChange belongs in VRS at the variation level -it feels like a higher order concept that describes a set of (or inclusion in a set of) CopyNumber variations. Can someone throw out an example use case? |
@cmprocknow Basically the majority of molecular/cytogenetic assessments of regional copy number changes (by hybridization, sequencing) changes is relative. Any CGH, arrayCGH, low-pass WGS-based CNV assessment... It is basically something anybody who works with such data agrees upon. |
@cmprocknow i'll try. When a testing lab reports what most folks consider to be a CopyNumber variant from an assay, they can have varying levels of fidelity based on the technology used. Sometimes these assays and methods call the exact count of copies of a region like GRCh37/hg19 1q21.1(chr1:143134063-143284670)x3 which shows that the assay is calling that there are exactly 3 copies of this sequence in the patient's genome. As you can also see in this ClinVar variation record that folks also refer to this variant using HGVS as a dupe NC_000001.10:g.(?143134063)(143284670_?)dup. Since this submitter specified that they called exactly 3 copies this CNV can be expressed If however, the assay technology was not able to call the exact number or count of copies of a region but instead had a method that expressed this CNV VRS needs to provide a vehicle to represent both types of CNVs. As it turns out, the qualified or assessment type (CopyChange) can be derived from the quantified or count-based type (CopyNumber) but it is not possible (always) to do the reverse. Let me know if this is helpful or not. It is important that we all understand what it is we are trying to address. ClinVar (and most reporting labs and knowledgebases) have both kinds of CNVs. For the longest time I presumed large dels and dupes were just Alleles (and they can be in VRS), but the CNV community looks at these large dels and dupes as qualified copy changes unless they know the exact count in which case they can also reference them more precisely. |
@cmprocknow there's also some background on this topic in #277; hope this helps! |
ProposalWe go with If you find this agreeable, please 👍 this comment. If not, please 👎 this comment and provide rationale and an alternate proposal. RationaleLike @larrybabb, I was leaning towards option 3 as the best of the described options. Here were my reasons:
However, after thinking carefully about the comments from @rrfreimuth, @mbaudis, @larrybabb, and others, I think what is nagging about this issue is that this class both describes copy change instead of count, and also does so qualitatively instead of quantitatively (I think this was how @reece also characterized these back in the day). And we are trying to find a class name that works across both of these axes, so we know where each concept naturally resides in this space:
So to define the new class we have to ask ourselves if something may exist in the future that would fill in the
I am personally a fan of the third option here, which may result in more overall classes but would keep each simple and precise. This means we could take our current proposed data structures and label them as |
@mbaudis, @larrybabb @ahwagner - thank you for the added clarity. My questions now center around implementation consequences given that some variations can only be described qualitatively, + the potential to define or derive qual from quant, but I'd rather discuss synchronously. |
@larrybabb suggested that It was also mentioned by @mbaudis that We are revising this proposal to accommodate both of these comments. The new proposed class names are:
To all participants: please 👍 or suggest revisions. |
Seeing no further objections following the community review period and seeing a plethora of 👍 we have implemented as proposed in 3f29329. |
👏 to @ahwagner for converting a 404 into content! |
As terms, "Absolute" and "Relative" are good opposites. In the proposed model, Absolute represents numerical counts, even if there is ambiguity in the count itself (in which case DefiniteRange and IndefiniteRange could be used). Relative represents levels relative to an unstated comparator (e.g., “partial loss”, “copy neutral”, “low-level gain” or “high-level gain”) but it also can represent an absolute count (“complete loss”, meaning copy count = 0).
Terms and values
1a. The last example - complete loss - throws me a bit because it is an absolute count that needs no comparator but which is included in a value set expressing relative copy number. I'd like to pull that value out, but we need a way to represent it, almost like a Text Expression datatype for Absolute.
1b. I am not sure whether "low-level gain" and "high-level gain" would be differentiable in practice, so I'd prefer to merge those concepts.
The difference in precision might need to be addressed. Is it possible to have a statement where there is a range of "Relative" terms (e.g., copy neutral to low-level gain)? If so, would we need a structure similar to Definite/Indefinite Range to support those terms?
I was thinking about whether Absolute and Relative might describe Quantitative and Qualitative concepts, respectively, but I keep going back and forth on that. Fundamentally, both are quantitative, but the values for Relative are expressed using qualitative terms (that are derived from some type of quantitative data).
In summary, my head hurts because CopyNumber is a quantitative concept that can be expressed 1) as numerical counts or as textual terms, 2) using absolute or relative frames of reference, 3) with or without ambiguity in precision.
The text was updated successfully, but these errors were encountered: