Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

add cnv interpretation #216

Merged
merged 7 commits into from
Nov 4, 2019
Merged

add cnv interpretation #216

merged 7 commits into from
Nov 4, 2019

Conversation

jharenza
Copy link
Collaborator

@jharenza jharenza commented Nov 3, 2019

Purpose/implementation

What scientific question is your analysis addressing?
What was your approach?
If this is not adding an analysis, describe your changes in this section.

CNVkit and ControlFreeC copy number outputs are not directly comparable. Updating this in the README.

Issue

What GitHub issue does your pull request address?
#182

Directions for reviewers

Tell potential reviewers what kind of feedback you are soliciting.
Are there particular areas that need a closer look?
Is there something you want to discuss further?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Is this clear or does this need more explanation?

Results

If your pull request includes code that produces scientific results, please summarize the results here.
This can help facilitate discussion around interpretation.
Please state what kinds of results are included (e.g., table, figure).

CNVkit and ControlFreeC copy number outputs are not directly comparable. Updating this in the README.
@jharenza
Copy link
Collaborator Author

jharenza commented Nov 3, 2019

hmm not sure if that markdown table is showing up properly.

@jaclyn-taroni
Copy link
Member

I will give formatting it a shot right now.

@jaclyn-taroni
Copy link
Member

It needed newlines on both sides.

@jharenza
Copy link
Collaborator Author

jharenza commented Nov 3, 2019

ahh ok good to know, thanks!

@@ -92,6 +92,32 @@ The release notes for each release are provided in the `release-notes.md` file t
* Somatic Copy Number Variant (CNV) data are provided in a modified [SEG format](https://software.broadinstitute.org/software/igv/SEG) for each of the [applied software packages](https://alexslemonade.github.io/OpenPBTA-manuscript/#somatic-copy-number-variant-calling).
* The CNVkit SEG file has an additional column `copy.num` to denote copy number of each segment, derived from the CNS file output of the algorithm described [here](https://cnvkit.readthedocs.io/en/stable/fileformats.html).
* The ControlFreeC TSV file is a merge of `*_CNVs` files produced from the algorithm, and columns are described [here](http://boevalab.inf.ethz.ch/FREEC/tutorial.html#OUTPUT).
* NOTE: The _copy number_ annotated in the CNVkit SEG file is annotated with respect to ploidy 2, however, the _copy number_ annotated in the ControlFreeC TSV file is annotated with respect to inferred ploidy from the algorithm, which is recorded in the `pbta_histologies.tsv` file. See the table below for examples of possible interpretations.

| Ploidy | Copy Number | Gain/Loss Interpretation |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is supposed to look at the tumor_ploidy column in the pbta_histologies.tsv file and the copy number column in the ControlFreeC TSV, is that correct? Two thoughts on this:

  • Why not include the ploidy information in the ControlFreeC file since we sometimes add columns to files (e.g., copy.num in CNVkit SEG) so an analyst has something all in one spot?
  • If we can't/don't want to put ploidy in the ControlFreeC file, can we change the column names here to: tumor_ploidy in pbta-histologies.tsv, copy number in pbta-cnv-controlfreec.tsv.gz, Gain/Loss Interpretation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking it would make more sense to add it to the ControlFreeC file, but it also may be confusing because in many cases, it would contradict the genotype column. The ploidy in the clinical file is overall tumor ploidy, not the segment ploidy, so we can add it if you think it is useful, but maybe then we modify genotype to segment_genotype.

The table was also meant to be inclusive of CNVkit, which is why I didn't label the columns specifically.

On another note, it looks like we have gain/loss info in the ControlFreeC TSV file, so those should also help the user not rely solely on copy number. The challenging thing would be - what then defines a homozygous loss, because does that mean for ploidy 3, homozygous loss is all 3 copies? (I may have to dig more into my genetics for this answer). I didn't realize this was the case when writing up the info for #182.

Thoughts on all of this - easiest for the user? Maybe we default to just using loss broadly, rather than categorizing as homo/hemi, and confirm total copy loss with lack of RNA expression, use gain broadly, and set a cutoff for amplification? That way we are not relying on copy number numbers which require ploidy interpretation, but rather, the gain/loss calls for ControlFreeC?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add tumor_ploidy to the TSV file, and change genotype to segment_genotype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean toward using loss broadly, but I don't think that we should rely on RNA expression as the only indicator of total loss; we should indicate total loss based on the seg calls as well.

The reason I am hesitant to rely on RNA expression is that it is quite possible to have a complete loss of some exons while others remain present. This could result in total loss of functionality, while the RNA level might indicate expression of the gene, as some exons are being expressed. This would be rare, I'd expect, but I feel like it is worth highlighting discrepancies between RNA and genomic data.

Copy link
Collaborator Author

@jharenza jharenza Nov 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely the case for ATRX, and we treat it a bit differently - we had been doing a coverage-based estimation of exons lost and reporting those as deletions and/or assessing SVs in this gene as a complement. However, the RNA-level loss is something we did in the past to ensure loss of genes' expression, especially in cases in which we know there should be complete loss (eg SMARCB1 in ATRT and some types of chordomas, CDKN2A/B in leukemias - not relevant here) and/or to ensure CN calls were generally lining up with expectations. It does get a bit complicated, I agree, trying to be broad, yet somehow cover all bases. In cases of hemizygous loss, I did a lot of manual inspection to ensure CN calls were accurate (which is not always the case).

@jharenza
Copy link
Collaborator Author

jharenza commented Nov 3, 2019

It needed newlines on both sides.

Dumb question and googling isn't super helpful - how did you do this? I tried <br>, <br/>, and \ they did not work.

@jaclyn-taroni
Copy link
Member

README.md Outdated Show resolved Hide resolved
@jharenza jharenza mentioned this pull request Nov 4, 2019
@jaclyn-taroni jaclyn-taroni merged commit 262d15a into master Nov 4, 2019
@jaclyn-taroni jaclyn-taroni deleted the cnv-ploidy-explanation branch November 4, 2019 14:17
@yuankunzhu yuankunzhu mentioned this pull request Nov 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants