-
Notifications
You must be signed in to change notification settings - Fork 67
Conversation
CNVkit and ControlFreeC copy number outputs are not directly comparable. Updating this in the README.
hmm not sure if that markdown table is showing up properly. |
I will give formatting it a shot right now. |
It needed newlines on both sides. |
ahh ok good to know, thanks! |
@@ -92,6 +92,32 @@ The release notes for each release are provided in the `release-notes.md` file t | |||
* Somatic Copy Number Variant (CNV) data are provided in a modified [SEG format](https://software.broadinstitute.org/software/igv/SEG) for each of the [applied software packages](https://alexslemonade.github.io/OpenPBTA-manuscript/#somatic-copy-number-variant-calling). | |||
* The CNVkit SEG file has an additional column `copy.num` to denote copy number of each segment, derived from the CNS file output of the algorithm described [here](https://cnvkit.readthedocs.io/en/stable/fileformats.html). | |||
* The ControlFreeC TSV file is a merge of `*_CNVs` files produced from the algorithm, and columns are described [here](http://boevalab.inf.ethz.ch/FREEC/tutorial.html#OUTPUT). | |||
* NOTE: The _copy number_ annotated in the CNVkit SEG file is annotated with respect to ploidy 2, however, the _copy number_ annotated in the ControlFreeC TSV file is annotated with respect to inferred ploidy from the algorithm, which is recorded in the `pbta_histologies.tsv` file. See the table below for examples of possible interpretations. | |||
|
|||
| Ploidy | Copy Number | Gain/Loss Interpretation | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One is supposed to look at the tumor_ploidy
column in the pbta_histologies.tsv
file and the copy number
column in the ControlFreeC TSV, is that correct? Two thoughts on this:
- Why not include the ploidy information in the ControlFreeC file since we sometimes add columns to files (e.g.,
copy.num
in CNVkit SEG) so an analyst has something all in one spot? - If we can't/don't want to put ploidy in the ControlFreeC file, can we change the column names here to:
tumor_ploidy
inpbta-histologies.tsv
,copy number
inpbta-cnv-controlfreec.tsv.gz
, Gain/Loss Interpretation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking it would make more sense to add it to the ControlFreeC file, but it also may be confusing because in many cases, it would contradict the genotype
column. The ploidy in the clinical file is overall tumor ploidy, not the segment ploidy, so we can add it if you think it is useful, but maybe then we modify genotype
to segment_genotype
.
The table was also meant to be inclusive of CNVkit, which is why I didn't label the columns specifically.
On another note, it looks like we have gain/loss info in the ControlFreeC TSV file, so those should also help the user not rely solely on copy number. The challenging thing would be - what then defines a homozygous loss, because does that mean for ploidy 3, homozygous loss is all 3 copies? (I may have to dig more into my genetics for this answer). I didn't realize this was the case when writing up the info for #182.
Thoughts on all of this - easiest for the user? Maybe we default to just using loss
broadly, rather than categorizing as homo/hemi, and confirm total copy loss with lack of RNA expression, use gain
broadly, and set a cutoff for amplification? That way we are not relying on copy number numbers which require ploidy interpretation, but rather, the gain/loss calls for ControlFreeC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add tumor_ploidy
to the TSV file, and change genotype
to segment_genotype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean toward using loss
broadly, but I don't think that we should rely on RNA expression as the only indicator of total loss; we should indicate total loss based on the seg calls as well.
The reason I am hesitant to rely on RNA expression is that it is quite possible to have a complete loss of some exons while others remain present. This could result in total loss of functionality, while the RNA level might indicate expression of the gene, as some exons are being expressed. This would be rare, I'd expect, but I feel like it is worth highlighting discrepancies between RNA and genomic data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely the case for ATRX, and we treat it a bit differently - we had been doing a coverage-based estimation of exons lost and reporting those as deletions and/or assessing SVs in this gene as a complement. However, the RNA-level loss is something we did in the past to ensure loss of genes' expression, especially in cases in which we know there should be complete loss (eg SMARCB1 in ATRT and some types of chordomas, CDKN2A/B in leukemias - not relevant here) and/or to ensure CN calls were generally lining up with expectations. It does get a bit complicated, I agree, trying to be broad, yet somehow cover all bases. In cases of hemizygous loss, I did a lot of manual inspection to ensure CN calls were accurate (which is not always the case).
Dumb question and googling isn't super helpful - how did you do this? I tried |
Ah, as in needs a blank line before and after the table: https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/216/files#diff-04c6e90faac2675aa89e2176d2eec7d8R96 and https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/216/files#diff-04c6e90faac2675aa89e2176d2eec7d8R120 |
Co-Authored-By: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
Purpose/implementation
What scientific question is your analysis addressing?
What was your approach?
If this is not adding an analysis, describe your changes in this section.
CNVkit and ControlFreeC copy number outputs are not directly comparable. Updating this in the README.
Issue
What GitHub issue does your pull request address?
#182
Directions for reviewers
Tell potential reviewers what kind of feedback you are soliciting.
Are there particular areas that need a closer look?
Is there something you want to discuss further?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Is this clear or does this need more explanation?
Results
If your pull request includes code that produces scientific results, please summarize the results here.
This can help facilitate discussion around interpretation.
Please state what kinds of results are included (e.g., table, figure).