Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

add cnv interpretation #216

Merged
merged 7 commits into from
Nov 4, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,32 @@ The release notes for each release are provided in the `release-notes.md` file t
* Somatic Copy Number Variant (CNV) data are provided in a modified [SEG format](https://software.broadinstitute.org/software/igv/SEG) for each of the [applied software packages](https://alexslemonade.github.io/OpenPBTA-manuscript/#somatic-copy-number-variant-calling).
* The CNVkit SEG file has an additional column `copy.num` to denote copy number of each segment, derived from the CNS file output of the algorithm described [here](https://cnvkit.readthedocs.io/en/stable/fileformats.html).
* The ControlFreeC TSV file is a merge of `*_CNVs` files produced from the algorithm, and columns are described [here](http://boevalab.inf.ethz.ch/FREEC/tutorial.html#OUTPUT).
* NOTE: The _copy number_ annotated in the CNVkit SEG file is annotated with respect to ploidy 2, however, the _status_ annotated in the ControlFreeC TSV file is annotated with respect to inferred ploidy from the algorithm, which is recorded in the `pbta_histologies.tsv` file. See the table below for examples of possible interpretations.

| Ploidy | Copy Number | Gain/Loss Interpretation |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is supposed to look at the tumor_ploidy column in the pbta_histologies.tsv file and the copy number column in the ControlFreeC TSV, is that correct? Two thoughts on this:

  • Why not include the ploidy information in the ControlFreeC file since we sometimes add columns to files (e.g., copy.num in CNVkit SEG) so an analyst has something all in one spot?
  • If we can't/don't want to put ploidy in the ControlFreeC file, can we change the column names here to: tumor_ploidy in pbta-histologies.tsv, copy number in pbta-cnv-controlfreec.tsv.gz, Gain/Loss Interpretation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking it would make more sense to add it to the ControlFreeC file, but it also may be confusing because in many cases, it would contradict the genotype column. The ploidy in the clinical file is overall tumor ploidy, not the segment ploidy, so we can add it if you think it is useful, but maybe then we modify genotype to segment_genotype.

The table was also meant to be inclusive of CNVkit, which is why I didn't label the columns specifically.

On another note, it looks like we have gain/loss info in the ControlFreeC TSV file, so those should also help the user not rely solely on copy number. The challenging thing would be - what then defines a homozygous loss, because does that mean for ploidy 3, homozygous loss is all 3 copies? (I may have to dig more into my genetics for this answer). I didn't realize this was the case when writing up the info for #182.

Thoughts on all of this - easiest for the user? Maybe we default to just using loss broadly, rather than categorizing as homo/hemi, and confirm total copy loss with lack of RNA expression, use gain broadly, and set a cutoff for amplification? That way we are not relying on copy number numbers which require ploidy interpretation, but rather, the gain/loss calls for ControlFreeC?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add tumor_ploidy to the TSV file, and change genotype to segment_genotype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean toward using loss broadly, but I don't think that we should rely on RNA expression as the only indicator of total loss; we should indicate total loss based on the seg calls as well.

The reason I am hesitant to rely on RNA expression is that it is quite possible to have a complete loss of some exons while others remain present. This could result in total loss of functionality, while the RNA level might indicate expression of the gene, as some exons are being expressed. This would be rare, I'd expect, but I feel like it is worth highlighting discrepancies between RNA and genomic data.

Copy link
Collaborator Author

@jharenza jharenza Nov 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely the case for ATRX, and we treat it a bit differently - we had been doing a coverage-based estimation of exons lost and reporting those as deletions and/or assessing SVs in this gene as a complement. However, the RNA-level loss is something we did in the past to ensure loss of genes' expression, especially in cases in which we know there should be complete loss (eg SMARCB1 in ATRT and some types of chordomas, CDKN2A/B in leukemias - not relevant here) and/or to ensure CN calls were generally lining up with expectations. It does get a bit complicated, I agree, trying to be broad, yet somehow cover all bases. In cases of hemizygous loss, I did a lot of manual inspection to ensure CN calls were accurate (which is not always the case).

|--------|-------------|------------------------------|
| 2 | 0 | Loss; homozygous deletion |
| 2 | 1 | Loss; hemizygous deletion |
| 2 | 2 | Copy neutral |
| 2 | 3 | Gain; one copy gain |
| 2 | 4 | Gain; two copy gain |
| 2 | 5+ | Gain; possible amplification |
| 3 | 0 | Loss; 3 copy loss |
| 3 | 1 | Loss; 2 copy loss |
| 3 | 2 | Loss; 1 copy loss |
| 3 | 3 | Copy neutral |
| 3 | 4 | Gain; one copy gain |
| 3 | 5 | Gain; two copy gain |
| 3 | 6+ | Gain; possible amplification |
| 4 | 0 | Loss; 4 copy loss |
| 4 | 1 | Loss; 3 copy loss |
| 4 | 2 | Loss; 2 copy loss |
| 4 | 3 | Loss; 1 copy loss |
| 4 | 4 | Copy neutral |
| 4 | 5 | Gain; one copy gain |
| 4 | 6 | Gain; two copy gain |
| 4 | 7+ | Gain; possible amplification |

* Somatic Structural Variant Data (Somatic SV) are provided in the [Annotated Manta TSV](doc/format/manta-tsv-header.md) format produced by the [applied software packages](https://alexslemonade.github.io/OpenPBTA-manuscript/#somatic-structural-variant-calling).
* Gene expression estimates from the [applied software packages](https://alexslemonade.github.io/OpenPBTA-manuscript/#gene-expression-abundance-estimation) are provided as a gene by sample matrix.
* Gene Fusions produced by the [applied software packages](https://alexslemonade.github.io/OpenPBTA-manuscript/#rna-fusion-calling-and-prioritization) are provided as [Arriba TSV](doc/format/arriba-tsv-header.md) and [STARFusion TSV](doc/format/starfusion-tsv-header.md) respectively.
Expand Down