v20 CNV update part4 : Rerun gistic and molecular subtyping v20 #1127

kgaonkar6 · 2021-08-05T00:36:18Z

Please merge #1123 #1124 #1126 before this PR.

Purpose/implementation Section

What scientific question is your analysis addressing?

Rerun subtyping for v20 with updated CNV #1114

What was your approach?

Rerun all molecular subtypes with the updated run-for-sutyping.sh which now included cnv modules required for subtyping

What GitHub issue does your pull request address?

#1125

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

See summary for discussion about changes.

Is there anything that you want to discuss further?

Expected changes:

Fusion_counts and Fusion_evidence was added from Updated analysis: Add fusion as TP53 loss #1094
The histology file was updated, where we removed the extra columns previous_cancer_predispositions previous_parent_aliquot_id v19 histologies file has extra columns #1079

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

tables

What is your summary of the results?

samples ("BS_SYGKQD8M","BS_SD28E0SC","BS_2JDVY4F7") are not annotated as CDKN2A/B losses because of our criteria of cn <=1 called as loss

> cnv_anno_cdkn2 %>% filter(biospecimen_id %in% c("BS_SYGKQD8M","BS_SD28E0SC","BS_2JDVY4F7"))
  biospecimen_id ploidy CDKN2A CDKN2B CDKN2A_DEL CDKN2B_DEL
1    BS_2JDVY4F7      3      2      2         No         No
2    BS_SD28E0SC      3      2      2         No         No
3    BS_SYGKQD8M      4      3      3         No         No

BS_B6K0N6HS now retains the CDKN deletion which was missing in Outdated PBTA Histologies v20 : Run molecular subtyping #1119
BS_K07KNTFY retains the chr19 amplification lost in Outdated PBTA Histologies v20 : Run molecular subtyping #1119
Some LGAT samples have gained a CDKN deletion from the Update manta FILTER=='PASS' Part1 : consensus cnv file generation #1114 update
BS_DW1CYEXP, BS_QDGHHS4S,BS_Z9PKZ4RT
A lot of CDKN losses changed in analyses/molecular-subtyping-EPN/results/EPN_all_data.tsv because regions not called by. any caller was assigned CN=2 in v19 but we have updated this to be NA as of Part1: Freec as default and neutral NA #1066.

   biospecimen_id status copy_number ploidy         ensembl gene_symbol
1     BS_01DQH017   loss           2      4 ENSG00000147889      CDKN2A
2     BS_5V2XCW17   loss           2      4 ENSG00000147889      CDKN2A
3     BS_9PQPZPGM   loss           2      4 ENSG00000147889      CDKN2A
4     BS_C9TNGEA0   loss           2      4 ENSG00000147889      CDKN2A
5     BS_FF73TT6D   loss           2      3 ENSG00000147889      CDKN2A
6     BS_HQFNQHVW   loss           2      3 ENSG00000147889      CDKN2A
7     BS_K6A9Z04J   loss           2      3 ENSG00000147889      CDKN2A
8     BS_KB9GJDCS   loss           2      3 ENSG00000147889      CDKN2A
9     BS_W5P7SPDH   loss           2      3 ENSG00000147889      CDKN2A
10    BS_W74D279Q   loss           2      3 ENSG00000147889      CDKN2A
11    BS_XBE002WV   loss           2      3 ENSG00000147889      CDKN2A
12    BS_ZBWQG7VR   loss           2      4 ENSG00000147889      CDKN2A
   cytoband
1    9p21.3
2    9p21.3
3    9p21.3
4    9p21.3
5    9p21.3
6    9p21.3
7    9p21.3
8    9p21.3
9    9p21.3
10   9p21.3
11   9p21.3
12   9p21.3

For example I checked one instance where the focal annotation was loss in CDKN2A because of CN set to 2 in consensus seg file but checking in cnvkit and freec didn't see any calls supporting the loss

> focal[which(focal$biospecimen_id=="BS_HQFNQHVW" & focal$gene_symbol=="CDKN2A"),] %>% as.data.frame()
  biospecimen_id status copy_number ploidy         ensembl gene_symbol
1    BS_HQFNQHVW   loss           2      3 ENSG00000147889      CDKN2A
  cytoband
1   9p21.3
> cnvkit_autosomes[which(cnvkit_autosomes$biospecimen_id=="BS_HQFNQHVW" & cnvkit_autosomes$gene_symbol=="CDKN2A"),] %>% as.data.frame()
[1] biospecimen_id status         copy_number    ploidy         ensembl       
[6] gene_symbol    cytoband      
<0 rows> (or 0-length row.names)
> freec_autosomes[which(freec_autosomes$biospecimen_id=="BS_HQFNQHVW" & freec_autosomes$gene_symbol=="CDKN2A"),] %>% as.data.frame()
[1] biospecimen_id status         copy_number    ploidy         ensembl       
[6] gene_symbol    cytoband      
<0 rows> (or 0-length row.names)

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

jharenza

This looks good to me - changes are minimal/expected and criteria is still stringent.

kgaonkar6 · 2021-08-11T16:26:54Z

Just wanted to document here that I reran focal-cn here since we added a logic to read relative 5c33bbe path only when subtyping and the html files needed to be updated

runjin326 · 2021-09-20T19:59:19Z

@jaclyn-taroni, sorry to reach out to you like this but I have a question about running the GISTIC module - have you ever get an error like this:

Matrix size 326815    1176
Removing NaN probes...
Removing 326815 markers with NaNs
Matrix size 0  1176
 
GISTIC 2.0 input error detected:
All input data were removed after NaN processing.
  adding: cnv-consensus-gistic/ (stored 0%)
  adding: cnv-consensus-gistic/gistic_inputs.mat (deflated 2%)

Looks like this is coming from changing copy number from 2 to NA for neutral calls (in the cnv-consensus.seg.gz) and I also tried running it on the V21 OpenPBTA consensus release and it gave me the same error. Any suggestions would help!

jaclyn-taroni · 2021-09-20T20:04:17Z

Hi @runjin326 - I have never encountered this specific error unfortunately. I assume that this part of the error message

GISTIC 2.0 input error detected:
All input data were removed after NaN processing.

Means that all of the inputs are being converted to NaN during some internal GISTIC step. You might need to do what was done in the chromothripsis module and use inferred tumor ploidy instead of NA:

OpenPBTA-analysis/analyses/chromothripsis/02-run-shatterseek-and-classify-confidence.R

Line 67 in d31c927

# Replace rows of NA copy number with ploidy for that particular tumor

runjin326 · 2021-09-20T20:05:58Z

Hi @runjin326 - I have never encountered this specific error unfortunately. I assume that this part of the error message
GISTIC 2.0 input error detected:
All input data were removed after NaN processing.
Means that all of the inputs are being converted to NaN during some internal GISTIC step. You might need to do what was done in the chromothripsis module and use inferred tumor ploidy instead of NA:

OpenPBTA-analysis/analyses/chromothripsis/02-run-shatterseek-and-classify-confidence.R

Line 67 in d31c927

# Replace rows of NA copy number with ploidy for that particular tumor

Thanks so much for the prompt reply - I will try that method :)

runjin326 · 2021-09-20T20:51:37Z

@jaclyn-taroni, I am so sorry but I have another related question -for OpenTargets, we made modifications to run consensus on WGS only and use CNVkit results for WXS samples only - my question is, do you see any issue with running GISTIC directly on CNVkit seg file? Also, when I was trying to run it, I realized that there are some weird chromosome names in the file:

[25] "chr1_KI270766v1_alt"     "chr7_KI270803v1_alt"     "chr15_KI270850v1_alt"    "chr17_KI270857v1_alt"   
[29] "chr17_KI270909v1_alt"    "chr19_KI270938v1_alt"    "chr22_KI270879v1_alt"    "chr1_KI270706v1_random" 
[33] "chr4_GL000008v2_random"  "chr1_KI270711v1_random"  "chr14_GL000194v1_random" "chr14_KI270846v1_alt"   
[37] "chr19_GL000209v2_alt"    "chr19_GL949753v2_alt"    "chr22_KI270928v1_alt"    "chr15_KI270851v1_alt"

Have we been ignoring those when calling consensus? Or how do we deal with these?
Thanks so much again!

jaclyn-taroni · 2021-09-20T22:08:01Z

I'll preface this by saying that I do not have a lot of experience running GISTIC on different datasets. A concern that comes to mind for me is whether or not GISTIC expects genome-wide measurements and if CNVkit on WXS provides genome-wide measurements. If GISTIC does expect genome-wide measurements and if CNVkit does not provide them, I would be concerned that the input data is then violating some assumptions.

I assume that we do not consider anything outside of the primary assembly for CN consensus based on what is in this file, which are the genomic regions that are callable in that pipeline: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/d31c927a27813ec0b8032fbe768002f31723636f/analyses/copy_number_consensus_call/ref/cnv_callable.bed

But I would need to ask someone more involved in writing that pipeline to be sure.

runjin326 · 2021-09-20T23:28:18Z

Thanks so much! I will definitely look into their assumptions about whether they expect genome-wide measurements. If you get a chance, could you please also check with someone that was involved in writing the pipeline as well? Greatly appreciate that!

kgaonkar6 · 2021-09-21T04:48:26Z

Hi @runjin326 - I have never encountered this specific error unfortunately. I assume that this part of the error message
GISTIC 2.0 input error detected:
All input data were removed after NaN processing.
Means that all of the inputs are being converted to NaN during some internal GISTIC step. You might need to do what was done in the chromothripsis module and use inferred tumor ploidy instead of NA:

OpenPBTA-analysis/analyses/chromothripsis/02-run-shatterseek-and-classify-confidence.R

Line 67 in d31c927

# Replace rows of NA copy number with ploidy for that particular tumor

I am seeing the same error now and the gistic run only updates the .mat file in pbta-cnv-consensus-gistic folder which unfortunately I missed the last time I ran this module.

updating: pbta-cnv-consensus-gistic/ (stored 0%)
updating: pbta-cnv-consensus-gistic/gistic_inputs.mat (deflated 2%)

I believe the suggested changes by @jaclyn-taroni might work, I can also open a ticket for the rerun for gistic module (this will also affect the HGG and EPN subtyping modules ).

runjin326 · 2021-09-21T12:47:42Z

@jaclyn-taroni and @kgaonkar6, I actually looked into the original publication for GISTIC v2.0 and they specified their testing data as followed:

We evaluate these improvements on a test set of 178 glioblastoma multiforme (GBM) cancer 
DNAs hybridized to the Affymetrix Single Nucleotide Polymorphism (SNP) 6.0 array as part of 
The Cancer Genome Atlas (TCGA) project [10] (the 'TCGA GBM set'), and on simulated data.

Looks like they used SNP array so I am assuming they don't have an assumption for genome-wide measurements and it would be fine to run on WXS samples?

jaclyn-taroni · 2021-09-21T13:17:04Z

That array is genome-wide.

runjin326 · 2021-09-21T19:20:12Z

@jaclyn-taroni, oh right! I kept digging more and asking around but still couldn't figure out whether it takes panel or WXS. Please let me know if anyone knows the answer!
In the meantime, I have one question to clarify about the HGG module - although the CNV, fusion tables were generated through the module (and into a combined table), only defining histone mutation, IDH and BRAF are actually used to call subtype, right?

jaclyn-taroni · 2021-09-21T21:19:41Z

In the meantime, I have one question to clarify about the HGG module - although the CNV, fusion tables were generated through the module (and into a combined table), only defining histone mutation, IDH and BRAF are actually used to call subtype, right?

Yes, all of that happens here.

runjin326 · 2021-09-21T21:23:46Z

In the meantime, I have one question to clarify about the HGG module - although the CNV, fusion tables were generated through the module (and into a combined table), only defining histone mutation, IDH and BRAF are actually used to call subtype, right?

Yes, all of that happens here.

Thanks so much!

jaclyn-taroni · 2021-09-23T13:41:57Z

@runjin326 mitochondrial and alt sequences are removed in the consensus pipeline:

OpenPBTA-analysis/analyses/copy_number_consensus_call/Snakefile

Lines 135 to 137 in d31c927

    
           # remove alt chromosomes and mitochondria 
        
           "| grep -v '_' " 
        
           "| grep -v 'chrM' > {output}"

For any other questions that are not directly related to this PR but are related to data in this project (OpenPBTA) specifically, I'd recommend filing a data question issue: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/new?assignees=&labels=data&template=data-question.md&title=

runjin326 · 2021-09-23T13:43:28Z

@jaclyn-taroni, thanks so much! Sure I wasn't sure where to post it but will submit a data question ticket next time! :)

kgaonkar6 added 6 commits August 4, 2021 15:59

rerun subtyping with cnv overlap update

2c45af0

adding gistic rerun from cnv-update-rerun

a765447

add cnv to subtyping script

b187a80

gistic rerun

fc183f4

rerun with gistic in data

5e3969e

rerun with download v20

71c5d4a

kgaonkar6 requested a review from jharenza August 5, 2021 00:36

This was referenced Aug 5, 2021

Download data v20 #1118

Merged

V20 PBTA histologies: Add cancer group #1128

Merged

jharenza mentioned this pull request Aug 5, 2021

Outdated PBTA Histologies v20 : Run molecular subtyping #1119

Closed

5 tasks

jharenza approved these changes Aug 5, 2021

View reviewed changes

kgaonkar6 changed the title ~~v20 CNV update: Rerun subtyping v20~~ v20 CNV update Part4 : Rerun subtyping v20 Aug 6, 2021

kgaonkar6 changed the title ~~v20 CNV update Part4 : Rerun subtyping v20~~ v20 CNV update Part4 : Rerun gistic and molecular subtyping v20 Aug 6, 2021

kgaonkar6 changed the title ~~v20 CNV update Part4 : Rerun gistic and molecular subtyping v20~~ v20 CNV update part4 : Rerun gistic and molecular subtyping v20 Aug 6, 2021

jaclyn-taroni mentioned this pull request Aug 10, 2021

V20 CNV update part1: Update overlap criteria for consensus CNV #1123

Merged

5 tasks

kgaonkar6 and others added 2 commits August 11, 2021 15:06

rerun focal

b8a76dd

conflict resolution

6b88b5d

Merge branch 'master' into rerun-subtyping_v20

92e3f2b

jaclyn-taroni merged commit 87d5d4b into AlexsLemonade:master Aug 11, 2021

kgaonkar6 mentioned this pull request Sep 21, 2021

Updated analysis: GISTIC with ploidy instead of NA #1180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v20 CNV update part4 : Rerun gistic and molecular subtyping v20 #1127

v20 CNV update part4 : Rerun gistic and molecular subtyping v20 #1127

kgaonkar6 commented Aug 5, 2021 •

edited

Loading

jharenza left a comment

kgaonkar6 commented Aug 11, 2021

runjin326 commented Sep 20, 2021

jaclyn-taroni commented Sep 20, 2021

runjin326 commented Sep 20, 2021

runjin326 commented Sep 20, 2021

jaclyn-taroni commented Sep 20, 2021

runjin326 commented Sep 20, 2021

kgaonkar6 commented Sep 21, 2021 •

edited

Loading

runjin326 commented Sep 21, 2021 •

edited

Loading

jaclyn-taroni commented Sep 21, 2021

runjin326 commented Sep 21, 2021

jaclyn-taroni commented Sep 21, 2021

runjin326 commented Sep 21, 2021

jaclyn-taroni commented Sep 23, 2021

runjin326 commented Sep 23, 2021

v20 CNV update part4 : Rerun gistic and molecular subtyping v20 #1127

v20 CNV update part4 : Rerun gistic and molecular subtyping v20 #1127

Conversation

kgaonkar6 commented Aug 5, 2021 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

jharenza left a comment

Choose a reason for hiding this comment

kgaonkar6 commented Aug 11, 2021

runjin326 commented Sep 20, 2021

jaclyn-taroni commented Sep 20, 2021

runjin326 commented Sep 20, 2021

runjin326 commented Sep 20, 2021

jaclyn-taroni commented Sep 20, 2021

runjin326 commented Sep 20, 2021

kgaonkar6 commented Sep 21, 2021 • edited Loading

runjin326 commented Sep 21, 2021 • edited Loading

jaclyn-taroni commented Sep 21, 2021

runjin326 commented Sep 21, 2021

jaclyn-taroni commented Sep 21, 2021

runjin326 commented Sep 21, 2021

jaclyn-taroni commented Sep 23, 2021

runjin326 commented Sep 23, 2021

kgaonkar6 commented Aug 5, 2021 •

edited

Loading

kgaonkar6 commented Sep 21, 2021 •

edited

Loading

runjin326 commented Sep 21, 2021 •

edited

Loading