Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

v20 CNV update part4 : Rerun gistic and molecular subtyping v20 #1127

Merged
merged 9 commits into from
Aug 11, 2021

Conversation

kgaonkar6
Copy link
Collaborator

@kgaonkar6 kgaonkar6 commented Aug 5, 2021

Please merge #1123 #1124 #1126 before this PR.

Purpose/implementation Section

What scientific question is your analysis addressing?

Rerun subtyping for v20 with updated CNV #1114

What was your approach?

Rerun all molecular subtypes with the updated run-for-sutyping.sh which now included cnv modules required for subtyping

What GitHub issue does your pull request address?

#1125

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

See summary for discussion about changes.

Is there anything that you want to discuss further?

Expected changes:

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

tables

What is your summary of the results?

  • samples ("BS_SYGKQD8M","BS_SD28E0SC","BS_2JDVY4F7") are not annotated as CDKN2A/B losses because of our criteria of cn <=1 called as loss
> cnv_anno_cdkn2 %>% filter(biospecimen_id %in% c("BS_SYGKQD8M","BS_SD28E0SC","BS_2JDVY4F7"))
  biospecimen_id ploidy CDKN2A CDKN2B CDKN2A_DEL CDKN2B_DEL
1    BS_2JDVY4F7      3      2      2         No         No
2    BS_SD28E0SC      3      2      2         No         No
3    BS_SYGKQD8M      4      3      3         No         No
   biospecimen_id status copy_number ploidy         ensembl gene_symbol
1     BS_01DQH017   loss           2      4 ENSG00000147889      CDKN2A
2     BS_5V2XCW17   loss           2      4 ENSG00000147889      CDKN2A
3     BS_9PQPZPGM   loss           2      4 ENSG00000147889      CDKN2A
4     BS_C9TNGEA0   loss           2      4 ENSG00000147889      CDKN2A
5     BS_FF73TT6D   loss           2      3 ENSG00000147889      CDKN2A
6     BS_HQFNQHVW   loss           2      3 ENSG00000147889      CDKN2A
7     BS_K6A9Z04J   loss           2      3 ENSG00000147889      CDKN2A
8     BS_KB9GJDCS   loss           2      3 ENSG00000147889      CDKN2A
9     BS_W5P7SPDH   loss           2      3 ENSG00000147889      CDKN2A
10    BS_W74D279Q   loss           2      3 ENSG00000147889      CDKN2A
11    BS_XBE002WV   loss           2      3 ENSG00000147889      CDKN2A
12    BS_ZBWQG7VR   loss           2      4 ENSG00000147889      CDKN2A
   cytoband
1    9p21.3
2    9p21.3
3    9p21.3
4    9p21.3
5    9p21.3
6    9p21.3
7    9p21.3
8    9p21.3
9    9p21.3
10   9p21.3
11   9p21.3
12   9p21.3

For example I checked one instance where the focal annotation was loss in CDKN2A because of CN set to 2 in consensus seg file but checking in cnvkit and freec didn't see any calls supporting the loss

> focal[which(focal$biospecimen_id=="BS_HQFNQHVW" & focal$gene_symbol=="CDKN2A"),] %>% as.data.frame()
  biospecimen_id status copy_number ploidy         ensembl gene_symbol
1    BS_HQFNQHVW   loss           2      3 ENSG00000147889      CDKN2A
  cytoband
1   9p21.3
> cnvkit_autosomes[which(cnvkit_autosomes$biospecimen_id=="BS_HQFNQHVW" & cnvkit_autosomes$gene_symbol=="CDKN2A"),] %>% as.data.frame()
[1] biospecimen_id status         copy_number    ploidy         ensembl       
[6] gene_symbol    cytoband      
<0 rows> (or 0-length row.names)
> freec_autosomes[which(freec_autosomes$biospecimen_id=="BS_HQFNQHVW" & freec_autosomes$gene_symbol=="CDKN2A"),] %>% as.data.frame()
[1] biospecimen_id status         copy_number    ploidy         ensembl       
[6] gene_symbol    cytoband      
<0 rows> (or 0-length row.names)

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me - changes are minimal/expected and criteria is still stringent.

@kgaonkar6 kgaonkar6 changed the title v20 CNV update: Rerun subtyping v20 v20 CNV update Part4 : Rerun subtyping v20 Aug 6, 2021
@kgaonkar6 kgaonkar6 changed the title v20 CNV update Part4 : Rerun subtyping v20 v20 CNV update Part4 : Rerun gistic and molecular subtyping v20 Aug 6, 2021
@kgaonkar6 kgaonkar6 changed the title v20 CNV update Part4 : Rerun gistic and molecular subtyping v20 v20 CNV update part4 : Rerun gistic and molecular subtyping v20 Aug 6, 2021
@kgaonkar6
Copy link
Collaborator Author

Just wanted to document here that I reran focal-cn here since we added a logic to read relative 5c33bbe path only when subtyping and the html files needed to be updated

@jaclyn-taroni jaclyn-taroni merged commit 87d5d4b into AlexsLemonade:master Aug 11, 2021
@runjin326
Copy link
Collaborator

@jaclyn-taroni, sorry to reach out to you like this but I have a question about running the GISTIC module - have you ever get an error like this:

Matrix size 326815    1176
Removing NaN probes...
Removing 326815 markers with NaNs
Matrix size 0  1176
 
GISTIC 2.0 input error detected:
All input data were removed after NaN processing.
  adding: cnv-consensus-gistic/ (stored 0%)
  adding: cnv-consensus-gistic/gistic_inputs.mat (deflated 2%)

Looks like this is coming from changing copy number from 2 to NA for neutral calls (in the cnv-consensus.seg.gz) and I also tried running it on the V21 OpenPBTA consensus release and it gave me the same error. Any suggestions would help!

@jaclyn-taroni
Copy link
Member

Hi @runjin326 - I have never encountered this specific error unfortunately. I assume that this part of the error message

GISTIC 2.0 input error detected:
All input data were removed after NaN processing.

Means that all of the inputs are being converted to NaN during some internal GISTIC step. You might need to do what was done in the chromothripsis module and use inferred tumor ploidy instead of NA:

# Replace rows of NA copy number with ploidy for that particular tumor

@runjin326
Copy link
Collaborator

Hi @runjin326 - I have never encountered this specific error unfortunately. I assume that this part of the error message

GISTIC 2.0 input error detected:
All input data were removed after NaN processing.

Means that all of the inputs are being converted to NaN during some internal GISTIC step. You might need to do what was done in the chromothripsis module and use inferred tumor ploidy instead of NA:

# Replace rows of NA copy number with ploidy for that particular tumor

Thanks so much for the prompt reply - I will try that method :)

@runjin326
Copy link
Collaborator

@jaclyn-taroni, I am so sorry but I have another related question -for OpenTargets, we made modifications to run consensus on WGS only and use CNVkit results for WXS samples only - my question is, do you see any issue with running GISTIC directly on CNVkit seg file? Also, when I was trying to run it, I realized that there are some weird chromosome names in the file:

[25] "chr1_KI270766v1_alt"     "chr7_KI270803v1_alt"     "chr15_KI270850v1_alt"    "chr17_KI270857v1_alt"   
[29] "chr17_KI270909v1_alt"    "chr19_KI270938v1_alt"    "chr22_KI270879v1_alt"    "chr1_KI270706v1_random" 
[33] "chr4_GL000008v2_random"  "chr1_KI270711v1_random"  "chr14_GL000194v1_random" "chr14_KI270846v1_alt"   
[37] "chr19_GL000209v2_alt"    "chr19_GL949753v2_alt"    "chr22_KI270928v1_alt"    "chr15_KI270851v1_alt"   

Have we been ignoring those when calling consensus? Or how do we deal with these?
Thanks so much again!

@jaclyn-taroni
Copy link
Member

I'll preface this by saying that I do not have a lot of experience running GISTIC on different datasets. A concern that comes to mind for me is whether or not GISTIC expects genome-wide measurements and if CNVkit on WXS provides genome-wide measurements. If GISTIC does expect genome-wide measurements and if CNVkit does not provide them, I would be concerned that the input data is then violating some assumptions.

I assume that we do not consider anything outside of the primary assembly for CN consensus based on what is in this file, which are the genomic regions that are callable in that pipeline: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/d31c927a27813ec0b8032fbe768002f31723636f/analyses/copy_number_consensus_call/ref/cnv_callable.bed

But I would need to ask someone more involved in writing that pipeline to be sure.

@runjin326
Copy link
Collaborator

Thanks so much! I will definitely look into their assumptions about whether they expect genome-wide measurements. If you get a chance, could you please also check with someone that was involved in writing the pipeline as well? Greatly appreciate that!

@kgaonkar6
Copy link
Collaborator Author

kgaonkar6 commented Sep 21, 2021

Hi @runjin326 - I have never encountered this specific error unfortunately. I assume that this part of the error message

GISTIC 2.0 input error detected:
All input data were removed after NaN processing.

Means that all of the inputs are being converted to NaN during some internal GISTIC step. You might need to do what was done in the chromothripsis module and use inferred tumor ploidy instead of NA:

# Replace rows of NA copy number with ploidy for that particular tumor

I am seeing the same error now and the gistic run only updates the .mat file in pbta-cnv-consensus-gistic folder which unfortunately I missed the last time I ran this module.

updating: pbta-cnv-consensus-gistic/ (stored 0%)
updating: pbta-cnv-consensus-gistic/gistic_inputs.mat (deflated 2%)

I believe the suggested changes by @jaclyn-taroni might work, I can also open a ticket for the rerun for gistic module (this will also affect the HGG and EPN subtyping modules ).

@runjin326
Copy link
Collaborator

runjin326 commented Sep 21, 2021

@jaclyn-taroni and @kgaonkar6, I actually looked into the original publication for GISTIC v2.0 and they specified their testing data as followed:

We evaluate these improvements on a test set of 178 glioblastoma multiforme (GBM) cancer 
DNAs hybridized to the Affymetrix Single Nucleotide Polymorphism (SNP) 6.0 array as part of 
The Cancer Genome Atlas (TCGA) project [10] (the 'TCGA GBM set'), and on simulated data. 

Looks like they used SNP array so I am assuming they don't have an assumption for genome-wide measurements and it would be fine to run on WXS samples?

@jaclyn-taroni
Copy link
Member

That array is genome-wide.

@runjin326
Copy link
Collaborator

@jaclyn-taroni, oh right! I kept digging more and asking around but still couldn't figure out whether it takes panel or WXS. Please let me know if anyone knows the answer!
In the meantime, I have one question to clarify about the HGG module - although the CNV, fusion tables were generated through the module (and into a combined table), only defining histone mutation, IDH and BRAF are actually used to call subtype, right?

@jaclyn-taroni
Copy link
Member

In the meantime, I have one question to clarify about the HGG module - although the CNV, fusion tables were generated through the module (and into a combined table), only defining histone mutation, IDH and BRAF are actually used to call subtype, right?

Yes, all of that happens here.

@runjin326
Copy link
Collaborator

In the meantime, I have one question to clarify about the HGG module - although the CNV, fusion tables were generated through the module (and into a combined table), only defining histone mutation, IDH and BRAF are actually used to call subtype, right?

Yes, all of that happens here.

Thanks so much!

@jaclyn-taroni
Copy link
Member

@runjin326 mitochondrial and alt sequences are removed in the consensus pipeline:

# remove alt chromosomes and mitochondria
"| grep -v '_' "
"| grep -v 'chrM' > {output}"

For any other questions that are not directly related to this PR but are related to data in this project (OpenPBTA) specifically, I'd recommend filing a data question issue: https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/new?assignees=&labels=data&template=data-question.md&title=

@runjin326
Copy link
Collaborator

@jaclyn-taroni, thanks so much! Sure I wasn't sure where to post it but will submit a data question ticket next time! :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants