MYCN amplification status thresholding #112

ewafula · 2021-09-10T00:48:16Z

Purpose/implementation Section

What scientific question is your analysis addressing?

The purpose of this analysis module is to assess the status of the MYCN gene amplification status between the OpenPedCan Neuroblastoma GMKF and TARGET consensus copy number variants (CNV) calls from WSX/WGS data and TARGET focal and segment SNP array data compared to the clinical information in the master histologies file.

What was your approach?

identify discordant samples in MYCN amplification status between the consensus CNV calls and the clinical information.
extract segments in the MYCN cytogenic region (2p24.3) reported in the CNVkit, Control-FREEC, and MantaSV variant callers analysis results.
computes copy number classification metrics using the clinical information in the histologies file as the ground truth compared to the inferred consensus CNV calls.
estimate overlap coverage between variant callers for samples that are discordant in MYCN amplification status between consensus CNV calls and clinical information.
plot chromosome 2 segment mean for the MYCN cytogenic region (2p24.3) reported in the CNVkit analysis results for discordant samples to illustrate the MYCN copy number status (deletion, amplification, or neutral).
computes copy number classification metrics using the clinical information in the histologies file as the ground truth compared to the inferred copy number calls from the TARGET focal and segment SNP array data files.

Input data

analyses/data/histologies.tsv
analyses/data/consensus_wgs_plus_cnvkit_wxs.tsv.gz
analyses/data/cnv-cnvkit.seg.gz
analyses/data/cnv-controlfreec.tsv.gz
analyses/data/sv-manta.tsv.gz
input/mycn_tumor_focal_SNParray.tsv
input/mycn_tumor_paired_segment_SNParray.tsv
input/mycn_tumor_single_segment_SNParray.tsv

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Check if the results are sufficient for determining copy number and overlap coverage cutoffs for the consensus CNV calling analysis and suggests other additional approaches if needed.

Review the accuracy of the analysis code

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

YES

Results

What types of results are included (e.g., table, figure)?

Tables

What is your summary of the results?

results/mycn_consensus_vs_clinical_status.tsv
results/mycn_consensus_vs_clinical_diff_status.tsv
results/mycn_consensus_vs_clinical_metrics.tsv
results/cnvkit_mycn_consensus_vs_clinical_diff_status.tsv
results/ccontrolfreec_mycn_consensus_vs_clinical_diff_status.tsv
results/mantaSV_mycn_consensus_vs_clinical_diff_status.tsv
02-mycn-consensus-clinical-discordant.html
results/mycn_focal_SNParray_vs_clinical_metrics.tsv
results/mycn_focal_SNParray_vs_clinical_status.tsv
results/mycn_paired_segment_SNParray_vs_clinical_metrics.tsv
results/mycn_paired_segment_SNParray_vs_clinical_status.tsv
results/mycn_single_segment_SNParray_vs_clinical_metrics.tsv
results/mycn_single_segment_SNParray_vs_clinical_status.tsv

Reproducibility Checklist

[x ] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

[x ] This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
[x ] The analytical code is documented and contains comments.

runjin326 · 2021-09-28T01:28:24Z

@afarrel , just noting our discussion in Slack and recap the issue:

We are re-evaluating CNV consensus annotation since:

When we are comparing the annotated CNV status with molecular_subtype, it seemed like among 47 MYCN amp samples, 16 of them are not annotated as either amplification or gain and for the MYCN non-amp, 243 does not have annotation.
Some PIs were worried about ControlFreeC being too stringent and not as sensitive as CNVkit.

For all NBL samples, we can do the following to check the differences between clinical annotation and our consensus call results:

final_cn <- readr::read_tsv("../../data/consensus_wgs_plus_cnvkit_wxs.tsv.gz")

wgs_check <- ot_histology %>% 
    filter(cancer_group == "Neuroblastoma") %>% 
    filter(sample_type == "Tumor") %>% 
    filter(!is.na(molecular_subtype)) %>% 
    filter(experimental_strategy == "WGS") %>% 
    select(Kids_First_Biospecimen_ID, molecular_subtype)

wgs_id <- wgs_check %>% pull(Kids_First_Biospecimen_ID) # this step gives 335 entries

wgs_cn <- final_cn %>% filter(biospecimen_id %in% wgs_id) %>% 
    filter(gene_symbol == "MYCN") %>% 
    rename(Kids_First_Biospecimen_ID = biospecimen_id) # this step only returned 75 entries

wgs_check %>% left_join(wgs_cn) %>% select(status, molecular_subtype) %>% 
    mutate(status = case_when( is.na(status)~"not_called", 
                                                     TRUE~status)) %>% table()

And the results are:

status
molecular_subtype amplification gain not_called
     MYCN amp                22    9         16
     MYCN non-amp             2   42        243
     Unknown                  0    0          1

I was concerned about only have 75 samples in the annotated consensus file (not_called are samples that are not in the annotated file) and none of them have neutral calls but I think I know what the reason is -

I understand why there is not that many neutral calls - that is because in the consensus seg file, the copy.num for any neutral in the consensus was replaced with NA - and then when we annotate the consensus seg file, we remove copy.num=NA and hence, the annotated consensus seg file would not contain that many neutral calls.

This particular strategy was introduced since previously, neutral calls get copy.num=2 and that generated a lot of false loss in the dataset. Using ControlFreeC as default (as compared to CNVkit which does not take into account ploidy when calling amplification) and use copy.num=NA for neutral calls, the data is a lot cleaner - see this issue in OpenPBTA discussion for details. I think the analysis done by @kgaonkar6 was pretty comprehensive and looked at multiple genes in >20 samples.

I think we can potentially use ploidy for copy number and that would rescue some neutral calls but would not add any gain or amplification and we need to think whether that is necessary. If we do want to change the methodology, we would need to perform an in-depth benchmarking analyses (instead of just using MYNC status), which can get quite involved.
cc @kgaonkar6 for inputs and suggestions.

afarrel · 2021-09-28T01:30:52Z

Thanks @runjin326 !

runjin326 · 2021-09-28T01:37:23Z

Thanks @runjin326 !

Also attaching a file with the comparison :)
MYCN_threshold_compare.txt

jharenza · 2021-09-29T12:47:48Z

Thanks @runjin326 for your feedback. This issue is tough and we may need to do this on a cohort or cancer level. We've done a lot of work to benchmark the brain tumors so I don't want us to alter this methodology globally based on MYCN, which is a very highly amplified gene compared to others. I'm surprised these aren't captured, but we can also use SNP array as default CN and consensus as backup once we have those data integrated.

Another option is to determine what the overlaps are for consensus. We implemented consensus if 90% overlap in one direction (CNV A overlaps 90% with CNV B) or 50% in both. Can we assess the actual breaks and overlaps as well @ewafula to see why amps are missed? It's also possible a few clinically amp samples won't have dna amp due to DNA/protein differences and/or tumor collection heterogeneity per assay. FISH is used clinically.

ewafula · 2021-10-27T20:40:55Z

@afarrel, I have updated the module with comparisons between the TARGET SNP arrays (focal and segment at multiple copy number ratio cutoffs to infer MYCN amp) and clinical MYCN amp status in the histologies files. Details explained in the README.md

jharenza · 2022-09-23T20:37:43Z

Closing this as we will add SNP arrays at a later date.

initial PR submit of mycn-thresholding module

f3d3cfc

ewafula requested review from jharenza and afarrel September 10, 2021 00:48

ewafula added 11 commits September 9, 2021 20:49

update README

c62865f

update README

79fbeaa

update README

95f7333

update README

e2db21f

update README

6576a8b

update README

d16d9ed

additonal scripts and README annotations

26891be

Merge remote-tracking branch 'upstream/dev' into mycn-thresholding

721b335

add ROC AUC scores to classification results

7fc2140

Merge remote-tracking branch 'upstream/dev' into mycn-thresholding

682f19b

formating - rounding decimal place to 6

3673fb2

afarrel mentioned this pull request Oct 18, 2021

Proposed Analysis: MYCN AMP/Gain Thresholding using SNP Array data d3b-center/ticket-tracker-OPC#233

Closed

ewafula added 2 commits October 27, 2021 10:05

qMerge remote-tracking branch 'upstream/dev' into mycn-thresholding

7ae9bd8

update with SNP TARGET SNP array comparison with clinical

bd936dc

ewafula added 2 commits October 28, 2021 10:14

update README

94972c9

update README

9b99fde

jharenza added merge after repo release and removed merge after repo release labels Apr 20, 2022

jharenza added the low priority label Jun 1, 2022

jharenza added the review after release label Jun 9, 2022

jharenza removed the review after release label Jul 13, 2022

jharenza marked this pull request as draft August 13, 2022 01:21

chinwallaa closed this Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MYCN amplification status thresholding #112

MYCN amplification status thresholding #112

ewafula commented Sep 10, 2021 •

edited

Loading

runjin326 commented Sep 28, 2021

afarrel commented Sep 28, 2021

runjin326 commented Sep 28, 2021 •

edited

Loading

jharenza commented Sep 29, 2021

ewafula commented Oct 27, 2021

jharenza commented Sep 23, 2022

MYCN amplification status thresholding #112

MYCN amplification status thresholding #112

Conversation

ewafula commented Sep 10, 2021 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

Input data

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

runjin326 commented Sep 28, 2021

afarrel commented Sep 28, 2021

runjin326 commented Sep 28, 2021 • edited Loading

jharenza commented Sep 29, 2021

ewafula commented Oct 27, 2021

jharenza commented Sep 23, 2022

ewafula commented Sep 10, 2021 •

edited

Loading

runjin326 commented Sep 28, 2021 •

edited

Loading