Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MYCN amplification status thresholding #112

Closed
wants to merge 16 commits into from

Conversation

ewafula
Copy link

@ewafula ewafula commented Sep 10, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

The purpose of this analysis module is to assess the status of the MYCN gene amplification status between the OpenPedCan Neuroblastoma GMKF and TARGET consensus copy number variants (CNV) calls from WSX/WGS data and TARGET focal and segment SNP array data compared to the clinical information in the master histologies file.

What was your approach?

  • identify discordant samples in MYCN amplification status between the consensus CNV calls and the clinical information.
  • extract segments in the MYCN cytogenic region (2p24.3) reported in the CNVkit, Control-FREEC, and MantaSV variant callers analysis results.
  • computes copy number classification metrics using the clinical information in the histologies file as the ground truth compared to the inferred consensus CNV calls.
  • estimate overlap coverage between variant callers for samples that are discordant in MYCN amplification status between consensus CNV calls and clinical information.
  • plot chromosome 2 segment mean for the MYCN cytogenic region (2p24.3) reported in the CNVkit analysis results for discordant samples to illustrate the MYCN copy number status (deletion, amplification, or neutral).
  • computes copy number classification metrics using the clinical information in the histologies file as the ground truth compared to the inferred copy number calls from the TARGET focal and segment SNP array data files.

Input data

  • analyses/data/histologies.tsv
  • analyses/data/consensus_wgs_plus_cnvkit_wxs.tsv.gz
  • analyses/data/cnv-cnvkit.seg.gz
  • analyses/data/cnv-controlfreec.tsv.gz
  • analyses/data/sv-manta.tsv.gz
  • input/mycn_tumor_focal_SNParray.tsv
  • input/mycn_tumor_paired_segment_SNParray.tsv
  • input/mycn_tumor_single_segment_SNParray.tsv

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Check if the results are sufficient for determining copy number and overlap coverage cutoffs for the consensus CNV calling analysis and suggests other additional approaches if needed.

Review the accuracy of the analysis code

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

YES

Results

What types of results are included (e.g., table, figure)?

Tables

What is your summary of the results?

  • results/mycn_consensus_vs_clinical_status.tsv
  • results/mycn_consensus_vs_clinical_diff_status.tsv
  • results/mycn_consensus_vs_clinical_metrics.tsv
  • results/cnvkit_mycn_consensus_vs_clinical_diff_status.tsv
  • results/ccontrolfreec_mycn_consensus_vs_clinical_diff_status.tsv
  • results/mantaSV_mycn_consensus_vs_clinical_diff_status.tsv
  • 02-mycn-consensus-clinical-discordant.html
  • results/mycn_focal_SNParray_vs_clinical_metrics.tsv
  • results/mycn_focal_SNParray_vs_clinical_status.tsv
  • results/mycn_paired_segment_SNParray_vs_clinical_metrics.tsv
  • results/mycn_paired_segment_SNParray_vs_clinical_status.tsv
  • results/mycn_single_segment_SNParray_vs_clinical_metrics.tsv
  • results/mycn_single_segment_SNParray_vs_clinical_status.tsv

Reproducibility Checklist

  • [x ] The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • [x ] This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • [x ] The analytical code is documented and contains comments.

@runjin326
Copy link

@afarrel , just noting our discussion in Slack and recap the issue:

We are re-evaluating CNV consensus annotation since:

  1. When we are comparing the annotated CNV status with molecular_subtype, it seemed like among 47 MYCN amp samples, 16 of them are not annotated as either amplification or gain and for the MYCN non-amp, 243 does not have annotation.
  2. Some PIs were worried about ControlFreeC being too stringent and not as sensitive as CNVkit.

For all NBL samples, we can do the following to check the differences between clinical annotation and our consensus call results:

final_cn <- readr::read_tsv("../../data/consensus_wgs_plus_cnvkit_wxs.tsv.gz")

wgs_check <- ot_histology %>% 
    filter(cancer_group == "Neuroblastoma") %>% 
    filter(sample_type == "Tumor") %>% 
    filter(!is.na(molecular_subtype)) %>% 
    filter(experimental_strategy == "WGS") %>% 
    select(Kids_First_Biospecimen_ID, molecular_subtype)

wgs_id <- wgs_check %>% pull(Kids_First_Biospecimen_ID) # this step gives 335 entries

wgs_cn <- final_cn %>% filter(biospecimen_id %in% wgs_id) %>% 
    filter(gene_symbol == "MYCN") %>% 
    rename(Kids_First_Biospecimen_ID = biospecimen_id) # this step only returned 75 entries

wgs_check %>% left_join(wgs_cn) %>% select(status, molecular_subtype) %>% 
    mutate(status = case_when( is.na(status)~"not_called", 
                                                     TRUE~status)) %>% table()

And the results are:

status
molecular_subtype amplification gain not_called
     MYCN amp                22    9         16
     MYCN non-amp             2   42        243
     Unknown                  0    0          1

I was concerned about only have 75 samples in the annotated consensus file (not_called are samples that are not in the annotated file) and none of them have neutral calls but I think I know what the reason is -

  • I understand why there is not that many neutral calls - that is because in the consensus seg file, the copy.num for any neutral in the consensus was replaced with NA - and then when we annotate the consensus seg file, we remove copy.num=NA and hence, the annotated consensus seg file would not contain that many neutral calls.

This particular strategy was introduced since previously, neutral calls get copy.num=2 and that generated a lot of false loss in the dataset. Using ControlFreeC as default (as compared to CNVkit which does not take into account ploidy when calling amplification) and use copy.num=NA for neutral calls, the data is a lot cleaner - see this issue in OpenPBTA discussion for details. I think the analysis done by @kgaonkar6 was pretty comprehensive and looked at multiple genes in >20 samples.

I think we can potentially use ploidy for copy number and that would rescue some neutral calls but would not add any gain or amplification and we need to think whether that is necessary. If we do want to change the methodology, we would need to perform an in-depth benchmarking analyses (instead of just using MYNC status), which can get quite involved.
cc @kgaonkar6 for inputs and suggestions.

@afarrel
Copy link

afarrel commented Sep 28, 2021

Thanks @runjin326 !

@runjin326
Copy link

runjin326 commented Sep 28, 2021

Thanks @runjin326 !

Also attaching a file with the comparison :)
MYCN_threshold_compare.txt

@jharenza
Copy link
Member

Thanks @runjin326 for your feedback. This issue is tough and we may need to do this on a cohort or cancer level. We've done a lot of work to benchmark the brain tumors so I don't want us to alter this methodology globally based on MYCN, which is a very highly amplified gene compared to others. I'm surprised these aren't captured, but we can also use SNP array as default CN and consensus as backup once we have those data integrated.

Another option is to determine what the overlaps are for consensus. We implemented consensus if 90% overlap in one direction (CNV A overlaps 90% with CNV B) or 50% in both. Can we assess the actual breaks and overlaps as well @ewafula to see why amps are missed? It's also possible a few clinically amp samples won't have dna amp due to DNA/protein differences and/or tumor collection heterogeneity per assay. FISH is used clinically.

@ewafula
Copy link
Author

ewafula commented Oct 27, 2021

@afarrel, I have updated the module with comparisons between the TARGET SNP arrays (focal and segment at multiple copy number ratio cutoffs to infer MYCN amp) and clinical MYCN amp status in the histologies files. Details explained in the README.md

@chinwallaa chinwallaa closed this Sep 23, 2022
@jharenza
Copy link
Member

Closing this as we will add SNP arrays at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants