Proposed Analysis: Create mutation frequencies for Ped OT platform #8

afarrel · 2021-05-14T19:38:16Z

What are the scientific goals of the analysis?

Update oncoprint-landscape module to output a gene mutation frequency TSV per histology (or cohort) for Pediatric Open Targets platform. For this, we will use all genes, not genes of interest.

Note: we may want to do this on a mutation level and add significance values as the OT takes in those values. Still need to discuss the best way to do this.

What input data are required for this analysis?

Consensus MAFs

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

1 day

Who will complete the analysis (please add a GitHub handle here if relevant)?

@ewafula

jharenza · 2021-05-24T23:51:47Z

Adding to this - we should probably also consider which mutations are going into this matrix - probably will want to exclude synonymous, silent, RNA, and intergenic for now. Do we know how OT will rank mutations? Is it by frequency per histology/cohort or by functional consequence + frequency? Thoughts @kgaonkar6 and @taylordm? cc @allisonheath

logstar · 2021-06-18T20:54:57Z

Sorry for the delay on this analysis.

I was wondering which mutation files in the PediatricOpenTargets/OpenPedCan-analysis v5 data release I should use for generating the mutation frequency tables. The oncoprint-landscape module uses the following files, but they are from AlexsLemonade/OpenPBTA-analysis data release.

maf_consensus=../../data/pbta-snv-consensus-mutation.maf.tsv.gz
fusion_file=../../data/pbta-fusion-putative-oncogenic.tsv
histologies_file=../../data/pbta-histologies.tsv
focal_directory=../focal-cn-file-preparation/results
focal_cnv_file=${focal_directory}/consensus_seg_most_focal_cn_status.tsv.gz

Should we rerun focal-cn-file-preparation module on PediatricOpenTargets/OpenPedCan-analysis release data? The focal_cnv_file is also generated using the AlexsLemonade/OpenPBTA-analysis release data.

After figuring out which mutation files to use, I am planning to merge them like the oncoprint-landscape module as following:

maf_object <- prepare_maf_object(
  maf_df = maf_df,
  cnv_df = cnv_df,
  metadata = metadata,
  fusion_df = fusion_df
)

(link to the code)

Then, generate gene summary tables for the merged mutation object using maftools::getGeneSummary, which would output a table that contains the number of mutated samples like the following. I will compute mutation frequency as MutatedSamples / total.

Hugo_Symbol	Frame_Shift_Del	Frame_Shift_Ins	In_Frame_Del	In_Frame_Ins	Missense_Mutation	total	MutatedSamples	AlteredSamples
MUC3A	3	2	2	1	22	30	26	26
MUC5AC	0	0	0	0	32	32	24	24
MUC4	0	0	1	2	23	26	21	21
ALK	0	0	0	0	18	18	18	18
NBPF10	0	0	0	0	19	19	17	17
HLA-A	0	0	0	0	18	18	17	17

Regarding the note:

Note: we may want to do this on a mutation level and add significance values as the OT takes in those values. Still need to discuss the best way to do this.

I wonder if you could clarify the procedure to do this analysis on a mutation level.

For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.

kgaonkar6 · 2021-06-18T21:10:24Z

The consensus maf in v5 is snv-consensus-plus-hotspots.maf.tsv.gz

logstar · 2021-06-18T21:16:58Z

The consensus maf in v5 is snv-consensus-plus-hotspots.maf.tsv.gz

Thank you for the quick reply! I will use snv-consensus-plus-hotspots.maf.tsv.gz to generate mutation frequency tables for each histology.

logstar · 2021-06-24T15:50:57Z

@kgaonkar6 I was wondering if I could directly use the following files for this analysis:

analyses/focal-cn-file-preparation/results/consensus_seg_most_focal_cn_status.tsv.gz
analyses/interaction-plots/results/gene_disease_top50.tsv
analyses/focal-cn-file-preparation/results/consensus_seg_focal_cn_recurrent_genes.tsv

My concern is that these files might be generated using AlexsLemonade/OpenPBTA-analysis data release. I am not sure if they are compatible with PediatricOpenTargets/OpenPedCan-analysis data release.

kgaonkar6 · 2021-06-24T16:10:06Z

You are right! Those files and (pbta-fusion-putative-oncogenic.tsv) are not updated to include samples added as part of OpenPedCan analysis yet but we will be rerunning before the next release. So I can keep you posted about that.

To me it looks like the requirement here is just the gene mutation frequency so should we add another script in the module to just generate the frequencies using an updated version of prepare_maf_object function ( make cnv_df optional so we can use it now ) and maftools::getGeneSummary as you suggested ?

logstar · 2021-06-24T16:29:13Z

You are right! Those files and (pbta-fusion-putative-oncogenic.tsv) are not updated to include samples added as part of OpenPedCan analysis yet but we will be rerunning before the next release. So I can keep you posted about that.

To me it looks like the requirement here is just the gene mutation frequency so should we add another script in the module to just generate the frequencies using an updated version of prepare_maf_object function ( make cnv_df optional so we can use it now ) and maftools::getGeneSummary as you suggested ?

Thank you for the quick reply and the suggestion!

I will skip the CNV part for now. I am planning to use an empty CNV file as a place holder for this analysis, so the original code can be reused for this analysis.

I am planning to update the 01-plot-oncoprint.R, so that the mutation frequency tables are consistent with the corresponding plots.

jharenza · 2021-06-24T17:15:01Z

I am planning to update the 01-plot-oncoprint.R, so that the mutation frequency tables are consistent with the corresponding plots.

One other note is we do not want to do top 50 here, but all genes mentioned in the comment above, excluding

synonymous, silent, RNA, and intergenic

It also sounds like this analysis should be done on a cohort+cancer_group and then cancer_group level, separately for primary tumors and relapse/recurrent/progressive tumors using the independent specimen lists. This analysis has morphed a bit since initially written.

For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.

Regarding significance, MutSigCV doesn't perform well on low Ns, and many of these histologies have a low N. I think for now, we are not going to worry about designating significance, but rather try to come up with a file of mutation frequencies, plus the additional annotation from @taylordm 's sample table. For the latter, we should probably just come up with a new ticket for creating the full annotated table.

below is the sample table:
OT_SomaticTables_SNV_CNV.xlsx

logstar · 2021-06-24T18:14:46Z

I am planning to update the 01-plot-oncoprint.R, so that the mutation frequency tables are consistent with the corresponding plots.

One other note is we do not want to do top 50 here, but all genes mentioned in the comment above, excluding
synonymous, silent, RNA, and intergenic
It also sounds like this analysis should be done on a cohort+cancer_group and then cancer_group level, separately for primary tumors and relapse/recurrent/progressive tumors using the independent specimen lists. This analysis has morphed a bit since initially written.

For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.

Regarding significance, MutSigCV doesn't perform well on low Ns, and many of these histologies have a low N. I think for now, we are not going to worry about designating significance, but rather try to come up with a file of mutation frequencies, plus the additional annotation from @taylordm 's sample table. For the latter, we should probably just come up with a new ticket for creating the full annotated table.

below is the sample table:
OT_SomaticTables_SNV_CNV.xlsx

@jharenza Thank you for the detailed notes. They are very helpful for implementing this analysis.

I will generate the mutation frequency tables accordingly. Then, I will annotate the SNV table of mutation frequencies according to #64.

I will skip the significance part for now.

jharenza · 2021-06-24T18:18:36Z

Sure thing, let me know if you have any questions along the way!

logstar · 2021-06-24T19:40:19Z

Hi @kgaonkar6. I was wondering if I could use independent-specimens.rnaseq.primary-plus.tsv from the independent-samples module to subset the fusion table.

Although the fusion table is work in progress at #7, the independent sample determination in the fusion table is related to the filtering of the snv-consensus-plus-hotspots.maf.tsv.gz.

In the original code, fusion independent samples are determined by matching sample_ids to the Kids_First_Biospecimen_ID in independent-specimens.wgs.primary.tsv, so the sample_ids with more than 2 rows in the histologies_df are removed from the snv-consensus-plus-hotspots.maf.tsv.gz in order to unambiguous matching between WGS and RNA-seq samples. Relevant code is listed below:

# in 00-map-to-sample_id.R
# An ambiguous sample_id will have more than 2 rows associated with it in the
# histologies file when looking at tumor samples -- that means we won't be able
# to determine when an WGS/WXS assay maps to an RNA-seq assay for the purpose of
# the oncoprint plot
ambiguous_sample_ids <- histologies_df %>%
  filter(sample_type == "Tumor",
         composition == "Solid Tissue") %>%
  group_by(sample_id) %>%
  tally() %>%
  filter(n > 2) %>%
  pull(sample_id)

ambiguous_biospecimens <- histologies_df %>%
  filter(sample_id %in% ambiguous_sample_ids) %>%
  pull(Kids_First_Biospecimen_ID)
# ...
biospecimens_to_remove <- unique(c(ambiguous_biospecimens,
                                   not_tumor_biospecimens))

# Filter the files!
maf_df <- maf_df %>%
  dplyr::filter(!(Tumor_Sample_Barcode %in% biospecimens_to_remove))
# ...

I found some sample IDs are mapping to hundreds or even thousands of samples, so I am concerned about removing the ambiguous_biospecimens.

> histologies_df %>%
+     filter(sample_type == "Tumor",
+            composition == "Solid Tissue") %>%
+     group_by(sample_id) %>%
+     tally() %>%
+     filter(n > 2)
# A tibble: 19 x 2
   sample_id     n
   <chr>     <int>
 1 01        11073
 2 02           49
 3 03          470
 4 05            9
 5 06          394
 6 09            5
 7 7316-14       3
 8 7316-1463     4
 9 7316-158      3
10 7316-161      3
11 7316-1765     4
12 7316-178      3
13 7316-3214     3
14 7316-3230     4
15 7316-3231     6
16 7316-85       3
17 7316-87       3
18 A16915        3
19 A18777        3

jharenza · 2021-06-24T19:42:22Z

hey @logstar - you can disregard fusions and CNVs completely here, only looking for SNV frequencies using the indep DNA sample lists

logstar · 2021-06-24T19:49:12Z

hey @logstar - you can disregard fusions and CNVs completely here, only looking for SNV frequencies using the indep DNA sample lists

Thank you for the quick reply. I will disregard the fusions and CNVs in this analysis.

Sorry for being distracted by the CNVs and fusions. I was trying to figure out the original code and make this module compatible to the full OT data release, so we will not need to revise the code much when CNV and fusions are available. Now, I will get the SNV mutation frequency table generated before worrying about CNV or fusions.

logstar · 2021-06-25T21:02:48Z

Hi @jharenza. I was wondering whether Translation_Start_Site Variant_Classification should be considered as non-synonyms.

In the original code, only the following Variant_Classifications are considered as non-synonyms, and Translation_Start_Site is not included.

    read.maf(
      maf = maf_df,
      clinicalData = metadata,
      cnTable = cnv_df,
      removeDuplicatedVariants = FALSE,
      vc_nonSyn = c(
        "Frame_Shift_Del",
        "Frame_Shift_Ins",
        "Splice_Site",
        "Nonsense_Mutation",
        "Nonstop_Mutation",
        "In_Frame_Del",
        "In_Frame_Ins",
        "Missense_Mutation",
        "Fusion",
        "Multi_Hit",
        "Multi_Hit_Fusion",
        "Hom_Deletion",
        "Hem_Deletion",
        "Amp",
        "Del"
      )
    )

However, the default non-synonyms in maftools are the following, which has the additional Translation_Start_Site.

  if(is.null(vc_nonSyn)){
    vc.nonSilent = c("Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
                     "Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
                     "In_Frame_Ins", "Missense_Mutation")
  }

All unique Variant_Classification in ../../data/snv-consensus-plus-hotspots.maf.tsv.gz are

3'Flank
3'UTR
5'Flank
5'UTR
Frame_Shift_Del
Frame_Shift_Ins
IGR
In_Frame_Del
In_Frame_Ins
Intron
Missense_Mutation
Nonsense_Mutation
Nonstop_Mutation
RNA
Silent
Splice_Region
Splice_Site
Translation_Start_Site

jharenza · 2021-06-25T21:04:15Z

we can add it as non-silent!

logstar · 2021-06-25T21:04:37Z

we can add it as non-silent!

Will do. Thank you for the quick reply!

d3b-center/ticket-tracker-OPC#8 Update oncoprint-landscape module for PediatricOpenTargets/ OpenPedCan-analysis. Generate mutation frequency tables for all genes.

@taylordm

Squashed commit of the following: commit d87986b7ce1a517f4807430ce6beaac5950b50ca Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 17:21:59 2021 -0400 Rename mutation-frequencies to snv-frequencies Rename module. commit b2d2fd5c391b43214825e7b458d0edcb5ac22f1a Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 17:13:18 2021 -0400 Annotate SNV table with mutation frequencies Issues addressed: - <d3b-center/ticket-tracker-OPC#64> - <d3b-center/ticket-tracker-OPC#8>. This issue is no longer compatible with the purpose of this module. This module intends to compute mutation frequencies for each variant, but this issue intents to compute the mutation frequencies for each gene. This issue is listed here for future reference. commit 84cacf28927121037f4b9ba895e5baa5d12c7b31 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 16:23:20 2021 -0400 [WIP] Update run-mutation-frequencies.sh commit 29ae8ef19f2339ae08f78c26ab42e6cf75d3556e Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 16:14:50 2021 -0400 [WIP] Generate annotated SNV frequency table commit 2cb06741ca192f77a3043d03574649a184459b11 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 14:54:39 2021 -0400 [WIP] Replace NA with blank string Also replaced HotSpot value 1 with Y and 0 with N. commit 57776f61e576a5e3e2672370370fd1090f3aa478 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 14:02:13 2021 -0400 [WIP] Use mygene.info to query gene IDs mygene.info seems to be actively maintained. The query results are more comprehensive than [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html). Relevant URLs: - <http://mygene.info/about> - <https://bioconductor.org/packages/release/bioc/html/mygene.html> mygene.info is suggested by @taylordm and @jharenza. commit 76bb0f5236378648adce429e45d3827009735b58 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 10:55:30 2021 -0400 [WIP] Generate SNV frequency tables Issue addressed: d3b-center/ticket-tracker-OPC#64

logstar · 2021-07-27T21:14:55Z

Closed with PR d3b-center/OpenPedCan-analysis#45 merged.

afarrel assigned ewafula May 14, 2021

jharenza changed the title ~~Proposed Analysis: Run SNV modules and oncoprints for Ped OT platform~~ Proposed Analysis: Create oncomatrix for Ped OT platform May 18, 2021

jharenza assigned logstar and unassigned ewafula May 20, 2021

jharenza changed the title ~~Proposed Analysis: Create oncomatrix for Ped OT platform~~ Proposed Analysis: Create mutation frequencies for Ped OT platform May 26, 2021

This was referenced Jun 24, 2021

Proposed Analysis: Annotate SNV table of mutation frequencies #64

Closed

Proposed Analysis: Annotate CNV frequency files #65

Closed

logstar mentioned this issue Jun 28, 2021

[WIP] Update oncoprint-landscape module to generate mutation frequency table d3b-center/OpenPedCan-analysis#32

Closed

5 tasks

logstar mentioned this issue Jun 30, 2021

Annotate SNV table with mutation frequencies d3b-center/OpenPedCan-analysis#45

Merged

5 tasks

logstar closed this as completed Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: Create mutation frequencies for Ped OT platform #8

Proposed Analysis: Create mutation frequencies for Ped OT platform #8

afarrel commented May 14, 2021 •

edited by jharenza

Loading

jharenza commented May 24, 2021

logstar commented Jun 18, 2021

kgaonkar6 commented Jun 18, 2021

logstar commented Jun 18, 2021

logstar commented Jun 24, 2021

kgaonkar6 commented Jun 24, 2021

logstar commented Jun 24, 2021

jharenza commented Jun 24, 2021

logstar commented Jun 24, 2021

jharenza commented Jun 24, 2021

logstar commented Jun 24, 2021

jharenza commented Jun 24, 2021

logstar commented Jun 24, 2021

logstar commented Jun 25, 2021

jharenza commented Jun 25, 2021

logstar commented Jun 25, 2021

logstar commented Jul 27, 2021

Proposed Analysis: Create mutation frequencies for Ped OT platform #8

Proposed Analysis: Create mutation frequencies for Ped OT platform #8

Comments

afarrel commented May 14, 2021 • edited by jharenza Loading

What are the scientific goals of the analysis?

What input data are required for this analysis?

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

Who will complete the analysis (please add a GitHub handle here if relevant)?

jharenza commented May 24, 2021

logstar commented Jun 18, 2021

kgaonkar6 commented Jun 18, 2021

logstar commented Jun 18, 2021

logstar commented Jun 24, 2021

kgaonkar6 commented Jun 24, 2021

logstar commented Jun 24, 2021

jharenza commented Jun 24, 2021

logstar commented Jun 24, 2021

jharenza commented Jun 24, 2021

logstar commented Jun 24, 2021

jharenza commented Jun 24, 2021

logstar commented Jun 24, 2021

logstar commented Jun 25, 2021

jharenza commented Jun 25, 2021

logstar commented Jun 25, 2021

logstar commented Jul 27, 2021

afarrel commented May 14, 2021 •

edited by jharenza

Loading