Skip to content
This repository has been archived by the owner on Jun 16, 2023. It is now read-only.

Proposed Analysis: Create mutation frequencies for Ped OT platform #8

Closed
afarrel opened this issue May 14, 2021 · 17 comments
Closed

Proposed Analysis: Create mutation frequencies for Ped OT platform #8

afarrel opened this issue May 14, 2021 · 17 comments
Assignees

Comments

@afarrel
Copy link

afarrel commented May 14, 2021

What are the scientific goals of the analysis?

Update oncoprint-landscape module to output a gene mutation frequency TSV per histology (or cohort) for Pediatric Open Targets platform. For this, we will use all genes, not genes of interest.

Note: we may want to do this on a mutation level and add significance values as the OT takes in those values. Still need to discuss the best way to do this.

What input data are required for this analysis?

Consensus MAFs

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

1 day

Who will complete the analysis (please add a GitHub handle here if relevant)?

@ewafula

@jharenza jharenza changed the title Proposed Analysis: Run SNV modules and oncoprints for Ped OT platform Proposed Analysis: Create oncomatrix for Ped OT platform May 18, 2021
@jharenza jharenza assigned logstar and unassigned ewafula May 20, 2021
@jharenza
Copy link
Member

Adding to this - we should probably also consider which mutations are going into this matrix - probably will want to exclude synonymous, silent, RNA, and intergenic for now. Do we know how OT will rank mutations? Is it by frequency per histology/cohort or by functional consequence + frequency? Thoughts @kgaonkar6 and @taylordm? cc @allisonheath

@jharenza jharenza changed the title Proposed Analysis: Create oncomatrix for Ped OT platform Proposed Analysis: Create mutation frequencies for Ped OT platform May 26, 2021
@logstar
Copy link

logstar commented Jun 18, 2021

Sorry for the delay on this analysis.

I was wondering which mutation files in the PediatricOpenTargets/OpenPedCan-analysis v5 data release I should use for generating the mutation frequency tables. The oncoprint-landscape module uses the following files, but they are from AlexsLemonade/OpenPBTA-analysis data release.

maf_consensus=../../data/pbta-snv-consensus-mutation.maf.tsv.gz
fusion_file=../../data/pbta-fusion-putative-oncogenic.tsv
histologies_file=../../data/pbta-histologies.tsv
focal_directory=../focal-cn-file-preparation/results
focal_cnv_file=${focal_directory}/consensus_seg_most_focal_cn_status.tsv.gz

Should we rerun focal-cn-file-preparation module on PediatricOpenTargets/OpenPedCan-analysis release data? The focal_cnv_file is also generated using the AlexsLemonade/OpenPBTA-analysis release data.

After figuring out which mutation files to use, I am planning to merge them like the oncoprint-landscape module as following:

maf_object <- prepare_maf_object(
  maf_df = maf_df,
  cnv_df = cnv_df,
  metadata = metadata,
  fusion_df = fusion_df
)

(link to the code)

Then, generate gene summary tables for the merged mutation object using maftools::getGeneSummary, which would output a table that contains the number of mutated samples like the following. I will compute mutation frequency as MutatedSamples / total.

Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins Missense_Mutation Nonsense_Mutation Nonstop_Mutation Splice_Site Translation_Start_Site total MutatedSamples AlteredSamples
MUC3A 3 2 2 1 22 0 0 0 0 30 26 26
MUC5AC 0 0 0 0 32 0 0 0 0 32 24 24
MUC4 0 0 1 2 23 0 0 0 0 26 21 21
ALK 0 0 0 0 18 0 0 0 0 18 18 18
NBPF10 0 0 0 0 19 0 0 0 0 19 17 17
HLA-A 0 0 0 0 18 0 0 0 0 18 17 17

Regarding the note:

Note: we may want to do this on a mutation level and add significance values as the OT takes in those values. Still need to discuss the best way to do this.

I wonder if you could clarify the procedure to do this analysis on a mutation level.

For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.

@kgaonkar6
Copy link
Contributor

The consensus maf in v5 is snv-consensus-plus-hotspots.maf.tsv.gz

@logstar
Copy link

logstar commented Jun 18, 2021

The consensus maf in v5 is snv-consensus-plus-hotspots.maf.tsv.gz

Thank you for the quick reply! I will use snv-consensus-plus-hotspots.maf.tsv.gz to generate mutation frequency tables for each histology.

@logstar
Copy link

logstar commented Jun 24, 2021

@kgaonkar6 I was wondering if I could directly use the following files for this analysis:

  • analyses/focal-cn-file-preparation/results/consensus_seg_most_focal_cn_status.tsv.gz
  • analyses/interaction-plots/results/gene_disease_top50.tsv
  • analyses/focal-cn-file-preparation/results/consensus_seg_focal_cn_recurrent_genes.tsv

My concern is that these files might be generated using AlexsLemonade/OpenPBTA-analysis data release. I am not sure if they are compatible with PediatricOpenTargets/OpenPedCan-analysis data release.

@kgaonkar6
Copy link
Contributor

You are right! Those files and (pbta-fusion-putative-oncogenic.tsv) are not updated to include samples added as part of OpenPedCan analysis yet but we will be rerunning before the next release. So I can keep you posted about that.

To me it looks like the requirement here is just the gene mutation frequency so should we add another script in the module to just generate the frequencies using an updated version of prepare_maf_object function ( make cnv_df optional so we can use it now ) and maftools::getGeneSummary as you suggested ?

@logstar
Copy link

logstar commented Jun 24, 2021

You are right! Those files and (pbta-fusion-putative-oncogenic.tsv) are not updated to include samples added as part of OpenPedCan analysis yet but we will be rerunning before the next release. So I can keep you posted about that.

To me it looks like the requirement here is just the gene mutation frequency so should we add another script in the module to just generate the frequencies using an updated version of prepare_maf_object function ( make cnv_df optional so we can use it now ) and maftools::getGeneSummary as you suggested ?

Thank you for the quick reply and the suggestion!

I will skip the CNV part for now. I am planning to use an empty CNV file as a place holder for this analysis, so the original code can be reused for this analysis.

I am planning to update the 01-plot-oncoprint.R, so that the mutation frequency tables are consistent with the corresponding plots.

@jharenza
Copy link
Member

I am planning to update the 01-plot-oncoprint.R, so that the mutation frequency tables are consistent with the corresponding plots.

One other note is we do not want to do top 50 here, but all genes mentioned in the comment above, excluding

synonymous, silent, RNA, and intergenic

It also sounds like this analysis should be done on a cohort+cancer_group and then cancer_group level, separately for primary tumors and relapse/recurrent/progressive tumors using the independent specimen lists. This analysis has morphed a bit since initially written.

For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.

Regarding significance, MutSigCV doesn't perform well on low Ns, and many of these histologies have a low N. I think for now, we are not going to worry about designating significance, but rather try to come up with a file of mutation frequencies, plus the additional annotation from @taylordm 's sample table. For the latter, we should probably just come up with a new ticket for creating the full annotated table.

below is the sample table:
OT_SomaticTables_SNV_CNV.xlsx

@logstar
Copy link

logstar commented Jun 24, 2021

I am planning to update the 01-plot-oncoprint.R, so that the mutation frequency tables are consistent with the corresponding plots.

One other note is we do not want to do top 50 here, but all genes mentioned in the comment above, excluding

synonymous, silent, RNA, and intergenic

It also sounds like this analysis should be done on a cohort+cancer_group and then cancer_group level, separately for primary tumors and relapse/recurrent/progressive tumors using the independent specimen lists. This analysis has morphed a bit since initially written.

For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis.

Regarding significance, MutSigCV doesn't perform well on low Ns, and many of these histologies have a low N. I think for now, we are not going to worry about designating significance, but rather try to come up with a file of mutation frequencies, plus the additional annotation from @taylordm 's sample table. For the latter, we should probably just come up with a new ticket for creating the full annotated table.

below is the sample table:
OT_SomaticTables_SNV_CNV.xlsx

@jharenza Thank you for the detailed notes. They are very helpful for implementing this analysis.

I will generate the mutation frequency tables accordingly. Then, I will annotate the SNV table of mutation frequencies according to #64.

I will skip the significance part for now.

@jharenza
Copy link
Member

Sure thing, let me know if you have any questions along the way!

@logstar
Copy link

logstar commented Jun 24, 2021

Hi @kgaonkar6. I was wondering if I could use independent-specimens.rnaseq.primary-plus.tsv from the independent-samples module to subset the fusion table.

Although the fusion table is work in progress at #7, the independent sample determination in the fusion table is related to the filtering of the snv-consensus-plus-hotspots.maf.tsv.gz.

In the original code, fusion independent samples are determined by matching sample_ids to the Kids_First_Biospecimen_ID in independent-specimens.wgs.primary.tsv, so the sample_ids with more than 2 rows in the histologies_df are removed from the snv-consensus-plus-hotspots.maf.tsv.gz in order to unambiguous matching between WGS and RNA-seq samples. Relevant code is listed below:

# in 00-map-to-sample_id.R
# An ambiguous sample_id will have more than 2 rows associated with it in the
# histologies file when looking at tumor samples -- that means we won't be able
# to determine when an WGS/WXS assay maps to an RNA-seq assay for the purpose of
# the oncoprint plot
ambiguous_sample_ids <- histologies_df %>%
  filter(sample_type == "Tumor",
         composition == "Solid Tissue") %>%
  group_by(sample_id) %>%
  tally() %>%
  filter(n > 2) %>%
  pull(sample_id)

ambiguous_biospecimens <- histologies_df %>%
  filter(sample_id %in% ambiguous_sample_ids) %>%
  pull(Kids_First_Biospecimen_ID)
# ...
biospecimens_to_remove <- unique(c(ambiguous_biospecimens,
                                   not_tumor_biospecimens))

# Filter the files!
maf_df <- maf_df %>%
  dplyr::filter(!(Tumor_Sample_Barcode %in% biospecimens_to_remove))
# ...

I found some sample IDs are mapping to hundreds or even thousands of samples, so I am concerned about removing the ambiguous_biospecimens.

> histologies_df %>%
+     filter(sample_type == "Tumor",
+            composition == "Solid Tissue") %>%
+     group_by(sample_id) %>%
+     tally() %>%
+     filter(n > 2)
# A tibble: 19 x 2
   sample_id     n
   <chr>     <int>
 1 01        11073
 2 02           49
 3 03          470
 4 05            9
 5 06          394
 6 09            5
 7 7316-14       3
 8 7316-1463     4
 9 7316-158      3
10 7316-161      3
11 7316-1765     4
12 7316-178      3
13 7316-3214     3
14 7316-3230     4
15 7316-3231     6
16 7316-85       3
17 7316-87       3
18 A16915        3
19 A18777        3

@jharenza
Copy link
Member

hey @logstar - you can disregard fusions and CNVs completely here, only looking for SNV frequencies using the indep DNA sample lists

@logstar
Copy link

logstar commented Jun 24, 2021

hey @logstar - you can disregard fusions and CNVs completely here, only looking for SNV frequencies using the indep DNA sample lists

Thank you for the quick reply. I will disregard the fusions and CNVs in this analysis.

Sorry for being distracted by the CNVs and fusions. I was trying to figure out the original code and make this module compatible to the full OT data release, so we will not need to revise the code much when CNV and fusions are available. Now, I will get the SNV mutation frequency table generated before worrying about CNV or fusions.

@logstar
Copy link

logstar commented Jun 25, 2021

Hi @jharenza. I was wondering whether Translation_Start_Site Variant_Classification should be considered as non-synonyms.

In the original code, only the following Variant_Classifications are considered as non-synonyms, and Translation_Start_Site is not included.

    read.maf(
      maf = maf_df,
      clinicalData = metadata,
      cnTable = cnv_df,
      removeDuplicatedVariants = FALSE,
      vc_nonSyn = c(
        "Frame_Shift_Del",
        "Frame_Shift_Ins",
        "Splice_Site",
        "Nonsense_Mutation",
        "Nonstop_Mutation",
        "In_Frame_Del",
        "In_Frame_Ins",
        "Missense_Mutation",
        "Fusion",
        "Multi_Hit",
        "Multi_Hit_Fusion",
        "Hom_Deletion",
        "Hem_Deletion",
        "Amp",
        "Del"
      )
    )

However, the default non-synonyms in maftools are the following, which has the additional Translation_Start_Site.

  if(is.null(vc_nonSyn)){
    vc.nonSilent = c("Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
                     "Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
                     "In_Frame_Ins", "Missense_Mutation")
  }

All unique Variant_Classification in ../../data/snv-consensus-plus-hotspots.maf.tsv.gz are

3'Flank
3'UTR
5'Flank
5'UTR
Frame_Shift_Del
Frame_Shift_Ins
IGR
In_Frame_Del
In_Frame_Ins
Intron
Missense_Mutation
Nonsense_Mutation
Nonstop_Mutation
RNA
Silent
Splice_Region
Splice_Site
Translation_Start_Site

@jharenza
Copy link
Member

we can add it as non-silent!

@logstar
Copy link

logstar commented Jun 25, 2021

we can add it as non-silent!

Will do. Thank you for the quick reply!

logstar added a commit to logstar/OpenPedCan-analysis that referenced this issue Jun 28, 2021
d3b-center/ticket-tracker-OPC#8

Update oncoprint-landscape module for PediatricOpenTargets/
OpenPedCan-analysis.

Generate mutation frequency tables for all genes.
logstar added a commit to logstar/OpenPedCan-analysis that referenced this issue Jun 30, 2021
Squashed commit of the following:

commit d87986b7ce1a517f4807430ce6beaac5950b50ca
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 17:21:59 2021 -0400

    Rename mutation-frequencies to snv-frequencies

    Rename module.

commit b2d2fd5c391b43214825e7b458d0edcb5ac22f1a
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 17:13:18 2021 -0400

    Annotate SNV table with mutation frequencies

    Issues addressed:

    - <d3b-center/ticket-tracker-OPC#64>
    - <d3b-center/ticket-tracker-OPC#8>.
      This issue is no longer compatible with the purpose of this module.
      This module intends to compute mutation frequencies for each variant,
      but this issue intents to compute the mutation frequencies for each gene.
      This issue is listed here for future reference.

commit 84cacf28927121037f4b9ba895e5baa5d12c7b31
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 16:23:20 2021 -0400

    [WIP] Update run-mutation-frequencies.sh

commit 29ae8ef19f2339ae08f78c26ab42e6cf75d3556e
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 16:14:50 2021 -0400

    [WIP] Generate annotated SNV frequency table

commit 2cb06741ca192f77a3043d03574649a184459b11
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 14:54:39 2021 -0400

    [WIP] Replace NA with blank string

    Also replaced HotSpot value 1 with Y and 0 with N.

commit 57776f61e576a5e3e2672370370fd1090f3aa478
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 14:02:13 2021 -0400

    [WIP] Use mygene.info to query gene IDs

    mygene.info seems to be actively maintained. The query results are more
    comprehensive than
    [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html).

    Relevant URLs:
    - <http://mygene.info/about>
    - <https://bioconductor.org/packages/release/bioc/html/mygene.html>

    mygene.info is suggested by @taylordm and  @jharenza.

commit 76bb0f5236378648adce429e45d3827009735b58
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 10:55:30 2021 -0400

    [WIP] Generate SNV frequency tables

    Issue addressed:
    d3b-center/ticket-tracker-OPC#64
@logstar
Copy link

logstar commented Jul 27, 2021

Closed with PR d3b-center/OpenPedCan-analysis#45 merged.

@logstar logstar closed this as completed Jul 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants