-
Notifications
You must be signed in to change notification settings - Fork 0
Proposed Analysis: Create mutation frequencies for Ped OT platform #8
Comments
Adding to this - we should probably also consider which mutations are going into this matrix - probably will want to exclude synonymous, silent, RNA, and intergenic for now. Do we know how OT will rank mutations? Is it by frequency per histology/cohort or by functional consequence + frequency? Thoughts @kgaonkar6 and @taylordm? cc @allisonheath |
Sorry for the delay on this analysis. I was wondering which mutation files in the PediatricOpenTargets/OpenPedCan-analysis v5 data release I should use for generating the mutation frequency tables. The
Should we rerun After figuring out which mutation files to use, I am planning to merge them like the maf_object <- prepare_maf_object(
maf_df = maf_df,
cnv_df = cnv_df,
metadata = metadata,
fusion_df = fusion_df
) Then, generate gene summary tables for the merged mutation object using
Regarding the note:
I wonder if you could clarify the procedure to do this analysis on a mutation level. For the significance values, I wonder if they describe whether each gene is significantly mutated in each histology. If so, are there any analysis module or R package for generating such significance values? I saw the maftools paper Figure 2 shows "log10 transformed Q-values estimated by MutSigCV", but MutSigCV requires a standalone MATLAB 2013a package and two additional input files (coverage file and covariate file), so MutSigCV may not be easy to implement for PediatricOpenTargets/OpenPedCan-analysis. |
The consensus maf in v5 is |
Thank you for the quick reply! I will use |
@kgaonkar6 I was wondering if I could directly use the following files for this analysis:
My concern is that these files might be generated using |
You are right! Those files and (pbta-fusion-putative-oncogenic.tsv) are not updated to include samples added as part of OpenPedCan analysis yet but we will be rerunning before the next release. So I can keep you posted about that. To me it looks like the requirement here is just the gene mutation frequency so should we add another script in the module to just generate the frequencies using an updated version of |
Thank you for the quick reply and the suggestion! I will skip the CNV part for now. I am planning to use an empty CNV file as a place holder for this analysis, so the original code can be reused for this analysis. I am planning to update the |
One other note is we do not want to do top 50 here, but all genes mentioned in the comment above, excluding
It also sounds like this analysis should be done on a
Regarding significance, MutSigCV doesn't perform well on low Ns, and many of these histologies have a low N. I think for now, we are not going to worry about designating significance, but rather try to come up with a file of mutation frequencies, plus the additional annotation from @taylordm 's sample table. For the latter, we should probably just come up with a new ticket for creating the full annotated table. below is the sample table: |
@jharenza Thank you for the detailed notes. They are very helpful for implementing this analysis. I will generate the mutation frequency tables accordingly. Then, I will annotate the SNV table of mutation frequencies according to #64. I will skip the significance part for now. |
Sure thing, let me know if you have any questions along the way! |
Hi @kgaonkar6. I was wondering if I could use Although the fusion table is work in progress at #7, the independent sample determination in the fusion table is related to the filtering of the In the original code, fusion independent samples are determined by matching # in 00-map-to-sample_id.R
# An ambiguous sample_id will have more than 2 rows associated with it in the
# histologies file when looking at tumor samples -- that means we won't be able
# to determine when an WGS/WXS assay maps to an RNA-seq assay for the purpose of
# the oncoprint plot
ambiguous_sample_ids <- histologies_df %>%
filter(sample_type == "Tumor",
composition == "Solid Tissue") %>%
group_by(sample_id) %>%
tally() %>%
filter(n > 2) %>%
pull(sample_id)
ambiguous_biospecimens <- histologies_df %>%
filter(sample_id %in% ambiguous_sample_ids) %>%
pull(Kids_First_Biospecimen_ID)
# ...
biospecimens_to_remove <- unique(c(ambiguous_biospecimens,
not_tumor_biospecimens))
# Filter the files!
maf_df <- maf_df %>%
dplyr::filter(!(Tumor_Sample_Barcode %in% biospecimens_to_remove))
# ... I found some sample IDs are mapping to hundreds or even thousands of samples, so I am concerned about removing the
|
hey @logstar - you can disregard fusions and CNVs completely here, only looking for SNV frequencies using the indep DNA sample lists |
Thank you for the quick reply. I will disregard the fusions and CNVs in this analysis. Sorry for being distracted by the CNVs and fusions. I was trying to figure out the original code and make this module compatible to the full OT data release, so we will not need to revise the code much when CNV and fusions are available. Now, I will get the SNV mutation frequency table generated before worrying about CNV or fusions. |
Hi @jharenza. I was wondering whether In the original code, only the following read.maf(
maf = maf_df,
clinicalData = metadata,
cnTable = cnv_df,
removeDuplicatedVariants = FALSE,
vc_nonSyn = c(
"Frame_Shift_Del",
"Frame_Shift_Ins",
"Splice_Site",
"Nonsense_Mutation",
"Nonstop_Mutation",
"In_Frame_Del",
"In_Frame_Ins",
"Missense_Mutation",
"Fusion",
"Multi_Hit",
"Multi_Hit_Fusion",
"Hom_Deletion",
"Hem_Deletion",
"Amp",
"Del"
)
) However, the default non-synonyms in maftools are the following, which has the additional if(is.null(vc_nonSyn)){
vc.nonSilent = c("Frame_Shift_Del", "Frame_Shift_Ins", "Splice_Site", "Translation_Start_Site",
"Nonsense_Mutation", "Nonstop_Mutation", "In_Frame_Del",
"In_Frame_Ins", "Missense_Mutation")
} All unique
|
we can add it as non-silent! |
Will do. Thank you for the quick reply! |
d3b-center/ticket-tracker-OPC#8 Update oncoprint-landscape module for PediatricOpenTargets/ OpenPedCan-analysis. Generate mutation frequency tables for all genes.
Squashed commit of the following: commit d87986b7ce1a517f4807430ce6beaac5950b50ca Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 17:21:59 2021 -0400 Rename mutation-frequencies to snv-frequencies Rename module. commit b2d2fd5c391b43214825e7b458d0edcb5ac22f1a Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 17:13:18 2021 -0400 Annotate SNV table with mutation frequencies Issues addressed: - <d3b-center/ticket-tracker-OPC#64> - <d3b-center/ticket-tracker-OPC#8>. This issue is no longer compatible with the purpose of this module. This module intends to compute mutation frequencies for each variant, but this issue intents to compute the mutation frequencies for each gene. This issue is listed here for future reference. commit 84cacf28927121037f4b9ba895e5baa5d12c7b31 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 16:23:20 2021 -0400 [WIP] Update run-mutation-frequencies.sh commit 29ae8ef19f2339ae08f78c26ab42e6cf75d3556e Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 16:14:50 2021 -0400 [WIP] Generate annotated SNV frequency table commit 2cb06741ca192f77a3043d03574649a184459b11 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 14:54:39 2021 -0400 [WIP] Replace NA with blank string Also replaced HotSpot value 1 with Y and 0 with N. commit 57776f61e576a5e3e2672370370fd1090f3aa478 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 14:02:13 2021 -0400 [WIP] Use mygene.info to query gene IDs mygene.info seems to be actively maintained. The query results are more comprehensive than [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html). Relevant URLs: - <http://mygene.info/about> - <https://bioconductor.org/packages/release/bioc/html/mygene.html> mygene.info is suggested by @taylordm and @jharenza. commit 76bb0f5236378648adce429e45d3827009735b58 Author: logstar <y.will.zhang@gmail.com> Date: Wed Jun 30 10:55:30 2021 -0400 [WIP] Generate SNV frequency tables Issue addressed: d3b-center/ticket-tracker-OPC#64
Closed with PR d3b-center/OpenPedCan-analysis#45 merged. |
What are the scientific goals of the analysis?
Update oncoprint-landscape module to output a gene mutation frequency TSV per histology (or cohort) for Pediatric Open Targets platform. For this, we will use all genes, not genes of interest.
Note: we may want to do this on a mutation level and add significance values as the OT takes in those values. Still need to discuss the best way to do this.
What input data are required for this analysis?
Consensus MAFs
How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?
1 day
Who will complete the analysis (please add a GitHub handle here if relevant)?
@ewafula
The text was updated successfully, but these errors were encountered: