Skip to content
This repository has been archived by the owner on Jun 16, 2023. It is now read-only.

Proposed Analysis: Annotate SNV table of mutation frequencies #64

Closed
3 of 4 tasks
jharenza opened this issue Jun 24, 2021 · 14 comments
Closed
3 of 4 tasks

Proposed Analysis: Annotate SNV table of mutation frequencies #64

jharenza opened this issue Jun 24, 2021 · 14 comments
Assignees

Comments

@jharenza
Copy link
Member

jharenza commented Jun 24, 2021

What are the scientific goals of the analysis?

Annotate the SNV TSV table of mutation frequencies per cohort+cancer group+primary/relapse as will be created in #8 for conversion to JSON format.

What methods do you plan to use to accomplish the scientific goals?

Annotate the table with headings as below:
OT_SomaticTables_SNV_CNV.xlsx

Much of this can be achieved by leveraging MAF fields corresponding to the exact variant calls. For ClinVar, we may need to download a version of the database.

Update June 30th

What input data are required for this analysis?

snv-consensus-plus-hotspots.maf.tsv.gz

How long do you expect is needed to complete the analysis? Will it be a multi-step analysis?

2-3 days

Who will complete the analysis (please add a GitHub handle here if relevant)?

@logstar ?

What relevant scientific literature relates to this analysis?

@logstar
Copy link

logstar commented Jun 29, 2021

Hi @jharenza . Thank you for the analysis description.

I have a few questions about how to generate the columns in the "Example Somatic Mutation Table" sheet of OT_SomaticTables_SNV_CNV.xlsx.

  • To generate Frequency in Overall Dataset (example value "4.17%"), should I use all samples, or all independent-primary-plus samples, or other set of samples in a cancer_group/cancer_group_cohort?
  • To generate Mutations/Total Samples ( example value ".1/24"), should I aggregate all variants of the corresponding gene and divide by "Total Samples"? Similarly, for "Total Samples", should I use all samples, or all independent-primary-plus samples, or other set of samples in a cancer_group/cancer_group_cohort?
  • To generate Identifer hg38 (See ClinVar ex) (example value "10_102599545_G_A"), should I concatenate Chromosome, Start_Position, End_Position, Reference_Allele, and Tumor_Seq_Allele2 with "_"? I briefly went over ClinVar identifier documentation and several examples, but I could not find any identifier exactly like "10_102599545_G_A".
  • To generate the following columns, do we have mapping tables available? If not, I will download them from their corresponding databases.
    • Protein Identifier/Name
    • Protein Refseq ID
    • Predicted Mutation Impact Score
    • Overall COSMIC frequency
    • OncoKB cancer gene
    • OncoKB oncogene/TS gene
  • To generate Hotspot, should I use the the HotSpotAllele column in the MAF file? Update (Jun 29 17:39:21 2021): found answer at doc/data-formats.md. I will use the HotSpotAllele column in the MAF file as the Hotspot columns in OT_SomaticTables_SNV_CNV.xlsx.

I will work on the frequencies first.

@jharenza
Copy link
Member Author

Hi @logstar

Let's tackle cancer_group_cohort first.

  • To generate Frequency in Overall Dataset (example value "4.17%"), should I use all samples, or all independent-primary-plus samples, or other set of samples in a cancer_group/cancer_group_cohort?
  • To generate Mutations/Total Samples ( example value ".1/24"), should I aggregate all variants of the corresponding gene and divide by "Total Samples"? Similarly, for "Total Samples", should I use all samples, or all independent-primary-plus samples, or other set of samples in a cancer_group/cancer_group_cohort?

Ah, good question. I hadn't recalled this field, but the way we would create this is within cancer_group_cohort, identify the unique variants per patient (right now, they are at a sample level, but you could pull out BS_IDs and the mutation metadata, then merge it with the PT_IDs, drop the BS_IDs and unique them to get patient-level variant calls. This would be the data that goes into Frequency in Overall Dataset and I think instead of Mutations/Total Samples, we should make this column two columns: Total mutations and Total patients in dataset, otherwise, it looks like it would just be the fraction that corresponds to the percent.

Then, the independent-primary will be used for Frequency in primary tumors and independent-relapse will be used for Frequency in relapse tumors. That being said, I think we need another four columns for Total primary tumors mutated and Total primary tumors in dataset Total relapse tumors mutated and Total relapse tumors in dataset. Let me update the excel file, too.

  • To generate Identifer hg38 (See ClinVar ex) (example value "10_102599545_G_A"), should I concatenate Chromosome, Start_Position, End_Position, Reference_Allele, and Tumor_Seq_Allele2 with "_"? I briefly went over ClinVar identifier documentation and several examples, but I could not find any identifier exactly like "10_102599545_G_A".

I think you have this almost right; there would be no end position in the above identifier.

  • To generate the following columns, do we have mapping tables available? If not, I will download them from their corresponding databases.
  • Protein Identifier/Name
  • Protein Refseq ID
  • Predicted Mutation Impact Score
  • Overall COSMIC frequency
  • OncoKB cancer gene
  • OncoKB oncogene/TS gene

Yes, let me update the excel file- give me a few min.

  • To generate Hotspot, should I use the the HotSpotAllele column in the MAF file?

yes

@logstar
Copy link

logstar commented Jun 29, 2021

@jharenza Thank you for the detailed reply.

I will work on cancer_group_cohort first.

I agree it is more informative to have the revised columns.

If I understand correctly, Frequency in Overall Dataset = Total mutations / Total patients in dataset. The Total mutations is the number of patients that has the corresponding variant of the row.

@jharenza
Copy link
Member Author

If I understand correctly, Frequency in Overall Dataset = Total mutations / Total patients in dataset. The Total mutations is the number of patients that has the corresponding variant of the row.

yes, where total mutations is total N of that specific mutation in the dataset

@logstar
Copy link

logstar commented Jun 29, 2021

If I understand correctly, Frequency in Overall Dataset = Total mutations / Total patients in dataset. The Total mutations is the number of patients that has the corresponding variant of the row.

yes, where total mutations is total N of that specific mutation in the dataset

Got it. Thank you for the quick reply. I will work on the analysis accordingly.

@kgaonkar6
Copy link
Contributor

kgaonkar6 commented Jun 30, 2021

Just wanted to make a note from the June 30th call (feel free to update/edit)

  • Instead of COSMIC frequency, annotate as COSMIC mutation census with tier from CosmicMutantExportCensus.tsv​
  • add RMTL (Y/N)

@logstar
Copy link

logstar commented Jun 30, 2021

Thank you for the notes.

I wonder where I can get the CosmicMutantExportCensus.tsv​ and RMTL table for annotation. Are they going to be available in future data releases?

@kgaonkar6
Copy link
Contributor

RMTL will be (soon) provided in v6
CosmicMutantExportCensus.tsv​ ( Originally from https://cancer.sanger.ac.uk/census or a reusable version of the file with the info) will also be provided in v6

@logstar
Copy link

logstar commented Jun 30, 2021

RMTL will be (soon) provided in v6
CosmicMutantExportCensus.tsv​ ( Originally from https://cancer.sanger.ac.uk/census or a reusable version of the file with the info) will also be provided in v6

Got it. Thank you for the quick reply.

@jharenza
Copy link
Member Author

I am having problems downloading the CosmicMutantExportCensus.tsv​ file - I cannot access via the Qiagen website as they suggest, so I submitted an email asking for help there. So, for now, proceed without this annotation.

logstar added a commit to logstar/OpenPedCan-analysis that referenced this issue Jun 30, 2021
Squashed commit of the following:

commit d87986b7ce1a517f4807430ce6beaac5950b50ca
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 17:21:59 2021 -0400

    Rename mutation-frequencies to snv-frequencies

    Rename module.

commit b2d2fd5c391b43214825e7b458d0edcb5ac22f1a
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 17:13:18 2021 -0400

    Annotate SNV table with mutation frequencies

    Issues addressed:

    - <d3b-center/ticket-tracker-OPC#64>
    - <d3b-center/ticket-tracker-OPC#8>.
      This issue is no longer compatible with the purpose of this module.
      This module intends to compute mutation frequencies for each variant,
      but this issue intents to compute the mutation frequencies for each gene.
      This issue is listed here for future reference.

commit 84cacf28927121037f4b9ba895e5baa5d12c7b31
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 16:23:20 2021 -0400

    [WIP] Update run-mutation-frequencies.sh

commit 29ae8ef19f2339ae08f78c26ab42e6cf75d3556e
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 16:14:50 2021 -0400

    [WIP] Generate annotated SNV frequency table

commit 2cb06741ca192f77a3043d03574649a184459b11
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 14:54:39 2021 -0400

    [WIP] Replace NA with blank string

    Also replaced HotSpot value 1 with Y and 0 with N.

commit 57776f61e576a5e3e2672370370fd1090f3aa478
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 14:02:13 2021 -0400

    [WIP] Use mygene.info to query gene IDs

    mygene.info seems to be actively maintained. The query results are more
    comprehensive than
    [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html).

    Relevant URLs:
    - <http://mygene.info/about>
    - <https://bioconductor.org/packages/release/bioc/html/mygene.html>

    mygene.info is suggested by @taylordm and  @jharenza.

commit 76bb0f5236378648adce429e45d3827009735b58
Author: logstar <y.will.zhang@gmail.com>
Date:   Wed Jun 30 10:55:30 2021 -0400

    [WIP] Generate SNV frequency tables

    Issue addressed:
    d3b-center/ticket-tracker-OPC#64
@logstar
Copy link

logstar commented Jul 27, 2021

@jharenza Is the following unfinished task included in the Gene_type annotation? The "COSMIC genes" are listed as a data source for the genelistreference.txt at https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/fusion_filtering.

Instead of COSMIC frequency, annotate as COSMIC mutation census with tier from CosmicMutantExportCensus.tsv​

If not, I could add the required annotation for the v7 annotator and snv-frequencies module updates.

@jharenza
Copy link
Member Author

No, this was to use the COSMIC mutation evidence rather than the genes. I never heard back from them, so we can add it as a future ticket and enhancement if we hear back.

@logstar
Copy link

logstar commented Jul 29, 2021

No, this was to use the COSMIC mutation evidence rather than the genes. I never heard back from them, so we can add it as a future ticket and enhancement if we hear back.

Got it. I think we could leave this ticket open as a reference.

I will also submit two tickets for adding COSMIC mutation evidence to snv-frequencies and annotator, label them with blocked, and refer to this issue.

@runjin326
Copy link

Closing with PR45 merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants