Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

#790 Part1: adding new SNV subtypes for LGAT #842

Merged
merged 28 commits into from
Jan 10, 2021

Conversation

kgaonkar6
Copy link
Collaborator

@kgaonkar6 kgaonkar6 commented Nov 16, 2020

Purpose/implementation Section

LGAT subtyping is being revamped as per issue. I'm diving into this with staggered PRs per alteration as:

What scientific question is your analysis addressing?

As per issue we will be subtyping LGAT based on SNV in the following genes:

  • LGG, NF1
    somatic loss of NF1 via either missense, nonsense mutation

  • LGG, BRAF V600E
    contains BRAF V600E or V599 SNV or non-canonical BRAF alterations such as p.V600ins or p.D594N

  • LGG, other MAPK
    contains KRAS, NRAS, HRAS, MAP2K1, MAP2K2, MAP2K1, ARAF SNV or indel

  • LGG, RTK
    harbors a MET SNV
    harbors a KIT SNV or
    harbors a PDGFRA SNV

  • LGG, FGFR
    harbors FGFR1 p.N546K, p.K656E, p.N577, or p. K687 hotspot mutations or

  • LGG, IDH
    harbors an IDH R132 mutation

  • LGG, H3.3
    harbors an H3F3A K28M or G35R/V mutation

  • LGG, H3.1
    harbors an HIST1H3B K28M
    harbors and HIST1H3C K28M

What was your approach?

I used a list of genes to look for in consensus SNV per subtype ( with additional information about hotspots, canonical mutations ).

The mutation status per subtype is saved as in lgat-subset/LGAT_snv_subset.tsv :

  Tumor_Sample_Barcode BRAF_V600E_mut FGFR_mut IDH_mut H3F3A_mut HIST1H3B_mut HIST1H3C_mut MAPK_mut RTK_mut NF1_mut

What GitHub issue does your pull request address?

#790

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • I wanted to save the alterations as a list since there are different conditions per gene and subtype , let me know if you prefer some other way to organize the genes/mutations lists.
  • each gene has it's own criteria for selection please refer to the description in Updated analysis: LGAT - add additional subtypes #790 by @jharenza . In addition, non-canonical mutation in kinase domain will also be added to BRAF V600E subtype

Is there anything that you want to discuss further?

Does the SNV need some basic filtering to keep only non-synonymous mutations?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

table

What is your summary of the results?

We have a total of 293 WGS, 45 biospecimen have BRAF mutation, 10 biospecimen have FGFR mutation, 16 biospecimen have MAPK mutation, 19 have RTK mutation, 5 have NF1 mutation and 1 HIST1H3B mutation.

We didn't find any IDH,H3F3A mutation.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@jharenza jharenza assigned jharenza and unassigned jharenza Nov 17, 2020
@jharenza jharenza self-requested a review November 17, 2020 18:08
@kgaonkar6 kgaonkar6 changed the title adding new SNV subtypes for LGAT #790 Part1: adding new SNV subtypes for LGAT Nov 18, 2020
@jharenza
Copy link
Collaborator

jharenza commented Nov 19, 2020

@kgaonkar6 thanks for this! To answer your question:

Does the SNV need some basic filtering to keep only non-synonymous mutations?

Good question and point! Yes, let's filter out synonymous and silent mutations based on the paper.

I think that you should keep the notebook for selecting LGAT samples separate from the SNV alteration notebook. I will still comment inline for review.

Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general! See my comments inline for a few changes.

analyses/molecular-subtyping-LGAT/input/snvOI_list.json Outdated Show resolved Hide resolved
@kgaonkar6
Copy link
Collaborator Author

kgaonkar6 commented Nov 19, 2020

@kgaonkar6 thanks for this! To answer your question:

Does the SNV need some basic filtering to keep only non-synonymous mutations?

Good question and point! Yes, let's filter out synonymous and silent mutations based on the paper.

I used Variant_Classification terms for synonymous SNV as per interaction-plots/scripts/02-process_mutations.R, does this look ok? should I also remove Variant_Classification %in% c("Intron","5`Flank") ? There are some Intron/5 flank snvs in that are getting through in the output.

# Variant Classification with Low/Modifier variant consequences 
#  from maftools http://asia.ensembl.org/Help/Glossary?id=535
synonymous <- c(
  "Silent",
  "Start_Codon_Ins",
  "Start_Codon_SNP",
  "Stop_Codon_Del",
  "De_novo_Start_InFrame",
  "De_novo_Start_OutOfFrame"
)

@jharenza
Copy link
Collaborator

@kgaonkar6 thanks for this! To answer your question:

Does the SNV need some basic filtering to keep only non-synonymous mutations?

Good question and point! Yes, let's filter out synonymous and silent mutations based on the paper.

I used Variant_Classification terms for synonymous SNV as per interaction-plots/scripts/02-process_mutations.R, does this look ok? should I also remove Variant_Classification %in% c("Intron","5`Flank") ? There are some Intron/5 flank snvs in that are getting through in the output.

# Variant Classification with Low/Modifier variant consequences 
#  from maftools http://asia.ensembl.org/Help/Glossary?id=535
synonymous <- c(
  "Silent",
  "Start_Codon_Ins",
  "Start_Codon_SNP",
  "Stop_Codon_Del",
  "De_novo_Start_InFrame",
  "De_novo_Start_OutOfFrame"
)

Hmm. I tried to look at how they defined the SNVs in the paper, but could not find anything about it in the methods. I just emailed the corresponding author, so hopefully we can hear back from her in a reasonable time frame.

To answer your question, I think getting rid of Intron makes sense, but 5' could be promoter and have an effect. Can you check the predicted effects?

@kgaonkar6
Copy link
Collaborator Author

Oh yeah I did check the 1000 pLGG paper but didn't find specific as well. I have here an example where both mut in MAP2K1 in this biospecimen are 5` or Intron the impact is modifier and no predicted imapct from SIFT or polyphen

# A tibble: 2 x 6
  Tumor_Sample_Barcode IMPACT   SIFT  PolyPhen Hugo_Symbol Variant_Classification
  <chr>                <chr>    <chr> <chr>    <chr>       <chr>                 
1 BS_4QFSH7C4          MODIFIER .     .        MAP2K1      5'Flank               
2 BS_4QFSH7C4          MODIFIER .     .        MAP2K1      Intron  

@jharenza
Copy link
Collaborator

jharenza commented Dec 1, 2020

Hmm. I tried to look at how they defined the SNVs in the paper, but could not find anything about it in the methods. I just emailed the corresponding author, so hopefully we can hear back from her in a reasonable time frame.

I got this response, but sent a followup asking for more clarification:

Hello Jo Lynne,
I was forwarded the email that you sent Cynthia Hawkins regarding the pLGG Cancer Cell paper.
In terms of SNV pipelines, there wasn't really one used for this project. Our approach was primarily a targeted tier-based analysis where we prioritized known drivers of pLGG via specific assays (eg, ddPCR/IHC for BRAF p.V600E or NanoString for KIAA1549-BRAF fusions) (details are in supplemental figure S3 of the paper). We had a few samples that remained uncharacterized after we'd tested for the most likely alterations that we ran RNAseq on. For those, we ran a combination of fusion callers (FusionMap, Defuse, TopHat, and Ericscript) to identify novel fusions. We also ran Mutect on the RNAseq samples to identify any SNVs which were cross-referenced to COSMIC for functional relevance.
If you'd like any further clarification don't hesitate to ask.
All the best,
--
Scott

- add HIST2H3C to JSON file
- update/shorten/space out some comments for clarity
- update paste0() to paste()
@jharenza jharenza self-requested a review December 2, 2020 22:26
Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kgaonkar6 thanks for this!

Re:

I used Variant_Classification terms for synonymous SNV as per interaction-plots/scripts/02-process_mutations.R, does this look ok? should I also remove Variant_Classification %in% c("Intron","5`Flank") ? There are some Intron/5 flank snvs in that are getting through in the output.

and

Oh yeah I did check the 1000 pLGG paper but didn't find specific as well. I have here an example where both mut in MAP2K1 in this biospecimen are 5` or Intron the impact is modifier and no predicted imapct from SIFT or polyphen

Since they are modifier and have no predicted impact, we should remove, so let's add that and update the output files.

I reviewed and made some minor updates here.

  • We missed HIST2H3C, which is also in the HGAT subtyping, so I added it here. H3F3B, however, is missed in HGAT, so we may have to add that when you work on that module - please make a note!
  • Updated/shortened/spaced out some comments for readability
  • Updated paste0() to paste() to shorten
  • Added an arrange step at the end so we can easily see changes in the future (we have been adding this to subtyping modules as they come through).

Otherwise, it looks good - I wanted to document that we are not adding NF1 germline here, so that still needs to be added at some point.

Thanks!

@kgaonkar6
Copy link
Collaborator Author

Thanks for the review!

Code update from the last time you reviewed satisfies the following comments:

Since they are modifier and have no predicted impact, we should remove, so let's add that and update the output files.

And I re-ran the module to create the new output files from your commits:

  • We missed HIST2H3C, which is also in the HGAT subtyping, so I added it here. H3F3B, however, is missed in HGAT, so we may have to add that when you work on that module - please make a note!
  • Updated/shortened/spaced out some comments for readability
  • Updated paste0() to paste() to shorten
  • Added an arrange step at the end so we can easily see changes in the future (we have been adding this to subtyping modules as they come through).

NF1 gerrmline will come in 04 script that compiles the LGAT subtyping from SNV/CNV and fusion. Thanks!

Copy link
Collaborator

@jharenza jharenza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more small change, which I think I saw you catch as well (and my bad), and we are good to go. I think the mutation selection here makes sense and looks good.

I also want to make a note that there are some samples which have mutations in multiple groups, so we will have to take this into account with subtypes later.

Nice job!

Co-authored-by: Jo Lynne Rokita <jharenza@gmail.com>
@jaclyn-taroni jaclyn-taroni self-requested a review December 8, 2020 01:50
@kgaonkar6
Copy link
Collaborator Author

@jharenza @jaclyn-taroni the analysis is now ready for re-review.

I re-ran the script with the update of typo for HIST2H3C on 193 but no changes in results since the canonical mutation is K28M for both HIST2H3C and HIST1H3C.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Comment on lines +131 to +196
```{r}
# Filter consensus mutation files for LGAT subset
consensusMutationSubset <- consensusMutation %>%
# find lgat samples
dplyr::filter(Tumor_Sample_Barcode %in% lgat_dna_df$Kids_First_Biospecimen_ID) %>%
# select tumor sample barcode, gene, short protein annotation, domains, and variant classification
dplyr::select(Tumor_Sample_Barcode,
Hugo_Symbol,
HGVSp_Short,
DOMAINS,
Variant_Classification,
IMPACT,
SIFT,
PolyPhen) %>%
dplyr::filter(
# get BRAF mutation status
# canonical mutations V600E
HGVSp_Short %in% snvOI$BRAF_V600E$canonical[!is.na(snvOI$BRAF_V600E$canonical)] &
Hugo_Symbol=="BRAF" | # OR
# hotspot mutations in p.600 and p.599
grepl(BRAF_hotspot,HGVSp_Short) &
Hugo_Symbol=="BRAF" | # OR
# and kinase domain mutation for non-canonical mutation
# Family: PK_Tyr_Ser-Thr https://pfam.xfam.org/family/PF07714
grepl("PF07714",DOMAINS) &
Hugo_Symbol=="BRAF" | # OR

# get NF1 mutation status
Hugo_Symbol %in% snvOI$NF1$gene &
Variant_Classification %in% c("Missense_Mutation","Nonsense_Mutation") |

# get other MAPK mutation status
# all mutations in MAPK genes
Hugo_Symbol %in% snvOI$MAPK$gene | # OR

# get RTK mutation status
# all mutations in RTK genes
Hugo_Symbol %in% snvOI$RTK$gene | # OR

# get FGFR mutation status
# canonical mutations
HGVSp_Short %in% snvOI$FGFR$canonical[!is.na(snvOI$FGFR$canonical)] &
Hugo_Symbol=="FGFR1" | # OR
# hotspot mutations
grepl(FGFR_hotspot,HGVSp_Short) &
Hugo_Symbol=="FGFR1" | # OR

# get IDH mutation status
# hostspot mutations
grepl(IDH_hotspot,HGVSp_Short) &
Hugo_Symbol %in% snvOI$IDH$gene | # OR

# get histone mutation status
# H3F3A canonical mutations
HGVSp_Short %in% snvOI$H3F3A$canonical & Hugo_Symbol %in% "H3F3A" | # OR
# H3F3B canonical mutations
HGVSp_Short %in% snvOI$H3F3B$canonical & Hugo_Symbol %in% "H3F3B" | # OR
# HIST1H3B canonical mutations
HGVSp_Short %in% snvOI$HIST1H3B$canonical & Hugo_Symbol %in% "HIST1H3B" | # OR
# HIST1H3C canonical mutations
HGVSp_Short %in% snvOI$HIST1H3C$canonical & Hugo_Symbol %in% "HIST1H3C" | # OR
# HIST2H3C canonical mutations
HGVSp_Short %in% snvOI$HIST2H3C$canonical & Hugo_Symbol %in% "HIST2H3C"
)

consensusMutationSubset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is implemented correctly. My personal preference would have been to create a data frame for each of these steps that you then bind all the rows together for potential ease of debugging but that is a personal preference!

Copy link
Collaborator Author

@kgaonkar6 kgaonkar6 Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would have definitely been good for debugging I guess I liked the way this could be read like a plan (thanks to tidyverse magic :D ).

I'm also tempted to use MultiAssayExperiment next time we need to do something similar with multiple genes. Thoughts?

@jaclyn-taroni jaclyn-taroni merged commit 202bb59 into AlexsLemonade:master Jan 10, 2021
@jaclyn-taroni jaclyn-taroni mentioned this pull request Jan 12, 2021
21 tasks
@kgaonkar6 kgaonkar6 deleted the lgat_add_subtyping_SNV branch January 22, 2021 21:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants