This repository has been archived by the owner on Jun 21, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 67
#790 Part1: adding new SNV subtypes for LGAT #842
Merged
jaclyn-taroni
merged 28 commits into
AlexsLemonade:master
from
kgaonkar6:lgat_add_subtyping_SNV
Jan 10, 2021
Merged
Changes from all commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
d37abea
adding new SNV subtypes for LGAT
kgaonkar6 dc6e743
adding additional subtypes
kgaonkar6 eb65034
re-doing subtyping so removing the last step for now
kgaonkar6 271958a
updating run_subtyping.sh
kgaonkar6 2281e95
removing subtyping list
kgaonkar6 cd1e4d2
update README
kgaonkar6 2f30d89
Merge branch 'master' into lgat_add_subtyping_SNV
kgaonkar6 32e05f7
changed to nb
kgaonkar6 93a2e44
Merge branch 'lgat_add_subtyping_SNV' of https://github.com/kgaonkar6…
kgaonkar6 05bf366
remove Rscript
kgaonkar6 408166d
Update README.md
kgaonkar6 4b972e4
Update analyses/molecular-subtyping-LGAT/01-subset-files-for-LGAT.Rmd
kgaonkar6 906bfb8
Update analyses/molecular-subtyping-LGAT/01-subset-files-for-LGAT.Rmd
kgaonkar6 5fca7ad
Update analyses/molecular-subtyping-LGAT/01-subset-files-for-LGAT.Rmd
kgaonkar6 38bab2f
adding comments remove silent snv
kgaonkar6 e12a603
adding sample_id
kgaonkar6 dee0179
adding H3F3B
kgaonkar6 27f270c
updates to files
kgaonkar6 dfad4ab
Merge branch 'master' into lgat_add_subtyping_SNV
kgaonkar6 694e8aa
add HIST2H3C; update comments
2c19d91
edits to Rmd in bash, H2.3
kgaonkar6 1b660ff
Update analyses/molecular-subtyping-LGAT/01-subset-files-for-LGAT.Rmd
jharenza 5838dae
Merge branch 'lgat_add_subtyping_SNV' of https://github.com/kgaonkar6…
kgaonkar6 58373db
remove modifider/low
kgaonkar6 da46d4e
Update analyses/molecular-subtyping-LGAT/01-subset-files-for-LGAT.Rmd
kgaonkar6 a4b57a0
re-run with HIST2H3C
kgaonkar6 86fd675
Merge remote-tracking branch 'upstream/master' into lgat_add_subtypin…
jaclyn-taroni 4898454
Merge branch 'master' into lgat_add_subtyping_SNV
jaclyn-taroni File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
81 changes: 0 additions & 81 deletions
81
analyses/molecular-subtyping-LGAT/01-subset-files-for-LGAT.R
This file was deleted.
Oops, something went wrong.
302 changes: 302 additions & 0 deletions
302
analyses/molecular-subtyping-LGAT/01-subset-files-for-LGAT.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,302 @@ | ||
--- | ||
title: "Annotate SNV subtype status for LGAT biospecimens" | ||
output: html_notebook | ||
author: K S Gaonkar | ||
date: 2020 | ||
--- | ||
|
||
In this PR we will use identify LGAT biospecimens from pathology diagnosis and annotate subtype specific SNV status per biospecimen. | ||
|
||
As per [issue](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/790) we will be subtyping LGAT based on SNV in the following genes: | ||
|
||
- LGG, NF1 | ||
somatic loss of NF1 via either missense, nonsense mutation | ||
|
||
- LGG, BRAF V600E | ||
contains BRAF V600E or V599 SNV or non-canonical BRAF alterations such as p.V600ins or p.D594N | ||
|
||
- LGG, other MAPK | ||
contains KRAS, NRAS, HRAS, MAP2K1, MAP2K2, MAP2K1, ARAF SNV or indel | ||
|
||
- LGG, RTK | ||
harbors a MET SNV | ||
harbors a KIT SNV or | ||
harbors a PDGFRA SNV | ||
|
||
- LGG, FGFR | ||
harbors FGFR1 p.N546K, p.K656E, p.N577, or p. K687 hotspot mutations or | ||
|
||
- LGG, IDH | ||
harbors an IDH R132 mutation | ||
|
||
- LGG, H3.3 | ||
harbors an H3F3A or H3F3B K28M or G35R/V mutation | ||
|
||
- LGG, H3.1 | ||
harbors an HIST1H3B K28M | ||
harbors and HIST1H3C K28M | ||
|
||
|
||
### Setup | ||
```{r} | ||
library(tidyverse) | ||
# Look for git root folder | ||
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git")) | ||
# get subset folder | ||
subset_dir <- file.path(root_dir, | ||
"analyses", | ||
"molecular-subtyping-LGAT", | ||
"lgat-subset") | ||
# create if doesn't exist | ||
if (!dir.exists(subset_dir)) { | ||
dir.create(subset_dir) | ||
} | ||
``` | ||
|
||
### Input | ||
|
||
```{r} | ||
# File from 00-LGAT-select-pathology-dx that is used for the pathology diagnosis | ||
# inclusion/exclusion criteria | ||
path_dx_list <- jsonlite::fromJSON( | ||
file.path(subset_dir, | ||
"lgat_subtyping_path_dx_strings.json") | ||
) | ||
# clinical file | ||
clinical <- read_tsv(file.path(root_dir, | ||
"data", | ||
"pbta-histologies.tsv"), | ||
guess_max = 10000) | ||
# consensus mutation data | ||
consensusMutation <- read_tsv(file.path(root_dir, | ||
"data", | ||
"pbta-snv-consensus-mutation.maf.tsv.gz")) | ||
``` | ||
|
||
Filter to tumor samples that should be included on the basis of pathology diagnosis | ||
```{r} | ||
lgat_specimens_df <- clinical %>% | ||
dplyr::filter(str_detect(str_to_lower(pathology_diagnosis), # Inclusion criteria | ||
paste(path_dx_list$include_path_dx, collapse = "|")), | ||
# Exclusion criteria | ||
str_detect(str_to_lower(pathology_diagnosis), | ||
paste(path_dx_list$exclude_path_dx, collapse = "|"), | ||
negate = TRUE), | ||
# Tumors | ||
sample_type == "Tumor", | ||
composition == "Solid Tissue") | ||
# Write this intermediate file to the subset directory as it allows for | ||
# inspection | ||
write_tsv(lgat_specimens_df, file.path(subset_dir, "lgat_metadata.tsv")) | ||
# Filter to dna samples | ||
lgat_dna_df <- lgat_specimens_df %>% | ||
dplyr::filter(experimental_strategy != "RNA-Seq") %>% | ||
# will keep Kids_First_Biospecimen_ID | ||
# sample_id is kept to be able to match with RNA-Seq in the later step | ||
dplyr::select(Kids_First_Biospecimen_ID,sample_id) | ||
``` | ||
|
||
|
||
Gather gene(s) that define LGAT subtypes | ||
Additional information for genes with known hotspots/canonical mutation is also provided in the list. | ||
If the gene has multiple hotspots we will gather the protein hotspots site to use it in the below chunks to grep the HGVSp_Short column. | ||
|
||
```{r} | ||
# combined list for SNV of interest per subtype | ||
# | ||
snvOI <- jsonlite::fromJSON(file.path(root_dir, | ||
"analyses", | ||
"molecular-subtyping-LGAT", | ||
"input", | ||
"snvOI_list.json")) | ||
# Collapse multiple hotspots in genes with "|" so easy grep calls | ||
BRAF_hotspot <-paste(snvOI$BRAF_V600E$hotspot[!is.na( snvOI$BRAF_V600E$hotspot)],collapse = "|") | ||
FGFR_hotspot <-paste(snvOI$FGFR$hotspot[!is.na( snvOI$FGFR$hotspot)],collapse = "|") | ||
IDH_hotspot<- paste(snvOI$IDH$hotspot[!is.na( snvOI$IDH$hotspot)],collapse = "|") | ||
``` | ||
|
||
|
||
### Subset consensus maf and annotate per subtype based on SNV | ||
We will gather SNV calls that satisfy the conditions per subtype and save the subset along with the HGVSp_Short, DOMAINS and Variant_Classification | ||
|
||
```{r} | ||
# Filter consensus mutation files for LGAT subset | ||
consensusMutationSubset <- consensusMutation %>% | ||
# find lgat samples | ||
dplyr::filter(Tumor_Sample_Barcode %in% lgat_dna_df$Kids_First_Biospecimen_ID) %>% | ||
# select tumor sample barcode, gene, short protein annotation, domains, and variant classification | ||
dplyr::select(Tumor_Sample_Barcode, | ||
Hugo_Symbol, | ||
HGVSp_Short, | ||
DOMAINS, | ||
Variant_Classification, | ||
IMPACT, | ||
SIFT, | ||
PolyPhen) %>% | ||
dplyr::filter( | ||
# get BRAF mutation status | ||
# canonical mutations V600E | ||
HGVSp_Short %in% snvOI$BRAF_V600E$canonical[!is.na(snvOI$BRAF_V600E$canonical)] & | ||
Hugo_Symbol=="BRAF" | # OR | ||
# hotspot mutations in p.600 and p.599 | ||
grepl(BRAF_hotspot,HGVSp_Short) & | ||
Hugo_Symbol=="BRAF" | # OR | ||
# and kinase domain mutation for non-canonical mutation | ||
# Family: PK_Tyr_Ser-Thr https://pfam.xfam.org/family/PF07714 | ||
grepl("PF07714",DOMAINS) & | ||
Hugo_Symbol=="BRAF" | # OR | ||
# get NF1 mutation status | ||
Hugo_Symbol %in% snvOI$NF1$gene & | ||
Variant_Classification %in% c("Missense_Mutation","Nonsense_Mutation") | | ||
# get other MAPK mutation status | ||
# all mutations in MAPK genes | ||
Hugo_Symbol %in% snvOI$MAPK$gene | # OR | ||
# get RTK mutation status | ||
# all mutations in RTK genes | ||
Hugo_Symbol %in% snvOI$RTK$gene | # OR | ||
# get FGFR mutation status | ||
# canonical mutations | ||
HGVSp_Short %in% snvOI$FGFR$canonical[!is.na(snvOI$FGFR$canonical)] & | ||
Hugo_Symbol=="FGFR1" | # OR | ||
# hotspot mutations | ||
grepl(FGFR_hotspot,HGVSp_Short) & | ||
Hugo_Symbol=="FGFR1" | # OR | ||
# get IDH mutation status | ||
# hostspot mutations | ||
grepl(IDH_hotspot,HGVSp_Short) & | ||
Hugo_Symbol %in% snvOI$IDH$gene | # OR | ||
# get histone mutation status | ||
# H3F3A canonical mutations | ||
HGVSp_Short %in% snvOI$H3F3A$canonical & Hugo_Symbol %in% "H3F3A" | # OR | ||
# H3F3B canonical mutations | ||
HGVSp_Short %in% snvOI$H3F3B$canonical & Hugo_Symbol %in% "H3F3B" | # OR | ||
# HIST1H3B canonical mutations | ||
HGVSp_Short %in% snvOI$HIST1H3B$canonical & Hugo_Symbol %in% "HIST1H3B" | # OR | ||
# HIST1H3C canonical mutations | ||
HGVSp_Short %in% snvOI$HIST1H3C$canonical & Hugo_Symbol %in% "HIST1H3C" | # OR | ||
# HIST2H3C canonical mutations | ||
HGVSp_Short %in% snvOI$HIST2H3C$canonical & Hugo_Symbol %in% "HIST2H3C" | ||
) | ||
consensusMutationSubset | ||
``` | ||
|
||
What is the distribution of Variant_Classification of the SNVs captured per gene? | ||
```{r} | ||
consensusMutationSubset %>% | ||
group_by(Hugo_Symbol,Variant_Classification) %>% | ||
tally() | ||
``` | ||
We have some synonymous variants (classified as Intron, 5`Flank and Silent) captured per gene of interest. | ||
|
||
|
||
We will now remove synonymous variant calls from the list above. The Variant_Classification terms for synonymous SNV are selected as per [interaction-plots/scripts/02-process_mutations.R](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/interaction-plots/scripts/02-process_mutations.R). | ||
```{r} | ||
# Variant Classification with Low/Modifier variant consequences | ||
# from maftools http://asia.ensembl.org/Help/Glossary?id=535 | ||
synonymous <- c( | ||
"Silent", | ||
"Start_Codon_Ins", | ||
"Start_Codon_SNP", | ||
"Stop_Codon_Del", | ||
"De_novo_Start_InFrame", | ||
"De_novo_Start_OutOfFrame" | ||
) | ||
consensusMutationSubset <- consensusMutationSubset %>% | ||
dplyr::filter(!Variant_Classification %in% synonymous) | ||
``` | ||
|
||
### Review predicted impact of SNVs | ||
|
||
```{r} | ||
consensusMutationSubset %>% | ||
dplyr::select("Tumor_Sample_Barcode", "IMPACT","SIFT","PolyPhen","Hugo_Symbol","Variant_Classification") | ||
``` | ||
We have some Modifier mutations in the filtered snv subset. | ||
|
||
We are removing the LOW/MODIFIER IMPACT annotated by [VEP](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/#impact-categories) | ||
|
||
- LOW (L): Assumed to be mostly harmless or unlikely to change protein behavior | ||
- MODIFIER (MO): Usually non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact | ||
|
||
```{r} | ||
consensusMutationSubset <- consensusMutationSubset %>% | ||
dplyr::filter(!IMPACT %in% c("MODIFIER","LOW")) | ||
``` | ||
|
||
|
||
Let's check whether all LGAT DNA biospecimens have a subtype defining SNV. | ||
|
||
```{r} | ||
all(lgat_dna_df$Kids_First_Biospecimen_ID %in% consensusMutationSubset$Tumor_Sample_Barcode) | ||
``` | ||
|
||
### Gather status of all SNVs per biospecimen | ||
To gather the biospecimens, we will use all LGAT DNA biospecimen IDs and assign "Yes" or "No" to each subtype column depending on the presence or absence of SNV that define the subtype. | ||
|
||
columnname | description | ||
--- | --- | ||
NF1_mut | somatic loss of NF1 via either missense, nonsense mutation | ||
BRAF_V600E_mut | contains BRAF V600E or V599 SNV or non-canonical BRAF alterations such as p.V600ins or p.D594N | ||
MAPK_mut | contains mutation in KRAS, NRAS, HRAS, MAP2K1, MAP2K2, MAP2K1, ARAF SNV or indel | ||
RTK_mut | harbors a MET, KIT, or PDGFRA SNV | ||
FGFR_mut | harbors FGFR1 p.N546K, p.K656E, p.N577, or p. K687 hotspot mutations | ||
IDH_mut | harbors an IDH R132 mutation | ||
H3.1_mut | harbors an HIST1H3B K28M or HIST1H3C K28M | ||
H3.2_mut | harbors an HIST2H3C K28M | ||
H3.3_mut | harbors an H3F3A K28M or G35R/V mutation | ||
|
||
```{r} | ||
consensusMutationSubset <- lgat_dna_df %>% | ||
left_join(consensusMutationSubset,by=c("Kids_First_Biospecimen_ID"="Tumor_Sample_Barcode")) %>% | ||
reshape2::dcast(Kids_First_Biospecimen_ID + sample_id~ Hugo_Symbol) %>% | ||
dplyr::mutate(BRAF_V600E_mut = if_else(BRAF>=1,"Yes","No"), | ||
FGFR_mut = if_else(rowSums( | ||
dplyr::select(.,snvOI$FGFR$gene)) >=1,"Yes","No"), | ||
# No IDH HIST1H3C, HIST2H3C, or H3F3A mutations found | ||
IDH_mut = if_else(any(colnames(.) %in% snvOI$IDH$gene) & | ||
rowSums(dplyr::select(.,one_of(snvOI$IDH$gene))) >=1, | ||
"Yes","No"), | ||
H3.1_mut = if_else(rowSums( | ||
dplyr::select(.,one_of(c("HIST1H3B","HIST1H3C")))) >=1 ,"Yes","No"), | ||
H3.2_mut = if_else(any(colnames(.) %in% snvOI$HIST2H3C$gene) & | ||
rowSums(dplyr::select(.,one_of(snvOI$HIST2H3C$gene))) >=1, | ||
"Yes","No"), | ||
H3.3_mut = if_else(any(colnames(.) %in% c("H3F3A","H3F3B")) & | ||
rowSums(dplyr::select(.,one_of(c("H3F3A","H3F3B")))) >= 1 ,"Yes","No"), | ||
MAPK_mut = if_else( | ||
rowSums(dplyr::select(.,one_of(snvOI$MAPK$gene)))>=1,"Yes","No"), | ||
RTK_mut = if_else( | ||
rowSums(dplyr::select(.,one_of(snvOI$RTK$gene)))>=1,"Yes","No"), | ||
NF1_mut = if_else( | ||
rowSums(dplyr::select(.,one_of(snvOI$NF1$gene)))>=1,"Yes","No") ) %>% | ||
dplyr::select("Kids_First_Biospecimen_ID","sample_id",ends_with("mut")) %>% | ||
dplyr::arrange(Kids_First_Biospecimen_ID, sample_id) | ||
consensusMutationSubset | ||
``` | ||
|
||
```{r} | ||
# remove consensusMutation | ||
rm(consensusMutation) | ||
# save to subset folder | ||
write_tsv(consensusMutationSubset,file.path(subset_dir, "LGAT_snv_subset.tsv")) | ||
``` |
525 changes: 373 additions & 152 deletions
525
...ing-LGAT/02-make-lgat-final-table.nb.html → ...ing-LGAT/01-subset-files-for-LGAT.nb.html
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this is implemented correctly. My personal preference would have been to create a data frame for each of these steps that you then bind all the rows together for potential ease of debugging but that is a personal preference!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would have definitely been good for debugging I guess I liked the way this could be read like a plan (thanks to tidyverse magic :D ).
I'm also tempted to use MultiAssayExperiment next time we need to do something similar with multiple genes. Thoughts?