PR 1 of n: Molecular Subtyping - HGG (Defining Lesions) (#352)

* Add `01-HGG-molecular-subtyping-data-prep.Rmd` - add this analysis to `.circleci` * Fix command in `.circleci` * Minor `lintr` format changes * Log2 transform expression data - rerun notebook * Use `controlfreec` cn data - rerun notebook * Create a column better distinguishing specific HGG mutations - rerun notebook * Change `01` nb to look only at HGG defining lesions - remove `results/HGG_molecular_subtypes.tsv` - new output file `results/HGG_defining_lesions.tsv` contains binary columns for all samples distinguishing whether or not they contain any of the four HGG defining lesions - rename `01` nb to better represent its purpose/content - rename object `tmb_df` to `snv_df` * Edit analysis in `.circleci` to reflect nb name change * Remove unused lines of code * Update code to reflect V12 change * Address @jharenza comments * Add to modules at a glance table Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
AlexsLemonade · Jan 4, 2020 · d26866f · d26866f
1 parent 4a3bb7e
commit d26866f
Show file tree

Hide file tree

Showing 5 changed files with 4,423 additions and 0 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -99,6 +99,10 @@ jobs:
       - run:
           name: Process SV file
           command: ./scripts/run_in_ci.sh Rscript analyses/sv-analysis/01-process-sv-file.R
+
+      - run:
+          name: Molecular Subtyping - HGG
+          command: ./scripts/run_in_ci.sh Rscript -e "rmarkdown::render('analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd', clean = TRUE)"
 
       - run:
           name: Oncoprint plotting

diff --git a/analyses/README.md b/analyses/README.md
@@ -23,6 +23,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
 | [`independent-samples`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/independent-samples) | `pbta-histologies.tsv` | Generates independent specimen lists for WGS/WXS samples | `independent-specimens.wgs.primary.tsv`, `independent-specimens.wgs.primary-plus.tsv`, `independent-specimens.wgswxs.primary.tsv`, `independent-specimens.wgswxs.primary-plus.tsv` (included in data download)
 | [`interaction-plots`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/interaction-plots) | `independent-specimens.wgs.primary-plus.tsv`, `pbta-snv-consensus-mutation.maf.tsv.gz` | Creates interaction plots for mutation mutual exclusivity/co-occurrence [#13](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/13); may be updated to include other data types (e.g., fusions) | N/A
 | [`molecular-subtyping-ATRT`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-ATRT) | `analyses/gene-set-enrichment-analysis/results/gsva_scores_stranded.tsv`, `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds`, `analyses/focal-cn-file-preparation/results/controlfreec_annotated_cn_autosomes.tsv.gz`, `pbta-snv-consensus-mutation-tmb.tsv` | *In progress*; summarizing data into tabular format in order to molecularly subtype ATRT samples [#244](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/244) | N/A
+| [`molecular-subtyping-HGG`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-HGG) | `pbta-snv-consensus-mutation.maf.tsv.gz` | *In progress*; molecular subtyping of high-grade glioma samples [#249](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249) | N/A
 | [`molecular-subtyping-SHH-tp53`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-SHH-tp53) | `pbta-histologies` and `pbta-snv-consensus-mutation.maf.tsv.gz` | Identify the SHH-classified medulloblastoma samples that have TP53 mutations | N/A
 | [`mutational-signatures`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/mutational-signatures) | `pbta-snv-consensus-mutation.maf.tsv.gz` | Performs COSMIC and Alexandrov et al. mutational signature analysis using the consensus SNV data | N/A
 | [`mutect2-vs-strelka2`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/mutect2-vs-strelka2) | `pbta-snv-mutect2.vep.maf.gz`, `pbta-snv-strelka2.vep.maf.gz` | *Deprecated*; comparison of only two SNV callers, subsumed by `snv-callers` | N/A

diff --git a/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd b/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd
@@ -0,0 +1,141 @@
+---
+title: "High-Grade Glioma Molecular Subtyping - Defining Lesions"
+output: 
+  html_notebook:
+    toc: TRUE
+    toc_float: TRUE
+author: Chante Bethell for ALSF CCDL
+date: 2019
+---
+
+This notebook looks at the defining lesions for all samples for the issue of 
+molecular subtyping high-grade glioma samples in the OpenPBTA dataset. 
+
+# Usage
+
+This notebook is intended to be run via the command line from the top directory
+of the repository as follows:
+
+`Rscript -e "rmarkdown::render('analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd', clean = TRUE)"`
+
+# Set Up
+
+```{r}
+# Get `magrittr` pipe
+`%>%` <- dplyr::`%>%`
+```
+
+## Directories and Files
+
+```{r}
+# Detect the ".git" folder -- this will in the project root directory.
+# Use this as the root directory to ensure proper sourcing of functions no
+# matter where this is called from
+root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
+
+# File path to results directory
+results_dir <-
+  file.path(root_dir, "analyses", "molecular-subtyping-HGG", "results")
+
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+
+# Read in metadata
+metadata <-
+  readr::read_tsv(file.path(root_dir, "data", "pbta-histologies.tsv"))
+
+# Select wanted columns in metadata for merging and assign to a new object
+select_metadata <- metadata %>%
+  dplyr::select(Kids_First_Participant_ID,
+                sample_id,
+                Kids_First_Biospecimen_ID,
+                disease_type_new)
+
+# Read in snv consensus mutation data
+snv_df <-
+  data.table::fread(file.path(root_dir,
+                              "data",
+                              "pbta-snv-consensus-mutation.maf.tsv.gz"))
+```
+
+# Prepare Data 
+
+## SNV consensus mutation data - defining lesions
+
+```{r}
+# Filter the snv consensus mutatation data for the target lesions
+snv_lesions_df <- snv_df %>%
+  dplyr::select(Tumor_Sample_Barcode, Hugo_Symbol, HGVSp_Short) %>%
+  dplyr::mutate(
+    H3F3A.K28M = dplyr::case_when(Hugo_Symbol == "H3F3A" &
+                                    HGVSp_Short == "p.K28M" ~ "Yes",
+                                  TRUE ~ "No"),
+    HIST1H3B.K28M = dplyr::case_when(
+      Hugo_Symbol == "HIST1H3B" & HGVSp_Short == "p.K28M" ~ "Yes",
+      TRUE ~ "No"
+    ),
+    H3F3A.G35R = dplyr::case_when(Hugo_Symbol == "H3F3A" &
+                                    HGVSp_Short == "p.G35R" ~ "Yes",
+                                  TRUE ~ "No"),
+    H3F3A.G35V = dplyr::case_when(Hugo_Symbol == "H3F3A" &
+                                    HGVSp_Short == "p.G35V" ~ "Yes",
+                                  TRUE ~ "No")
+  ) %>%
+  dplyr::select(
+    -HGVSp_Short,
+    -Hugo_Symbol
+  )
+
+# Join the selected variables from the metadata with the snv consensus mutation
+# and defining lesions data.frame
+snv_lesions_df <- select_metadata %>%
+  dplyr::right_join(snv_lesions_df,
+                    by = c("Kids_First_Biospecimen_ID" = "Tumor_Sample_Barcode")) %>%
+  dplyr::select(
+    -disease_type_new,
+    dplyr::everything()
+  ) %>%
+  dplyr::distinct() %>%
+  dplyr::mutate(
+    disease_type_reclassified = dplyr::case_when(
+      H3F3A.K28M == "Yes" ~ "High-grade glioma, H3 K28 mutant",
+        HIST1H3B.K28M == "Yes" ~ "High-grade glioma, H3 K28 mutant",
+        H3F3A.G35R == "Yes" ~ "High-grade glioma, H3 G35 mutant",
+        H3F3A.G35V == "Yes" ~ "High-grade glioma, H3 G35 mutant",
+      TRUE ~ as.character(disease_type_new)
+    )
+  )
+
+# Display `snv_lesions_df`
+snv_lesions_df 
+```
+
+## Save final table of results
+
+```{r}
+# Save final data.frame to file
+readr::write_tsv(snv_lesions_df,
+                 file.path(results_dir, "HGG_defining_lesions.tsv"))
+```
+
+## Inconsistencies in disease classification
+
+```{r}
+# Isolate the samples with the specified mutations that were not classified
+# as HGG or DIPG
+snv_lesions_df %>%
+  dplyr::filter(
+    grepl("High-grade glioma", disease_type_reclassified) &
+      !(disease_type_new %in% c("High-grade glioma", 
+                                "Brainstem glioma- Diffuse intrinsic pontine glioma"))
+  )
+```
+
+# Session Info
+
+```{r}
+# Print the session information
+sessionInfo()
+```
+
diff --git a/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.nb.html b/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.nb.html