Update fusion-summary to include union of biospecimen IDs in fusion c…

…allers (AlexsLemonade#478) * Add note about putative oncogenic fusions single caller * Add in biospecimens in either caller Change to notebook for documentation purposes * Run notebook instead Also +x * Remove Rscript * Minor typo, formatting fixes * Add in the original caller files to modules at a glance * Add in 'missing fusions' To better match former behavior * Will embryonal step pass in CI? * Revert "Will embryonal step pass in CI?" This reverts commit b81379d. * Skip ependymoma steps in CI * Forgot the variable in CI * Forgot to replace NA with 0 * Apply @jashapiro right_join suggestion * Apply suggestions from code review Co-Authored-By: jashapiro <jashapiro@gmail.com> Co-authored-by: jashapiro <jashapiro@gmail.com>
jaclyn-taroni · Jan 27, 2020 · 0e642ef · 0e642ef
1 parent e6a165e
commit 0e642ef
Show file tree

Hide file tree

Showing 10 changed files with 4,280 additions and 1,640 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -135,7 +135,7 @@ jobs:
 
       - run:
           name: Fusion Summary
-          command: ./scripts/run_in_ci.sh bash "analyses/fusion-summary/run-new-analysis.sh"
+          command: OPENPBTA_TESTING=1 ./scripts/run_in_ci.sh bash "analyses/fusion-summary/run-new-analysis.sh"
 
       - run:
           name: Molecular subtyping - Non-MB/Non-ATRT Embryonal tumors 

diff --git a/analyses/README.md b/analyses/README.md
@@ -20,7 +20,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
 | [`create-subset-files`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/create-subset-files) | All files | This module contains the code to create the subset files used in continuous integration | All subset files for continuous integration
 | [`focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` | Maps from copy number variant caller segments to gene identifiers; will eventually be updated to use consensus copy number calls ([#186](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/186))| `results/cnvkit_annotated_cn_autosomes.tsv.bz2` <br> `results/cnvkit_annotated_cn_x_and_y.tsv.bz2` <br> `results/controlfreec_annotated_cn_autosomes.tsv.bz2` <br> `results/controlfreec_annotated_cn_x_and_y.tsv.bz2`
 | [`fusion_filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | `pbta-fusion-arriba.tsv.gz` <br> `pbta-fusion-starfusion.tsv.gz` | Standardizes, filters, and prioritizes fusion calls | `results/pbta-fusion-putative-oncogenic.tsv` <br> `results/pbta-fusion-recurrent-fusion-byhistology.tsv` <br> `results/pbta-fusion-recurrent-fusion-bysample.tsv` (included in data download)
-| [`fusion-summary`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion-summary)| `pbta-histologies.tsv` <br> `pbta-fusion-putative-oncogenic.tsv` | Generate summary tables from fusion files ([#398](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/398)) | `results/fusion_summary_embryonal_foi.tsv` <br> `results/fusion_summary_ependymoma_foi.tsv`
+| [`fusion-summary`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion-summary)| `pbta-histologies.tsv` <br> `pbta-fusion-putative-oncogenic.tsv` <br> `pbta-fusion-arriba.tsv.gz` <br> `pbta-fusion-starfusion.tsv.gz` | Generate summary tables from fusion files ([#398](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/398)) | `results/fusion_summary_embryonal_foi.tsv` <br> `results/fusion_summary_ependymoma_foi.tsv`
 | [`gene-set-enrichment-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/gene-set-enrichment-analysis) | `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds`  | *In progress*. Updated gene set enrichment analysis with appropriate RNA-seq expression data | `results/gsva_scores_stranded.tsv` <br> `results/gsva_scores_polya.tsv` <br> for stranded, polya expression data respectively  
 | [`immune-deconv`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/immune-deconv) | `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | Immune/Stroma characterization across PBTA (part of [#15](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/15)) | `results/deconv-output.RData`
 | [`independent-samples`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/independent-samples) | `pbta-histologies.tsv` | Generates independent specimen lists for WGS/WXS samples | `results/independent-specimens.wgs.primary.tsv` <br> `results/independent-specimens.wgs.primary-plus.tsv` <br> `results/independent-specimens.wgswxs.primary.tsv` <br> `results/independent-specimens.wgswxs.primary-plus.tsv` (included in data download)

diff --git a/analyses/fusion-summary/01-fusion-summary.R b/analyses/fusion-summary/01-fusion-summary.R
diff --git a/analyses/fusion-summary/01-fusion-summary.Rmd b/analyses/fusion-summary/01-fusion-summary.Rmd
@@ -0,0 +1,186 @@
+---
+title: "Generate Fusion Summary Files"
+output: html_notebook
+author: Daniel Miller (D3b) and Jaclyn Taroni (CCDL)
+date: January 2020
+params:
+  is_ci: 0
+---
+
+Generate fusion files specifically for consumption by molecular subtyping analyses
+
+## Set up
+
+```{r}
+# if running in CI, we need to skip the EPN steps
+if (params$is_ci == 0) running_in_ci <- FALSE
+if (params$is_ci == 1) running_in_ci <- TRUE
+```
+
+### Libraries and functions
+
+```{r}
+library(tidyverse)
+```
+
+```{r}
+#' Generate filtered fusion frame
+#' @param df Unfiltered fusion data frame
+#' @param bioid List of biospecimen IDs
+#' @param fuses List of explicit fusion names
+#' @param genes List of gene names
+#' @return the filtered fusion data frame
+filterFusion <- function(df, bioid, fuses, genes) {
+  if (!missing(bioid)) {
+    df <- filter(df, Sample %in% bioid)
+  }
+  if (!missing(fuses) & !missing(genes)) {
+    df <- filter(df, FusionName %in% fuses |
+                   Gene1A %in% genes |
+                   Gene2A %in% genes |
+                   Gene1B %in% genes |
+                   Gene2B %in% genes)
+  } else if (!missing(fuses)) {
+    df <- filter(df, FusionName %in% fuses)
+  } else if (!missing(genes)) {
+    df <- filter(df,
+                 Gene1A %in% genes |
+                   Gene2A %in% genes |
+                   Gene1B %in% genes |
+                   Gene2B %in% genes)
+  }
+  return(df %>% select(Sample, FusionName))
+}
+
+
+#' Generate matrix with fusion counts
+#' @param fuseDF Filtered fusion data frame
+#' @param bioid List of biospecimen IDs that should be included in final table
+
+#' @return Data frame that contains fusion counts
+prepareOutput <- function(fuseDF, bioid) {
+  fuseDF %>% 
+    reshape2::dcast(Sample ~ FusionName) %>%
+    right_join(data.frame(Sample = bioid)) %>%
+    replace(is.na(.), 0) %>%
+    rename(Kids_First_Biospecimen_ID = Sample)
+}
+```
+
+### Read in data
+
+```{r}
+dataDir <- file.path("..", "..", "data")
+#' The putative oncogenic fusion file is what we'll use to check for the 
+#' presence or absence of the fusions.
+putativeOncogenicDF <- 
+  read_tsv(file.path(dataDir, "pbta-fusion-putative-oncogenic.tsv"))
+#' However, some biospecimens are not represented in this filtered, prioritized
+#' file but *are* present in the original files -- this will cause them to be
+#' "missing" in the final files for consumption which could mislead analysts.
+arribaDF <- read_tsv(file.path(dataDir, "pbta-fusion-arriba.tsv.gz"))
+starfusionDF <- read_tsv(file.path(dataDir, "pbta-fusion-starfusion.tsv.gz"))
+```
+
+### Output
+
+```{r}
+resultsDir <- "results"
+if (!dir.exists(resultsDir)) {
+  dir.create(resultsDir)
+}
+ependFile <- file.path(resultsDir, "fusion_summary_ependymoma_foi.tsv")
+embryFile <- file.path(resultsDir, "fusion_summary_embryonal_foi.tsv")
+```
+
+## Fusions and genes of interest
+
+Taken from [`AlexsLemonade/OpenPBTA-analysis#245`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/245) and [`AlexsLemonade/OpenPBTA-analysis#251`](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/251), respectively.
+
+```{r}
+#' **Filters**
+#'
+#' *Fusions Filters*
+#' 1: Exact match a list of fusions common in Ependymoma tumors
+ependFuses <- c(
+  "C11orf95--MAML2",
+  "C11orf95--RELA",
+  "C11orf95--YAP1",
+  "LTBP3--RELA",
+  "PTEN--TAS2R1",
+  "YAP1--FAM118B",
+  "YAP1--MAMLD1",
+  "YAP1--MAMLD2"
+)
+ependGenes <- c(
+  "RELA"
+)
+#' 2: Exact match a list of fusions common in Embryonal tumors
+#' as well as fusions containing a particular gene with any other gene
+embryFuses <- c(
+  "CIC--NUTM1",
+  "MN1--BEND2",
+  "MN1--CXXC5"
+)
+embryGenes <- c(
+  "FOXR2",
+  "MN1",
+  "TTYH1"
+)
+```
+
+### Filter putative oncogenic fusions list
+
+```{r}
+allFuseEpend <- filterFusion(df = putativeOncogenicDF,
+                             fuses = ependFuses,
+                             genes = ependGenes)
+allFuseEmbry <- filterFusion(df = putativeOncogenicDF,
+                             fuses = embryFuses,
+                             genes = embryGenes)
+```
+
+Get the biospecimen IDs that are present in *either* caller file (Arriba, STARFusion).
+The fusions in the putative oncogenic fusion file can be retained even if they are not in both callers: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/8fba1753608d8ac0aa3d5d7d63c480b8f00ff0e9/analyses/fusion_filtering/04-project-specific-filtering.Rmd#L242
+We use the putative oncogenic file here, therefore any sample that is in either file but does not have a fusion that is relevant to the subtyping tickets is not _missing_ but instead has no evidence of the relevant fusions.
+
+```{r}
+specimensUnion<- union(arribaDF$tumor_id, starfusionDF$tumor_id)
+```
+
+#### Write non-MB, non-ATRT embryonal fusions to file
+
+```{r}
+allFuseEmbry <- allFuseEmbry %>%
+  prepareOutput(specimensUnion)
+```
+
+```{r}
+# Are there any missing fusions?
+setdiff(embryFuses, colnames(allFuseEmbry))
+```
+
+```{r}
+allFuseEmbry %>%
+  mutate(
+    `CIC--NUTM1` = 0,
+    `MN1--BEND2` = 0
+  ) %>%
+  write_tsv(embryFile)
+```
+
+#### Write ependymoma fusions to file
+
+```{r}
+if (!running_in_ci) {
+  allFuseEpend %>%
+    prepareOutput(specimensUnion) %>%
+    mutate(
+      `C11orf95--YAP1` = 0,
+      `LTBP3--RELA` = 0,
+      `PTEN--TAS2R1` = 0,
+      `YAP1--MAMLD2` = 0
+    ) %>%
+    write_tsv(ependFile)
+}
+```
diff --git a/analyses/fusion-summary/01-fusion-summary.nb.html b/analyses/fusion-summary/01-fusion-summary.nb.html
diff --git a/analyses/fusion-summary/README.md b/analyses/fusion-summary/README.md
@@ -1,12 +1,14 @@
 # Fusion Summary
 
 This module generates summary files for fusions of interest present in biospecimens taken from:
+
 1. Ependymoma tumors
 2. Embryonal tumors not from ATRT or MB
 
-To genereate the tables simply run:
+To generate the tables run:
+
 ```
-./run-new-analysis.sh
+bash run-new-analysis.sh
 ```
 
 ## General Use