CN Status Heatmap (PR 1 of 2) (#602)

* The basic set up is there. Needs more work * Push this version of the heatmap though its gonna change * Sort of working * Neatened up things * Sorted out a few error handling items * Almost there * It's working. Needs more documentation and tweaks * documentation!! * Organize and make functions in their own util folder * Add length filter and fix error * Refreshy the notebook * Add to CircleCI * Streamline the PR to functions and README * Minor updates to READMEs * Update some minor comments and etc. * extra space in config file * Forgot this should not be in CI until next PR * Fix indices thing * Some typo fixes * Incorporate @cbethell 's suggestions * Use @jaclyn-taroni 's wording suggestions * Incorporate @cbethell 's suggestions * Add more details about `bp_per_bin` in the README * Incorporate @jashapiro suggestions * Make updates to the logic and its handling of uncallables * Forgot to take out development things. * Use @josh logic * Comment updates Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com> Co-authored-by: jashapiro <jashapiro@gmail.com>
AlexsLemonade · Apr 12, 2020 · 7d4b1c6 · 7d4b1c6
1 parent 203319b
commit 7d4b1c6
Show file tree

Hide file tree

Showing 3 changed files with 184 additions and 9 deletions.
diff --git a/analyses/README.md b/analyses/README.md
@@ -12,15 +12,15 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
 | Module | Input Files | Brief Description | Output Files Consumed by Other Analyses |
 |--------|-------|-------------------|--------------|
 | [`chromosomal-instability`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/chromosomal-instability) | `pbta-histologies.tsv` <br> `pbta-sv-manta.tsv.gz` <br> `pbta-cnv-cnvkit.seg.gz` | Evaluates chromosomal instability by calculating chromosomal breakpoint densities and by creating circular plot visuals | N/A
-| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-consensus-gistic.zip` <br> `analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group  | N/A
+| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-consensus-gistic.zip` <br> `analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg` | Plots genome wide visualizations relating to copy number results | N/A
 | [`cnv-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-comparison) | Earlier version of SEG files | *Deprecated*; compared earlier version of the CNV methods. | N/A
 | [`collapse-rnaseq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/collapse-rnaseq) | `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` <br> `gencode.v27.primary_assembly.annotation.gtf.gz` | Collapses RSEM FPKM matrices such that gene symbols are de-duplicated. | `results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` (included in data download; too large for tracking via GitHub) <br> `results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` (included in data download; too large for tracking via GitHub)
 | [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds` <br> `pbta-gene-expression-rsem-tpm.stranded.rds` <br> `pbta-histologies.tsv` <br> `pbta-mend-qc-manifest.tsv` <br> `pbta-mend-qc-results.tar.gz` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
 | [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) and a set of excluded regions where CNV calls are not made | `results/cnv_consensus.tsv` <br> `results/pbta-cnv-consensus.seg.gz` (included in data download) <br> `ref/cnv_excluded_regions.bed` <br> `ref/cnv_callable.bed`
 | [`create-subset-files`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/create-subset-files) | All files | This module contains the code to create the subset files used in continuous integration | All subset files for continuous integration
 | [`focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg.gz` | Maps from copy number variant caller segments to gene identifiers; will be updated to take into account changes that affect entire cytobands, chromosome arms ([#186](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/186))| `results/cnvkit_annotated_cn_autosomes.tsv.gz` <br> `results/cnvkit_annotated_cn_x_and_y.tsv.gz` <br> `results/controlfreec_annotated_cn_autosomes.tsv.gz` <br> `results/controlfreec_annotated_cn_x_and_y.tsv.gz` <br> `results/consensus_seg_annotated_cn_autosomes.tsv.gz` (included in data download) <br> `results/consensus_seg_annotated_cn_x_and_y.tsv.gz` (included in data download)
 | [`fusion_filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | `pbta-fusion-arriba.tsv.gz` <br> `pbta-fusion-starfusion.tsv.gz` | Standardizes, filters, and prioritizes fusion calls | `results/pbta-fusion-putative-oncogenic.tsv`(included in data download) <br> `results/pbta-fusion-recurrent-fusion-byhistology.tsv` (included in data download) <br> `results/pbta-fusion-recurrent-fusion-bysample.tsv` (included in data download)
-| [`fusion-summary`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion-summary)| `pbta-histologies.tsv` <br> `pbta-fusion-putative-oncogenic.tsv` <br> `pbta-fusion-arriba.tsv.gz` <br> `pbta-fusion-starfusion.tsv.gz` | Generate summary tables from fusion files ([#398](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/398); [#623](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623)) | `results/fusion_summary_embryonal_foi.tsv` (included in data download) <br> `results/fusion_summary_ependymoma_foi.tsv` (included in data download) <br> `results/fusion_summary_ewings_foi.tsv` 
+| [`fusion-summary`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion-summary)| `pbta-histologies.tsv` <br> `pbta-fusion-putative-oncogenic.tsv` <br> `pbta-fusion-arriba.tsv.gz` <br> `pbta-fusion-starfusion.tsv.gz` | Generate summary tables from fusion files ([#398](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/398); [#623](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623)) | `results/fusion_summary_embryonal_foi.tsv` (included in data download) <br> `results/fusion_summary_ependymoma_foi.tsv` (included in data download) <br> `results/fusion_summary_ewings_foi.tsv`
 | [`gene-set-enrichment-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/gene-set-enrichment-analysis) | `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds`  | *In progress*. Updated gene set enrichment analysis with appropriate RNA-seq expression data | `results/gsva_scores_stranded.tsv` <br> `results/gsva_scores_polya.tsv` <br> for stranded, polya expression data respectively
 | [`gistic-cohort-vs-histology-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/gistic-cohort-vs-histology-comparison) | `analyses/run-gistic/results/pbta-cnv-consensus-gistic.zip` <br> `analyses/run-gistic/results/pbta-cnv-consensus-hgat-gistic.zip` <br> `analyses/run-gistic/results/pbta-cnv-consensus-lgat-gistic.zip` <br> `analyses/run-gistic/results/pbta-cnv-consensus-medulloblastoma-gistic.zip` | Comparison of the GISTIC results of the entire cohort with the GISTIC results of three individual histolgies, namely, LGAT, HGAT and medulloblastoma ([#547](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/547) | N/A
 | [`immune-deconv`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/immune-deconv) | `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | Immune/Stroma characterization across PBTA (part of [#15](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/15)) | `results/deconv-output.RData`
@@ -31,7 +31,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
 | [`molecular-subtyping-EWS`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-EWS) | `analyses/fusion-summary/results/fusion_summary_ewings_foi.tsv`| Reclassification of tumors based on the presence of defining fusions for Ewing Sarcoma per [#623](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623) | `results/EWS_samples.tsv`
 | [`molecular-subtyping-HGG`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-HGG) | `pbta-snv-consensus-mutation.maf.tsv.gz` <br> `analyses/focal-cn-preparation/results/cnvkit_annotated_cn_autosomes.tsv.gz` <br> `pbta-fusion-putative-oncogenic.tsv` <br> `pbta-cnv-consensus-gistic.zip` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` | Molecular subtyping of high-grade glioma samples [#249](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249) | `results/HGG_molecular_subtype.tsv`
 | [`molecular-subtyping-LGAT`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-LGAT)| `pbta-snv-consensus-mutation.maf.tsv.gz` <br> `pbta-fusion-putative-oncogenic.tsv` <br> `pbta-fusion-recurrently-fused-genes-bysample.tsv`| Molecular subtyping of Low-grade astrocytic tumor samples [#631](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/631) | `results/lgat_subtyping.tsv`
-| [`molecular-subtyping-SHH-tp53`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-SHH-tp53) | `pbta-histologies` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | *Deprecated*; Identify the SHH-classified medulloblastoma samples that have TP53 mutations [#247](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/247) | N/A 
+| [`molecular-subtyping-SHH-tp53`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-SHH-tp53) | `pbta-histologies` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | *Deprecated*; Identify the SHH-classified medulloblastoma samples that have TP53 mutations [#247](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/247) | N/A
 | [`molecular-subtyping-chordoma`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-chordoma) | `analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | *In progress*; identifying poorly-differentiated chordoma samples per [#250](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/250) | N/A
 | [`molecular-subtyping-embryonal`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-embryonal) | `analyses/fusion-summary/fusion_summary_embryonal_foi.tsv` <br>  `pbta-histologies.tsv` <br> `pbta-sv-manta.tsv.gz` <br> `analyses/focal-cn-file-preparation/consensus_seg_annotated_cn_x_and_y.tsv.gz` <br> `analyses/focal-cn-file-preparation/cnvkit_annotated_cn_x_and_y.tsv.gz` <br> `analyses/focal-cn-file-preparation/controlfreec_annotated_cn_x_and_y.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` | Molecular subtyping of non-medulloblastoma, non-ATRT embryonal tumors [#251](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/251) | `results/embryonal_tumor_molecular_subtypes.tsv`
 | [`molecular-subtyping-pathology`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-pathology) | `analyses/molecular-subtyping-EWS/results/EWS_samples.tsv` <br> `analyses/molecular-subtyping-HGG/results/HGG_molecular_subtype.tsv` <br> `analyses/molecular-subtyping-LGAT/results/lgat_subtyping.tsv` <br> `analyses/molecular-subtyping-embryonal/results/embryonal_tumor_molecular_subtypes.tsv` <br> `pbta-fusion-putative-oncogenic.tsv` | Compile output from other molecular subtyping modules and incorporate pathology feedback [#645](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/645) | `results/compiled_molecular_subtyping_with_pathology_feedback.tsv`

diff --git a/analyses/cnv-chrom-plot/README.md b/analyses/cnv-chrom-plot/README.md
@@ -1,18 +1,33 @@
-## Plotting GISTIC results
+## Plotting Copy Number Results
 
 **Module Author:** Candace Savonen ([@cansavvy](https://www.github.com/cansavvy))
 
-The goal of this analysis is to plot GISTIC results and make CNV plots by histology groups. 
+This module plots genome-wide visualizations relating to copy number results.
 
-### Running the analysis
+### Creating the GISTIC plot
 
-This analysis consists of a single R Notebook, that can be run with the following from the top directory of the project:
+The GISTIC chromosomal plots can be re-generated by running this notebook:
 
 ```
 Rscript -e "rmarkdown::render('analyses/cnv-chrom-plot/gistic_plot.Rmd', clean = TRUE)"
 ```
 
+### Creating the CN status heatmap plot
+
+The CN status heatmap can be re-generated by running this notebook:
+
+```
+Rscript -e "rmarkdown::render('analyses/cnv-chrom-plot/cn_status_heatmap.Rmd', clean = TRUE)"
+```
+
 ### Output
 
-The output is a plot of the GISTIC scores (`plots/gistic_plot.png`) as well as
-plots of the `seg.mean` by each histology group (e.g. `plots/Chondrosarcoma_plot.png`).
+The output of these notebooks is a series of plots:
+- barplot of the GISTIC scores (`plots/gistic_plot.png`)
+- line plots of the `seg.mean` by each histology group (e.g. `plots/Chondrosarcoma_plot.png`)
+- heatmap of CN status by genome bin: (`plots/cn_status_heatmap.pdf`)
+
+### Custom functions:
+`bp_per_bin` - Given a binned genome ranges object and another `GenomicRanges` object, return the number of base pairs covered per bin. 
+Can be used with any `GenomicRanges` object, but in this context is used within `call_bin_status` to find the number of base pairs of each CN status per bin.   
+`call_bin_status` - Given a sample_id, copy number segment ranges, and binned genome ranges object, make a call for each bin on what CN copy status has the most coverage in the bin.
diff --git a/analyses/cnv-chrom-plot/util/bin-coverage.R b/analyses/cnv-chrom-plot/util/bin-coverage.R
@@ -0,0 +1,160 @@
+# Functions for calling CN statuses of genome bins
+#
+# C. Savonen for ALSF - CCDL
+#
+# 2020
+
+bp_per_bin <- function(bin_ranges, status_ranges) {
+  # Given a binned genome ranges object and another GenomicRanges object, 
+  # Return the number of bp covered per bin.
+  #
+  # Args:
+  #   bin_ranges: A binned GenomicRanges made from tileGenome. 
+  #   status_ranges:A GenomicRanges object to calculate what percent coverage of
+  #   each bin. 
+  #
+  # Returns:
+  #  a data.frame with bins x number of bp 
+
+  # Find the portions of each copy number segment that overlap with each bin.
+  bin_overlaps <- GenomicRanges::pintersect(
+    IRanges::findOverlapPairs(
+      bin_ranges,
+      status_ranges
+    )
+  )
+
+  # Which bins do the segs in `bin_overlaps` overlap with?
+  bin_indices <- GenomicRanges::findOverlaps(
+    bin_ranges,
+    bin_overlaps
+  )
+
+  # Get the sum of the length of all seg portions for each bin.
+  bp_per_bin <- tapply(
+    bin_overlaps@ranges@width, # Get length of each sequence within the bin
+    bin_indices@from, # Index of which bin it overlaps
+    sum
+  ) # Add up length per bin
+
+  # Format as data.frame with rows = bins
+  per_bin_df <- data.frame(
+    bin = as.numeric(names(bp_per_bin)),
+    bp_per_bin = as.numeric(bp_per_bin)
+  )
+
+  # Store dummy counts if there are no ranges that are in the bins
+  if (nrow(per_bin_df) == 0) {
+    per_bin_df <- data.frame(
+      bin = as.numeric(1:length(bin_ranges)),
+      bp_per_bin = 0
+    )
+  }
+  return(per_bin_df)
+}
+
+call_bin_status <- function(sample_id,
+                            seg_ranges,
+                            bin_ranges,
+                            uncallable_ranges, 
+                            frac_threshold_val = .75, 
+                            frac_uncallable_val = .75) {
+
+  # Given a sample_id, CN segment ranges, and binned genome ranges object, 
+  # make a call for each bin on what CN copy status has the most coverage in the bin. 
+  # Uses bp_per_bin function. 
+  #
+  # Args:
+  #   sample_id: A string that corresponds to a single biospecimen id
+  #   seg_ranges: A GenomicRanges object that contains a `status` and a `biospecimen` slot.  
+  #               The `biospecimen slot will be used to split out the `sample_id`'s corresponding ranges.   
+  #               The `status` slot should have gain/loss/neutral. 
+  #   bin_ranges: A binned GenomicRanges made from tileGenome that has been uncompressed with `unlist`. 
+  #   frac_threshold: What coverage fraction do we need to make the call?
+  #   uncallable_threshold: What fraction of a bin needs to be callable for us
+  #                         us to make a status call?
+  #
+  # Returns:
+  #  a small data.frame that contains the status call of the sample for each bin. 
+  #
+  # Extract the ranges for this sample
+  sample_seg_ranges <- seg_ranges[which(seg_ranges$biospecimen == sample_id)]
+
+  # Split ranges into their respective statuses
+  gain_ranges <- sample_seg_ranges[sample_seg_ranges$status == "gain"]
+  loss_ranges <- sample_seg_ranges[sample_seg_ranges$status == "loss"]
+  neutral_ranges <- sample_seg_ranges[sample_seg_ranges$status == "neutral"]
+
+  # Calculate length of each type of status per bin
+  gain_per_bin <- bp_per_bin(bin_ranges, gain_ranges)
+  loss_per_bin <- bp_per_bin(bin_ranges, loss_ranges)
+  neutral_per_bin <- bp_per_bin(bin_ranges, neutral_ranges)
+  uncallable_per_bin <- bp_per_bin(bin_ranges, uncallable_ranges)
+
+  # Format this data into one data.frame where each row is a bin
+  bin_bp_status <- data.frame(
+    bin = as.numeric(1:length(bin_ranges)),
+    # Keep bin width
+    bin_width = bin_ranges@ranges@width
+  ) %>%
+  # Join loss coverage data
+  dplyr::left_join(gain_per_bin,
+                   by = "bin"
+  ) %>%
+  # Rename as .gain
+  dplyr::rename(bp_per_bin.gain = bp_per_bin) %>%
+  # Join loss coverage data
+  dplyr::left_join(loss_per_bin,
+                   by = "bin"
+  ) %>%
+  # Rename as .loss
+  dplyr::rename(bp_per_bin.loss = bp_per_bin) %>%
+  # Join neutral coverage data
+  dplyr::left_join(neutral_per_bin,
+                   by = "bin"
+  ) %>%
+  # Rename as .neutral
+  dplyr::rename(bp_per_bin.neutral = bp_per_bin) %>%
+  # Join uncallable loss coverage data
+  dplyr::left_join(uncallable_per_bin,
+                     by = "bin"
+                   ) %>%
+  # Rename as .uncallable
+  dplyr::rename(bp_per_bin.uncallable = bp_per_bin) %>% 
+    # If there is an NA, at this point we can assume it means 0
+    dplyr::mutate_at(
+      dplyr::vars(
+        dplyr::starts_with("bp_per_bin")
+      ),
+      ~ tidyr::replace_na(., 0)
+    ) %>%
+    # Calculate the bins fraction of each status
+    dplyr::mutate(
+      frac_gain = bp_per_bin.gain / bin_width,
+      frac_loss = bp_per_bin.loss / bin_width,
+      frac_neutral = bp_per_bin.neutral / bin_width,
+      frac_uncallable = bp_per_bin.uncallable / bin_width
+    ) %>%
+    # Use these percentages for declaring final call per bin based on
+    # the frac_delta_threshold
+    dplyr::mutate(
+      status =  dplyr::case_when(
+        frac_uncallable > uncallable_threshold ~ "uncallable"
+        frac_gain > threshold ~ "gain",
+        frac_loss > threshold ~ "loss",
+        frac_neutral > threshold ~ "neutral",
+        TRUE ~ "unstable"
+      )
+    )
+
+  # Format this data as a status
+  status_df <- bin_bp_status %>%
+    # Only keep the bin and status columns
+    dplyr::select(bin, status) %>%
+    # Arrange the bins in order
+    dplyr::arrange(bin) %>%
+    # Spread this data so we can make it a sample x bin matrix later
+    tidyr::spread(bin, status) 
+
+  return(status_df)
+}