Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Commit

Permalink
CN Status Heatmap (PR 1 of 2) (#602)
Browse files Browse the repository at this point in the history
* The basic set up is there. Needs more work

* Push this version of the heatmap though its gonna change

* Sort of working

* Neatened up things

* Sorted out a few error handling items

* Almost there

* It's working. Needs more documentation and tweaks

* documentation!!

* Organize and make functions in their own util folder

* Add length filter and fix error

* Refreshy the notebook

* Add to CircleCI

* Streamline the PR to functions and README

* Minor updates to READMEs

* Update some minor comments and etc.

* extra space in config file

* Forgot this should not be in CI until next PR

* Fix indices thing

* Some typo fixes

* Incorporate @cbethell 's suggestions

* Use @jaclyn-taroni 's wording suggestions

* Incorporate @cbethell 's suggestions

* Add more details about `bp_per_bin` in the README

* Incorporate @jashapiro suggestions

* Make updates to the logic and its handling of uncallables

* Forgot to take out development things.

* Use @josh logic

* Comment updates

Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
Co-authored-by: jashapiro <jashapiro@gmail.com>
  • Loading branch information
3 people authored Apr 12, 2020
1 parent 203319b commit 7d4b1c6
Show file tree
Hide file tree
Showing 3 changed files with 184 additions and 9 deletions.
6 changes: 3 additions & 3 deletions analyses/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| Module | Input Files | Brief Description | Output Files Consumed by Other Analyses |
|--------|-------|-------------------|--------------|
| [`chromosomal-instability`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/chromosomal-instability) | `pbta-histologies.tsv` <br> `pbta-sv-manta.tsv.gz` <br> `pbta-cnv-cnvkit.seg.gz` | Evaluates chromosomal instability by calculating chromosomal breakpoint densities and by creating circular plot visuals | N/A
| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-consensus-gistic.zip` <br> `analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg` | Makes plots from GISTIC output as well as `seg.mean` plots by histology group | N/A
| [`cnv-chrom-plot`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-chrom) | `pbta-cnv-consensus-gistic.zip` <br> `analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg` | Plots genome wide visualizations relating to copy number results | N/A
| [`cnv-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-comparison) | Earlier version of SEG files | *Deprecated*; compared earlier version of the CNV methods. | N/A
| [`collapse-rnaseq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/collapse-rnaseq) | `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` <br> `gencode.v27.primary_assembly.annotation.gtf.gz` | Collapses RSEM FPKM matrices such that gene symbols are de-duplicated. | `results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` (included in data download; too large for tracking via GitHub) <br> `results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` (included in data download; too large for tracking via GitHub)
| [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds` <br> `pbta-gene-expression-rsem-tpm.stranded.rds` <br> `pbta-histologies.tsv` <br> `pbta-mend-qc-manifest.tsv` <br> `pbta-mend-qc-results.tar.gz` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) and a set of excluded regions where CNV calls are not made | `results/cnv_consensus.tsv` <br> `results/pbta-cnv-consensus.seg.gz` (included in data download) <br> `ref/cnv_excluded_regions.bed` <br> `ref/cnv_callable.bed`
| [`create-subset-files`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/create-subset-files) | All files | This module contains the code to create the subset files used in continuous integration | All subset files for continuous integration
| [`focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `analyses/copy_number_consensus_call/results/pbta-cnv-consensus.seg.gz` | Maps from copy number variant caller segments to gene identifiers; will be updated to take into account changes that affect entire cytobands, chromosome arms ([#186](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/186))| `results/cnvkit_annotated_cn_autosomes.tsv.gz` <br> `results/cnvkit_annotated_cn_x_and_y.tsv.gz` <br> `results/controlfreec_annotated_cn_autosomes.tsv.gz` <br> `results/controlfreec_annotated_cn_x_and_y.tsv.gz` <br> `results/consensus_seg_annotated_cn_autosomes.tsv.gz` (included in data download) <br> `results/consensus_seg_annotated_cn_x_and_y.tsv.gz` (included in data download)
| [`fusion_filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | `pbta-fusion-arriba.tsv.gz` <br> `pbta-fusion-starfusion.tsv.gz` | Standardizes, filters, and prioritizes fusion calls | `results/pbta-fusion-putative-oncogenic.tsv`(included in data download) <br> `results/pbta-fusion-recurrent-fusion-byhistology.tsv` (included in data download) <br> `results/pbta-fusion-recurrent-fusion-bysample.tsv` (included in data download)
| [`fusion-summary`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion-summary)| `pbta-histologies.tsv` <br> `pbta-fusion-putative-oncogenic.tsv` <br> `pbta-fusion-arriba.tsv.gz` <br> `pbta-fusion-starfusion.tsv.gz` | Generate summary tables from fusion files ([#398](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/398); [#623](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623)) | `results/fusion_summary_embryonal_foi.tsv` (included in data download) <br> `results/fusion_summary_ependymoma_foi.tsv` (included in data download) <br> `results/fusion_summary_ewings_foi.tsv`
| [`fusion-summary`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion-summary)| `pbta-histologies.tsv` <br> `pbta-fusion-putative-oncogenic.tsv` <br> `pbta-fusion-arriba.tsv.gz` <br> `pbta-fusion-starfusion.tsv.gz` | Generate summary tables from fusion files ([#398](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/398); [#623](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623)) | `results/fusion_summary_embryonal_foi.tsv` (included in data download) <br> `results/fusion_summary_ependymoma_foi.tsv` (included in data download) <br> `results/fusion_summary_ewings_foi.tsv`
| [`gene-set-enrichment-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/gene-set-enrichment-analysis) | `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` | *In progress*. Updated gene set enrichment analysis with appropriate RNA-seq expression data | `results/gsva_scores_stranded.tsv` <br> `results/gsva_scores_polya.tsv` <br> for stranded, polya expression data respectively
| [`gistic-cohort-vs-histology-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/gistic-cohort-vs-histology-comparison) | `analyses/run-gistic/results/pbta-cnv-consensus-gistic.zip` <br> `analyses/run-gistic/results/pbta-cnv-consensus-hgat-gistic.zip` <br> `analyses/run-gistic/results/pbta-cnv-consensus-lgat-gistic.zip` <br> `analyses/run-gistic/results/pbta-cnv-consensus-medulloblastoma-gistic.zip` | Comparison of the GISTIC results of the entire cohort with the GISTIC results of three individual histolgies, namely, LGAT, HGAT and medulloblastoma ([#547](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/547) | N/A
| [`immune-deconv`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/immune-deconv) | `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | Immune/Stroma characterization across PBTA (part of [#15](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/15)) | `results/deconv-output.RData`
Expand All @@ -31,7 +31,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| [`molecular-subtyping-EWS`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-EWS) | `analyses/fusion-summary/results/fusion_summary_ewings_foi.tsv`| Reclassification of tumors based on the presence of defining fusions for Ewing Sarcoma per [#623](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/623) | `results/EWS_samples.tsv`
| [`molecular-subtyping-HGG`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-HGG) | `pbta-snv-consensus-mutation.maf.tsv.gz` <br> `analyses/focal-cn-preparation/results/cnvkit_annotated_cn_autosomes.tsv.gz` <br> `pbta-fusion-putative-oncogenic.tsv` <br> `pbta-cnv-consensus-gistic.zip` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` | Molecular subtyping of high-grade glioma samples [#249](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/249) | `results/HGG_molecular_subtype.tsv`
| [`molecular-subtyping-LGAT`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-LGAT)| `pbta-snv-consensus-mutation.maf.tsv.gz` <br> `pbta-fusion-putative-oncogenic.tsv` <br> `pbta-fusion-recurrently-fused-genes-bysample.tsv`| Molecular subtyping of Low-grade astrocytic tumor samples [#631](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/631) | `results/lgat_subtyping.tsv`
| [`molecular-subtyping-SHH-tp53`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-SHH-tp53) | `pbta-histologies` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | *Deprecated*; Identify the SHH-classified medulloblastoma samples that have TP53 mutations [#247](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/247) | N/A
| [`molecular-subtyping-SHH-tp53`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-SHH-tp53) | `pbta-histologies` <br> `pbta-snv-consensus-mutation.maf.tsv.gz` | *Deprecated*; Identify the SHH-classified medulloblastoma samples that have TP53 mutations [#247](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/247) | N/A
| [`molecular-subtyping-chordoma`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-chordoma) | `analyses/focal-cn-file-preparation/results/consensus_seg_annotated_cn_autosomes.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` | *In progress*; identifying poorly-differentiated chordoma samples per [#250](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/250) | N/A
| [`molecular-subtyping-embryonal`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-embryonal) | `analyses/fusion-summary/fusion_summary_embryonal_foi.tsv` <br> `pbta-histologies.tsv` <br> `pbta-sv-manta.tsv.gz` <br> `analyses/focal-cn-file-preparation/consensus_seg_annotated_cn_x_and_y.tsv.gz` <br> `analyses/focal-cn-file-preparation/cnvkit_annotated_cn_x_and_y.tsv.gz` <br> `analyses/focal-cn-file-preparation/controlfreec_annotated_cn_x_and_y.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` <br> `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` | Molecular subtyping of non-medulloblastoma, non-ATRT embryonal tumors [#251](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/251) | `results/embryonal_tumor_molecular_subtypes.tsv`
| [`molecular-subtyping-pathology`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/molecular-subtyping-pathology) | `analyses/molecular-subtyping-EWS/results/EWS_samples.tsv` <br> `analyses/molecular-subtyping-HGG/results/HGG_molecular_subtype.tsv` <br> `analyses/molecular-subtyping-LGAT/results/lgat_subtyping.tsv` <br> `analyses/molecular-subtyping-embryonal/results/embryonal_tumor_molecular_subtypes.tsv` <br> `pbta-fusion-putative-oncogenic.tsv` | Compile output from other molecular subtyping modules and incorporate pathology feedback [#645](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/645) | `results/compiled_molecular_subtyping_with_pathology_feedback.tsv`
Expand Down
27 changes: 21 additions & 6 deletions analyses/cnv-chrom-plot/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,33 @@
## Plotting GISTIC results
## Plotting Copy Number Results

**Module Author:** Candace Savonen ([@cansavvy](https://www.github.com/cansavvy))

The goal of this analysis is to plot GISTIC results and make CNV plots by histology groups.
This module plots genome-wide visualizations relating to copy number results.

### Running the analysis
### Creating the GISTIC plot

This analysis consists of a single R Notebook, that can be run with the following from the top directory of the project:
The GISTIC chromosomal plots can be re-generated by running this notebook:

```
Rscript -e "rmarkdown::render('analyses/cnv-chrom-plot/gistic_plot.Rmd', clean = TRUE)"
```

### Creating the CN status heatmap plot

The CN status heatmap can be re-generated by running this notebook:

```
Rscript -e "rmarkdown::render('analyses/cnv-chrom-plot/cn_status_heatmap.Rmd', clean = TRUE)"
```

### Output

The output is a plot of the GISTIC scores (`plots/gistic_plot.png`) as well as
plots of the `seg.mean` by each histology group (e.g. `plots/Chondrosarcoma_plot.png`).
The output of these notebooks is a series of plots:
- barplot of the GISTIC scores (`plots/gistic_plot.png`)
- line plots of the `seg.mean` by each histology group (e.g. `plots/Chondrosarcoma_plot.png`)
- heatmap of CN status by genome bin: (`plots/cn_status_heatmap.pdf`)

### Custom functions:
`bp_per_bin` - Given a binned genome ranges object and another `GenomicRanges` object, return the number of base pairs covered per bin.
Can be used with any `GenomicRanges` object, but in this context is used within `call_bin_status` to find the number of base pairs of each CN status per bin.
`call_bin_status` - Given a sample_id, copy number segment ranges, and binned genome ranges object, make a call for each bin on what CN copy status has the most coverage in the bin.
160 changes: 160 additions & 0 deletions analyses/cnv-chrom-plot/util/bin-coverage.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Functions for calling CN statuses of genome bins
#
# C. Savonen for ALSF - CCDL
#
# 2020

bp_per_bin <- function(bin_ranges, status_ranges) {
# Given a binned genome ranges object and another GenomicRanges object,
# Return the number of bp covered per bin.
#
# Args:
# bin_ranges: A binned GenomicRanges made from tileGenome.
# status_ranges:A GenomicRanges object to calculate what percent coverage of
# each bin.
#
# Returns:
# a data.frame with bins x number of bp

# Find the portions of each copy number segment that overlap with each bin.
bin_overlaps <- GenomicRanges::pintersect(
IRanges::findOverlapPairs(
bin_ranges,
status_ranges
)
)

# Which bins do the segs in `bin_overlaps` overlap with?
bin_indices <- GenomicRanges::findOverlaps(
bin_ranges,
bin_overlaps
)

# Get the sum of the length of all seg portions for each bin.
bp_per_bin <- tapply(
bin_overlaps@ranges@width, # Get length of each sequence within the bin
bin_indices@from, # Index of which bin it overlaps
sum
) # Add up length per bin

# Format as data.frame with rows = bins
per_bin_df <- data.frame(
bin = as.numeric(names(bp_per_bin)),
bp_per_bin = as.numeric(bp_per_bin)
)

# Store dummy counts if there are no ranges that are in the bins
if (nrow(per_bin_df) == 0) {
per_bin_df <- data.frame(
bin = as.numeric(1:length(bin_ranges)),
bp_per_bin = 0
)
}
return(per_bin_df)
}

call_bin_status <- function(sample_id,
seg_ranges,
bin_ranges,
uncallable_ranges,
frac_threshold_val = .75,
frac_uncallable_val = .75) {

# Given a sample_id, CN segment ranges, and binned genome ranges object,
# make a call for each bin on what CN copy status has the most coverage in the bin.
# Uses bp_per_bin function.
#
# Args:
# sample_id: A string that corresponds to a single biospecimen id
# seg_ranges: A GenomicRanges object that contains a `status` and a `biospecimen` slot.
# The `biospecimen slot will be used to split out the `sample_id`'s corresponding ranges.
# The `status` slot should have gain/loss/neutral.
# bin_ranges: A binned GenomicRanges made from tileGenome that has been uncompressed with `unlist`.
# frac_threshold: What coverage fraction do we need to make the call?
# uncallable_threshold: What fraction of a bin needs to be callable for us
# us to make a status call?
#
# Returns:
# a small data.frame that contains the status call of the sample for each bin.
#
# Extract the ranges for this sample
sample_seg_ranges <- seg_ranges[which(seg_ranges$biospecimen == sample_id)]

# Split ranges into their respective statuses
gain_ranges <- sample_seg_ranges[sample_seg_ranges$status == "gain"]
loss_ranges <- sample_seg_ranges[sample_seg_ranges$status == "loss"]
neutral_ranges <- sample_seg_ranges[sample_seg_ranges$status == "neutral"]

# Calculate length of each type of status per bin
gain_per_bin <- bp_per_bin(bin_ranges, gain_ranges)
loss_per_bin <- bp_per_bin(bin_ranges, loss_ranges)
neutral_per_bin <- bp_per_bin(bin_ranges, neutral_ranges)
uncallable_per_bin <- bp_per_bin(bin_ranges, uncallable_ranges)

# Format this data into one data.frame where each row is a bin
bin_bp_status <- data.frame(
bin = as.numeric(1:length(bin_ranges)),
# Keep bin width
bin_width = bin_ranges@ranges@width
) %>%
# Join loss coverage data
dplyr::left_join(gain_per_bin,
by = "bin"
) %>%
# Rename as .gain
dplyr::rename(bp_per_bin.gain = bp_per_bin) %>%
# Join loss coverage data
dplyr::left_join(loss_per_bin,
by = "bin"
) %>%
# Rename as .loss
dplyr::rename(bp_per_bin.loss = bp_per_bin) %>%
# Join neutral coverage data
dplyr::left_join(neutral_per_bin,
by = "bin"
) %>%
# Rename as .neutral
dplyr::rename(bp_per_bin.neutral = bp_per_bin) %>%
# Join uncallable loss coverage data
dplyr::left_join(uncallable_per_bin,
by = "bin"
) %>%
# Rename as .uncallable
dplyr::rename(bp_per_bin.uncallable = bp_per_bin) %>%
# If there is an NA, at this point we can assume it means 0
dplyr::mutate_at(
dplyr::vars(
dplyr::starts_with("bp_per_bin")
),
~ tidyr::replace_na(., 0)
) %>%
# Calculate the bins fraction of each status
dplyr::mutate(
frac_gain = bp_per_bin.gain / bin_width,
frac_loss = bp_per_bin.loss / bin_width,
frac_neutral = bp_per_bin.neutral / bin_width,
frac_uncallable = bp_per_bin.uncallable / bin_width
) %>%
# Use these percentages for declaring final call per bin based on
# the frac_delta_threshold
dplyr::mutate(
status = dplyr::case_when(
frac_uncallable > uncallable_threshold ~ "uncallable"
frac_gain > threshold ~ "gain",
frac_loss > threshold ~ "loss",
frac_neutral > threshold ~ "neutral",
TRUE ~ "unstable"
)
)

# Format this data as a status
status_df <- bin_bp_status %>%
# Only keep the bin and status columns
dplyr::select(bin, status) %>%
# Arrange the bins in order
dplyr::arrange(bin) %>%
# Spread this data so we can make it a sample x bin matrix later
tidyr::spread(bin, status)

return(status_df)
}

0 comments on commit 7d4b1c6

Please sign in to comment.