Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Add cytoband to copy number files using bedtools intersect #617

Merged
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
7dffdfc
Add cytoband data using bedtools intersect with UCSC cytoband file
cbethell Mar 9, 2020
e745ad9
Update comments
cbethell Mar 9, 2020
1c4eea5
@cansavvy and @jashapiro suggested changes
cbethell Mar 11, 2020
bba0864
sort before filtering out losses and gains
cbethell Mar 12, 2020
f573daa
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 12, 2020
337a264
Add notebook to join and wrangle the cytoband bed files
cbethell Mar 12, 2020
3c827b1
Add chromosome arm field and GISTIC arm status data
cbethell Mar 13, 2020
ee4ff57
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 13, 2020
bf47620
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 13, 2020
721ab75
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 13, 2020
9fa3b85
implement @jashapiro's suggested changes
cbethell Mar 13, 2020
d215dcb
add steps for loss and gain bed files to `run-bedtools.sh`
cbethell Mar 14, 2020
5a2239d
Propagate changes to bed files to `03` nb
cbethell Mar 16, 2020
c31745d
change logic to uncompress the cytoband file once
cbethell Mar 16, 2020
05bd93a
Substitute snakemake for shell in bedtools script
jashapiro Mar 16, 2020
b60cf95
Merge remote-tracking branch 'cbethell/add-cytoband-status-with-bedto…
jashapiro Mar 16, 2020
2af767a
rerun `03` nb with updated coverage bed files
cbethell Mar 17, 2020
60147fc
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 17, 2020
69a3ee5
Update module README and start defining most focal units
cbethell Mar 18, 2020
8247e53
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 18, 2020
0c89687
Reformat final output table and rename output file
cbethell Mar 19, 2020
bcbbb09
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 19, 2020
2b6c1b2
Apply suggestions from @jashapiro code review
cbethell Mar 19, 2020
01f385b
Move addition of `chromosome_arm` step after joining original UCSC data
cbethell Mar 19, 2020
d64159b
update README to reflect addition of `band_length` column
cbethell Mar 19, 2020
807a7ad
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 19, 2020
b26cd42
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 20, 2020
e9bf7a7
remove redundant joining of original ucsc cytoband data section
cbethell Mar 20, 2020
126ea71
update usage comment and re-render html output
cbethell Mar 20, 2020
53b906d
@jashapiro's commit suggestion to include sex chromosome data
cbethell Mar 20, 2020
1b761ee
rerun module (now that UCSC file download includes sex chromosomes)
cbethell Mar 20, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions analyses/focal-cn-file-preparation/02-add-ploidy-consensus.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@ library(tidyverse)

```{r}
scratch_dir <- file.path("..", "..", "scratch")
output_dir <- file.path(scratch_dir, "cytoband_status", "segments")
if(!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
```

```{r}
Expand Down Expand Up @@ -146,3 +150,69 @@ add_status_df %>%
output_file <- file.path("..", "..", "scratch", "consensus_seg_with_status.tsv")
write_tsv(add_status_df, output_file)
```

### Prepare separate bed files for losses/gains for bedtools coverage function

```{r}
bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
group_by(Kids_First_Biospecimen_ID) %>%
arrange(chrom, loc.start, loc.end)

losses_bed_status_df <- bed_status_df %>%
filter(status == "loss")

gains_bed_status_df <- bed_status_df %>%
filter(status == "gain")

# make lists of data frames by sample
bed_status_list <- bed_status_df %>%
group_split()
names(bed_status_list) <- bed_status_df %>%
group_keys() %>%
pull(Kids_First_Biospecimen_ID)

bed_loss_list <- losses_bed_status_df %>%
group_split()
names(bed_loss_list) <- losses_bed_status_df %>%
group_keys() %>%
pull(Kids_First_Biospecimen_ID)

bed_gain_list <- gains_bed_status_df %>%
group_split()
names(bed_gain_list) <- gains_bed_status_df %>%
group_keys() %>%
pull(Kids_First_Biospecimen_ID)
```

### Write to file

```{r}
temp <- purrr::imap(bed_status_list,
~ write_tsv(.x,
file.path(
output_dir, paste0("consensus_callable.", .y, ".bed")
),
col_names = FALSE))

temp_loss <- purrr::imap(bed_loss_list,
~ write_tsv(.x,
file.path(
output_dir, paste0("consensus_loss.", .y, ".bed")
),
col_names = FALSE))

temp_gain <- purrr::imap(bed_gain_list,
~ write_tsv(.x,
file.path(
output_dir, paste0("consensus_gain.", .y, ".bed")
),
col_names = FALSE))
```

## Session Info

```{r}
sessionInfo()
```

137 changes: 129 additions & 8 deletions analyses/focal-cn-file-preparation/02-add-ploidy-consensus.nb.html

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
title: "Add dominant status column to consensus SEG file with cytoband field"
output:
html_notebook:
toc: TRUE
toc_float: TRUE
author: Chante Bethell for ALSF CCDL
date: 2020
---

This notebook adds dominant status information per cytoband to the consensus SEG files prepared in `run-prepare-cn.sh` using the `bedtools coverage` function (these files are stored in the project's scratch directory as bed files).

## Usage

This notebook is intended to be run from the command line with the following (assumes you are in the root directory of the repository):

```
Rscript -e "rmarkdown::render('analyses/focal-cn-file-preparation/03-add-cytoband-status-consensus.Rmd', clean = TRUE)"
```

## Set up

### Libraries and functions

```{r}
library(tidyverse)
```

### Files and directories

```{r}
scratch_dir <- file.path("..", "..", "scratch")
```

### Read in files

The files we are reading in here are the bed files prepared in the shell script using `bedtools coverage`.

```{r message = FALSE}
all_callable_cytoband_status_df <-
read_tsv(file.path(scratch_dir, "intersect_with_cytoband_callable.bed"),
col_names = FALSE)

loss_cytoband_status_df <-
read_tsv(file.path(scratch_dir, "intersect_with_cytoband_loss.bed"),
col_names = FALSE)

gain_cytoband_status_df <-
read_tsv(file.path(scratch_dir, "intersect_with_cytoband_gain.bed"),
col_names = FALSE)
```

## Wrangle and merge consensus cytoband status data

### Filter each of the cytoband status data.frames

```{r}
all_callable_cytoband_status_df <-
all_callable_cytoband_status_df %>%
select(
chr = X1,
cytoband = X4,
cbethell marked this conversation as resolved.
Show resolved Hide resolved
band_length = X8,
callable_fraction = X9,
Kids_First_Biospecimen_ID = X10
) %>%
filter(!is.na(cytoband))

gain_cytoband_status_df <- gain_cytoband_status_df %>%
select(
chr = X1,
cytoband = X4,
gain_fraction = X9,
Kids_First_Biospecimen_ID = X10
) %>%
filter(!is.na(cytoband))

loss_cytoband_status_df <- loss_cytoband_status_df %>%
select(
chr = X1,
cytoband = X4,
loss_fraction = X9,
Kids_First_Biospecimen_ID = X10
) %>%
filter(!is.na(cytoband))
```

### Join all data.frames together

```{r}
final_df <- all_callable_cytoband_status_df %>%
left_join(gain_cytoband_status_df,
by = c("chr", "cytoband", "Kids_First_Biospecimen_ID")) %>%
inner_join(loss_cytoband_status_df,
by = c("chr", "cytoband", "Kids_First_Biospecimen_ID"))
```

### Add `dominant_status` field to final data.frame

```{r}
# Create a dominant status column
final_df <- final_df %>%
replace_na(list(
gain_fraction = 0,
loss_fraction = 0
)) %>%
mutate(
dominant_status = case_when(
callable_fraction < 0.5 ~ "uncallable",
gain_fraction / callable_fraction > 0.5 ~ "gain",
loss_fraction / callable_fraction > 0.5 ~ "loss",
(gain_fraction + loss_fraction) / callable_fraction > 0.5 ~ "unstable",
TRUE ~ "neutral"
)
)
```

### Add chromosome arm column

```{r}
# Add a column that tells us the position of the p or q and then use this to
# split the cytoband column
final_df <- final_df %>%
mutate(
cytoband_with_arm = paste0(gsub("chr", "", chr), cytoband),
chromosome_arm = gsub("(p|q).*", "\\1", cytoband_with_arm)
) %>%
select(
Kids_First_Biospecimen_ID,
chr,
cytoband,
dominant_status,
band_length,
everything(),
-cytoband_with_arm,
)
```

### Display and save final table

```{r}
# Display final table with `uncallable` value added to `dominant_status` column
final_df

# Write to file
write_tsv(final_df, file.path("results", "consensus_seg_with_ucsc_cytoband_status.tsv.gz"))
```

## Session Info

```{r}
sessionInfo()
```
3,232 changes: 3,232 additions & 0 deletions analyses/focal-cn-file-preparation/03-add-cytoband-status-consensus.nb.html

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
# This script is intended to be run via the command line.
# This example assumes it is being run from the root of the repository.
#
# Rscript --vanilla analyses/oncoprint-landscape/03-prepare-cn-file.R \
# Rscript --vanilla analyses/focal-cn-file-preparation/03-prepare-cn-file.R \
# --cnv_file data/pbta-cnv-controlfreec.tsv.gz \
# --gtf_file data/gencode.v27.primary_assembly.annotation.gtf.gz \
# --metadata data/pbta-histologies.tsv \
Expand Down
51 changes: 47 additions & 4 deletions analyses/focal-cn-file-preparation/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Focal copy number file preparation

**Module authors:** Chante Bethell ([@cbethell](https://github.com/cbethell)) and Jaclyn Taroni ([@jaclyn-taroni](https://github.com/jaclyn-taroni))
**Module authors:** Chante Bethell ([@cbethell](https://github.com/cbethell)), Joshua Shapiro ([@jashapiro](https://github.com/jashapiro)), and Jaclyn Taroni ([@jaclyn-taroni](https://github.com/jaclyn-taroni))

The copy number data from OpenPBTA are provided as ranges or segments.
The purpose of this module is to map from those ranges to gene identifiers for consumption by downstream analyses (e.g., OncoPrint plotting).
Expand All @@ -12,6 +12,8 @@ To run this analysis _only on consensus SEG file_, use the following (from the r
```
bash analyses/focal-cn-file-preparation/run-prepare-cn.sh
```
**Note**: The `run-bedtools.snakemake` script is implemented in `run-prepare-cn.sh` to run the bedtools coverage steps between the UCSC cytoband file and the samples in the copy number files produced in `02-add-ploidy-consensus.Rmd`.
This script currently takes a while to run, and therefore slows down the processing speed of the main shell script `run-prepare-cn.sh`.

Running the following from the root directory of the repository will run the steps for the original copy number call files (CNVkit and ControlFreeC) in addition to the consensus SEG file:

Expand All @@ -28,9 +30,14 @@ RUN_ORIGINAL=1 bash analyses/focal-cn-file-preparation/run-prepare-cn.sh

* `02-add-ploidy-consensus.Rmd` - This is very similar to the CNVkit file prep (`01-add-ploidy-cnvkit.Rmd`).
However, there are instances in the consensus SEG file where `copy.num` is `NA` which are removed.
See the notebook for more information.
See the notebook for more information. This notebook also prepares lists of copy number bed files by sample for use in the implementation of bedtools coverage in `run-bedtools.snakemake`.

* `03-prepare-cn-file.R` - This script performs the ranges to annotation mapping using the GENCODE v27 GTF included via the data download step; it takes the ControlFreeC file or a SEG (e.g., CNVkit, consensus SEG) file prepared with `01-add-ploidy-cnvkit.Rmd` and `02-add-ploidy-cnvkit.Rmd` as input.
* `03-add-cytoband-status-consensus.Rmd` - This notebook reads in the bedtools coverage output files and defines the dominant copy number status for each chromosome arm. The output of this notebook is a table with the following columns:

| `Kids_First_Biospecimen_ID` | chr | cytoband | dominant_status | band_length | callable_fraction | gain_fraction | loss_fraction | chromosome_arm |
|----------------|--------|-------------|--------|---------|----------|-------------|---------|---------------|

* `04-prepare-cn-file.R` - This script performs the ranges to annotation mapping using the GENCODE v27 GTF included via the data download step; it takes the ControlFreeC file or a SEG (e.g., CNVkit, consensus SEG) file prepared with `01-add-ploidy-cnvkit.Rmd` and `02-add-ploidy-cnvkit.Rmd` as input.
**The mapping is limited to _exons_.**
Mapping to cytobands is performed with the [`org.Hs.eg.db`](https://doi.org/doi:10.18129/B9.bioc.org.Hs.eg.db) package.
A table with the following columns is returned:
Expand Down Expand Up @@ -66,9 +73,43 @@ focal-cn-file-preparation
├── 01-add-ploidy-cnvkit.nb.html
├── 02-add-ploidy-consensus.Rmd
├── 02-add-ploidy-consensus.nb.html
├── 03-prepare-cn-file.R
├── 03-add-cytoband-status-consensus.Rmd
├── 03-add-cytoband-status-consensus.nb.html
├── 04-prepare-cn-file.R
├── README.md
├── annotation_files
│   └── txdb_from_gencode.v27.gtf.db
├── display-plots.md
├── gistic-results
│   └── pbta-cnv-cnvkit-gistic
│   ├── D.cap1.5.mat
│   ├── all_data_by_genes.txt
│   ├── all_lesions.conf_90.txt
│   ├── all_thresholded.by_genes.txt
│   ├── amp_genes.conf_90.txt
│   ├── amp_qplot.pdf
│   ├── amp_qplot.png
│   ├── broad_data_by_genes.txt
│   ├── broad_gistic_plot.pdf
│   ├── broad_significance_results.txt
│   ├── broad_values_by_arm.txt
│   ├── del_genes.conf_90.txt
│   ├── del_qplot.pdf
│   ├── del_qplot.png
│   ├── focal_dat.0.98.mat
│   ├── focal_data_by_genes.txt
│   ├── freqarms_vs_ngenes.pdf
│   ├── gistic_inputs.mat
│   ├── peak_regs.mat
│   ├── perm_ads.mat
│   ├── raw_copy_number.pdf
│   ├── raw_copy_number.png
│   ├── regions_track.conf_90.bed
│   ├── sample_cutoffs.txt
│   ├── sample_seg_counts.txt
│   ├── scores.0.98.mat
│   ├── scores.gistic
│   └── wide_peak_regs.mat
├── plots
│   ├── cnvkit_annotated_cn_autosomes_polya_loss_cor_plot.png
│   ├── cnvkit_annotated_cn_autosomes_polya_stacked_plot.png
Expand Down Expand Up @@ -111,9 +152,11 @@ focal-cn-file-preparation
│   ├── cnvkit_annotated_cn_x_and_y.tsv.gz
│   ├── consensus_seg_annotated_cn_autosomes.tsv.gz
│   ├── consensus_seg_annotated_cn_x_and_y.tsv.gz
│   ├── consensus_seg_with_ucsc_cytoband.tsv.gz
│   ├── controlfreec_annotated_cn_autosomes.tsv.gz
│   └── controlfreec_annotated_cn_x_and_y.tsv.gz
├── rna-expression-validation.R
├── run-bedtools.snakemake
├── run-prepare-cn.sh
└── util
└── rna-expression-functions.R
Expand Down
Binary file not shown.
Loading