Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Add cytoband to copy number files using bedtools intersect #617

Merged
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
7dffdfc
Add cytoband data using bedtools intersect with UCSC cytoband file
cbethell Mar 9, 2020
e745ad9
Update comments
cbethell Mar 9, 2020
1c4eea5
@cansavvy and @jashapiro suggested changes
cbethell Mar 11, 2020
bba0864
sort before filtering out losses and gains
cbethell Mar 12, 2020
f573daa
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 12, 2020
337a264
Add notebook to join and wrangle the cytoband bed files
cbethell Mar 12, 2020
3c827b1
Add chromosome arm field and GISTIC arm status data
cbethell Mar 13, 2020
ee4ff57
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 13, 2020
bf47620
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 13, 2020
721ab75
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 13, 2020
9fa3b85
implement @jashapiro's suggested changes
cbethell Mar 13, 2020
d215dcb
add steps for loss and gain bed files to `run-bedtools.sh`
cbethell Mar 14, 2020
5a2239d
Propagate changes to bed files to `03` nb
cbethell Mar 16, 2020
c31745d
change logic to uncompress the cytoband file once
cbethell Mar 16, 2020
05bd93a
Substitute snakemake for shell in bedtools script
jashapiro Mar 16, 2020
b60cf95
Merge remote-tracking branch 'cbethell/add-cytoband-status-with-bedto…
jashapiro Mar 16, 2020
2af767a
rerun `03` nb with updated coverage bed files
cbethell Mar 17, 2020
60147fc
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 17, 2020
69a3ee5
Update module README and start defining most focal units
cbethell Mar 18, 2020
8247e53
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 18, 2020
0c89687
Reformat final output table and rename output file
cbethell Mar 19, 2020
bcbbb09
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 19, 2020
2b6c1b2
Apply suggestions from @jashapiro code review
cbethell Mar 19, 2020
01f385b
Move addition of `chromosome_arm` step after joining original UCSC data
cbethell Mar 19, 2020
d64159b
update README to reflect addition of `band_length` column
cbethell Mar 19, 2020
807a7ad
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 19, 2020
b26cd42
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 20, 2020
e9bf7a7
remove redundant joining of original ucsc cytoband data section
cbethell Mar 20, 2020
126ea71
update usage comment and re-render html output
cbethell Mar 20, 2020
53b906d
@jashapiro's commit suggestion to include sex chromosome data
cbethell Mar 20, 2020
1b761ee
rerun module (now that UCSC file download includes sex chromosomes)
cbethell Mar 20, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions analyses/focal-cn-file-preparation/02-add-ploidy-consensus.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -146,3 +146,31 @@ add_status_df %>%
output_file <- file.path("..", "..", "scratch", "consensus_seg_with_status.tsv")
write_tsv(add_status_df, output_file)
```

### Prepare separate bed files for losses/gains for bedtools coverage function

```{r}
bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything())

losses_bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
filter(status == "loss")

gains_bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
filter(status == "gain")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sure the bed tables are sorted before output.

Suggested change
bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything())
losses_bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
filter(status == "loss")
gains_bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
filter(status == "gain")
bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
arrange(chrom, loc.start, loc.end)
losses_bed_status_df <- bed_status_df %>%
filter(status == "loss")
gains_bed_status_df <- bed_status_df %>%
filter(status == "gain")

```

### Write to file

```{r}
bed_output_file <- file.path("..", "..", "scratch", "consensus_seg_with_status.bed")
loss_bed_output_file <- file.path("..", "..", "scratch", "consensus_seg_with_status_losses.bed")
gain_bed_output_file <- file.path("..", "..", "scratch", "consensus_seg_with_status_gains.bed")

write_tsv(bed_status_df, bed_output_file)
write_tsv(losses_bed_status_df, loss_bed_output_file)
write_tsv(gains_bed_status_df, gain_bed_output_file)
```

49 changes: 43 additions & 6 deletions analyses/focal-cn-file-preparation/02-add-ploidy-consensus.nb.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion analyses/focal-cn-file-preparation/03-prepare-cn-file.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
# This script is intended to be run via the command line.
# This example assumes it is being run from the root of the repository.
#
# Rscript --vanilla analyses/oncoprint-landscape/03-prepare-cn-file.R \
# Rscript --vanilla analyses/focal-cn-file-preparation/03-prepare-cn-file.R \
# --cnv_file data/pbta-cnv-controlfreec.tsv.gz \
# --gtf_file data/gencode.v27.primary_assembly.annotation.gtf.gz \
# --metadata data/pbta-histologies.tsv \
Expand Down
149 changes: 90 additions & 59 deletions analyses/focal-cn-file-preparation/run-prepare-cn.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,74 +18,105 @@ cd "$script_directory" || exit

scratch_dir=../../scratch
data_dir=../../data
results_dir=../../analyses/focal-cn-file-preparation/results
histologies_file=${data_dir}/pbta-histologies.tsv
gtf_file=${data_dir}/gencode.v27.primary_assembly.annotation.gtf.gz
goi_file=../../analyses/oncoprint-landscape/driver-lists/brain-goi-list-long.txt
independent_specimens_file=${data_dir}/independent-specimens.wgswxs.primary.tsv
ucsc_bed_file=${results_dir}/ucsc_cytoband.bed
consensus_bed_file=${scratch_dir}/consensus_seg_with_status.tsv
loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.tsv
gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.tsv
callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these file names where changed elsewhere, so I am noting that here.

Suggested change
consensus_bed_file=${scratch_dir}/consensus_seg_with_status.tsv
loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.tsv
gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.tsv
callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.tsv
consensus_bed_file=${scratch_dir}/consensus_seg_with_status.bed
loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.bed
gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.bed
callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.bed


# Prep the consensus SEG file data
Rscript --vanilla -e "rmarkdown::render('02-add-ploidy-consensus.Rmd', clean = TRUE)"

# Run annotation step for consensus file
Rscript --vanilla 03-prepare-cn-file.R \
--cnv_file ${scratch_dir}/consensus_seg_with_status.tsv \
--gtf_file $gtf_file \
--metadata $histologies_file \
--filename_lead "consensus_seg_annotated_cn" \
--seg
# Download and save UCSC cytoband file as bed file
wget -O ${scratch_dir}/ucsc_cytoband.bed http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBand.txt.gz

# Use bedtools intersect to find the intersection between the UCSC file with
# cytoband data and the `scratch/consensus_with_status.tsv` file prepared in
# `02-add-ploidy-consensus.Rmd`

libraryStrategies=("polya" "stranded")
chromosomesType=("autosomes" "x_and_y")
for strategy in ${libraryStrategies[@]}; do
bedtools coverage \
-a ${scratch_dir}/ucsc_cytoband.bed \
-b ${scratch_dir}/consensus_seg_with_status_losses.bed \
-f 0.75 \
Copy link
Member

@jashapiro jashapiro Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have sorted the -b file, then you can speed this up with -sorted (I don't think we need -f here).

The same thing applies to the two bedtools coverage calls below this one.

Suggested change
-f 0.75 \
-sorted \

> $loss_intersect_with_cytoband_file

for chromosome_type in ${chromosomesType[@]}; do
bedtools coverage \
-a ${scratch_dir}/ucsc_cytoband.bed \
-b ${scratch_dir}/consensus_seg_with_status_gains.bed \
-f 0.75 \
> $gain_intersect_with_cytoband_file

Rscript --vanilla rna-expression-validation.R \
--annotated_cnv_file results/consensus_seg_annotated_cn_${chromosome_type}.tsv.gz \
--expression_file ${data_dir}/pbta-gene-expression-rsem-fpkm-collapsed.${strategy}.rds \
--independent_specimens_file $independent_specimens_file \
--metadata $histologies_file \
--goi_list $goi_file \
--filename_lead "consensus_seg_annotated_cn"_${chromosome_type}_${strategy}
done
done
bedtools coverage \
-a ${scratch_dir}/ucsc_cytoband.bed \
-b ${scratch_dir}/consensus_seg_with_status.bed \
-f 0.75 \
> $callable_intersect_with_cytoband_file

# if we want to process the CNV data from the original callers
# (e.g., CNVkit, ControlFreeC)
if [ "$RUN_ORIGINAL" -gt "0" ]; then

# Prep the CNVkit data
Rscript --vanilla -e "rmarkdown::render('01-add-ploidy-cnvkit.Rmd', clean = TRUE)"

# Run annotation step for CNVkit
Rscript --vanilla 03-prepare-cn-file.R \
--cnv_file ${scratch_dir}/cnvkit_with_status.tsv \
--gtf_file $gtf_file \
--metadata $histologies_file \
--filename_lead "cnvkit_annotated_cn" \
--seg

# Run annotation step for ControlFreeC
Rscript --vanilla 03-prepare-cn-file.R \
--cnv_file ${data_dir}/pbta-cnv-controlfreec.tsv.gz \
--gtf_file $gtf_file \
--metadata $histologies_file \
--filename_lead "controlfreec_annotated_cn" \
--controlfreec

filenameLead=("cnvkit_annotated_cn" "controlfreec_annotated_cn")
for filename in ${filenameLead[@]}; do
for strategy in ${libraryStrategies[@]}; do
for chromosome_type in ${chromosomesType[@]}; do
Rscript --vanilla rna-expression-validation.R \
--annotated_cnv_file results/${filename}_${chromosome_type}.tsv.gz \
--expression_file ${data_dir}/pbta-gene-expression-rsem-fpkm-collapsed.${strategy}.rds \
--independent_specimens_file $independent_specimens_file \
--metadata $histologies_file \
--goi_list $goi_file \
--filename_lead ${filename}_${chromosome_type}_${strategy}
done
done
done

fi
# # Run annotation step for consensus file
# Rscript --vanilla 03-prepare-cn-file.R \
# --cnv_file ${scratch_dir}/consensus_seg_with_status.tsv \
# --gtf_file $gtf_file \
# --metadata $histologies_file \
# --filename_lead "consensus_seg_annotated_cn" \
# --seg
#
# libraryStrategies=("polya" "stranded")
# chromosomesType=("autosomes" "x_and_y")
# for strategy in ${libraryStrategies[@]}; do
#
# for chromosome_type in ${chromosomesType[@]}; do
#
# Rscript --vanilla rna-expression-validation.R \
# --annotated_cnv_file results/consensus_seg_annotated_cn_${chromosome_type}.tsv.gz \
# --expression_file ${data_dir}/pbta-gene-expression-rsem-fpkm-collapsed.${strategy}.rds \
# --independent_specimens_file $independent_specimens_file \
# --metadata $histologies_file \
# --goi_list $goi_file \
# --filename_lead "consensus_seg_annotated_cn"_${chromosome_type}_${strategy}
# done
# done
#
# # if we want to process the CNV data from the original callers
# # (e.g., CNVkit, ControlFreeC)
# if [ "$RUN_ORIGINAL" -gt "0" ]; then
#
# # Prep the CNVkit data
# Rscript --vanilla -e "rmarkdown::render('01-add-ploidy-cnvkit.Rmd', clean = TRUE)"
#
# # Run annotation step for CNVkit
# Rscript --vanilla 03-prepare-cn-file.R \
# --cnv_file ${scratch_dir}/cnvkit_with_status.tsv \
# --gtf_file $gtf_file \
# --metadata $histologies_file \
# --filename_lead "cnvkit_annotated_cn" \
# --seg
#
# # Run annotation step for ControlFreeC
# Rscript --vanilla 03-prepare-cn-file.R \
# --cnv_file ${data_dir}/pbta-cnv-controlfreec.tsv.gz \
# --gtf_file $gtf_file \
# --metadata $histologies_file \
# --filename_lead "controlfreec_annotated_cn" \
# --controlfreec
#
# filenameLead=("cnvkit_annotated_cn" "controlfreec_annotated_cn")
# for filename in ${filenameLead[@]}; do
# for strategy in ${libraryStrategies[@]}; do
# for chromosome_type in ${chromosomesType[@]}; do
# Rscript --vanilla rna-expression-validation.R \
# --annotated_cnv_file results/${filename}_${chromosome_type}.tsv.gz \
# --expression_file ${data_dir}/pbta-gene-expression-rsem-fpkm-collapsed.${strategy}.rds \
# --independent_specimens_file $independent_specimens_file \
# --metadata $histologies_file \
# --goi_list $goi_file \
# --filename_lead ${filename}_${chromosome_type}_${strategy}
# done
# done
# done
#
# fi