Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Add cytoband to copy number files using bedtools intersect #617

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
7dffdfc
Add cytoband data using bedtools intersect with UCSC cytoband file
cbethell Mar 9, 2020
e745ad9
Update comments
cbethell Mar 9, 2020
1c4eea5
@cansavvy and @jashapiro suggested changes
cbethell Mar 11, 2020
bba0864
sort before filtering out losses and gains
cbethell Mar 12, 2020
f573daa
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 12, 2020
337a264
Add notebook to join and wrangle the cytoband bed files
cbethell Mar 12, 2020
3c827b1
Add chromosome arm field and GISTIC arm status data
cbethell Mar 13, 2020
ee4ff57
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 13, 2020
bf47620
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 13, 2020
721ab75
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 13, 2020
9fa3b85
implement @jashapiro's suggested changes
cbethell Mar 13, 2020
d215dcb
add steps for loss and gain bed files to `run-bedtools.sh`
cbethell Mar 14, 2020
5a2239d
Propagate changes to bed files to `03` nb
cbethell Mar 16, 2020
c31745d
change logic to uncompress the cytoband file once
cbethell Mar 16, 2020
05bd93a
Substitute snakemake for shell in bedtools script
jashapiro Mar 16, 2020
b60cf95
Merge remote-tracking branch 'cbethell/add-cytoband-status-with-bedto…
jashapiro Mar 16, 2020
2af767a
rerun `03` nb with updated coverage bed files
cbethell Mar 17, 2020
60147fc
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 17, 2020
69a3ee5
Update module README and start defining most focal units
cbethell Mar 18, 2020
8247e53
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 18, 2020
0c89687
Reformat final output table and rename output file
cbethell Mar 19, 2020
bcbbb09
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 19, 2020
2b6c1b2
Apply suggestions from @jashapiro code review
cbethell Mar 19, 2020
01f385b
Move addition of `chromosome_arm` step after joining original UCSC data
cbethell Mar 19, 2020
d64159b
update README to reflect addition of `band_length` column
cbethell Mar 19, 2020
807a7ad
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 19, 2020
b26cd42
Merge branch 'master' into add-cytoband-status-with-bedtools
cbethell Mar 20, 2020
e9bf7a7
remove redundant joining of original ucsc cytoband data section
cbethell Mar 20, 2020
126ea71
update usage comment and re-render html output
cbethell Mar 20, 2020
53b906d
@jashapiro's commit suggestion to include sex chromosome data
cbethell Mar 20, 2020
1b761ee
rerun module (now that UCSC file download includes sex chromosomes)
cbethell Mar 20, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions analyses/focal-cn-file-preparation/00-prepare-bed-files.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# This script downloads and prepares the UCSC cytoband data to be added to the
# focal CN files that result from this module.
#
#
# Chante Bethell for CCDL 2020
#
# #### Example Usage
#
# This script is intended to be run via the command line.
# This example assumes it is being run from the root of the repository.
#
# Rscript --vanilla analyses/focal-cn-file-preparation/00-prepare-ucsc-cytoband-file.R

#### Set Up --------------------------------------------------------------------

# Install GenomicRanges
if (!("GenomicRanges" %in% installed.packages())) {
BiocManager::install("GenomicRanges", update = FALSE)
}

# Get `magrittr` pipe
`%>%` <- dplyr::`%>%`

#### Directories and Files -----------------------------------------------------

# Detect the ".git" folder -- this will in the project root directory.
# Use this as the root directory to ensure proper sourcing of functions no
# matter where this is called from
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))

# Set path to results directory
results_dir <-
file.path(root_dir, "analyses", "focal-cn-file-preparation", "results")

if (!dir.exists(results_dir)) {
dir.create(results_dir)
}

#### Download and Wrangle UCSC data -------------------------------------------

# Read in UCSC cytoband data. The decision to implement the UCSC hg38 cytoband
# file was made based on a comparison done between the cytoband calls in the
# `org.Hs.eg.db` package and the calls in the UCSC file. We found that they
# disagreed in ~11,800 calls out of ~800,000 and the `UCSC file` contains more
# cytoband calls.
ucsc_cytoband <-
data.table::fread(
"http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBand.txt.gz"
)

# Select variables needed in the UCSC cytoband data -- must be in the
# required bedtools format: chr, start, end
ucsc_cytoband_bed <- ucsc_cytoband %>%
dplyr::select(chr = V1, start = V2, end = V3, cytoband = V4) %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are dropping the gram negative/pos calls? and renaming the columns. A couple things: 1) bedtools doesn't care about column names. 2) Column names are still nice to keep track of things so you can just use a col.names in line 48 so that you don't have to rename them later. Then if you don't want the gram neg/pos column you can just drop it in this line. (Though I'm not sure it hurts anything to keep it around and this file isn't that big so keeping an extra column is not too big of a deal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, will make this change.

dplyr::mutate(cytoband = paste0(gsub("chr", "", chr), cytoband),
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
chr = gsub("_.*","", chr)) %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to drop the chr on the chromosome column? If yes, just use gsub("chr","", chr). But if that is not what you are doing, can you explain what your goal is here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step, in cases like: chr10_GL383545v1_alt, is removing everything after the first _ character inclusive. This is done to make it comparable with the consensus_with_status.tsv file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AHH I see. Okay cool. I think in most cases in this project we've dropped _alt chromosomes, but I could be wrong.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cansavvy re:

IF it doesn't throw any downstream analyses off (this is an if of unknown size) then you could just alter this dplyr::select line to move the biospecimen id to the end. And then you can use the consensus_seg_with_status.tsv file with bedtools exactly how it is. However, I do not know how much that will throw things off, so you should look into that before using this idea.

I did give this a thought before filing the PR but was worried about downstream analyses, but will look into it and let you know what I find.

An alternative that is less risky, then is to save a BED ready version of that file in scratch directory at the end of that 02 notebook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, will also look into this before making the change.

Copy link
Collaborator

@cansavvy cansavvy Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By BED ready version, that only means moving that biospecimen column out of the first 3 column spots.

dplyr::filter(!(chr %in% c("chrUn", "chrM")))

# Save as bed file
readr::write_tsv(
ucsc_cytoband_bed,
file.path(results_dir, "ucsc_cytoband.bed")
)

#### Prepare consensus seg file -----------------------------------------------

# Read in the consensus copy number file produced in `02-add-ploid-consensus.Rmd`
consensus_with_status <-
readr::read_tsv(file.path(root_dir, "scratch", "consensus_seg_with_status.tsv"))

# Select variables needed in the consensus copy number data -- must be in the
# required bedtools format: chr, start, end
consensus_with_status_bed <- consensus_with_status %>%
dplyr::select(chr = chrom, start = loc.start, end = loc.end, status, Kids_First_Biospecimen_ID)

# Save as bed file
readr::write_tsv(
consensus_with_status_bed,
file.path(results_dir, "consensus_with_status.bed")
)
Loading