-
Notifications
You must be signed in to change notification settings - Fork 67
Add cytoband to copy number files using bedtools intersect #617
Changes from 2 commits
7dffdfc
e745ad9
1c4eea5
bba0864
f573daa
337a264
3c827b1
ee4ff57
bf47620
721ab75
9fa3b85
d215dcb
5a2239d
c31745d
05bd93a
b60cf95
2af767a
60147fc
69a3ee5
8247e53
0c89687
bcbbb09
2b6c1b2
01f385b
d64159b
807a7ad
b26cd42
e9bf7a7
126ea71
53b906d
1b761ee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# This script downloads and prepares the UCSC cytoband data to be added to the | ||
# focal CN files that result from this module. | ||
# | ||
# | ||
# Chante Bethell for CCDL 2020 | ||
# | ||
# #### Example Usage | ||
# | ||
# This script is intended to be run via the command line. | ||
# This example assumes it is being run from the root of the repository. | ||
# | ||
# Rscript --vanilla analyses/focal-cn-file-preparation/00-prepare-ucsc-cytoband-file.R | ||
|
||
#### Set Up -------------------------------------------------------------------- | ||
|
||
# Install GenomicRanges | ||
if (!("GenomicRanges" %in% installed.packages())) { | ||
BiocManager::install("GenomicRanges", update = FALSE) | ||
} | ||
|
||
# Get `magrittr` pipe | ||
`%>%` <- dplyr::`%>%` | ||
|
||
#### Directories and Files ----------------------------------------------------- | ||
|
||
# Detect the ".git" folder -- this will in the project root directory. | ||
# Use this as the root directory to ensure proper sourcing of functions no | ||
# matter where this is called from | ||
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git")) | ||
|
||
# Set path to results directory | ||
results_dir <- | ||
file.path(root_dir, "analyses", "focal-cn-file-preparation", "results") | ||
|
||
if (!dir.exists(results_dir)) { | ||
dir.create(results_dir) | ||
} | ||
|
||
#### Download and Wrangle UCSC data ------------------------------------------- | ||
|
||
# Read in UCSC cytoband data. The decision to implement the UCSC hg38 cytoband | ||
# file was made based on a comparison done between the cytoband calls in the | ||
# `org.Hs.eg.db` package and the calls in the UCSC file. We found that they | ||
# disagreed in ~11,800 calls out of ~800,000 and the `UCSC file` contains more | ||
# cytoband calls. | ||
ucsc_cytoband <- | ||
data.table::fread( | ||
"http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBand.txt.gz" | ||
) | ||
|
||
# Select variables needed in the UCSC cytoband data -- must be in the | ||
# required bedtools format: chr, start, end | ||
ucsc_cytoband_bed <- ucsc_cytoband %>% | ||
dplyr::select(chr = V1, start = V2, end = V3, cytoband = V4) %>% | ||
dplyr::mutate(cytoband = paste0(gsub("chr", "", chr), cytoband), | ||
cansavvy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
chr = gsub("_.*","", chr)) %>% | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you trying to drop the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This step, in cases like: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AHH I see. Okay cool. I think in most cases in this project we've dropped There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
An alternative that is less risky, then is to save a BED ready version of that file in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah okay, will also look into this before making the change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By |
||
dplyr::filter(!(chr %in% c("chrUn", "chrM"))) | ||
|
||
# Save as bed file | ||
readr::write_tsv( | ||
ucsc_cytoband_bed, | ||
file.path(results_dir, "ucsc_cytoband.bed") | ||
) | ||
|
||
#### Prepare consensus seg file ----------------------------------------------- | ||
|
||
# Read in the consensus copy number file produced in `02-add-ploid-consensus.Rmd` | ||
consensus_with_status <- | ||
readr::read_tsv(file.path(root_dir, "scratch", "consensus_seg_with_status.tsv")) | ||
|
||
# Select variables needed in the consensus copy number data -- must be in the | ||
# required bedtools format: chr, start, end | ||
consensus_with_status_bed <- consensus_with_status %>% | ||
dplyr::select(chr = chrom, start = loc.start, end = loc.end, status, Kids_First_Biospecimen_ID) | ||
|
||
# Save as bed file | ||
readr::write_tsv( | ||
consensus_with_status_bed, | ||
file.path(results_dir, "consensus_with_status.bed") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you are dropping the gram negative/pos calls? and renaming the columns. A couple things: 1) bedtools doesn't care about column names. 2) Column names are still nice to keep track of things so you can just use a
col.names
in line 48 so that you don't have to rename them later. Then if you don't want the gram neg/pos column you can just drop it in this line. (Though I'm not sure it hurts anything to keep it around and this file isn't that big so keeping an extra column is not too big of a deal).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, will make this change.