-
Notifications
You must be signed in to change notification settings - Fork 67
Add cytoband to copy number files using bedtools intersect #617
Add cytoband to copy number files using bedtools intersect #617
Conversation
- rename script to better indicate its content - propagate change to shell script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbethell I think we can simplify this by cutting back on most of the content in 00-prepare-bed-files.R
script and going straight to the intersection bedtools bit. I would vouch for trying to get to the intersection bit quicker because some of these formatting changes you will make here will not necessarily carried over and also bedtools does not require them. I tested these items with bedtools to make sure:
- Bedtools doesn't care about your column names as long as it finds the chr, start, end in those first three columns in that order (I see your comments mentioned this part), but what you name those columns makes no difference to bedtools.
- Bedtools doesn't care if you have extra columns after
chr
start
andend
so you don't need to remove columns after it.
Bedtools documentation is pretty good, so you may want to poke around in there a bit (if you have not already): https://bedtools.readthedocs.io/en/latest/content/overview.html
But the other bit of advice I'd have is just try to run bedtools first and it gives pretty good error messages if the data is formatted wrong. For example, when I ran cytoband_with_status.tsv
without moving the biospecimen ID from the first column, it told me:
- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the
expected columns (e.g., cols 2 and 3 for BED).
As bedtools mentions here, there is one outstanding formatting issue we need to resolve to be able to run bedtools off the bat. That pesky Kids_First_Biospecimen_ID
column needs to be moved to be after from the first three columns.
IF it doesn't throw any downstream analyses off (this is an if of unknown size) then you could just alter this dplyr::select line to move the biospecimen id to the end. And then you can use the consensus_seg_with_status.tsv
file with bedtools exactly how it is. However, I do not know how much that will throw things off, so you should look into that before using this idea.
ucsc_cytoband_bed <- ucsc_cytoband %>% | ||
dplyr::select(chr = V1, start = V2, end = V3, cytoband = V4) %>% | ||
dplyr::mutate(cytoband = paste0(gsub("chr", "", chr), cytoband), | ||
chr = gsub("_.*","", chr)) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you trying to drop the chr
on the chromosome column? If yes, just use gsub("chr","", chr)
. But if that is not what you are doing, can you explain what your goal is here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step, in cases like: chr10_GL383545v1_alt
, is removing everything after the first _
character inclusive. This is done to make it comparable with the consensus_with_status.tsv
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AHH I see. Okay cool. I think in most cases in this project we've dropped _alt
chromosomes, but I could be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cansavvy re:
IF it doesn't throw any downstream analyses off (this is an if of unknown size) then you could just alter this dplyr::select line to move the biospecimen id to the end. And then you can use the
consensus_seg_with_status.tsv
file with bedtools exactly how it is. However, I do not know how much that will throw things off, so you should look into that before using this idea.I did give this a thought before filing the PR but was worried about downstream analyses, but will look into it and let you know what I find.
An alternative that is less risky, then is to save a BED ready version of that file in scratch
directory at the end of that 02 notebook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, will also look into this before making the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By BED ready version
, that only means moving that biospecimen column out of the first 3 column spots.
# Select variables needed in the UCSC cytoband data -- must be in the | ||
# required bedtools format: chr, start, end | ||
ucsc_cytoband_bed <- ucsc_cytoband %>% | ||
dplyr::select(chr = V1, start = V2, end = V3, cytoband = V4) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you are dropping the gram negative/pos calls? and renaming the columns. A couple things: 1) bedtools doesn't care about column names. 2) Column names are still nice to keep track of things so you can just use a col.names
in line 48 so that you don't have to rename them later. Then if you don't want the gram neg/pos column you can just drop it in this line. (Though I'm not sure it hurts anything to keep it around and this file isn't that big so keeping an extra column is not too big of a deal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, will make this change.
@cansavvy re:
I did give this a thought before filing the PR but was worried about downstream analyses, but will look into it and let you know what I find. |
More specifically @cbethell, given your comments and findings here, here's what I'd suggest: Step 1) Change the 02-add-ploidy-consensus.Rmd to save a " Step 2) Have your already Step 3) Continue on with the intersect steps you have laid out in your bash script. Though we should discuss if -wa option is what you want here. See this doc |
|
I'm not clear on what information |
Gotcha. Okay. I think my confusion may have been over what our end file results goal here was. We want UCSC's cytoband reports and do not care about losing cytobands as they are reported by our data. |
Upon further discussion with @jashapiro @cansavvy @jaclyn-taroni, the following steps will be taken:
|
I don't think you need to annotate these files with genes, rather you want to annotate them with status: |
Okay, so the final output we want here is a table with fields similar to the following:
Where |
I think callable would be any range that is within |
I think you might want every cytoband in there, so I would have 4 possibilities for status:
|
This seems like it would be the most flexible for downstream analyses, but I believe it may require you to read an additional file in (the original cytoband file) to capture the |
An alternative worth exploring might be to use https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html This could allow you to make a table like the following by combining three separate coverage calls
This would give flexibility on choosing cutoffs later without rerunning the whole thing. |
Nice, I'll look into implementing this tool. My understanding here is that |
I think you could skip the |
- remove 00 script - generate consensus bed files in 02 nb, one for the whole consensus seg file, one filtered for losses, and one filtered for gains (these files are saved in the project's scratch directory) - implement bedtools coverage to add cytoband data from the UCSC cytoband file to the regions with status calls - comment out the rest of the shell script for development purposes
@jashapiro I tried to implement this in the last commit, but was unsuccessful thus far as the script seems to get stuck at the first |
@cbethell I am not sure what would be happening there, but I have a couple of quick thoughts. One is that you should not need the |
bed_status_df <- add_status_df %>% | ||
select(chrom, loc.start, loc.end, everything()) | ||
|
||
losses_bed_status_df <- add_status_df %>% | ||
select(chrom, loc.start, loc.end, everything()) %>% | ||
filter(status == "loss") | ||
|
||
gains_bed_status_df <- add_status_df %>% | ||
select(chrom, loc.start, loc.end, everything()) %>% | ||
filter(status == "gain") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sure the bed tables are sorted before output.
bed_status_df <- add_status_df %>% | |
select(chrom, loc.start, loc.end, everything()) | |
losses_bed_status_df <- add_status_df %>% | |
select(chrom, loc.start, loc.end, everything()) %>% | |
filter(status == "loss") | |
gains_bed_status_df <- add_status_df %>% | |
select(chrom, loc.start, loc.end, everything()) %>% | |
filter(status == "gain") | |
bed_status_df <- add_status_df %>% | |
select(chrom, loc.start, loc.end, everything()) %>% | |
arrange(chrom, loc.start, loc.end) | |
losses_bed_status_df <- bed_status_df %>% | |
filter(status == "loss") | |
gains_bed_status_df <- bed_status_df %>% | |
filter(status == "gain") |
consensus_bed_file=${scratch_dir}/consensus_seg_with_status.tsv | ||
loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.tsv | ||
gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.tsv | ||
callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.tsv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these file names where changed elsewhere, so I am noting that here.
consensus_bed_file=${scratch_dir}/consensus_seg_with_status.tsv | |
loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.tsv | |
gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.tsv | |
callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.tsv | |
consensus_bed_file=${scratch_dir}/consensus_seg_with_status.bed | |
loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.bed | |
gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.bed | |
callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.bed |
bedtools coverage \ | ||
-a ${scratch_dir}/ucsc_cytoband.bed \ | ||
-b ${scratch_dir}/consensus_seg_with_status_losses.bed \ | ||
-f 0.75 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have sorted the -b
file, then you can speed this up with -sorted
(I don't think we need -f
here).
The same thing applies to the two bedtools coverage
calls below this one.
-f 0.75 \ | |
-sorted \ |
- remove `-f` flag from bedtools coverage and add `-sorted` flag - rerun - change `tsv` files to `bed` files
- wrangle the data in the cytoband status bed files in `run-prepare-cn.sh` - add `dominant_status` column to final table - add `uncallable` regions from original cytoband file
The notebook added in the last commit can be seen here. In this notebook so far,
|
- add nb to start to define the most focal units of recurrent CNVs (this nb crashes locally at the `right_join` step but I wanted to get an opinion on whether or not this is what we want to do here) - add step to save CN file with UCSC cytoband and dominant status data - update the module README to reflect changes and make not on the processing speed of the shell script - remove `run-bedtools.sh` which was replaced with `run-bedtools.snakemake` - rename prepare cn file R script to be `04` and propagate this change in the shell script
In the last commit, I added a notebook named Note: This notebook is beyond the scope of this PR and was added to demonstrate my thought process on how we would want to tackle the part of the analysis that focuses on finding the most focal units. The reason I added it to this PR is because it determines the final file that comes out of this PR. In other words, should this PR save a file of our CN calls with If not, what should the final file of this PR look like with the next step being defining the most focal units in mind? |
In this description, it is not clear to me how we would distinguish an arm loss from a cytoband loss. My kneejerk reaction is that you would want to generate arm status separately using a similar approach to how the cytoband calls were generated (you may want to supply a different value to |
I think if we have all of the cytoband calls, we can readily combine them to identify arm losses with some simple |
Okay sounds good 👍 |
This is what I wanted to get to the bottom of, the final output table that we would like to see from this PR. So to make sure that we are on the same page here, the final table from this PR should include the following fields:
Is that correct? Should I also break out the GISTIC comparison section of this PR into its own separate notebook and PR? |
That seems right to me. You also had a column for chromosome arm in there in the last version I looked at, and I would leave that in. I might change the I do think I would save the gistic comparison for a separate PR. |
Okay sounds good, these changes will be in the upcoming commit. |
- add sample IDs to final output table - remove GISTIC comparison section of nb and remove output file from this section - remove `05` nb and rendered output - rerun `03` nb - update README to reflect changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! Just a few small changes to suggest.
One is adding in band_length as a column we want to maintain in the output. The other is to make sure all output rows have a chromosome arm. Otherwise, this should be pretty ready to go!
### Add chromosome arm column | ||
|
||
```{r} | ||
# Add a column that tells us the position of the p or q and then use this to | ||
# split the cytoband column | ||
final_df <- final_df %>% | ||
mutate(cytoband_with_arm = paste0(gsub("chr", "", chr), cytoband), | ||
chromosome_arm = gsub("(p|q).*", "\\1", cytoband_with_arm)) %>% | ||
select(-cytoband_with_arm) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step should be moved to after the UCSC data is merged back in, otherwise those uncallable cytobands might not have arm assignments.
analyses/focal-cn-file-preparation/03-add-cytoband-status-consensus.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/03-add-cytoband-status-consensus.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/03-add-cytoband-status-consensus.Rmd
Outdated
Show resolved
Hide resolved
| `Kids_First_Biospecimen_ID` | chr | cytoband | dominant_status | callable_fraction | gain_fraction | loss_fraction | chromosome_arm | | ||
|----------------|--------|-------------|--------|---------|-------------|---------|---------------| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will want to update this to reflect any added columns with changes above.
- add the `band_length` field as above - remove non-canonical chromosomes (and mitochondria) Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
- rerun nb after making changes to include `band_length` - remove unnecessary `replace_na` step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I am approving this, but I have two additional comments...
One is whether we need to add in the ucsc cytobands at all, as the bedtools coverage
results really should have all the ones we care about, and it looks like the merge is resulting in no added rows now.
I also did a bit of spot checking, and there does seem to be something a bit funny going on with sample BS_CBMAWSAR. For some reason, that one is not getting "uncallable" calls as I would expect (if one sample is uncallable in a region, all of them should be).
For example, here:
BS_JTBM5TSE chr1 p11.2 uncallable 1300000 0.2165369 0 0.2165369 1p
BS_DE26D072 chr1 p11.2 uncallable 1300000 0.2165369 0 0.2165369 1p
BS_CBMAWSAR chr1 p11.2 neutral 1300000 1 0 0 1p
BS_ZS1QRMXS chr1 p11.2 uncallable 1300000 0.2165369 0 0 1p
BS_5R0HHQ1Y chr1 p11.2 uncallable 1300000 0.2165369 0 0 1p
So something is funny there, which may warrant investigation. Most likely it isn't showing up in one of the original tables for some reason? Or it shouldn't be included at all?
### Read in and join original UCSC cytoband data | ||
|
||
This step is to define any additional uncallable cytoband regions. | ||
|
||
```{r message = FALSE} | ||
ucsc_cytoband_df <- data.table::fread("http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBand.txt.gz", data.table = FALSE) | ||
|
||
ucsc_cytoband_df <- ucsc_cytoband_df %>% | ||
mutate(band_length = V3 - V2) %>% | ||
select(chr = V1, cytoband = V4, band_length) %>% | ||
filter(chr %in% paste0("chr", 1:23)) # remove nonstandard chroms. | ||
|
||
# Join the cytoband data from the original UCSC cytoband file with the data | ||
# bedtools coverage files wrangled in the steps above | ||
final_df <- final_df %>% | ||
right_join(ucsc_cytoband_df, by = c("chr", "cytoband", "band_length")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realizing this whole step may be redundant, since the bedtools coverage
is including all of the ucsc cytobands already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I will remove this section.
@jashapiro After some investigation, I found that the BS_CBMAWSAR sample, with the addition of three other samples( namely, BS_H50HR85Y, BS_FF73TT6D, and BS_80078QDG), all have a callable ratio of 1 for this region compared to the remainder of the samples with a callable ratio of 0.2165369. However, these samples appear to agree with all other samples of the cohort in other cytoband regions. Would you still recommend removing these samples? |
I don't think anything different should be done with these
No, I would leave them. I have some guesses why that is happening, but I don't think it should be a major issue in general. My guess is that it is where there is a very large segment that crosses an uncallable region. Shouldn't happen that often, but should be something we are aware of, I guess. We will probably want to add logic at some point that says if a band is mostly uncallable, then it should be uncallable for all samples. But other bands for that sample should stay as they are. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks almost all set to me!
I realized that I made a mistake in the UCSC file download and filtered out sex chromosomes, so we are missing some of the data at the moment. So the last step here should be to incorporate my one suggestion and rerun everything (which will be slow, but ah well.)
Otherwise, approved!
Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
Purpose/implementation Section
The purpose of this PR is to generate cytoband copy number status consensus files for consumption by downstream analyses.
What scientific question is your analysis addressing?
As noted in the original comment on PR #497, the current annotated output of focal-cn-file-preparation (e.g., the contents of results) contains some information about cytobands. However, the cytoband status sometimes disagree with the GISTIC arm status (GISTIC cutoff is 0.98 of arm for an event). Using the approach in this PR, we can hopefully this issue more directly and hopefully find more agreement between cytoband status and arm status.
What was your approach?
My approach was to prepare the cytoband file retrieved from the UCSC database and the
consensus_seg_with_status.tsv
file prepared in02-add-ploidy-consensus.Rmd
to be in format required by bedtools functions. I also separated theconsensus_seg_with_status.tsv
file into gains and losses and saved these as individual bed files.I then used bedtools coverage to retrieve the coverage ratio for each of this files using the UCSC cytoband bed file.
In
03-add-cytoband-status-consensus.Rmd
, I merge the data produced using bedtools coverage and denote the dominant status for each chromosome arm using this data. I then add and compare GISTIC's arm status data. Should this be broken up into 2 separate notebooks?What GitHub issue does your pull request address?
This PR addresses issue #497.
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
dominant_status
for our consensus calls should also receive some close attention.Is there anything that you want to discuss further?
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes, this is ready for review.
Results
What types of results are included (e.g., table, figure)?
Currently, this PR produces a number of intermediate
bed
files stored in the project's scratch directory, but one final results file as follows:results/consensus_seg_gistic_cytoband_status.tsv
This results file contains the chromosome arm and the dominant status calls for our consensus data and for GISTIC's data.
Should I be saving this file before the addition of the GISTIC data?
What is your summary of the results?
The chromosome arm calls appear to agree for a total of 35 out of 48 instances (this total excludes the
_alt
chromosomes that I left in the final file for the purpose of being thorough -- these chromosomes are labeleduncallable
).Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.