Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Add cytoband to copy number files using bedtools intersect #617

Merged

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Mar 9, 2020

Purpose/implementation Section

The purpose of this PR is to generate cytoband copy number status consensus files for consumption by downstream analyses.

What scientific question is your analysis addressing?

As noted in the original comment on PR #497, the current annotated output of focal-cn-file-preparation (e.g., the contents of results) contains some information about cytobands. However, the cytoband status sometimes disagree with the GISTIC arm status (GISTIC cutoff is 0.98 of arm for an event). Using the approach in this PR, we can hopefully this issue more directly and hopefully find more agreement between cytoband status and arm status.

What was your approach?

My approach was to prepare the cytoband file retrieved from the UCSC database and the consensus_seg_with_status.tsv file prepared in 02-add-ploidy-consensus.Rmd to be in format required by bedtools functions. I also separated the consensus_seg_with_status.tsv file into gains and losses and saved these as individual bed files.

I then used bedtools coverage to retrieve the coverage ratio for each of this files using the UCSC cytoband bed file.

In 03-add-cytoband-status-consensus.Rmd, I merge the data produced using bedtools coverage and denote the dominant status for each chromosome arm using this data. I then add and compare GISTIC's arm status data. Should this be broken up into 2 separate notebooks?

What GitHub issue does your pull request address?

This PR addresses issue #497.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • The implementation of the bedtools coverage function and the file this produces should receive a particularly close look.
  • The logic determining the dominant_status for our consensus calls should also receive some close attention.

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, this is ready for review.

Results

What types of results are included (e.g., table, figure)?

Currently, this PR produces a number of intermediate bed files stored in the project's scratch directory, but one final results file as follows:

  • results/consensus_seg_gistic_cytoband_status.tsv

This results file contains the chromosome arm and the dominant status calls for our consensus data and for GISTIC's data.

Should I be saving this file before the addition of the GISTIC data?

What is your summary of the results?

The chromosome arm calls appear to agree for a total of 35 out of 48 instances (this total excludes the _alt chromosomes that I left in the final file for the purpose of being thorough -- these chromosomes are labeled uncallable).

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

- rename script to better indicate its content
- propagate change to shell script
Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbethell I think we can simplify this by cutting back on most of the content in 00-prepare-bed-files.R script and going straight to the intersection bedtools bit. I would vouch for trying to get to the intersection bit quicker because some of these formatting changes you will make here will not necessarily carried over and also bedtools does not require them. I tested these items with bedtools to make sure:

  1. Bedtools doesn't care about your column names as long as it finds the chr, start, end in those first three columns in that order (I see your comments mentioned this part), but what you name those columns makes no difference to bedtools.
  2. Bedtools doesn't care if you have extra columns after chr start and end so you don't need to remove columns after it.

Bedtools documentation is pretty good, so you may want to poke around in there a bit (if you have not already): https://bedtools.readthedocs.io/en/latest/content/overview.html
But the other bit of advice I'd have is just try to run bedtools first and it gives pretty good error messages if the data is formatted wrong. For example, when I ran cytoband_with_status.tsv without moving the biospecimen ID from the first column, it told me:

- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the
  expected columns (e.g., cols 2 and 3 for BED).

As bedtools mentions here, there is one outstanding formatting issue we need to resolve to be able to run bedtools off the bat. That pesky Kids_First_Biospecimen_ID column needs to be moved to be after from the first three columns.

IF it doesn't throw any downstream analyses off (this is an if of unknown size) then you could just alter this dplyr::select line to move the biospecimen id to the end. And then you can use the consensus_seg_with_status.tsv file with bedtools exactly how it is. However, I do not know how much that will throw things off, so you should look into that before using this idea.

ucsc_cytoband_bed <- ucsc_cytoband %>%
dplyr::select(chr = V1, start = V2, end = V3, cytoband = V4) %>%
dplyr::mutate(cytoband = paste0(gsub("chr", "", chr), cytoband),
chr = gsub("_.*","", chr)) %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to drop the chr on the chromosome column? If yes, just use gsub("chr","", chr). But if that is not what you are doing, can you explain what your goal is here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step, in cases like: chr10_GL383545v1_alt, is removing everything after the first _ character inclusive. This is done to make it comparable with the consensus_with_status.tsv file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AHH I see. Okay cool. I think in most cases in this project we've dropped _alt chromosomes, but I could be wrong.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cansavvy re:

IF it doesn't throw any downstream analyses off (this is an if of unknown size) then you could just alter this dplyr::select line to move the biospecimen id to the end. And then you can use the consensus_seg_with_status.tsv file with bedtools exactly how it is. However, I do not know how much that will throw things off, so you should look into that before using this idea.

I did give this a thought before filing the PR but was worried about downstream analyses, but will look into it and let you know what I find.

An alternative that is less risky, then is to save a BED ready version of that file in scratch directory at the end of that 02 notebook.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, will also look into this before making the change.

Copy link
Collaborator

@cansavvy cansavvy Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By BED ready version, that only means moving that biospecimen column out of the first 3 column spots.

analyses/focal-cn-file-preparation/00-prepare-bed-files.R Outdated Show resolved Hide resolved
# Select variables needed in the UCSC cytoband data -- must be in the
# required bedtools format: chr, start, end
ucsc_cytoband_bed <- ucsc_cytoband %>%
dplyr::select(chr = V1, start = V2, end = V3, cytoband = V4) %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you are dropping the gram negative/pos calls? and renaming the columns. A couple things: 1) bedtools doesn't care about column names. 2) Column names are still nice to keep track of things so you can just use a col.names in line 48 so that you don't have to rename them later. Then if you don't want the gram neg/pos column you can just drop it in this line. (Though I'm not sure it hurts anything to keep it around and this file isn't that big so keeping an extra column is not too big of a deal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, will make this change.

@cbethell
Copy link
Contributor Author

@cansavvy re:

IF it doesn't throw any downstream analyses off (this is an if of unknown size) then you could just alter this dplyr::select line to move the biospecimen id to the end. And then you can use the consensus_seg_with_status.tsv file with bedtools exactly how it is. However, I do not know how much that will throw things off, so you should look into that before using this idea.

I did give this a thought before filing the PR but was worried about downstream analyses, but will look into it and let you know what I find.

@cansavvy
Copy link
Collaborator

More specifically @cbethell, given your comments and findings here, here's what I'd suggest:

Step 1) Change the 02-add-ploidy-consensus.Rmd to save a "bedtools ready" of the cytoband_with_status.tsv by saving a version of the file in scratch where you move the Kids_First_Biospecimen_ID column to the end of the file in using this kind of dplyr::select line

Step 2) Have your already run-prepare-cn-file.sh bash script wget the UCSC cytoband file with the url you have in this script.

Step 3) Continue on with the intersect steps you have laid out in your bash script. Though we should discuss if -wa option is what you want here. See this doc

@jaclyn-taroni
Copy link
Member

-wa was suggested by @jashapiro and is in #497, but is probably worth discussion / re-evaluating so everyone is on the same page.

@jashapiro
Copy link
Member

I'm not clear on what information intersect_with_cytoband.tsv is supposed to capture? (Also, since it is still a .bed file it should probably be named that) As it stands (bedtools intersect -wa -f 0.1), it is capturing which cytobands have at least 10% of their length present the consensus status, but it is not retaining any information about which status (and seems to be super redundant, if the current file is correct)

@cansavvy
Copy link
Collaborator

I'm not clear on what information intersect_with_cytoband.tsv is supposed to capture? (Also, since it is still a .bed file it should probably be named that) As it stands (bedtools intersect -wa -f 0.1), it is capturing which cytobands have at least 10% of their length present the consensus status, but it is not retaining any information about which status (and seems to be super redundant, if the current file is correct)

Gotcha. Okay. I think my confusion may have been over what our end file results goal here was. We want UCSC's cytoband reports and do not care about losing cytobands as they are reported by our data.

@cbethell
Copy link
Contributor Author

Upon further discussion with @jashapiro @cansavvy @jaclyn-taroni, the following steps will be taken:

  1. Use bedtools subtract -A -f to filter out cytobands from the UCSC cytoband file with too much uncalled data.
  2. Save separate bed files for the losses and gains reported in consensus_seg_with_status.tsv.
  3. Use bedtools intersect -wa -f 0.75 between the filtered callable cytobands file (provided to the -a flag) and each of the consensus seg bed files (one for losses and one for gains, provided to the -b flag).
  4. Then read these files into R and annotate these files with genes.

@jaclyn-taroni
Copy link
Member

  1. Then read these files into R and annotate these files with genes.

I don't think you need to annotate these files with genes, rather you want to annotate them with status: callable, loss, gain

@cbethell
Copy link
Contributor Author

I don't think you need to annotate these files with genes, rather you want to annotate them with status: callable, loss, gain

Okay, so the final output we want here is a table with fields similar to the following:

chr start end cytoband status
chr1 0 2300000 p36.33 loss
chr4 106700000 113200000 q25 callable

Where callable could be a possible neutral call?

@cansavvy
Copy link
Collaborator

I don't think you need to annotate these files with genes, rather you want to annotate them with status: callable, loss, gain

Okay, so the final output we want here is a table with fields similar to the following:

chr start end cytoband status
chr1 0 2300000 p36.33 loss
chr4 106700000 113200000 q25 callable
Where callable could be a possible neutral call?

I think callable would be any range that is within callable regions but not called as gain or loss.

@jashapiro
Copy link
Member

I think you might want every cytoband in there, so I would have 4 possibilities for status: loss, neutral, gain, and uncalled

chr start end cytoband status
chr1 0 2300000 p36.33 loss
chr4 106700000 113200000 q25 callable

@jaclyn-taroni
Copy link
Member

I think you might want every cytoband in there, so I would have 4 possibilities for status: loss, neutral, gain, and uncalled

This seems like it would be the most flexible for downstream analyses, but I believe it may require you to read an additional file in (the original cytoband file) to capture the uncalled.

@jashapiro
Copy link
Member

jashapiro commented Mar 11, 2020

An alternative worth exploring might be to use bedtools coverage to capture the actual fraction of a cytoband that is covered by loss, gain, or uncalled.

https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html

This could allow you to make a table like the following by combining three separate coverage calls

chr start end cytoband loss_fraction gain_fraction uncalled_fraction status
chr1 1000 2000 1p1 0.8 0.02 0.1 loss
chr1 2000 3000 1p2 0.0 0.01 0.02 neutral

This would give flexibility on choosing cutoffs later without rerunning the whole thing.

@cbethell
Copy link
Contributor Author

An alternative worth exploring might be to use bedtools coverage to capture the actual fraction of a cytoband that is covered by loss, gain, or uncalled.

https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html

This could allow you to make a table like the following by combining three separate coverage calls

chr start end cytoband loss_fraction gain_fraction uncalled_fraction status
chr1 1000 2000 1p1 0.8 0.02 0.1 loss
chr1 2000 3000 1p2 0.0 0.01 0.02 neutral
This would give flexibility on choosing cutoffs later without rerunning the whole thing.

Nice, I'll look into implementing this tool.

My understanding here is that bedtools subtract would still be used to retain the callable cytoband regions, then bedtools coverage would be used (in place of bedtools intersect) for each consensus seg with status file (losses, gains, and uncalled regions). Correct?

@jashapiro
Copy link
Member

My understanding here is that bedtools subtract would still be used to retain the callable cytoband regions, then bedtools coverage would be used (in place of bedtools intersect) for each consensus seg with status file (losses, gains, and uncalled regions). Correct?

I think you could skip the subtract and just go to coverage for gains, losses, and uncallable. Then you would read in the three coverage results files to merge them together and make "status" calls.

- remove 00 script 
- generate consensus bed files in 02 nb, one for the whole consensus seg file, one filtered for losses, and one filtered for gains (these files are saved in the project's scratch directory)
- implement bedtools coverage to add cytoband data from the UCSC cytoband file to the regions with status calls
- comment out the rest of the shell script for development purposes
@cbethell
Copy link
Contributor Author

I think you could skip the subtract and just go to coverage for gains, losses, and uncallable. Then you would read in the three coverage results files to merge them together and make "status" calls.

@jashapiro I tried to implement this in the last commit, but was unsuccessful thus far as the script seems to get stuck at the first bedtools coverage implementation in the shell script but does not throw an error. Do you have an idea of why this may be?

@jashapiro
Copy link
Member

@cbethell I am not sure what would be happening there, but I have a couple of quick thoughts.

One is that you should not need the -f flag for the coverage calculation.
The other is that you may be able to speed it up by making sure the bed files are sorted and then adding --sorted throughout. I suggest doing this with an arrange() statement in your R script before printing, which I will add as a suggestion in just a sec.

Comment on lines 153 to 162
bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything())

losses_bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
filter(status == "loss")

gains_bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
filter(status == "gain")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sure the bed tables are sorted before output.

Suggested change
bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything())
losses_bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
filter(status == "loss")
gains_bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
filter(status == "gain")
bed_status_df <- add_status_df %>%
select(chrom, loc.start, loc.end, everything()) %>%
arrange(chrom, loc.start, loc.end)
losses_bed_status_df <- bed_status_df %>%
filter(status == "loss")
gains_bed_status_df <- bed_status_df %>%
filter(status == "gain")

Comment on lines 27 to 30
consensus_bed_file=${scratch_dir}/consensus_seg_with_status.tsv
loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.tsv
gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.tsv
callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these file names where changed elsewhere, so I am noting that here.

Suggested change
consensus_bed_file=${scratch_dir}/consensus_seg_with_status.tsv
loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.tsv
gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.tsv
callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.tsv
consensus_bed_file=${scratch_dir}/consensus_seg_with_status.bed
loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.bed
gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.bed
callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.bed

bedtools coverage \
-a ${scratch_dir}/ucsc_cytoband.bed \
-b ${scratch_dir}/consensus_seg_with_status_losses.bed \
-f 0.75 \
Copy link
Member

@jashapiro jashapiro Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have sorted the -b file, then you can speed this up with -sorted (I don't think we need -f here).

The same thing applies to the two bedtools coverage calls below this one.

Suggested change
-f 0.75 \
-sorted \

cbethell and others added 3 commits March 12, 2020 09:30
- remove `-f` flag from bedtools coverage and add `-sorted` flag
- rerun
- change `tsv` files to `bed` files
- wrangle the data in the cytoband status bed files in `run-prepare-cn.sh`
- add `dominant_status` column to final table
- add `uncallable` regions from original cytoband file
@cbethell
Copy link
Contributor Author

The notebook added in the last commit can be seen here.

In this notebook so far,

  • I merged the data prepared by bedtools coverage in the shell script (also done in this PR) and created a dominant_status column using the coverage values.
  • I then read in the original cytoband file to define uncallable values in the dominant_status field. However, these rows seem to be _alt chromosomes (I believe we are dropping these but I left them in for verification that we should ignore these instances).
  • I also assigned the uncallable value to rows where the callable coverage ratio == 0. This step should probably receive a close look as I am not sure that this was the right decision made here.
  • My next steps will be adding a chromosome arm field and adding the GISTIC broad arm data for comparison.

cbethell and others added 3 commits March 17, 2020 18:26
- add nb to start to define the most focal units of recurrent CNVs (this nb crashes locally at the `right_join` step but I wanted to get an opinion on whether or not this is what we want to do here)
- add step to save CN file with UCSC cytoband and dominant status data
- update the module README to reflect changes and make not on the processing speed of the shell script
- remove `run-bedtools.sh` which was replaced with `run-bedtools.snakemake`
- rename prepare cn file R script to be `04` and propagate this change in the shell script
@cbethell
Copy link
Contributor Author

cbethell commented Mar 18, 2020

In the last commit, I added a notebook named 05-define-most-focal-units.Rmd to start to dig into defining the most focal units for recurrent CNVs using the cytoband status calls determined in this PR. (I also updated the README to reflect the changes in the PR in its current state.)

Note: This notebook is beyond the scope of this PR and was added to demonstrate my thought process on how we would want to tackle the part of the analysis that focuses on finding the most focal units.

The reason I added it to this PR is because it determines the final file that comes out of this PR.

In other words, should this PR save a file of our CN calls with chromosome_arm, cytoband, dominant_status, and Kids_First_Biospecimen_ID, for consumption in a subsequent notebook that will take this information and keep focal CN cytoband calls where possible, and denote broad chromosome arm calls were they exist? (This file was saved and added in the last commit)

If not, what should the final file of this PR look like with the next step being defining the most focal units in mind?

@jaclyn-taroni
Copy link
Member

In other words, should this PR save a file of our CN calls with chromosome_arm, cytoband, dominant_status, and Kids_First_Biospecimen_ID, for consumption in a subsequent notebook that will take this information and keep focal CN cytoband calls where possible, and denote broad chromosome arm calls were they exist? (This file was saved and added in the last commit)

In this description, it is not clear to me how we would distinguish an arm loss from a cytoband loss. My kneejerk reaction is that you would want to generate arm status separately using a similar approach to how the cytoband calls were generated (you may want to supply a different value to -f for example). That being said, I have not been keeping up with this as closely as @jashapiro and may have said something contradictory.

@jashapiro
Copy link
Member

In other words, should this PR save a file of our CN calls with chromosome_arm, cytoband, dominant_status, and Kids_First_Biospecimen_ID, for consumption in a subsequent notebook that will take this information and keep focal CN cytoband calls where possible, and denote broad chromosome arm calls were they exist? (This file was saved and added in the last commit)

In this description, it is not clear to me how we would distinguish an arm loss from a cytoband loss. My kneejerk reaction is that you would want to generate arm status separately using a similar approach to how the cytoband calls were generated (you may want to supply a different value to -f for example). That being said, I have not been keeping up with this as closely as @jashapiro and may have said something contradictory.

I think if we have all of the cytoband calls, we can readily combine them to identify arm losses with some simple dplyr::group_by() and mutate(). We are not using -f for the the coverage calculations, as we get more raw data out (% of each cytoband covered) that we can make (and adjust) calls from within the R script which is what makes the "dominant" call. That said, I think the arm calls I think should wait for the next PR. I'd like to see this one get in with the table of cytoband stats/calls.

@jaclyn-taroni
Copy link
Member

Okay sounds good 👍

@cbethell
Copy link
Contributor Author

cbethell commented Mar 19, 2020

I'd like to see this one get in with the table of cytoband stats/calls.

This is what I wanted to get to the bottom of, the final output table that we would like to see from this PR.
The addition of the 05 nb was more of a how "this is how we may want to use the output of this PR in the next step, so is this how I should save the output". This will be removed in the upcoming commit.

So to make sure that we are on the same page here, the final table from this PR should include the following fields:

chr cytoband coverage_ratio_callable coverage_ratio_gains coverage_ratio_losses dominant_status

Is that correct?

Should I also break out the GISTIC comparison section of this PR into its own separate notebook and PR?

@jashapiro
Copy link
Member

So to make sure that we are on the same page here, the final table from this PR should include the following fields:

chr cytoband coverage_ratio_callable coverage_ratio_gains coverage_ratio_losses dominant_status
Is that correct?

That seems right to me. You also had a column for chromosome arm in there in the last version I looked at, and I would leave that in.

I might change the coverage_ratio_* labels to callable_fraction, gain_fraction and loss_fraction too, just to make them a bit simpler.

I do think I would save the gistic comparison for a separate PR.

@cbethell
Copy link
Contributor Author

That seems right to me. You also had a column for chromosome arm in there in the last version I > > looked at, and I would leave that in.

I might change the coverage_ratio_* labels to callable_fraction, gain_fraction and loss_fraction too, > just to make them a bit simpler.

I do think I would save the gistic comparison for a separate PR.

Okay sounds good, these changes will be in the upcoming commit.

cbethell and others added 2 commits March 19, 2020 14:46
- add sample IDs to final output table
- remove GISTIC comparison section of nb and remove output file from this section
- remove `05` nb and rendered output
- rerun `03` nb
- update README to reflect changes
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Just a few small changes to suggest.

One is adding in band_length as a column we want to maintain in the output. The other is to make sure all output rows have a chromosome arm. Otherwise, this should be pretty ready to go!

Comment on lines 97 to 106
### Add chromosome arm column

```{r}
# Add a column that tells us the position of the p or q and then use this to
# split the cytoband column
final_df <- final_df %>%
mutate(cytoband_with_arm = paste0(gsub("chr", "", chr), cytoband),
chromosome_arm = gsub("(p|q).*", "\\1", cytoband_with_arm)) %>%
select(-cytoband_with_arm)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step should be moved to after the UCSC data is merged back in, otherwise those uncallable cytobands might not have arm assignments.

Comment on lines 37 to 38
| `Kids_First_Biospecimen_ID` | chr | cytoband | dominant_status | callable_fraction | gain_fraction | loss_fraction | chromosome_arm |
|----------------|--------|-------------|--------|---------|-------------|---------|---------------|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will want to update this to reflect any added columns with changes above.

cbethell and others added 3 commits March 19, 2020 17:01
- add the `band_length` field as above
- remove non-canonical chromosomes (and mitochondria)

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>
- rerun nb after making changes to include `band_length`
- remove unnecessary `replace_na` step
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I am approving this, but I have two additional comments...

One is whether we need to add in the ucsc cytobands at all, as the bedtools coverage results really should have all the ones we care about, and it looks like the merge is resulting in no added rows now.

I also did a bit of spot checking, and there does seem to be something a bit funny going on with sample BS_CBMAWSAR. For some reason, that one is not getting "uncallable" calls as I would expect (if one sample is uncallable in a region, all of them should be).

For example, here:

BS_JTBM5TSE	chr1	p11.2	uncallable	1300000	0.2165369	0	0.2165369	1p
BS_DE26D072	chr1	p11.2	uncallable	1300000	0.2165369	0	0.2165369	1p
BS_CBMAWSAR	chr1	p11.2	neutral	1300000	1	0	0	1p
BS_ZS1QRMXS	chr1	p11.2	uncallable	1300000	0.2165369	0	0	1p
BS_5R0HHQ1Y	chr1	p11.2	uncallable	1300000	0.2165369	0	0	1p

So something is funny there, which may warrant investigation. Most likely it isn't showing up in one of the original tables for some reason? Or it shouldn't be included at all?

Comment on lines 118 to 133
### Read in and join original UCSC cytoband data

This step is to define any additional uncallable cytoband regions.

```{r message = FALSE}
ucsc_cytoband_df <- data.table::fread("http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBand.txt.gz", data.table = FALSE)

ucsc_cytoband_df <- ucsc_cytoband_df %>%
mutate(band_length = V3 - V2) %>%
select(chr = V1, cytoband = V4, band_length) %>%
filter(chr %in% paste0("chr", 1:23)) # remove nonstandard chroms.

# Join the cytoband data from the original UCSC cytoband file with the data
# bedtools coverage files wrangled in the steps above
final_df <- final_df %>%
right_join(ucsc_cytoband_df, by = c("chr", "cytoband", "band_length"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing this whole step may be redundant, since the bedtools coverage is including all of the ucsc cytobands already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I will remove this section.

@cbethell
Copy link
Contributor Author

cbethell commented Mar 20, 2020

Looks good! I am approving this, but I have two additional comments...

One is whether we need to add in the ucsc cytobands at all, as the bedtools coverage results really should have all the ones we care about, and it looks like the merge is resulting in no added rows now.

I also did a bit of spot checking, and there does seem to be something a bit funny going on with sample BS_CBMAWSAR. For some reason, that one is not getting "uncallable" calls as I would expect (if one sample is uncallable in a region, all of them should be).

For example, here:

BS_JTBM5TSE	chr1	p11.2	uncallable	1300000	0.2165369	0	0.2165369	1p
BS_DE26D072	chr1	p11.2	uncallable	1300000	0.2165369	0	0.2165369	1p
BS_CBMAWSAR	chr1	p11.2	neutral	1300000	1	0	0	1p
BS_ZS1QRMXS	chr1	p11.2	uncallable	1300000	0.2165369	0	0	1p
BS_5R0HHQ1Y	chr1	p11.2	uncallable	1300000	0.2165369	0	0	1p

So something is funny there, which may warrant investigation. Most likely it isn't showing up in one of the original tables for some reason? Or it shouldn't be included at all?

@jashapiro After some investigation, I found that the BS_CBMAWSAR sample, with the addition of three other samples( namely, BS_H50HR85Y, BS_FF73TT6D, and BS_80078QDG), all have a callable ratio of 1 for this region compared to the remainder of the samples with a callable ratio of 0.2165369. However, these samples appear to agree with all other samples of the cohort in other cytoband regions.
Note: This information was determined using the intersect_with_cytoband_callable.bed file

Would you still recommend removing these samples?

@jashapiro
Copy link
Member

I don't think anything different should be done with these

Would you still recommend removing these samples?

No, I would leave them. I have some guesses why that is happening, but I don't think it should be a major issue in general. My guess is that it is where there is a very large segment that crosses an uncallable region. Shouldn't happen that often, but should be something we are aware of, I guess.

We will probably want to add logic at some point that says if a band is mostly uncallable, then it should be uncallable for all samples. But other bands for that sample should stay as they are.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks almost all set to me!

I realized that I made a mistake in the UCSC file download and filtered out sex chromosomes, so we are missing some of the data at the moment. So the last step here should be to incorporate my one suggestion and rerun everything (which will be slow, but ah well.)

Otherwise, approved!

analyses/focal-cn-file-preparation/run-bedtools.snakemake Outdated Show resolved Hide resolved
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants