Add cytoband to copy number files using bedtools intersect #617

cbethell · 2020-03-09T21:16:34Z

Purpose/implementation Section

The purpose of this PR is to generate cytoband copy number status consensus files for consumption by downstream analyses.

What scientific question is your analysis addressing?

As noted in the original comment on PR #497, the current annotated output of focal-cn-file-preparation (e.g., the contents of results) contains some information about cytobands. However, the cytoband status sometimes disagree with the GISTIC arm status (GISTIC cutoff is 0.98 of arm for an event). Using the approach in this PR, we can hopefully this issue more directly and hopefully find more agreement between cytoband status and arm status.

What was your approach?

My approach was to prepare the cytoband file retrieved from the UCSC database and the consensus_seg_with_status.tsv file prepared in 02-add-ploidy-consensus.Rmd to be in format required by bedtools functions. I also separated the consensus_seg_with_status.tsv file into gains and losses and saved these as individual bed files.

I then used bedtools coverage to retrieve the coverage ratio for each of this files using the UCSC cytoband bed file.

In 03-add-cytoband-status-consensus.Rmd, I merge the data produced using bedtools coverage and denote the dominant status for each chromosome arm using this data. I then add and compare GISTIC's arm status data. Should this be broken up into 2 separate notebooks?

What GitHub issue does your pull request address?

This PR addresses issue #497.

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The implementation of the bedtools coverage function and the file this produces should receive a particularly close look.
The logic determining the dominant_status for our consensus calls should also receive some close attention.

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, this is ready for review.

Results

What types of results are included (e.g., table, figure)?

Currently, this PR produces a number of intermediate bed files stored in the project's scratch directory, but one final results file as follows:

results/consensus_seg_gistic_cytoband_status.tsv

This results file contains the chromosome arm and the dominant status calls for our consensus data and for GISTIC's data.

Should I be saving this file before the addition of the GISTIC data?

What is your summary of the results?

The chromosome arm calls appear to agree for a total of 35 out of 48 instances (this total excludes the _alt chromosomes that I left in the final file for the purpose of being thorough -- these chromosomes are labeled uncallable).

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

- rename script to better indicate its content - propagate change to shell script

cansavvy

@cbethell I think we can simplify this by cutting back on most of the content in 00-prepare-bed-files.R script and going straight to the intersection bedtools bit. I would vouch for trying to get to the intersection bit quicker because some of these formatting changes you will make here will not necessarily carried over and also bedtools does not require them. I tested these items with bedtools to make sure:

Bedtools doesn't care about your column names as long as it finds the chr, start, end in those first three columns in that order (I see your comments mentioned this part), but what you name those columns makes no difference to bedtools.
Bedtools doesn't care if you have extra columns after chr start and end so you don't need to remove columns after it.

Bedtools documentation is pretty good, so you may want to poke around in there a bit (if you have not already): https://bedtools.readthedocs.io/en/latest/content/overview.html
But the other bit of advice I'd have is just try to run bedtools first and it gives pretty good error messages if the data is formatted wrong. For example, when I ran cytoband_with_status.tsv without moving the biospecimen ID from the first column, it told me:

- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the
  expected columns (e.g., cols 2 and 3 for BED).

As bedtools mentions here, there is one outstanding formatting issue we need to resolve to be able to run bedtools off the bat. That pesky Kids_First_Biospecimen_ID column needs to be moved to be after from the first three columns.

IF it doesn't throw any downstream analyses off (this is an if of unknown size) then you could just alter this dplyr::select line to move the biospecimen id to the end. And then you can use the consensus_seg_with_status.tsv file with bedtools exactly how it is. However, I do not know how much that will throw things off, so you should look into that before using this idea.

cansavvy · 2020-03-10T12:13:28Z

analyses/focal-cn-file-preparation/00-prepare-bed-files.R

+ucsc_cytoband_bed <- ucsc_cytoband %>%
+  dplyr::select(chr = V1, start = V2, end = V3, cytoband = V4) %>%
+  dplyr::mutate(cytoband = paste0(gsub("chr", "", chr), cytoband),
+                chr = gsub("_.*","", chr)) %>%


Are you trying to drop the chr on the chromosome column? If yes, just use gsub("chr","", chr). But if that is not what you are doing, can you explain what your goal is here?

This step, in cases like: chr10_GL383545v1_alt, is removing everything after the first _ character inclusive. This is done to make it comparable with the consensus_with_status.tsv file.

AHH I see. Okay cool. I think in most cases in this project we've dropped _alt chromosomes, but I could be wrong.

@cansavvy re:

IF it doesn't throw any downstream analyses off (this is an if of unknown size) then you could just alter this dplyr::select line to move the biospecimen id to the end. And then you can use the consensus_seg_with_status.tsv file with bedtools exactly how it is. However, I do not know how much that will throw things off, so you should look into that before using this idea.

I did give this a thought before filing the PR but was worried about downstream analyses, but will look into it and let you know what I find.

An alternative that is less risky, then is to save a BED ready version of that file in scratch directory at the end of that 02 notebook.

Ah okay, will also look into this before making the change.

By BED ready version, that only means moving that biospecimen column out of the first 3 column spots.

analyses/focal-cn-file-preparation/00-prepare-bed-files.R

cansavvy · 2020-03-10T12:23:52Z

analyses/focal-cn-file-preparation/00-prepare-bed-files.R

+# Select variables needed in the UCSC cytoband data -- must be in the
+# required bedtools format: chr, start, end 
+ucsc_cytoband_bed <- ucsc_cytoband %>%
+  dplyr::select(chr = V1, start = V2, end = V3, cytoband = V4) %>%


Looks like you are dropping the gram negative/pos calls? and renaming the columns. A couple things: 1) bedtools doesn't care about column names. 2) Column names are still nice to keep track of things so you can just use a col.names in line 48 so that you don't have to rename them later. Then if you don't want the gram neg/pos column you can just drop it in this line. (Though I'm not sure it hurts anything to keep it around and this file isn't that big so keeping an extra column is not too big of a deal).

Gotcha, will make this change.

cbethell · 2020-03-11T14:23:48Z

@cansavvy re:

IF it doesn't throw any downstream analyses off (this is an if of unknown size) then you could just alter this dplyr::select line to move the biospecimen id to the end. And then you can use the consensus_seg_with_status.tsv file with bedtools exactly how it is. However, I do not know how much that will throw things off, so you should look into that before using this idea.

I did give this a thought before filing the PR but was worried about downstream analyses, but will look into it and let you know what I find.

cansavvy · 2020-03-11T15:28:23Z

More specifically @cbethell, given your comments and findings here, here's what I'd suggest:

Step 1) Change the 02-add-ploidy-consensus.Rmd to save a "bedtools ready" of the cytoband_with_status.tsv by saving a version of the file in scratch where you move the Kids_First_Biospecimen_ID column to the end of the file in using this kind of dplyr::select line

Step 2) Have your already run-prepare-cn-file.sh bash script wget the UCSC cytoband file with the url you have in this script.

Step 3) Continue on with the intersect steps you have laid out in your bash script. Though we should discuss if -wa option is what you want here. See this doc

jaclyn-taroni · 2020-03-11T15:31:03Z

-wa was suggested by @jashapiro and is in #497, but is probably worth discussion / re-evaluating so everyone is on the same page.

jashapiro · 2020-03-11T15:46:14Z

I'm not clear on what information intersect_with_cytoband.tsv is supposed to capture? (Also, since it is still a .bed file it should probably be named that) As it stands (bedtools intersect -wa -f 0.1), it is capturing which cytobands have at least 10% of their length present the consensus status, but it is not retaining any information about which status (and seems to be super redundant, if the current file is correct)

cansavvy · 2020-03-11T15:48:55Z

I'm not clear on what information intersect_with_cytoband.tsv is supposed to capture? (Also, since it is still a .bed file it should probably be named that) As it stands (bedtools intersect -wa -f 0.1), it is capturing which cytobands have at least 10% of their length present the consensus status, but it is not retaining any information about which status (and seems to be super redundant, if the current file is correct)

Gotcha. Okay. I think my confusion may have been over what our end file results goal here was. We want UCSC's cytoband reports and do not care about losing cytobands as they are reported by our data.

cbethell · 2020-03-11T16:15:44Z

Upon further discussion with @jashapiro @cansavvy @jaclyn-taroni, the following steps will be taken:

Use bedtools subtract -A -f to filter out cytobands from the UCSC cytoband file with too much uncalled data.
Save separate bed files for the losses and gains reported in consensus_seg_with_status.tsv.
Use bedtools intersect -wa -f 0.75 between the filtered callable cytobands file (provided to the -a flag) and each of the consensus seg bed files (one for losses and one for gains, provided to the -b flag).
Then read these files into R and annotate these files with genes.

jaclyn-taroni · 2020-03-11T16:21:09Z

Then read these files into R and annotate these files with genes.

I don't think you need to annotate these files with genes, rather you want to annotate them with status: callable, loss, gain

cbethell · 2020-03-11T16:30:00Z

I don't think you need to annotate these files with genes, rather you want to annotate them with status: callable, loss, gain

Okay, so the final output we want here is a table with fields similar to the following:

chr	start	end	cytoband	status
chr1	0	2300000	p36.33	loss
chr4	106700000	113200000	q25	callable

Where callable could be a possible neutral call?

cansavvy · 2020-03-11T16:36:47Z

I don't think you need to annotate these files with genes, rather you want to annotate them with status: callable, loss, gain

Okay, so the final output we want here is a table with fields similar to the following:

chr start end cytoband status
chr1 0 2300000 p36.33 loss
chr4 106700000 113200000 q25 callable
Where callable could be a possible neutral call?

I think callable would be any range that is within callable regions but not called as gain or loss.

jashapiro · 2020-03-11T16:41:11Z

I think you might want every cytoband in there, so I would have 4 possibilities for status: loss, neutral, gain, and uncalled

chr	start	end	cytoband	status
chr1	0	2300000	p36.33	loss
chr4	106700000	113200000	q25	callable

jaclyn-taroni · 2020-03-11T16:46:16Z

I think you might want every cytoband in there, so I would have 4 possibilities for status: loss, neutral, gain, and uncalled

This seems like it would be the most flexible for downstream analyses, but I believe it may require you to read an additional file in (the original cytoband file) to capture the uncalled.

jashapiro · 2020-03-11T17:04:20Z

An alternative worth exploring might be to use bedtools coverage to capture the actual fraction of a cytoband that is covered by loss, gain, or uncalled.

https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html

This could allow you to make a table like the following by combining three separate coverage calls

chr	start	end	cytoband	loss_fraction	gain_fraction	uncalled_fraction	status
chr1	1000	2000	1p1	0.8	0.02	0.1	loss
chr1	2000	3000	1p2	0.0	0.01	0.02	neutral

This would give flexibility on choosing cutoffs later without rerunning the whole thing.

cbethell · 2020-03-11T17:12:49Z

An alternative worth exploring might be to use bedtools coverage to capture the actual fraction of a cytoband that is covered by loss, gain, or uncalled.

https://bedtools.readthedocs.io/en/latest/content/tools/coverage.html

This could allow you to make a table like the following by combining three separate coverage calls

chr start end cytoband loss_fraction gain_fraction uncalled_fraction status
chr1 1000 2000 1p1 0.8 0.02 0.1 loss
chr1 2000 3000 1p2 0.0 0.01 0.02 neutral
This would give flexibility on choosing cutoffs later without rerunning the whole thing.

Nice, I'll look into implementing this tool.

My understanding here is that bedtools subtract would still be used to retain the callable cytoband regions, then bedtools coverage would be used (in place of bedtools intersect) for each consensus seg with status file (losses, gains, and uncalled regions). Correct?

jashapiro · 2020-03-11T18:10:33Z

My understanding here is that bedtools subtract would still be used to retain the callable cytoband regions, then bedtools coverage would be used (in place of bedtools intersect) for each consensus seg with status file (losses, gains, and uncalled regions). Correct?

I think you could skip the subtract and just go to coverage for gains, losses, and uncallable. Then you would read in the three coverage results files to merge them together and make "status" calls.

- remove 00 script - generate consensus bed files in 02 nb, one for the whole consensus seg file, one filtered for losses, and one filtered for gains (these files are saved in the project's scratch directory) - implement bedtools coverage to add cytoband data from the UCSC cytoband file to the regions with status calls - comment out the rest of the shell script for development purposes

cbethell · 2020-03-11T19:01:52Z

I think you could skip the subtract and just go to coverage for gains, losses, and uncallable. Then you would read in the three coverage results files to merge them together and make "status" calls.

@jashapiro I tried to implement this in the last commit, but was unsuccessful thus far as the script seems to get stuck at the first bedtools coverage implementation in the shell script but does not throw an error. Do you have an idea of why this may be?

jashapiro · 2020-03-11T19:58:14Z

@cbethell I am not sure what would be happening there, but I have a couple of quick thoughts.

One is that you should not need the -f flag for the coverage calculation.
The other is that you may be able to speed it up by making sure the bed files are sorted and then adding --sorted throughout. I suggest doing this with an arrange() statement in your R script before printing, which I will add as a suggestion in just a sec.

jashapiro · 2020-03-11T19:58:55Z

analyses/focal-cn-file-preparation/02-add-ploidy-consensus.Rmd

+bed_status_df <- add_status_df %>%
+  select(chrom, loc.start, loc.end, everything())
+
+losses_bed_status_df <- add_status_df %>%
+  select(chrom, loc.start, loc.end, everything()) %>%
+  filter(status == "loss")
+
+gains_bed_status_df <- add_status_df %>%
+  select(chrom, loc.start, loc.end, everything()) %>%
+  filter(status == "gain")


This makes sure the bed tables are sorted before output.

Suggested change

bed_status_df <- add_status_df %>%

select(chrom, loc.start, loc.end, everything())

losses_bed_status_df <- add_status_df %>%

select(chrom, loc.start, loc.end, everything()) %>%

filter(status == "loss")

gains_bed_status_df <- add_status_df %>%

select(chrom, loc.start, loc.end, everything()) %>%

filter(status == "gain")

bed_status_df <- add_status_df %>%

select(chrom, loc.start, loc.end, everything()) %>%

arrange(chrom, loc.start, loc.end)

losses_bed_status_df <- bed_status_df %>%

filter(status == "loss")

gains_bed_status_df <- bed_status_df %>%

filter(status == "gain")

jashapiro · 2020-03-11T20:01:00Z

analyses/focal-cn-file-preparation/run-prepare-cn.sh

+consensus_bed_file=${scratch_dir}/consensus_seg_with_status.tsv
+loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.tsv
+gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.tsv
+callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.tsv


I think these file names where changed elsewhere, so I am noting that here.

Suggested change

consensus_bed_file=${scratch_dir}/consensus_seg_with_status.tsv

loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.tsv

gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.tsv

callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.tsv

consensus_bed_file=${scratch_dir}/consensus_seg_with_status.bed

loss_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_losses.bed

gain_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_gains.bed

callable_intersect_with_cytoband_file=${scratch_dir}/intersect_with_cytoband_callable.bed

jashapiro · 2020-03-11T20:02:58Z

analyses/focal-cn-file-preparation/run-prepare-cn.sh

+bedtools coverage \
+    -a ${scratch_dir}/ucsc_cytoband.bed \
+    -b ${scratch_dir}/consensus_seg_with_status_losses.bed \
+    -f 0.75 \


If you have sorted the -b file, then you can speed this up with -sorted (I don't think we need -f here).

The same thing applies to the two bedtools coverage calls below this one.

Suggested change

-f 0.75 \

-sorted \

- remove `-f` flag from bedtools coverage and add `-sorted` flag - rerun - change `tsv` files to `bed` files

- wrangle the data in the cytoband status bed files in `run-prepare-cn.sh` - add `dominant_status` column to final table - add `uncallable` regions from original cytoband file

cbethell · 2020-03-12T21:53:10Z

The notebook added in the last commit can be seen here.

In this notebook so far,

I merged the data prepared by bedtools coverage in the shell script (also done in this PR) and created a dominant_status column using the coverage values.
I then read in the original cytoband file to define uncallable values in the dominant_status field. However, these rows seem to be _alt chromosomes (I believe we are dropping these but I left them in for verification that we should ignore these instances).
I also assigned the uncallable value to rows where the callable coverage ratio == 0. This step should probably receive a close look as I am not sure that this was the right decision made here.
My next steps will be adding a chromosome arm field and adding the GISTIC broad arm data for comparison.

- save final table to file

- add nb to start to define the most focal units of recurrent CNVs (this nb crashes locally at the `right_join` step but I wanted to get an opinion on whether or not this is what we want to do here) - add step to save CN file with UCSC cytoband and dominant status data - update the module README to reflect changes and make not on the processing speed of the shell script - remove `run-bedtools.sh` which was replaced with `run-bedtools.snakemake` - rename prepare cn file R script to be `04` and propagate this change in the shell script

cbethell · 2020-03-18T19:44:06Z

In the last commit, I added a notebook named 05-define-most-focal-units.Rmd to start to dig into defining the most focal units for recurrent CNVs using the cytoband status calls determined in this PR. (I also updated the README to reflect the changes in the PR in its current state.)

Note: This notebook is beyond the scope of this PR and was added to demonstrate my thought process on how we would want to tackle the part of the analysis that focuses on finding the most focal units.

The reason I added it to this PR is because it determines the final file that comes out of this PR.

In other words, should this PR save a file of our CN calls with chromosome_arm, cytoband, dominant_status, and Kids_First_Biospecimen_ID, for consumption in a subsequent notebook that will take this information and keep focal CN cytoband calls where possible, and denote broad chromosome arm calls were they exist? (This file was saved and added in the last commit)

If not, what should the final file of this PR look like with the next step being defining the most focal units in mind?

jaclyn-taroni · 2020-03-19T15:52:32Z

In other words, should this PR save a file of our CN calls with chromosome_arm, cytoband, dominant_status, and Kids_First_Biospecimen_ID, for consumption in a subsequent notebook that will take this information and keep focal CN cytoband calls where possible, and denote broad chromosome arm calls were they exist? (This file was saved and added in the last commit)

In this description, it is not clear to me how we would distinguish an arm loss from a cytoband loss. My kneejerk reaction is that you would want to generate arm status separately using a similar approach to how the cytoband calls were generated (you may want to supply a different value to -f for example). That being said, I have not been keeping up with this as closely as @jashapiro and may have said something contradictory.

jashapiro · 2020-03-19T16:07:05Z

In other words, should this PR save a file of our CN calls with chromosome_arm, cytoband, dominant_status, and Kids_First_Biospecimen_ID, for consumption in a subsequent notebook that will take this information and keep focal CN cytoband calls where possible, and denote broad chromosome arm calls were they exist? (This file was saved and added in the last commit)

In this description, it is not clear to me how we would distinguish an arm loss from a cytoband loss. My kneejerk reaction is that you would want to generate arm status separately using a similar approach to how the cytoband calls were generated (you may want to supply a different value to -f for example). That being said, I have not been keeping up with this as closely as @jashapiro and may have said something contradictory.

I think if we have all of the cytoband calls, we can readily combine them to identify arm losses with some simple dplyr::group_by() and mutate(). We are not using -f for the the coverage calculations, as we get more raw data out (% of each cytoband covered) that we can make (and adjust) calls from within the R script which is what makes the "dominant" call. That said, I think the arm calls I think should wait for the next PR. I'd like to see this one get in with the table of cytoband stats/calls.

jaclyn-taroni · 2020-03-19T16:08:36Z

Okay sounds good 👍

cbethell · 2020-03-19T16:19:22Z

I'd like to see this one get in with the table of cytoband stats/calls.

This is what I wanted to get to the bottom of, the final output table that we would like to see from this PR.
The addition of the 05 nb was more of a how "this is how we may want to use the output of this PR in the next step, so is this how I should save the output". This will be removed in the upcoming commit.

So to make sure that we are on the same page here, the final table from this PR should include the following fields:

chr	cytoband	coverage_ratio_callable	coverage_ratio_gains	coverage_ratio_losses	dominant_status

Is that correct?

Should I also break out the GISTIC comparison section of this PR into its own separate notebook and PR?

jashapiro · 2020-03-19T16:52:06Z

So to make sure that we are on the same page here, the final table from this PR should include the following fields:

chr cytoband coverage_ratio_callable coverage_ratio_gains coverage_ratio_losses dominant_status
Is that correct?

That seems right to me. You also had a column for chromosome arm in there in the last version I looked at, and I would leave that in.

I might change the coverage_ratio_* labels to callable_fraction, gain_fraction and loss_fraction too, just to make them a bit simpler.

I do think I would save the gistic comparison for a separate PR.

cbethell · 2020-03-19T16:55:00Z

That seems right to me. You also had a column for chromosome arm in there in the last version I > > looked at, and I would leave that in.

I might change the coverage_ratio_* labels to callable_fraction, gain_fraction and loss_fraction too, > just to make them a bit simpler.

I do think I would save the gistic comparison for a separate PR.

Okay sounds good, these changes will be in the upcoming commit.

- add sample IDs to final output table - remove GISTIC comparison section of nb and remove output file from this section - remove `05` nb and rendered output - rerun `03` nb - update README to reflect changes

jashapiro

This looks good! Just a few small changes to suggest.

One is adding in band_length as a column we want to maintain in the output. The other is to make sure all output rows have a chromosome arm. Otherwise, this should be pretty ready to go!

analyses/focal-cn-file-preparation/03-add-cytoband-status-consensus.Rmd

jashapiro · 2020-03-19T20:22:32Z

analyses/focal-cn-file-preparation/03-add-cytoband-status-consensus.Rmd

+### Add chromosome arm column
+
+```{r}
+# Add a column that tells us the position of the p or q and then use this to
+# split the cytoband column
+final_df <- final_df %>%
+  mutate(cytoband_with_arm = paste0(gsub("chr", "", chr), cytoband),
+         chromosome_arm = gsub("(p|q).*", "\\1", cytoband_with_arm)) %>%
+  select(-cytoband_with_arm)
+```


This step should be moved to after the UCSC data is merged back in, otherwise those uncallable cytobands might not have arm assignments.

analyses/focal-cn-file-preparation/03-add-cytoband-status-consensus.Rmd

jashapiro · 2020-03-19T20:33:39Z

analyses/focal-cn-file-preparation/README.md

+  | `Kids_First_Biospecimen_ID` | chr | cytoband | dominant_status | callable_fraction | gain_fraction | loss_fraction | chromosome_arm |
+  |----------------|--------|-------------|--------|---------|-------------|---------|---------------|


You will want to update this to reflect any added columns with changes above.

- add the `band_length` field as above - remove non-canonical chromosomes (and mitochondria) Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

- rerun nb after making changes to include `band_length` - remove unnecessary `replace_na` step

jashapiro

Looks good! I am approving this, but I have two additional comments...

One is whether we need to add in the ucsc cytobands at all, as the bedtools coverage results really should have all the ones we care about, and it looks like the merge is resulting in no added rows now.

I also did a bit of spot checking, and there does seem to be something a bit funny going on with sample BS_CBMAWSAR. For some reason, that one is not getting "uncallable" calls as I would expect (if one sample is uncallable in a region, all of them should be).

For example, here:

BS_JTBM5TSE	chr1	p11.2	uncallable	1300000	0.2165369	0	0.2165369	1p
BS_DE26D072	chr1	p11.2	uncallable	1300000	0.2165369	0	0.2165369	1p
BS_CBMAWSAR	chr1	p11.2	neutral	1300000	1	0	0	1p
BS_ZS1QRMXS	chr1	p11.2	uncallable	1300000	0.2165369	0	0	1p
BS_5R0HHQ1Y	chr1	p11.2	uncallable	1300000	0.2165369	0	0	1p

So something is funny there, which may warrant investigation. Most likely it isn't showing up in one of the original tables for some reason? Or it shouldn't be included at all?

jashapiro · 2020-03-20T14:21:58Z

analyses/focal-cn-file-preparation/03-add-cytoband-status-consensus.Rmd

+### Read in and join original UCSC cytoband data
+
+This step is to define any additional uncallable cytoband regions.
+
+```{r message = FALSE}
+ucsc_cytoband_df <- data.table::fread("http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBand.txt.gz", data.table = FALSE)
+
+ucsc_cytoband_df <- ucsc_cytoband_df %>%
+  mutate(band_length = V3 - V2) %>%
+  select(chr = V1, cytoband = V4, band_length) %>%
+  filter(chr %in% paste0("chr", 1:23)) # remove nonstandard chroms.
+
+# Join the cytoband data from the original UCSC cytoband file with the data
+# bedtools coverage files wrangled in the steps above
+final_df <- final_df %>%
+  right_join(ucsc_cytoband_df, by = c("chr", "cytoband", "band_length")) 


Realizing this whole step may be redundant, since the bedtools coverage is including all of the ucsc cytobands already.

Agreed, I will remove this section.

cbethell · 2020-03-20T15:26:43Z

Looks good! I am approving this, but I have two additional comments...

One is whether we need to add in the ucsc cytobands at all, as the bedtools coverage results really should have all the ones we care about, and it looks like the merge is resulting in no added rows now.

I also did a bit of spot checking, and there does seem to be something a bit funny going on with sample BS_CBMAWSAR. For some reason, that one is not getting "uncallable" calls as I would expect (if one sample is uncallable in a region, all of them should be).

For example, here:
BS_JTBM5TSE	chr1	p11.2	uncallable	1300000	0.2165369	0	0.2165369	1p
BS_DE26D072	chr1	p11.2	uncallable	1300000	0.2165369	0	0.2165369	1p
BS_CBMAWSAR	chr1	p11.2	neutral	1300000	1	0	0	1p
BS_ZS1QRMXS	chr1	p11.2	uncallable	1300000	0.2165369	0	0	1p
BS_5R0HHQ1Y	chr1	p11.2	uncallable	1300000	0.2165369	0	0	1p
So something is funny there, which may warrant investigation. Most likely it isn't showing up in one of the original tables for some reason? Or it shouldn't be included at all?

@jashapiro After some investigation, I found that the BS_CBMAWSAR sample, with the addition of three other samples( namely, BS_H50HR85Y, BS_FF73TT6D, and BS_80078QDG), all have a callable ratio of 1 for this region compared to the remainder of the samples with a callable ratio of 0.2165369. However, these samples appear to agree with all other samples of the cohort in other cytoband regions.
Note: This information was determined using the intersect_with_cytoband_callable.bed file

Would you still recommend removing these samples?

-rerun nb

jashapiro · 2020-03-20T18:27:04Z

I don't think anything different should be done with these

Would you still recommend removing these samples?

No, I would leave them. I have some guesses why that is happening, but I don't think it should be a major issue in general. My guess is that it is where there is a very large segment that crosses an uncallable region. Shouldn't happen that often, but should be something we are aware of, I guess.

We will probably want to add logic at some point that says if a band is mostly uncallable, then it should be uncallable for all samples. But other bands for that sample should stay as they are.

jashapiro

This looks almost all set to me!

I realized that I made a mistake in the UCSC file download and filtered out sex chromosomes, so we are missing some of the data at the moment. So the last step here should be to incorporate my one suggestion and rerun everything (which will be slow, but ah well.)

Otherwise, approved!

analyses/focal-cn-file-preparation/run-bedtools.snakemake

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

cbethell added 2 commits March 9, 2020 16:50

Add cytoband data using bedtools intersect with UCSC cytoband file

7dffdfc

Update comments

e745ad9

- rename script to better indicate its content - propagate change to shell script

jaclyn-taroni requested a review from cansavvy March 9, 2020 21:44

cansavvy reviewed Mar 10, 2020

View reviewed changes

jashapiro reviewed Mar 11, 2020

View reviewed changes

cbethell and others added 3 commits March 12, 2020 09:30

sort before filtering out losses and gains

bba0864

- remove `-f` flag from bedtools coverage and add `-sorted` flag - rerun - change `tsv` files to `bed` files

Merge branch 'master' into add-cytoband-status-with-bedtools

f573daa

Add notebook to join and wrangle the cytoband bed files

337a264

- wrangle the data in the cytoband status bed files in `run-prepare-cn.sh` - add `dominant_status` column to final table - add `uncallable` regions from original cytoband file

cbethell and others added 2 commits March 13, 2020 08:43

Add chromosome arm field and GISTIC arm status data

3c827b1

- save final table to file

Merge branch 'master' into add-cytoband-status-with-bedtools

ee4ff57

cbethell and others added 3 commits March 17, 2020 18:26

Merge branch 'master' into add-cytoband-status-with-bedtools

60147fc

Merge branch 'master' into add-cytoband-status-with-bedtools

8247e53

cbethell and others added 2 commits March 19, 2020 14:46

Reformat final output table and rename output file

0c89687

- add sample IDs to final output table - remove GISTIC comparison section of nb and remove output file from this section - remove `05` nb and rendered output - rerun `03` nb - update README to reflect changes

Merge branch 'master' into add-cytoband-status-with-bedtools

bcbbb09

jashapiro reviewed Mar 19, 2020

View reviewed changes

cbethell and others added 3 commits March 19, 2020 17:01

Apply suggestions from @jashapiro code review

2b6c1b2

- add the `band_length` field as above - remove non-canonical chromosomes (and mitochondria) Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

Move addition of chromosome_arm step after joining original UCSC data

01f385b

- rerun nb after making changes to include `band_length` - remove unnecessary `replace_na` step

update README to reflect addition of band_length column

d64159b

cbethell mentioned this pull request Mar 19, 2020

Figure Panel: Oncoprint Landscape #641

Closed

Merge branch 'master' into add-cytoband-status-with-bedtools

807a7ad

jashapiro approved these changes Mar 20, 2020

View reviewed changes

Merge branch 'master' into add-cytoband-status-with-bedtools

b26cd42

cbethell added 2 commits March 20, 2020 11:31

remove redundant joining of original ucsc cytoband data section

e9bf7a7

-rerun nb

update usage comment and re-render html output

126ea71

jashapiro approved these changes Mar 20, 2020

View reviewed changes

analyses/focal-cn-file-preparation/run-bedtools.snakemake Outdated Show resolved Hide resolved

cbethell and others added 2 commits March 20, 2020 14:36

@jashapiro's commit suggestion to include sex chromosome data

53b906d

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

rerun module (now that UCSC file download includes sex chromosomes)

1b761ee

jaclyn-taroni merged commit c938435 into AlexsLemonade:master Mar 23, 2020

cbethell mentioned this pull request Mar 23, 2020

Define most focal units of recurrent CNVs #644

Merged

5 tasks

jaclyn-taroni mentioned this pull request Apr 13, 2020

Updated analysis: generate cytoband copy number status file for consumption #497

Closed

		\| `Kids_First_Biospecimen_ID` \| chr \| cytoband \| dominant_status \| callable_fraction \| gain_fraction \| loss_fraction \| chromosome_arm \|
		\|----------------\|--------\|-------------\|--------\|---------\|-------------\|---------\|---------------\|

Add cytoband to copy number files using bedtools intersect #617

Add cytoband to copy number files using bedtools intersect #617

Conversation

cbethell commented Mar 9, 2020 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

cansavvy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cansavvy Mar 11, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented Mar 11, 2020

cansavvy commented Mar 11, 2020

jaclyn-taroni commented Mar 11, 2020

jashapiro commented Mar 11, 2020

cansavvy commented Mar 11, 2020

cbethell commented Mar 11, 2020

jaclyn-taroni commented Mar 11, 2020

cbethell commented Mar 11, 2020

cansavvy commented Mar 11, 2020

jashapiro commented Mar 11, 2020

jaclyn-taroni commented Mar 11, 2020

jashapiro commented Mar 11, 2020 • edited Loading

cbethell commented Mar 11, 2020

jashapiro commented Mar 11, 2020

cbethell commented Mar 11, 2020

jashapiro commented Mar 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro Mar 11, 2020 • edited Loading

Choose a reason for hiding this comment

cbethell commented Mar 12, 2020

cbethell commented Mar 18, 2020 • edited Loading

jaclyn-taroni commented Mar 19, 2020

jashapiro commented Mar 19, 2020

jaclyn-taroni commented Mar 19, 2020

cbethell commented Mar 19, 2020 • edited Loading

jashapiro commented Mar 19, 2020

cbethell commented Mar 19, 2020

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented Mar 20, 2020 • edited Loading

jashapiro commented Mar 20, 2020

jashapiro left a comment

Choose a reason for hiding this comment

cbethell commented Mar 9, 2020 •

edited

Loading

cansavvy Mar 11, 2020 •

edited

Loading

jashapiro commented Mar 11, 2020 •

edited

Loading

jashapiro Mar 11, 2020 •

edited

Loading

cbethell commented Mar 18, 2020 •

edited

Loading

cbethell commented Mar 19, 2020 •

edited

Loading

cbethell commented Mar 20, 2020 •

edited

Loading