CNV consensus (3 of n): Filter bad segments #328

fingerfen · 2019-12-11T23:22:29Z

Purpose/implementation Section

Continue copy number consensus call

What scientific question is your analysis addressing?

Merging consensus calls

What GitHub issue does your pull request address?

Issue #128

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The new "get_rid_bad_segments.py" file

Is there anything that you want to discuss further?

No

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Taking advantage of some snakemake features to reduce redundancy.

jashapiro

Hi @fingerfen, thank you for submitting this next step! Sorry for the delay in review; I was away for a while and have been digging myself back out of a bit of a hole.

The main question I have is the provenance of the bad_chromosomal_seg_updated_merged.txt file. How was this produced? We definitely would want to know how that file came to be, and what it contains. If there was a script that created it, we would want to have that in this analysis as well.

I also had a suggestion or two on how to simplify the snakefile, one of which is here, the other of which is in a pull request I submitted to your branch.

Finally, I had some thoughts about places where it might be possible to speed up your script, so you don't need to go through the entire bad_chromosomal_seg file every time. One simple suggestion would be to store each chromosome separately (in a dict, for example), to cut the number of comparisons you need to do. Alternatively (as I suggested below), you might be able to iterate thorugh the CNV file and bad file together, as long as they are both sorted.

I don't actually know if this is a particular slow point in the script, but it does seem it could be. One other alternative, which I am a bit hesitant to suggest as it may require a bit of work to wrap your head around and get working as you want, is to replace some or all of the script with bedtools subtract, which should be extremely efficient. You would want to take advantage of the -f/-F and -A options to achieve similar results to your script, I think. bedtools is already in the docker container if you want to try it out.

analyses/copy_number_consensus_call/Snakefile

jashapiro · 2019-12-13T20:12:10Z

analyses/copy_number_consensus_call/src/scripts/bad_chromosomal_seg_updated_merged.txt

@@ -0,0 +1,3693 @@
+chr1	0	535988	telo,,,	length,,,


Where did this file come from? Is there a script that generated it? Or did it come from a different source?

If the former, we would like to have that script included in the repository. If the latter, we should indicate where the file came from in a README in the base directory of this analysis.

@xiehongbo provided me with the file so I will let him explain it. I took the file he gave to me, sorted and merged any overlapping segments.

@jashapiro I also would like to clarify something. From a glance, it seems that bedtools subtract can do what the "get_rid_bad_segments.py" is doing.

Are you suggesting that we replace that script with a simple bedtools subtract line instead? It seem like bedtools subtract would do the same job and probably much faster.

It depends. I want this analysis to be merged quickly, so I don't really want to add more for you to do. But if it is a quick substitution and you can verify that your code and bedtools work the same way, then it might be a good idea.

I would suggest that we get this merged in as is since we know it works, and you can think about updating to bedtools in a future PR if you have time.

I am more concerned for this PR with making sure we have all the steps required to generate the bad_chromosomal_seg file. That would include the origin of the files that @hongboxie gave you, as well as the scripts for liftover, sorting and merging, as required.

I see. So I think right now I will stick with this and make future PR to update to Bedtools if time permits.

As for the bed_chromosomal_seg file. How do you want me to document my process of making this file? Do I comment my process in a certain file somewhere?
I didn't generate this file with a scrip. I did so manually but I could write script if you think it is necessary.

A script would be very much preferred, even if it is just a simple shell script with a series of steps.

Since the purpose of the script is to document my process, does it have to be integrated into the pipeline? Or can it just be a standalone file?

It does not have to be integrated into the snakemake file, but we might ultimately want to add it to the shell script, though that doesn't need to be done right now. I can also add it into the testing once we have it documented in a first phase.

I have made the changes and pushed them to this PR. The changes were:

Implementing re.split in the .py code

Add in the script to generate the black list.

jashapiro · 2019-12-13T20:12:54Z

analyses/copy_number_consensus_call/src/scripts/get_rid_bad_segments.py

+
+################# ASSUMPTION ###############
+# It is assumed that the reference file DO NOT have overlapping telomeric/centromeric/segments
+# The provided reference file "bad_chromosomal_seg_updated_merged.txt" DOES NOT have


Where is this provided from?

jashapiro · 2019-12-13T20:28:31Z

analyses/copy_number_consensus_call/src/scripts/get_rid_bad_segments.py

+    if start_cnv > end_cnv:
+        start_cnv, end_cnv = end_cnv, start_cnv
+
+    ## For each CNV, loop through the entire reference file


This step could make things slow. Are the CNVs sorted? If so, we can probably keep track of where we are in the reference file. If the CNV is after our current reference position, we can search forward in the reference, but we won't need to go all the way to the end, and we won't need to go back to the start for the next CNV. I don't know how long this takes on the full data, but if it is slow, this might be a place to look for a speed boost.

jashapiro · 2019-12-13T20:55:42Z

analyses/copy_number_consensus_call/src/scripts/get_rid_bad_segments.py

+        if stripped_content[0].find('\t') != -1:
+            final_content = [i.split('\t') for i in stripped_content]
+
+        ## If the file is space delimited, split the file up by the spaces
+        else:
+            final_content = [i.split() for i in stripped_content]


For these purposes, are we likely to get both tab and space separated files? I would have assumed that if these are bed files they should all be separated the same way.

For the purpose of this pipeline, the input file into this step is ALWAYS going to be space separated files. The reason why this is here is because when I made this script, I wanted to add a little flexibility into it in case someone wants to take this script out and use it on their tab separated data. I could take it out if you think this is not necessary.

Would it be safe to use re.split and split on any whitespace of one or more characters with '\s+'?

Also - don't bed files require tab spacing? Are there tools for bed files that output space separated ones?

Yes, it does seem that re.split would simplify this into a few, if not one, line. I can implement this if everyone agrees

Bed files do require tab spacing. However, this step (step 3) doesn't use bedtools. The step before this (step 2) outputs space separated file which then get read by the .py script in step 3. By the next time bedtools is needed, the input file would have been converted into tab separated files.

I think it might be best that I go back,

Implement the re.split.

For consistency, change the output of step 2 to output tab separated file.

What does everyone think?

@fingerfen : if re.split would let you simplify many lines to one line and reduce the potential for an error, I'm a fan of that.

hongboxie · 2019-12-16T14:41:58Z

On how "the blacklist" is created.

The blacklist primarily includes IGLL regions, centromeric and telomeric regions. We published a few CNV centered papers, this is standard practice how we do it. PMID: 25066379; PMID: 28398664;PMID: 31222980;PMID: 26742502;PMID: 25892112.
This can also be viewed from my colleague Kai Wang at his PennCNV website : http://penncnv.openbioinformatics.org/en/latest/misc/faq/
Segmental duplication regions: it is also popular to remove segmental duplication regions from the final CNV detection. For instance: PMID: 24098321. This is also come from our observation that those regions, present most noisy results that cloud our conclusions. After removing those, we have much cleaner results, especially for NGS (WGS/WES). For genotyping array, due to how probes are selected, the problem may not as obvious. For WGS/WES, we do see the difference.
We are actually want to remove one more class of regions : low mappability regions. But want to leave this open for now.

jashapiro · 2019-12-16T14:51:27Z

Thank you @hongboxie, this is useful information!

Is there a script or link that that we could include to document these decisions? I also just want to confirm that the regions are for hg38 (only because the PennCNV link seems only to include hg19).

hongboxie · 2019-12-16T15:06:26Z

I will use PennCNV as the webPage link.

Everything could be converted from hg19 to hg38 build by leftover program from UCSC utility tool set.

Nhat should provide a file of the blacklist which can be posted somewhere as a linked file.

I think that it should be adequate.

Co-Authored-By: jashapiro <jashapiro@gmail.com>

Simplify snakemake

Also adds filtered `del` files.

This makes all files uniform bed format files.

jashapiro

Thank you for adding the script to generate the blacklist. That is very helpful!

Can you add a link to the UCSC track that you used? Was it perhaps this one? https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=genomicSuperDups
or this? https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=rmsk

Otherwise, I think this looks good, but please also see the pull request I submitted to your branch: fingerfen#2 for a few other file format-related changes that would be nice to merge in here.

analyses/copy_number_consensus_call/src/scripts/generate_black_list.sh

jashapiro · 2019-12-17T14:44:29Z

analyses/copy_number_consensus_call/Snakefile

+    input:
+        ## Define the location of the input file and take the path/extension from the config file
+        script=os.path.join(config["scripts"], "get_rid_bad_segments.py"),
+        bad_list=os.path.join(config["scripts"], "bad_chromosomal_seg_updated_merged.txt"),


As per my other suggestion, should this be renamed?

Yes, I will make that change

…_list.sh Co-Authored-By: jashapiro <jashapiro@gmail.com>

Switch to tabs for filter output

jashapiro · 2019-12-17T19:34:05Z

analyses/copy_number_consensus_call/src/scripts/generate_black_list.sh

+# There are two components to this file the "IGLL regions, centromeric and telomeric regions" and the "segmental duplication regions"
+# 1) The IGLL regions, centromeric and telomeric regions are generated from the practice described by Kai Wang at his PennCNV website http://penncnv.openbioinformatics.org/en/latest/misc/faq/
+# 2) The segmental duplication are downloaded from UCSC genome browser. The segmental duplication with 95% identity was downloaded and merged.
+     The tracked used for this is here https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=genomicSuperDups 


Suggested change

The tracked used for this is here https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=genomicSuperDups

# The track used for this is here https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=genomicSuperDups

jashapiro

Looks good! Lets see the next steps!

…updated_merged.bed

Duong and others added 6 commits December 11, 2019 15:57

add step 3 to Snakefile

25a50f5

Add file

14a6e23

Add files

df8596e

add file

7c427eb

add file

b332ab1

add file

c22ff1b

fingerfen changed the title ~~Filter bad segments~~ CNV consensus (3 of n): Filter bad segments Dec 12, 2019

jaclyn-taroni requested a review from jashapiro December 12, 2019 16:28

fingerfen and others added 3 commits December 12, 2019 13:06

Merge branch 'master' into filter_bad_segments

c9ed4b8

Merge branch 'master' into filter_bad_segments

ab23f1f

Simplify snakemake

531a067

Taking advantage of some snakemake features to reduce redundancy.

jashapiro reviewed Dec 13, 2019

View reviewed changes

spelling fix

d98069f

fingerfen and others added 9 commits December 16, 2019 10:31

Update analyses/copy_number_consensus_call/Snakefile

85cfc1b

Co-Authored-By: jashapiro <jashapiro@gmail.com>

Merge branch 'filter_bad_segments' into jashapiro/simplify-snakemake

66a4618

Merge pull request #1 from jashapiro/jashapiro/simplify-snakemake

28b2729

Simplify snakemake

changed file

94b4af1

changed Snakemake

5b9a0e6

Switch to tabs for filter output

e2090be

Also adds filtered `del` files.

Add "chr" to output files that don't have them

f4978e2

This makes all files uniform bed format files.

add re.split

0d2b447

add black_list generating script

049e179

jashapiro reviewed Dec 17, 2019

View reviewed changes

fingerfen and others added 3 commits December 17, 2019 13:21

Update analyses/copy_number_consensus_call/src/scripts/generate_black…

cdd50a6

…_list.sh Co-Authored-By: jashapiro <jashapiro@gmail.com>

Merge pull request #2 from jashapiro/jashapiro/entab

b465b1c

Switch to tabs for filter output

changed extension to bed

76c9806

jashapiro reviewed Dec 17, 2019

View reviewed changes

jashapiro approved these changes Dec 17, 2019

View reviewed changes

jashapiro and others added 4 commits December 17, 2019 14:36

Rename bad_chromosomal_seg_updated_merged.txt to bad_chromosomal_seg_…

e00ca9b

…updated_merged.bed

Comment line

e6d20f3

Merge branch 'master' into filter_bad_segments

75cc80a

Merge branch 'master' into filter_bad_segments

4e5b281

jaclyn-taroni merged commit db0d74e into AlexsLemonade:master Dec 17, 2019

jashapiro mentioned this pull request Dec 19, 2019

CNV consensus (5 of 6):Consensus call #357

Merged

3 tasks

jharenza mentioned this pull request Jan 13, 2020

Proposed Analysis: Copy number consensus calls #128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNV consensus (3 of n): Filter bad segments #328

CNV consensus (3 of n): Filter bad segments #328

fingerfen commented Dec 11, 2019

jashapiro left a comment

jashapiro Dec 13, 2019

fingerfen Dec 16, 2019

fingerfen Dec 16, 2019

jashapiro Dec 16, 2019

fingerfen Dec 16, 2019

jashapiro Dec 16, 2019

fingerfen Dec 16, 2019

jashapiro Dec 16, 2019

fingerfen Dec 17, 2019

jashapiro Dec 13, 2019

jashapiro Dec 13, 2019

jashapiro Dec 13, 2019

fingerfen Dec 16, 2019

cgreene Dec 16, 2019

cgreene Dec 16, 2019

fingerfen Dec 16, 2019

cgreene Dec 16, 2019

hongboxie commented Dec 16, 2019

jashapiro commented Dec 16, 2019

hongboxie commented Dec 16, 2019

jashapiro left a comment

jashapiro Dec 17, 2019

fingerfen Dec 17, 2019

jashapiro Dec 17, 2019

jashapiro left a comment

	The tracked used for this is here https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=genomicSuperDups
	# The track used for this is here https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=genomicSuperDups

CNV consensus (3 of n): Filter bad segments #328

CNV consensus (3 of n): Filter bad segments #328

Conversation

fingerfen commented Dec 11, 2019

Purpose/implementation Section

What scientific question is your analysis addressing?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Reproducibility Checklist

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongboxie commented Dec 16, 2019

jashapiro commented Dec 16, 2019

hongboxie commented Dec 16, 2019

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment