Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

CNV consensus (3 of n): Filter bad segments #328

Merged
merged 26 commits into from
Dec 17, 2019

Conversation

fingerfen
Copy link
Contributor

Purpose/implementation Section

Continue copy number consensus call

What scientific question is your analysis addressing?

Merging consensus calls

What GitHub issue does your pull request address?

Issue #128

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The new "get_rid_bad_segments.py" file

Is there anything that you want to discuss further?

No

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

@fingerfen fingerfen changed the title Filter bad segments CNV consensus (3 of n): Filter bad segments Dec 12, 2019
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fingerfen, thank you for submitting this next step! Sorry for the delay in review; I was away for a while and have been digging myself back out of a bit of a hole.

The main question I have is the provenance of the bad_chromosomal_seg_updated_merged.txt file. How was this produced? We definitely would want to know how that file came to be, and what it contains. If there was a script that created it, we would want to have that in this analysis as well.

I also had a suggestion or two on how to simplify the snakefile, one of which is here, the other of which is in a pull request I submitted to your branch.

Finally, I had some thoughts about places where it might be possible to speed up your script, so you don't need to go through the entire bad_chromosomal_seg file every time. One simple suggestion would be to store each chromosome separately (in a dict, for example), to cut the number of comparisons you need to do. Alternatively (as I suggested below), you might be able to iterate thorugh the CNV file and bad file together, as long as they are both sorted.

I don't actually know if this is a particular slow point in the script, but it does seem it could be. One other alternative, which I am a bit hesitant to suggest as it may require a bit of work to wrap your head around and get working as you want, is to replace some or all of the script with bedtools subtract, which should be extremely efficient. You would want to take advantage of the -f/-F and -A options to achieve similar results to your script, I think. bedtools is already in the docker container if you want to try it out.

analyses/copy_number_consensus_call/Snakefile Outdated Show resolved Hide resolved
@@ -0,0 +1,3693 @@
chr1 0 535988 telo,,, length,,,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this file come from? Is there a script that generated it? Or did it come from a different source?

If the former, we would like to have that script included in the repository. If the latter, we should indicate where the file came from in a README in the base directory of this analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiehongbo provided me with the file so I will let him explain it. I took the file he gave to me, sorted and merged any overlapping segments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jashapiro I also would like to clarify something. From a glance, it seems that bedtools subtract can do what the "get_rid_bad_segments.py" is doing.

Are you suggesting that we replace that script with a simple bedtools subtract line instead? It seem like bedtools subtract would do the same job and probably much faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. I want this analysis to be merged quickly, so I don't really want to add more for you to do. But if it is a quick substitution and you can verify that your code and bedtools work the same way, then it might be a good idea.

I would suggest that we get this merged in as is since we know it works, and you can think about updating to bedtools in a future PR if you have time.

I am more concerned for this PR with making sure we have all the steps required to generate the bad_chromosomal_seg file. That would include the origin of the files that @hongboxie gave you, as well as the scripts for liftover, sorting and merging, as required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So I think right now I will stick with this and make future PR to update to Bedtools if time permits.

As for the bed_chromosomal_seg file. How do you want me to document my process of making this file? Do I comment my process in a certain file somewhere?
I didn't generate this file with a scrip. I did so manually but I could write script if you think it is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A script would be very much preferred, even if it is just a simple shell script with a series of steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the purpose of the script is to document my process, does it have to be integrated into the pipeline? Or can it just be a standalone file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not have to be integrated into the snakemake file, but we might ultimately want to add it to the shell script, though that doesn't need to be done right now. I can also add it into the testing once we have it documented in a first phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes and pushed them to this PR. The changes were:

  1. Implementing re.split in the .py code
  2. Add in the script to generate the black list.


################# ASSUMPTION ###############
# It is assumed that the reference file DO NOT have overlapping telomeric/centromeric/segments
# The provided reference file "bad_chromosomal_seg_updated_merged.txt" DOES NOT have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this provided from?

if start_cnv > end_cnv:
start_cnv, end_cnv = end_cnv, start_cnv

## For each CNV, loop through the entire reference file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step could make things slow. Are the CNVs sorted? If so, we can probably keep track of where we are in the reference file. If the CNV is after our current reference position, we can search forward in the reference, but we won't need to go all the way to the end, and we won't need to go back to the start for the next CNV. I don't know how long this takes on the full data, but if it is slow, this might be a place to look for a speed boost.

Comment on lines 49 to 54
if stripped_content[0].find('\t') != -1:
final_content = [i.split('\t') for i in stripped_content]

## If the file is space delimited, split the file up by the spaces
else:
final_content = [i.split() for i in stripped_content]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these purposes, are we likely to get both tab and space separated files? I would have assumed that if these are bed files they should all be separated the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purpose of this pipeline, the input file into this step is ALWAYS going to be space separated files. The reason why this is here is because when I made this script, I wanted to add a little flexibility into it in case someone wants to take this script out and use it on their tab separated data. I could take it out if you think this is not necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be safe to use re.split and split on any whitespace of one or more characters with '\s+'?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - don't bed files require tab spacing? Are there tools for bed files that output space separated ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does seem that re.split would simplify this into a few, if not one, line. I can implement this if everyone agrees

Bed files do require tab spacing. However, this step (step 3) doesn't use bedtools. The step before this (step 2) outputs space separated file which then get read by the .py script in step 3. By the next time bedtools is needed, the input file would have been converted into tab separated files.

I think it might be best that I go back,

  1. Implement the re.split.
  2. For consistency, change the output of step 2 to output tab separated file.

What does everyone think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fingerfen : if re.split would let you simplify many lines to one line and reduce the potential for an error, I'm a fan of that.

@hongboxie
Copy link

On how "the blacklist" is created.

  1. The blacklist primarily includes IGLL regions, centromeric and telomeric regions. We published a few CNV centered papers, this is standard practice how we do it. PMID: 25066379; PMID: 28398664;PMID: 31222980;PMID: 26742502;PMID: 25892112.
  2. This can also be viewed from my colleague Kai Wang at his PennCNV website : http://penncnv.openbioinformatics.org/en/latest/misc/faq/
  3. Segmental duplication regions: it is also popular to remove segmental duplication regions from the final CNV detection. For instance: PMID: 24098321. This is also come from our observation that those regions, present most noisy results that cloud our conclusions. After removing those, we have much cleaner results, especially for NGS (WGS/WES). For genotyping array, due to how probes are selected, the problem may not as obvious. For WGS/WES, we do see the difference.
  4. We are actually want to remove one more class of regions : low mappability regions. But want to leave this open for now.

@jashapiro
Copy link
Member

Thank you @hongboxie, this is useful information!

Is there a script or link that that we could include to document these decisions? I also just want to confirm that the regions are for hg38 (only because the PennCNV link seems only to include hg19).

@hongboxie
Copy link

I will use PennCNV as the webPage link.

Everything could be converted from hg19 to hg38 build by leftover program from UCSC utility tool set.

Nhat should provide a file of the blacklist which can be posted somewhere as a linked file.

I think that it should be adequate.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the script to generate the blacklist. That is very helpful!

Can you add a link to the UCSC track that you used? Was it perhaps this one? https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=genomicSuperDups
or this? https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=rmsk

Otherwise, I think this looks good, but please also see the pull request I submitted to your branch: fingerfen#2 for a few other file format-related changes that would be nice to merge in here.

input:
## Define the location of the input file and take the path/extension from the config file
script=os.path.join(config["scripts"], "get_rid_bad_segments.py"),
bad_list=os.path.join(config["scripts"], "bad_chromosomal_seg_updated_merged.txt"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my other suggestion, should this be renamed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will make that change

# There are two components to this file the "IGLL regions, centromeric and telomeric regions" and the "segmental duplication regions"
# 1) The IGLL regions, centromeric and telomeric regions are generated from the practice described by Kai Wang at his PennCNV website http://penncnv.openbioinformatics.org/en/latest/misc/faq/
# 2) The segmental duplication are downloaded from UCSC genome browser. The segmental duplication with 95% identity was downloaded and merged.
The tracked used for this is here https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=genomicSuperDups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The tracked used for this is here https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=genomicSuperDups
# The track used for this is here https://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=783211655_qnHWS1w9VWkUebtWb7482jKHTMF8&c=chr1&g=genomicSuperDups

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Lets see the next steps!

@jaclyn-taroni jaclyn-taroni merged commit db0d74e into AlexsLemonade:master Dec 17, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants