-
Notifications
You must be signed in to change notification settings - Fork 67
Generate CNV exclusion list #467
Generate CNV exclusion list #467
Conversation
Removing `src` directory to unnest `scripts` and adding `ref` directory for genomic info files.
Link and script to process downloaded file for segmental duplciations.
These regions are the ones defined by @hongboxie here: AlexsLemonade#438 (comment) Converted from hg18 to hg38
Note that ordering has changed, but the actual differences between these files should be relatively small other than that. There are changes to the cnv_consensus.tsv file where segments that are not contained within the defined CNV are discarded but might have been retained before.
…erlap_entries.py Co-Authored-By: Candace Savonen <cansav09@gmail.com>
…erlap_entries.py Co-Authored-By: Candace Savonen <cansav09@gmail.com>
It took me awhile to figure out that the origin of these files was described as comments in the shell script. So my vote would be yes let's programmatically generate these. |
I can do the telomeres... I have no idea how to get the IG regions, unfortunately. |
In my work on #476, I have substantially updated the README to include more information on the creation of exclusion regions, making it hopefully easier to find. If those updates are a good start, it might make sense to make further changes in that branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I agree that the documentation changes on #476 look good and any changes can be continued over there.
Purpose/implementation Section
What scientific question is your analysis addressing?
The analysis of CNV consensus files takes advantage of a file of regions that are to be excluded due to expected (and previously observed) high levels of false positives. These include regions such as telomeres, centromeres, and known segmental duplications.
What was your approach?
I split up the various categories of excluded regions into separate files for ease of identification and modification. These are:
ref/centromeres.bed
ref/heterochromatin.bed
ref/immunoglobulin_regions.bed
ref/segmental_dups.bed
ref/telomeres.bed
The origins of these files, with code where appropriate, are in
scripts/prepare_blacklist_files.sh
The files are then merged with a new rule in the
Snakefile
to generateref/cnv_excluded.bed
which can be used downstream.What GitHub issue does your pull request address?
closes #438
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
This file differs slightly from the previously included file, provided by @fingerfen, mostly I think in the handling of segmental duplication regions. There seem to be some broader regions that are excluded, but I could not find references for why those had been excluded.
Note however, that there do not appear to be major changes in the final CNV regions, though there are some effects at the margins.
Is there anything that you want to discuss further?
Do we need to programmatically generate every region, or is it okay that the telomeres and IG regions are
simply included as their own files?
How concerned should we be be about changes in the final set of calls resulting from this change? Can we adjust the generation to better match the previous set (included here as
ref/bad_chromosomal_seg_merged.bed
)Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Results
What types of results are included (e.g., table, figure)?
What is your summary of the results?
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.