Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Generate CNV exclusion list #467

Merged

Conversation

jashapiro
Copy link
Member

@jashapiro jashapiro commented Jan 22, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

The analysis of CNV consensus files takes advantage of a file of regions that are to be excluded due to expected (and previously observed) high levels of false positives. These include regions such as telomeres, centromeres, and known segmental duplications.

What was your approach?

I split up the various categories of excluded regions into separate files for ease of identification and modification. These are:

  • ref/centromeres.bed
  • ref/heterochromatin.bed
  • ref/immunoglobulin_regions.bed
  • ref/segmental_dups.bed
  • ref/telomeres.bed

The origins of these files, with code where appropriate, are in scripts/prepare_blacklist_files.sh

The files are then merged with a new rule in the Snakefile to generate ref/cnv_excluded.bed which can be used downstream.

What GitHub issue does your pull request address?

closes #438

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

This file differs slightly from the previously included file, provided by @fingerfen, mostly I think in the handling of segmental duplication regions. There seem to be some broader regions that are excluded, but I could not find references for why those had been excluded.

Note however, that there do not appear to be major changes in the final CNV regions, though there are some effects at the margins.

Is there anything that you want to discuss further?

Do we need to programmatically generate every region, or is it okay that the telomeres and IG regions are
simply included as their own files?

How concerned should we be be about changes in the final set of calls resulting from this change? Can we adjust the generation to better match the previous set (included here as ref/bad_chromosomal_seg_merged.bed)

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

Duong and others added 30 commits December 18, 2019 03:51
Removing `src` directory to unnest `scripts` and adding `ref` directory for genomic info files.
Link and script to process downloaded file for segmental duplciations.
These regions are the ones defined by @hongboxie here: AlexsLemonade#438 (comment)
Converted from hg18 to hg38
Note that ordering has changed, but the actual differences between these files should be relatively small other than that.

There are changes to the cnv_consensus.tsv file where segments that are not contained within the defined CNV are discarded but might have been retained before.
@jashapiro jashapiro marked this pull request as ready for review January 22, 2020 19:35
@jaclyn-taroni
Copy link
Member

Do we need to programmatically generate every region, or is it okay that the telomeres and IG regions are simply included as their own files?

It took me awhile to figure out that the origin of these files was described as comments in the shell script. So my vote would be yes let's programmatically generate these.

@jashapiro
Copy link
Member Author

Do we need to programmatically generate every region, or is it okay that the telomeres and IG regions are simply included as their own files?

It took me awhile to figure out that the origin of these files was described as comments in the shell script. So my vote would be yes let's programmatically generate these.

I can do the telomeres... I have no idea how to get the IG regions, unfortunately.

@jashapiro
Copy link
Member Author

In my work on #476, I have substantially updated the README to include more information on the creation of exclusion regions, making it hopefully easier to find. If those updates are a good start, it might make sense to make further changes in that branch?

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit: - I agree that the documentation changes on #476 look good and any changes can be continued over there.

@jaclyn-taroni jaclyn-taroni merged commit 844a9e4 into AlexsLemonade:master Jan 25, 2020
@jashapiro jashapiro deleted the jashapiro/generate-cnv-blacklist branch April 11, 2021 18:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reproducing copy number excluded regions
2 participants