Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Reproducing copy number excluded regions #438

Closed
jashapiro opened this issue Jan 15, 2020 · 1 comment · Fixed by #467
Closed

Reproducing copy number excluded regions #438

jashapiro opened this issue Jan 15, 2020 · 1 comment · Fixed by #467
Assignees
Labels
cnv Related to or requires CNV data data in progress Someone is working on this issue, but feel free to propose an alternative approach! updated analysis

Comments

@jashapiro
Copy link
Member

What data file(s) does this issue pertain to?

https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/copy_number_consensus_call/src/scripts/IGLL_telo_centromeric_region.txt

Put your question or report your issue here.

I am trying to reproduce the generation of the IGLL_telo_centromeric_region.txt file that is used for Copy number consensus generation #128. This file includes regions that are to be excluded from analysis due to high error rates. The telomeres and centromeres can be reproduced from UCSC data files, but I am confused by the immunoglobulin regions. The documentation points to http://penncnv.openbioinformatics.org/en/latest/misc/faq/, but it is not clear from where the enumerations of those regions is defined. Moreover, the regions defined there are for hg18:

chr22:20715572-21595082
chr14:105065301-106352275
chr2:88937989-89411302
chr14:21159897-22090937

However, the regions in the IGLL_telo_centromeric_region.txt file do not seem to correspond to a liftOver of those regions to hg38

Applying liftOver hg18->hg38, I get the following regions:

chr22:22031174-22922910
chr14:105527919-106873021
chr2:88857361-89330430
chr14:21621904-22552154

The nearest equivalent regions in IGLL_telo_centromeric_region.txt seem to be these:

chr22:21990603-22947816
chr2:88854372-89330679
chr14:21676543-22556766

(There are only three here, presumably because the other chr14 region falls near a telomere and is excluded that way?)

In addition, IGLL_telo_centromeric_region.txt includes the region chr21:3100000-7000000 which is listed as a stalk (acrocentric arm) the UCSC cytoband file, but no other stalk regions are excluded, so I was not sure why this one was.

Can @fingerfen or @xiehongbo provide some additional information?

@jashapiro jashapiro added the data label Jan 15, 2020
@jashapiro
Copy link
Member Author

From @xiehongbo via email:

Kai used transcription start and end site as his coordinates. I actually inspected the repeat element and extend the region to cover more low complex regions. I also have more which is know to have complex genomic features. It is all about estimation. Also those regions matters with SNP arrays.

In hg18 build here are 6 regions we used:

chr2:88935000-89418000 IgKappa
chr6:29775000-33225000 HLA*
chr7:141636000-142225000 TCRbeta
chr14:21214600-22095500 TCRalpha
chr14:105046000-106368585 IgHeavy
chr22:20675000-21620000 IgLambda

Using hg18->hg38 liftover, these correspond to:

chr2:88854372-89330679
chr6:29699244-33149245
chr14:21676543-22556766
chr14:105508618-106881350
chr22:21990603-22947816

Which are the regions as found in the provided file.

I will update the analysis to include this as a starting point.

@jaclyn-taroni jaclyn-taroni added snv Related to or requires SNV data cnv Related to or requires CNV data updated analysis in progress Someone is working on this issue, but feel free to propose an alternative approach! and removed snv Related to or requires SNV data labels Jan 18, 2020
jashapiro added a commit to jashapiro/OpenPBTA-analysis that referenced this issue Jan 22, 2020
These regions are the ones defined by @hongboxie here: AlexsLemonade#438 (comment)
Converted from hg18 to hg38
jaclyn-taroni added a commit that referenced this issue Jan 25, 2020
* add to Snakefile

* updating fork

* changed output path and name

* implement segmean

* implement segmean

* add result file

* add result files

* add trailing line

* fix .py

* change Snakefile comment

* change README.md

* change README.md

* Updates to file organization

Removing `src` directory to unnest `scripts` and adding `ref` directory for genomic info files.

* add alternative segdup generation

Link and script to process downloaded file for segmental duplciations.

* Updates to blacklist generation

* Add IG regions

These regions are the ones defined by @hongboxie here: #438 (comment)
Converted from hg18 to hg38

* Add step to potentially fix overlapping dup del segments.

* Notebook to look at consensus calls for overlaps

* Add overlap pruning

* Update output files

Note that ordering has changed, but the actual differences between these files should be relatively small other than that.

There are changes to the cnv_consensus.tsv file where segments that are not contained within the defined CNV are discarded but might have been retained before.

* update readme

* Add telomere definition file

* Update blacklist generation script

* Remove accidentally included notebook

* Tried to clarify complicated bedtools step.

* Update analyses/copy_number_consensus_call/scripts/remove_dup_NULL_overlap_entries.py

Co-Authored-By: Candace Savonen <cansav09@gmail.com>

* Update analyses/copy_number_consensus_call/scripts/remove_dup_NULL_overlap_entries.py

Co-Authored-By: Candace Savonen <cansav09@gmail.com>

* Add more clarifying comments

* Add full exclusion list and remove outdated files

* Update readmes

* Updated output files.

* Re-add previous blacklist

* More descriptive excluded file name

* Update filename

Co-authored-by: Candace Savonen <cansav09@gmail.com>
Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
jaclyn-taroni added a commit that referenced this issue Jan 27, 2020
* add to Snakefile

* updating fork

* changed output path and name

* implement segmean

* implement segmean

* add result file

* add result files

* add trailing line

* fix .py

* change Snakefile comment

* change README.md

* change README.md

* Updates to file organization

Removing `src` directory to unnest `scripts` and adding `ref` directory for genomic info files.

* add alternative segdup generation

Link and script to process downloaded file for segmental duplciations.

* Updates to blacklist generation

* Add IG regions

These regions are the ones defined by @hongboxie here: #438 (comment)
Converted from hg18 to hg38

* Add step to potentially fix overlapping dup del segments.

* Notebook to look at consensus calls for overlaps

* Add overlap pruning

* Update output files

Note that ordering has changed, but the actual differences between these files should be relatively small other than that.

There are changes to the cnv_consensus.tsv file where segments that are not contained within the defined CNV are discarded but might have been retained before.

* update readme

* Add telomere definition file

* Update blacklist generation script

* Remove accidentally included notebook

* Tried to clarify complicated bedtools step.

* Update analyses/copy_number_consensus_call/scripts/remove_dup_NULL_overlap_entries.py

Co-Authored-By: Candace Savonen <cansav09@gmail.com>

* Update analyses/copy_number_consensus_call/scripts/remove_dup_NULL_overlap_entries.py

Co-Authored-By: Candace Savonen <cansav09@gmail.com>

* Add more clarifying comments

* Add full exclusion list and remove outdated files

* Update readmes

* Updated output files.

* Re-add previous blacklist

* Add chromosome lengths file

* Create file of neutral regions

* Use hg.38.chrom.sizes

* More descriptive excluded file name

* Update filename

* Sort chromosomes and remove alt from callable.

* Fix sed command

* Finish the rule to combine neutral regions.

* Add output of bad callers

* Bad caller summary notebook

* Add output of neutral segments to the seg file

Neutral segments (copy number 2) are included if they fall within a "callable region" which is one not covered by a large excluded region.

When we add these back, we still exclude specimens where more than two callers 'failed' with high numbers of segments

* remove working notebooks

* Bug fixes

* Unset X and Y copy number calls

* Update README

* Add callable regions to analyses/README.md

* Simplify output file description in readme

* Simplify file reading

we don't need data types here, so keeping everything as strings simplifies, and removes potential errors from unexpected conversions from int to float

* comment out status message

* Move segfile step into snakemake

* Fix filename in snakemake

* Update results.

* Update scratch dir handling

Put all intermediate files in a defined scratch sub directory.

* Update analyses/copy_number_consensus_call/scripts/bed_to_segfile.R

Co-Authored-By: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>

* remove unused option.

Co-authored-by: Candace Savonen <cansav09@gmail.com>
Co-authored-by: Jaclyn Taroni <jaclyn.n.taroni@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cnv Related to or requires CNV data data in progress Someone is working on this issue, but feel free to propose an alternative approach! updated analysis
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants