Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Include neutral changes in cnv consensus .seg file #476

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
3f55855
add to Snakefile
Dec 18, 2019
9a923a3
resolve conflict
Dec 19, 2019
d38289c
Merge remote-tracking branch 'upstream/master'
Jan 4, 2020
3a20aa0
updating fork
Jan 4, 2020
d3d6431
Merge remote-tracking branch 'upstream/master'
Jan 6, 2020
305bbbf
changed output path and name
Jan 6, 2020
eec2ffc
update Snakefile to master
Jan 10, 2020
5c98fda
implement segmean
Jan 13, 2020
cc4eff6
implement segmean
Jan 14, 2020
fa23995
add result file
Jan 14, 2020
6fc6b7a
resolve
Jan 14, 2020
902abbb
add result files
Jan 14, 2020
8bf98d3
add trailing line
Jan 14, 2020
264798c
fix .py
Jan 14, 2020
96490f3
change Snakefile comment
Jan 14, 2020
b242642
change README.md
Jan 14, 2020
9df166b
change README.md
Jan 14, 2020
5d2fd04
Updates to file organization
jashapiro Jan 14, 2020
d5d2a72
Merge branch 'jashapiro/reorg-cnv-consensus' into jashapiro/generate-…
jashapiro Jan 14, 2020
e4c66b4
add alternative segdup generation
jashapiro Jan 14, 2020
44eb15f
Updates to blacklist generation
jashapiro Jan 15, 2020
253ff4b
Add IG regions
jashapiro Jan 16, 2020
88385f5
Add step to potentially fix overlapping dup del segments.
jashapiro Jan 21, 2020
92a08e8
Notebook to look at consensus calls for overlaps
jashapiro Jan 21, 2020
1bb834c
Add overlap pruning
jashapiro Jan 21, 2020
8ad9ef8
Update output files
jashapiro Jan 21, 2020
86072fe
update readme
jashapiro Jan 21, 2020
3047346
Merge branch 'master' into jashapiro/fix_cnv_overlaps
jashapiro Jan 21, 2020
4f3f1ef
Add telomere definition file
jashapiro Jan 21, 2020
acee89e
Update blacklist generation script
jashapiro Jan 22, 2020
3cd5d56
Remove accidentally included notebook
jashapiro Jan 22, 2020
9a06e5c
Tried to clarify complicated bedtools step.
jashapiro Jan 22, 2020
9acf2c5
Update analyses/copy_number_consensus_call/scripts/remove_dup_NULL_ov…
jashapiro Jan 22, 2020
9a38ace
Update analyses/copy_number_consensus_call/scripts/remove_dup_NULL_ov…
jashapiro Jan 22, 2020
19e30da
Add more clarifying comments
jashapiro Jan 22, 2020
4030c83
Merge jashapiro/fix_cnv_overlaps
jashapiro Jan 22, 2020
8eb447e
Merge remote-tracking branch 'upstream/master' into jashapiro/generat…
jashapiro Jan 22, 2020
8c71b0b
Add full exclusion list and remove outdated files
jashapiro Jan 22, 2020
e7a350e
Update readmes
jashapiro Jan 22, 2020
82b869e
Updated output files.
jashapiro Jan 22, 2020
673947f
Re-add previous blacklist
jashapiro Jan 22, 2020
7cfeec1
Add chromosome lengths file
jashapiro Jan 22, 2020
5208b9e
Create file of neutral regions
jashapiro Jan 22, 2020
a577076
Use hg.38.chrom.sizes
jashapiro Jan 22, 2020
d433218
More descriptive excluded file name
jashapiro Jan 22, 2020
31518c6
Merge branch 'jashapiro/generate-cnv-blacklist' into jashapiro/fill-s…
jashapiro Jan 22, 2020
737e165
Update filename
jashapiro Jan 22, 2020
4ba776c
Sort chromosomes and remove alt from callable.
jashapiro Jan 22, 2020
7c8a26a
Fix sed command
jashapiro Jan 22, 2020
496ba0b
Finish the rule to combine neutral regions.
jashapiro Jan 22, 2020
9ed7088
Add output of bad callers
jashapiro Jan 23, 2020
73806ff
Bad caller summary notebook
jashapiro Jan 23, 2020
6393c44
Add output of neutral segments to the seg file
jashapiro Jan 23, 2020
da9723f
remove working notebooks
jashapiro Jan 23, 2020
82b3c37
Bug fixes
jashapiro Jan 23, 2020
ad69e70
Unset X and Y copy number calls
jashapiro Jan 23, 2020
b9be61f
Update README
jashapiro Jan 24, 2020
458dcbc
Add callable regions to analyses/README.md
jashapiro Jan 24, 2020
6b20cf2
Merge remote-tracking branch 'upstream/master' into jashapiro/fill-se…
jashapiro Jan 24, 2020
b4fca42
Simplify output file description in readme
jashapiro Jan 24, 2020
c7c0a95
Simplify file reading
jashapiro Jan 24, 2020
74f9be8
comment out status message
jashapiro Jan 24, 2020
e44262d
Move segfile step into snakemake
jashapiro Jan 24, 2020
710f51a
Fix filename in snakemake
jashapiro Jan 24, 2020
b693549
Merge upstream/master
jashapiro Jan 26, 2020
79e10a7
Update results.
jashapiro Jan 26, 2020
36cd4f5
Update scratch dir handling
jashapiro Jan 26, 2020
1d2987c
Update analyses/copy_number_consensus_call/scripts/bed_to_segfile.R
jashapiro Jan 27, 2020
a1ffa03
remove unused option.
jashapiro Jan 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion analyses/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| [`cnv-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-comparison) | Earlier version of SEG files | *Deprecated*; compared earlier version of the CNV methods. | N/A
| [`collapse-rnaseq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/collapse-rnaseq) | `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` <br> `gencode.v27.primary_assembly.annotation.gtf.gz` | Collapses RSEM FPKM matrices such that gene symbols are de-duplicated. | `results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` (included in data download; too large for tracking via GitHub)
| [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds` <br> `pbta-gene-expression-rsem-tpm.stranded.rds` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) and a set of excluded regions where CNV calls are not made | `results/cnv_consensus.tsv` <br> `results/pbta-cnv-consensus.seg` <br> `ref/cnv_excluded_regions.bed`
| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) and a set of excluded regions where CNV calls are not made | `results/cnv_consensus.tsv` <br> `results/pbta-cnv-consensus.seg` <br> `ref/cnv_excluded_regions.bed` <br> `ref/cnv_callable.bed`
| [`create-subset-files`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/create-subset-files) | All files | This module contains the code to create the subset files used in continuous integration | All subset files for continuous integration
| [`focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` | Maps from copy number variant caller segments to gene identifiers; will eventually be updated to use consensus copy number calls ([#186](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/186))| `results/cnvkit_annotated_cn_autosomes.tsv.bz2` <br> `results/cnvkit_annotated_cn_x_and_y.tsv.bz2` <br> `results/controlfreec_annotated_cn_autosomes.tsv.bz2` <br> `results/controlfreec_annotated_cn_x_and_y.tsv.bz2`
| [`fusion_filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | `pbta-fusion-arriba.tsv.gz` <br> `pbta-fusion-starfusion.tsv.gz` | Standardizes, filters, and prioritizes fusion calls | `results/pbta-fusion-putative-oncogenic.tsv` <br> `results/pbta-fusion-recurrent-fusion-byhistology.tsv` <br> `results/pbta-fusion-recurrent-fusion-bysample.tsv` (included in data download)
Expand Down
45 changes: 30 additions & 15 deletions analyses/copy_number_consensus_call/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,12 @@ This analysis uses information from the following files generated from the 3 cal
* `pbta-cnv-controlfreec.tsv.gz`
* `pbta-sv-manta.tsv.gz`

The analysis produces an output file that includes the original calls used for each consensus call:
The analysis produces the following output files

* `results/cnv_consensus.tsv`

A segfile for downstream processing:

* `results/pbta-cnv-consensus.seg`

And a bed file of regions that were excluded from calls (see step 7)

* `ref/cnv_excluded_regions.bed`
* `results/cnv_consensus.tsv`: A tab separated file out consense copy number variants, including the original calls used for each consensus call
* `results/pbta-cnv-consensus.seg`: A `.seg` formatted file for downstream processing
* `ref/cnv_excluded_regions.bed`: A `.bed` file of error-prone regions that were filtered from copy number calls
* `ref/cnv_callable.bed`: A `.bed` file of regions considered "callable" by the analysis pipeline

## Running the pipeline

Expand All @@ -29,7 +24,24 @@ Go to OpenPBTA-analysis/analyses/copy_number_consensus_call and run `bash run_co

## Methods

This pipeline revolves around the use of Snakemake to run analysis for each patient sample. The overview of the steps are as followed:
### Assayed Regions

Regions of the genome with a high potential for error are first defined by merging the set telomeric, centromeric and heterochromatic regions with regions around immunoglobulins and segmentmental duplications.
The input files for this step are described in `scripts/prepare_blacklist_files.sh` and include:

* `ref/centromeres.bed`
* `ref/heterochromatin.bed`
* `ref/immunoglobulin_regions.bed`
* `ref/segmental_dups.bed`
* `ref/telomeres.bed`

The final set of merged excluded regions are placed in the file `ref/cnv_excluded_regions.bed`

In addition, a file of the genomic regions that we deem "callable" is created at `ref/cnv_callable.bed` as the complement of the excluded regions, after removing exclusions smaller than 200kb.

### Consensus CNV creation

The per-sample pipeline revolves around the use of Snakemake to run analysis for each patient sample. The overview of the steps are as followed:

1) Parse through the 3 input files and put CNVs of the **same caller and sample** in the same files.
2) Remove any sample/caller combination files with **more than 2500** CNVs called.
Expand All @@ -43,15 +55,18 @@ This pipeline revolves around the use of Snakemake to run analysis for each pati
9) Reformat the columns of the files (So the info are easier to read)
10) **Call consensus** by comparing CNVs from 2 call methods at a time.

Since there are 3 callers, there were 3 comparisons: `manta-cnvkit`, `manta-freec`, and `cnvkit-freec`. If a CNV from 1 caller **overlaps 50% or more** with at least 1 CNV from another caller, the common region of the overlapping CNV would be the new CONSENSUS CNV.
Since there are 3 callers, there were 3 comparisons: `manta-cnvkit`, `manta-freec`, and `cnvkit-freec`. If a CNV from 1 caller **overlaps 50% or more** with at least 1 CNV from another caller, the common region of the overlapping CNV would be the new CONSENSUS CNV.

11) **Sort and merge** the CNVs from the comparison pairs ,`manta-cnvkit` `manta-freec` `cnvkit-freec`, together into 1 file
12) Resolve overlapping segments where duplications are embedded within larger deletion segments, or deletions within duplications.
13) After every samples' consensus CNVs were called, **combine all merged files** from step 10 and output to `results/cnv_consensus.tsv`
14) The `results/cnv_consensus.tsv` is translated into a `pbta-cnv-consensus.seg` file in the same format as `pbta-cnv-cnvkit.seg.gz`.
When a consensus CNV contains from multiple source CNV segments, we take the mean of the CNVkit `seg.mean` values from the source segments, weighted by segment length.
If no CNVkit CNV was included, the value for this column is NA.
14) The `results/cnv_consensus.tsv` is translated into a `results/pbta-cnv-consensus.seg` file in the same format as `pbta-cnv-cnvkit.seg.gz`, including all samples where at least two callers passed quality filtering.
When a consensus segment is derived from multiple source segments, we take the mean of the CNVkit `seg.mean` values from the source segments, weighted by segment length.
If no CNVkit variant was included, the value for this column is NA.
The `copy.num` column is the weighted median of CNVkit segment values where they exist, or Control-FREEC values in the absence of CNVkit data.
Because some software (notably GISTIC) requires all samples to have the same regions called, the copy number variants from `cnv_consensus.tsv` are supplementented with "neutral" segments where no call was made.
These include all non-variant regions present in `ref/cnv_callable.bed`
The neutral regions are assigned copy.num 2, except on chrX and chrY, where the copy number is left NA.

## Example Output File

Expand Down
Loading