Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Commit

Permalink
CNV consensus (6 of 6): Merge consensus files and name columns (#403)
Browse files Browse the repository at this point in the history
* add to Snakefile

* updating fork

* add final consensus merging step

* add column naming

* add column naming

* add column naming

* change so that output files doesn't have extra indentation at the end

* change so that output files doesn't have extra indentation at the end

* changed output path and name

* changed output path and name

* Update analyses/copy_number_consensus_call/Snakefile

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/Snakefile

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/Snakefile

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/Snakefile

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/Snakefile

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* changed Snakemake, reduced redundancy

* changed Snakemake, reduced redundancy

* changed minor error

* update Snakefile

* add README.md

* Update README.md

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update README.md

* Update README.md

* Update README.md

* add result consensus file

* make changes to analyses/README.md

* Update analyses/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/Snakefile

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* Update analyses/copy_number_consensus_call/README.md

Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org>

* cut the last column

* add the result file with out the last column

Co-authored-by: jashapiro <jashapiro@gmail.com>
  • Loading branch information
2 people authored and jaclyn-taroni committed Jan 8, 2020
1 parent 21e0edb commit 7917a7f
Show file tree
Hide file tree
Showing 5 changed files with 7,174 additions and 9 deletions.
2 changes: 1 addition & 1 deletion analyses/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
| [`cnv-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-comparison) | Earlier version of SEG files | *Deprecated*; compared earlier version of the CNV methods. | N/A
| [`collapse-rnaseq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/collapse-rnaseq) | `pbta-gene-expression-rsem-fpkm.polya.rds`, `pbta-gene-expression-rsem-fpkm.stranded.rds`, `gencode.v27.primary_assembly.annotation.gtf.gz` | Collapses RSEM FPKM matrices such that gene symbols are de-duplicated. | `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds`, `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` (included in data download)
| [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds`, `pbta-gene-expression-rsem-tpm.stranded.rds` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz`, `pbta-cnv-controlfreec.tsv.gz`, `pbta-sv-manta.tsv.gz` | *In progress*; will produce consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) | N/A
| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz`, `pbta-cnv-controlfreec.tsv.gz`, `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) | `cnv_consensus.tsv`
| [`create-subset-files`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/create-subset-files) | All files | This module contains the code to create the subset files used in continuous integration | All subset files for continuous integration
| [`focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation) | `pbta-cnv-cnvkit.seg.gz`, `pbta-cnv-controlfreec.tsv.gz`, `pbta-gene-expression-rsem-fpkm.polya.rds`, `pbta-gene-expression-rsem-fpkm.stranded.rds` | Maps from copy number variant caller segments to gene identifiers; will eventually be updated to use consensus copy number calls ([#186](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/186))| `cnvkit_annotated_cn_autosomes.tsv.gz`, `cnvkit_annotated_cn_x_and_y.tsv.gz`, `controlfreec_annotated_cn_autosomes.tsv.gz`, `controlfreec_annotated_cn_x_and_y.tsv.gz`
| [`fusion_filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | `pbta-fusion-arriba.tsv.gz`, `pbta-fusion-starfusion.tsv.gz` | Standardizes, filters, and prioritizes fusion calls | `pbta-fusion-putative-oncogenic.tsv`, `pbta-fusion-recurrent-fusion-byhistology.tsv`, `pbta-fusion-recurrent-fusion-bysample.tsv` (included in data download)
Expand Down
61 changes: 61 additions & 0 deletions analyses/copy_number_consensus_call/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Copy Number Consensus Call

## Overview

The PBTA data set contains CNVs called from different callers, ie. Manta, CNVkit, and Freec.
The goal is to use all of these callers to reduce false positives and come up with a final consensus list of CNVs.
This analysis uses information from the following files generated from the 3 callers

* `pbta-cnv-cnvkit.seg.gz`
* `pbta-cnv-controlfreec.tsv.gz`
* `pbta-sv-manta.tsv.gz`

## Running the pipeline

To run the entire pipeline, make sure to have the latest release of the three input files mentioned in the Overview section.
Go to OpenPBTA-analysis/analyses/copy_number_consensus_call and run `bash run_consensus_call.sh`

## Methods

This pipeline revolves around the use of Snakemake to run analysis for each patient sample. The overview of the steps are as followed:

1) Parse through the 3 input files and put CNVs of the **same caller and sample** in the same files.
2) Remove any sample/caller combination files with **more than 2500** CNVs called.
We belive these to be noisy/poor quality samples (this came from what GISTIC uses as a cutoff for noisy samples).
3) Create a `config_snakemake.yaml` that contains all of the samples names to run the Snakemake pipeline
4) Run the Snakemake pipeline to perform analysis **per sample**.
5) Filter for any CNVs that are over a certain **SIZE_CUTOFF** (default 3000 bp)
6) Filter for any **significant** CNVs called by Freec (default pval = 0.01)
7) Filter out any CNVs that overlap 50% or more with **IGLL, telomeric, centromeric, seg_dup regions**
8) Merge any CNVs of the same sample and call method if they **overlap or within 10,000 bp** (We consider CNV calls within 10,000 bp the same CNV)
9) Reformat the columns of the files (So the info are easier to read)
10) **Call consensus** by comparing CNVs from 2 call methods at a time.

Since there are 3 callers, there were 3 comparisons: `manta-cnvkit`, `manta-freec`, and `cnvkit-freec`. If a CNV from 1 caller **overlaps 50% or more** with at least 1 CNV from another caller, the common region of the overlapping CNV would be the new CONSENSUS CNV.

11) **Sort and merge** the CNVs from the comparison pairs ,`manta-cnvkit` `manta-freec` `cnvkit-freec`, together into 1 file
12) After every samples' consensus CNVs were called, **combine all merged files** from step 10 and output to `results/cnv_consensus.tsv`

## Example Output File

```
chrom start end manta_CNVs cnvkit_CNVs freec_CNVs CNV_type Biospecimen file_names
chr11 771036 866778 NULL 770516:866778:3 771036:871536:3 DUP BS_007JTNB8 BS_007JTNB8.cnvkit_freec.dup.bed
chr13 99966948 99991872 NULL 99954829:99994557:3 99966948:99991872:3 DUP BS_007JTNB8 BS_007JTNB8.cnvkit_freec.dup.bed
chr14 103515996 103563240 NULL 103515996:103563363:3 103511784:103541532:3,103543140:103563240:3 DUP BS_007JTNB8 BS_007JTNB8.cnvkit_freec.dup.bed
```

* The 1st line of the file is the header which contains the column names. There are 9 columns in total
* The 2nd line is the first CNV of the file.
* Column 1 is the **consensus** CNV chromosome
* Column 2 is the **consensus** CNV start location
* Column 3 is the **consensus** CNV end location
* Columns 4, 5, and 6 contain the calls of Manta, CNVkit, and Freec that make up the **consensus** CNV described in columns 1, 2, and 3.
* ie. If there is info in column 4, that means one or more CNVs called from Manta made up the current **consensus** CNV described in columns 1, 2, and 3.
* Columns 4, 5, and 6 have the following format: `START:END:COPY_NUMBER,START:END:COPY_NUMBER`
* Note that if there is more than one original CNV call corresponding to a given consensus CNV from a given caller, the information for each of the CNV calls will be comma separated.
* In the example output above column 6 of line 4 contains `103511784:103541532:3,103543140:103563240:3` which means 2 CNVs called by FreeC helped to make up the **consensus** CNV on line 4.
One has the start and end coordinates of `103511784:103541532` **on the same chromosome** and has a copy number of `3` and another has the coordinates `103543140:103563240` and has a copy number of `3`.
* Column 7 is the CNVtype. This will be one of DUP or DEL, corresponding to duplications or deletions, respectively. Note that this does not describe the number of copies, only the direction of the copy number change.
* Column 8 is the Sample name
* Column 9 contains the name of of the two-caller consensus files (`manta-cnvkit` `manta-freec` `cnvkit-freec`) that made up the final **consensus** CNV.
44 changes: 39 additions & 5 deletions analyses/copy_number_consensus_call/Snakefile
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
## Define the ending file(s) that we want
OUTPUT= expand("../../scratch/interim/{sample}.{combined_caller}.{dupdel}.bed",
sample=config["samples"],
combined_caller=["manta_cnvkit", "manta_freec", "cnvkit_freec"],
dupdel=["dup", "del"]),
OUTPUT = "results/cnv_consensus.tsv",

## Define the wildcards to use in the file names.
wildcard_constraints:
Expand Down Expand Up @@ -161,4 +158,41 @@ rule compare_cnv_methods:
"--sample {params.sample_name}"



rule combine_merge_paired_cnv:
input:
## Define the location of the input file
manta_cnvkit="../../scratch/interim/{sample}.manta_cnvkit.{dupdel}.bed",
manta_freec="../../scratch/interim/{sample}.manta_freec.{dupdel}.bed",
cnvkit_freec="../../scratch/interim/{sample}.cnvkit_freec.{dupdel}.bed"
output:
## Define the output files' names
merged="../../scratch/endpoints/{sample}.{dupdel}.merged.final.bed"
threads: 1
shell:
## Combine the input file, sort and output to one file
## Columns 4, 5, and 6 hold the original CNV calls from Manta, CNVkit, and Freec, respectively.
## We want to retain info in these columns when merging these files so we use COLLAPSE to keep the information in these columns
## Columns 7 and 8 are the CNVtype (DEL, DUP) and Sample_name, respectively.
## AT THIS POINT, these columns of the input files hold the same values, thus we perform DISTINCT, which is to take the unique of columns 7 and 8.
## As for column 9, this column holds the files that were merged to get a specific CNV. We want to keep all information here so we COLLAPSE it.
"cat {input.manta_cnvkit} {input.manta_freec} {input.cnvkit_freec} "
"| sort -k1,1 -k2,2n "
"| bedtools merge -c 4,5,6,7,8,9 -o collapse,collapse,collapse,distinct,distinct,collapse "
"> {output.merged}"


rule merge_all:
input:
## Take all of the del and dup files of ALL samples as input. If there are 200 samples, there will be 400 files total
bedfiles = expand("../../scratch/endpoints/{sample}.{dupdel}.merged.final.bed",
sample = config["samples"],
dupdel = ["dup", "del"])
output:
## Define the final output file
"results/cnv_consensus.tsv"
shell:
## Add a header to the file and combine all of the files using cat.
"echo -e 'chrom\tstart\tend\tmanta_CNVs\tcnvkit_CNVs\tfreec_CNVs\tCNV_type\tBiospecimen\tfile_names' "
" | cat - {input.bedfiles} "
" | cut -f 1-8 "
" > {output}"
Loading

0 comments on commit 7917a7f

Please sign in to comment.