CNV consensus (6 of 6): Merge consensus files and name columns (#403)

* add to Snakefile * updating fork * add final consensus merging step * add column naming * add column naming * add column naming * change so that output files doesn't have extra indentation at the end * change so that output files doesn't have extra indentation at the end * changed output path and name * changed output path and name * Update analyses/copy_number_consensus_call/Snakefile Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/Snakefile Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/Snakefile Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/Snakefile Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/src/scripts/compare_variant_calling_updated.py Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/Snakefile Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * changed Snakemake, reduced redundancy * changed Snakemake, reduced redundancy * changed minor error * update Snakefile * add README.md * Update README.md * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update README.md * Update README.md * Update README.md * add result consensus file * make changes to analyses/README.md * Update analyses/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/Snakefile Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * Update analyses/copy_number_consensus_call/README.md Co-Authored-By: jashapiro <josh.shapiro@ccdatalab.org> * cut the last column * add the result file with out the last column Co-authored-by: jashapiro <jashapiro@gmail.com>
AlexsLemonade · Jan 8, 2020 · 7917a7f · 7917a7f
1 parent 21e0edb
commit 7917a7f
Show file tree

Hide file tree

Showing 5 changed files with 7,174 additions and 9 deletions.
diff --git a/analyses/README.md b/analyses/README.md
@@ -15,7 +15,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
 | [`cnv-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-comparison) | Earlier version of SEG files | *Deprecated*; compared earlier version of the CNV methods. | N/A 
 | [`collapse-rnaseq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/collapse-rnaseq) | `pbta-gene-expression-rsem-fpkm.polya.rds`, `pbta-gene-expression-rsem-fpkm.stranded.rds`, `gencode.v27.primary_assembly.annotation.gtf.gz` | Collapses RSEM FPKM matrices such that gene symbols are de-duplicated. | `pbta-gene-expression-rsem-fpkm-collapsed.polya.rds`, `pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` (included in data download)
 | [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds`, `pbta-gene-expression-rsem-tpm.stranded.rds` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
-| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz`, `pbta-cnv-controlfreec.tsv.gz`, `pbta-sv-manta.tsv.gz` | *In progress*; will produce consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) | N/A
+| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz`, `pbta-cnv-controlfreec.tsv.gz`, `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) | `cnv_consensus.tsv`
 | [`create-subset-files`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/create-subset-files) | All files | This module contains the code to create the subset files used in continuous integration | All subset files for continuous integration
 | [`focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation) | `pbta-cnv-cnvkit.seg.gz`, `pbta-cnv-controlfreec.tsv.gz`, `pbta-gene-expression-rsem-fpkm.polya.rds`, `pbta-gene-expression-rsem-fpkm.stranded.rds` | Maps from copy number variant caller segments to gene identifiers; will eventually be updated to use consensus copy number calls ([#186](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/186))| `cnvkit_annotated_cn_autosomes.tsv.gz`, `cnvkit_annotated_cn_x_and_y.tsv.gz`, `controlfreec_annotated_cn_autosomes.tsv.gz`, `controlfreec_annotated_cn_x_and_y.tsv.gz`
 | [`fusion_filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | `pbta-fusion-arriba.tsv.gz`, `pbta-fusion-starfusion.tsv.gz` | Standardizes, filters, and prioritizes fusion calls | `pbta-fusion-putative-oncogenic.tsv`, `pbta-fusion-recurrent-fusion-byhistology.tsv`, `pbta-fusion-recurrent-fusion-bysample.tsv` (included in data download) 

diff --git a/analyses/copy_number_consensus_call/README.md b/analyses/copy_number_consensus_call/README.md
@@ -0,0 +1,61 @@
+# Copy Number Consensus Call
+
+## Overview
+
+The PBTA data set contains CNVs called from different callers, ie. Manta, CNVkit, and Freec. 
+The goal is to use all of these callers to reduce false positives and come up with a final consensus list of CNVs.
+This analysis uses information from the following files generated from the 3 callers
+
+* `pbta-cnv-cnvkit.seg.gz`	
+* `pbta-cnv-controlfreec.tsv.gz`
+* `pbta-sv-manta.tsv.gz`
+
+## Running the pipeline
+
+To run the entire pipeline, make sure to have the latest release of the three input files mentioned in the Overview section.
+Go to OpenPBTA-analysis/analyses/copy_number_consensus_call and run `bash run_consensus_call.sh`
+
+## Methods
+
+This pipeline revolves around the use of Snakemake to run analysis for each patient sample. The overview of the steps are as followed:
+
+1) Parse through the 3 input files and put CNVs of the **same caller and sample** in the same files.
+2) Remove any sample/caller combination files with **more than 2500** CNVs called.
+   We belive these to be noisy/poor quality samples (this came from what GISTIC uses as a cutoff for noisy samples).
+3) Create a `config_snakemake.yaml` that contains all of the samples names to run the Snakemake pipeline
+4) Run the Snakemake pipeline to perform analysis **per sample**. 
+5) Filter for any CNVs that are over a certain **SIZE_CUTOFF** (default 3000 bp)
+6) Filter for any **significant** CNVs called by Freec (default pval = 0.01)
+7) Filter out any CNVs that overlap 50% or more with **IGLL, telomeric, centromeric, seg_dup regions**
+8) Merge any CNVs of the same sample and call method if they **overlap or within 10,000 bp** (We consider CNV calls within 10,000 bp the same CNV)
+9) Reformat the columns of the files (So the info are easier to read)
+10) **Call consensus** by comparing CNVs from 2 call methods at a time. 
+
+Since there are 3 callers, there were 3 comparisons: `manta-cnvkit`, `manta-freec`, and `cnvkit-freec`. If a CNV from 1 caller **overlaps 50% or more** with at least 1 CNV from another caller, the common region of the overlapping CNV would be the new CONSENSUS CNV. 
+
+11) **Sort and merge** the CNVs from the comparison pairs ,`manta-cnvkit` `manta-freec` `cnvkit-freec`, together into 1 file
+12) After every samples' consensus CNVs were called, **combine all merged files** from step 10 and output to `results/cnv_consensus.tsv`
+
+## Example Output File
+
+```
+chrom start end	manta_CNVs	cnvkit_CNVs	freec_CNVs	CNV_type	Biospecimen	file_names
+chr11 771036  866778	NULL	770516:866778:3	771036:871536:3	DUP	BS_007JTNB8	BS_007JTNB8.cnvkit_freec.dup.bed
+chr13	99966948	99991872	NULL	99954829:99994557:3	99966948:99991872:3	DUP	BS_007JTNB8	BS_007JTNB8.cnvkit_freec.dup.bed
+chr14	103515996	103563240	NULL	103515996:103563363:3	103511784:103541532:3,103543140:103563240:3	DUP	BS_007JTNB8	BS_007JTNB8.cnvkit_freec.dup.bed
+```
+
+* The 1st line of the file is the header which contains the column names. There are 9 columns in total
+* The 2nd line is the first CNV of the file.
+* Column 1 is the **consensus** CNV chromosome
+* Column 2 is the **consensus** CNV start location
+* Column 3 is the **consensus** CNV end location
+* Columns 4, 5, and 6 contain the calls of Manta, CNVkit, and Freec that make up the **consensus** CNV described in columns 1, 2, and 3. 
+* ie. If there is info in column 4, that means one or more CNVs called from Manta made up the current **consensus** CNV described in columns 1, 2, and 3. 
+* Columns 4, 5, and 6 have the following format: `START:END:COPY_NUMBER,START:END:COPY_NUMBER`
+  * Note that if there is more than one original CNV call corresponding to a given consensus CNV from a given caller, the information for each of the CNV calls will be comma separated.
+  * In the example output above column 6 of line 4 contains `103511784:103541532:3,103543140:103563240:3` which means 2 CNVs called by FreeC helped to make up the **consensus** CNV on line 4. 
+One has the start and end coordinates of `103511784:103541532` **on the same chromosome** and has a copy number of `3` and another has the coordinates `103543140:103563240` and has a copy number of `3`. 
+* Column 7 is the CNVtype. This will be one of DUP or DEL, corresponding to duplications or deletions, respectively. Note that this does not describe the number of copies, only the direction of the copy number change.
+* Column 8 is the Sample name
+* Column 9 contains the name of of the two-caller consensus files (`manta-cnvkit` `manta-freec` `cnvkit-freec`) that made up the final **consensus** CNV. 
diff --git a/analyses/copy_number_consensus_call/Snakefile b/analyses/copy_number_consensus_call/Snakefile
@@ -1,8 +1,5 @@
 ## Define the ending file(s) that we want
-OUTPUT= expand("../../scratch/interim/{sample}.{combined_caller}.{dupdel}.bed",
-               sample=config["samples"], 
-               combined_caller=["manta_cnvkit", "manta_freec", "cnvkit_freec"],
-               dupdel=["dup", "del"]),
+OUTPUT = "results/cnv_consensus.tsv",
 
 ## Define the wildcards to use in the file names. 
 wildcard_constraints:
@@ -161,4 +158,41 @@ rule compare_cnv_methods:
         "--sample {params.sample_name}"
 
 
-
+rule combine_merge_paired_cnv:
+    input:
+        ## Define the location of the input file
+        manta_cnvkit="../../scratch/interim/{sample}.manta_cnvkit.{dupdel}.bed",
+        manta_freec="../../scratch/interim/{sample}.manta_freec.{dupdel}.bed",
+        cnvkit_freec="../../scratch/interim/{sample}.cnvkit_freec.{dupdel}.bed"
+    output:
+        ## Define the output files' names
+        merged="../../scratch/endpoints/{sample}.{dupdel}.merged.final.bed"
+    threads: 1
+    shell:
+        ## Combine the input file, sort and output to one file
+        ## Columns 4, 5, and 6 hold the original CNV calls from Manta, CNVkit, and Freec, respectively. 
+        ## We want to retain info in these columns when merging these files so we use COLLAPSE to keep the information in these columns
+        ## Columns 7 and 8 are the CNVtype (DEL, DUP) and Sample_name, respectively.
+        ## AT THIS POINT, these columns of the input files hold the same values, thus we perform DISTINCT, which is to take the unique of columns 7 and 8.
+        ## As for column 9, this column holds the files that were merged to get a specific CNV. We want to keep all information here so we COLLAPSE it.
+        "cat {input.manta_cnvkit} {input.manta_freec} {input.cnvkit_freec} "
+        "| sort -k1,1 -k2,2n "
+        "| bedtools merge -c 4,5,6,7,8,9 -o collapse,collapse,collapse,distinct,distinct,collapse "
+        "> {output.merged}"
+
+
+rule merge_all:
+    input:
+        ## Take all of the del and dup files of ALL samples as input. If there are 200 samples, there will be 400 files total
+        bedfiles = expand("../../scratch/endpoints/{sample}.{dupdel}.merged.final.bed",
+                          sample = config["samples"],
+                          dupdel = ["dup", "del"])
+    output:
+        ## Define the final output file
+        "results/cnv_consensus.tsv"
+    shell:
+        ## Add a header to the file and combine all of the files using cat. 
+       "echo -e 'chrom\tstart\tend\tmanta_CNVs\tcnvkit_CNVs\tfreec_CNVs\tCNV_type\tBiospecimen\tfile_names' "
+       " | cat - {input.bedfiles} "
+       " | cut -f 1-8 "
+       " > {output}"