AlexsLemonade · jaclyn-taroni · Jan 27, 2020 · Dec 18, 2019 · Dec 19, 2019 · Jan 4, 2020
diff --git a/analyses/README.md b/analyses/README.md
@@ -16,7 +16,7 @@ Note that _nearly all_ modules use the harmonized clinical data file (`pbta-hist
 | [`cnv-comparison`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/cnv-comparison) | Earlier version of SEG files | *Deprecated*; compared earlier version of the CNV methods. | N/A
 | [`collapse-rnaseq`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/collapse-rnaseq) | `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` <br> `gencode.v27.primary_assembly.annotation.gtf.gz` | Collapses RSEM FPKM matrices such that gene symbols are de-duplicated. | `results/pbta-gene-expression-rsem-fpkm-collapsed.polya.rds` <br> `results/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds` (included in data download; too large for tracking via GitHub)
 | [`comparative-RNASeq-analysis`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/comparative-RNASeq-analysis) | `pbta-gene-expression-rsem-tpm.polya.rds` <br> `pbta-gene-expression-rsem-tpm.stranded.rds` | *In progress*; will produce expression outlier profiles per [#229](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/229) | N/A |
-| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) and a set of excluded regions where CNV calls are not made | `results/cnv_consensus.tsv` <br> `results/pbta-cnv-consensus.seg` <br> `ref/cnv_excluded_regions.bed`
+| [`copy_number_consensus_call`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/copy_number_consensus_call) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-sv-manta.tsv.gz` | Produces consensus copy number calls per [#128](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/128) and a set of excluded regions where CNV calls are not made | `results/cnv_consensus.tsv` <br> `results/pbta-cnv-consensus.seg` <br> `ref/cnv_excluded_regions.bed` <br> `ref/cnv_callable.bed`
 | [`create-subset-files`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/create-subset-files) | All files | This module contains the code to create the subset files used in continuous integration | All subset files for continuous integration
 | [`focal-cn-file-preparation`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/focal-cn-file-preparation) | `pbta-cnv-cnvkit.seg.gz` <br> `pbta-cnv-controlfreec.tsv.gz` <br> `pbta-gene-expression-rsem-fpkm.polya.rds` <br> `pbta-gene-expression-rsem-fpkm.stranded.rds` | Maps from copy number variant caller segments to gene identifiers; will eventually be updated to use consensus copy number calls ([#186](https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/186))| `results/cnvkit_annotated_cn_autosomes.tsv.bz2` <br> `results/cnvkit_annotated_cn_x_and_y.tsv.bz2` <br> `results/controlfreec_annotated_cn_autosomes.tsv.bz2` <br> `results/controlfreec_annotated_cn_x_and_y.tsv.bz2`
 | [`fusion_filtering`](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/fusion_filtering) | `pbta-fusion-arriba.tsv.gz` <br> `pbta-fusion-starfusion.tsv.gz` | Standardizes, filters, and prioritizes fusion calls | `results/pbta-fusion-putative-oncogenic.tsv` <br> `results/pbta-fusion-recurrent-fusion-byhistology.tsv` <br> `results/pbta-fusion-recurrent-fusion-bysample.tsv` (included in data download)

diff --git a/analyses/copy_number_consensus_call/README.md b/analyses/copy_number_consensus_call/README.md
@@ -10,17 +10,12 @@ This analysis uses information from the following files generated from the 3 cal
 * `pbta-cnv-controlfreec.tsv.gz`
 * `pbta-sv-manta.tsv.gz`
 
-The analysis produces an output file that includes the original calls used for each consensus call:
+The analysis produces the following output files
 
-* `results/cnv_consensus.tsv`
-
-A segfile for downstream processing:
-
-* `results/pbta-cnv-consensus.seg`
-
-And a bed file of regions that were excluded from calls (see step 7)
-
-* `ref/cnv_excluded_regions.bed`
+* `results/cnv_consensus.tsv`:  A tab separated file out consense copy number variants, including the original calls used for each consensus call
+* `results/pbta-cnv-consensus.seg`: A `.seg` formatted file for downstream processing
+* `ref/cnv_excluded_regions.bed`: A `.bed` file of error-prone regions that were filtered from copy number calls
+* `ref/cnv_callable.bed`: A `.bed` file of regions considered "callable" by the analysis pipeline
 
 ## Running the pipeline
 
@@ -29,7 +24,24 @@ Go to OpenPBTA-analysis/analyses/copy_number_consensus_call and run `bash run_co
 
 ## Methods
 
-This pipeline revolves around the use of Snakemake to run analysis for each patient sample. The overview of the steps are as followed:
+### Assayed Regions
+
+Regions of the genome with a high potential for error are first defined by merging the set telomeric, centromeric and heterochromatic regions with regions around immunoglobulins and segmentmental duplications.
+The input files for this step are described in `scripts/prepare_blacklist_files.sh` and include:
+
+* `ref/centromeres.bed`
+* `ref/heterochromatin.bed`
+* `ref/immunoglobulin_regions.bed`
+* `ref/segmental_dups.bed`
+* `ref/telomeres.bed`
+
+The final set of merged excluded regions are placed in the file `ref/cnv_excluded_regions.bed`
+
+In addition, a file of the genomic regions that we deem "callable" is created at `ref/cnv_callable.bed` as the complement of the excluded regions, after removing exclusions smaller than 200kb.
+
+### Consensus CNV creation
+
+The per-sample pipeline revolves around the use of Snakemake to run analysis for each patient sample. The overview of the steps are as followed:
 
 1) Parse through the 3 input files and put CNVs of the **same caller and sample** in the same files.
 2) Remove any sample/caller combination files with **more than 2500** CNVs called.
@@ -43,15 +55,18 @@ This pipeline revolves around the use of Snakemake to run analysis for each pati
 9) Reformat the columns of the files (So the info are easier to read)
 10) **Call consensus** by comparing CNVs from 2 call methods at a time. 
 
-Since there are 3 callers, there were 3 comparisons: `manta-cnvkit`, `manta-freec`, and `cnvkit-freec`. If a CNV from 1 caller **overlaps 50% or more** with at least 1 CNV from another caller, the common region of the overlapping CNV would be the new CONSENSUS CNV. 
+Since there are 3 callers, there were 3 comparisons: `manta-cnvkit`, `manta-freec`, and `cnvkit-freec`. If a CNV from 1 caller **overlaps 50% or more** with at least 1 CNV from another caller, the common region of the overlapping CNV would be the new CONSENSUS CNV.
 
 11) **Sort and merge** the CNVs from the comparison pairs ,`manta-cnvkit` `manta-freec` `cnvkit-freec`, together into 1 file
 12) Resolve overlapping segments where duplications are embedded within larger deletion segments, or deletions within duplications.
 13) After every samples' consensus CNVs were called, **combine all merged files** from step 10 and output to `results/cnv_consensus.tsv`
-14) The `results/cnv_consensus.tsv` is translated into a `pbta-cnv-consensus.seg` file in the same format as `pbta-cnv-cnvkit.seg.gz`.
-When a consensus CNV contains from multiple source CNV segments, we take the mean of the CNVkit `seg.mean` values from the source segments, weighted by segment length.
-If no CNVkit CNV was included, the value for this column is NA.
+14) The `results/cnv_consensus.tsv` is translated into a `results/pbta-cnv-consensus.seg` file in the same format as `pbta-cnv-cnvkit.seg.gz`, including all samples where at least two callers passed quality filtering.
+When a consensus segment is derived from multiple source segments, we take the mean of the CNVkit `seg.mean` values from the source segments, weighted by segment length.
+If no CNVkit variant was included, the value for this column is NA.
 The `copy.num` column is the weighted median of CNVkit segment values where they exist, or Control-FREEC values in the absence of CNVkit data.
+Because some software (notably GISTIC) requires all samples to have the same regions called, the copy number variants from `cnv_consensus.tsv` are supplementented with "neutral" segments where no call was made.
+These include all non-variant regions present in `ref/cnv_callable.bed`
+The neutral regions are assigned copy.num 2, except on chrX and chrY, where the copy number is left NA.
 
 ## Example Output File