From 2f5971de3d7c2e43a4201618fc0a7cdb09733a53 Mon Sep 17 00:00:00 2001
From: Candace Savonen <savonenc@msu.edu>
Date: Mon, 2 Dec 2019 08:18:26 -0500
Subject: [PATCH 01/11] initial restructure

---
 analyses/snv-callers/README.md | 101 ++-------------------------------
 1 file changed, 5 insertions(+), 96 deletions(-)

diff --git a/analyses/snv-callers/README.md b/analyses/snv-callers/README.md
index 1442704196..c9ebee2308 100644
--- a/analyses/snv-callers/README.md
+++ b/analyses/snv-callers/README.md
@@ -8,10 +8,6 @@ This consensus mutation file is [MAF-like](#consensus-mutation-call) meaning it
 * [General usage of scripts](#general-usage-of-scripts)
 * [Individual caller evaluation](#individual-caller-evaluation)  
   * [Base change analysis](#base-change-analysis)  
-  * [Variant allele fraction calculation](#variant-allele-fraction-calculation)  
-  * [Genomic regional analyses](#genomic-regional-analyses)  
-  * [Tumor mutation burden](#tumor-mutation-burden-calculation)
-  * [COSMIC mutation overlap](#cosmic-mutation-overlap)  
 * [Comparison analysis](#comparison-of-callers)
  * [Mutation IDs](#mutation-ids)
  * [Consensus mutation calls](#consensus-mutation-calls)
@@ -60,7 +56,11 @@ This set up script only needs to be run once and its three options are all relat
                   GitHub. Only coordinates are needed.
 ```
 
-### 01-calculate_vaf_tmb.R
+### 01-setup_db.py
+
+### 02-merge_callers.R
+
+### 03-calculate_tmb.R
 
 This script sets up the given MAF file and outputs three files ([VAF](#variant-allele-fraction-calculation),
 [TMB](#tumor-mutation-burden-calculation), and [Regional analysis](#genomic-regional-analyses))
@@ -81,97 +81,6 @@ that are used to make an overall evaluation report in `02-run_eval.R`.
  --overwrite : If specified, will overwrite any files of the same name. Default is FALSE.
  --no_region : If used, regional analysis will not be done.
 ```
-
-### 02-run_eval.R
-
-This script takes the files output by `01-calculate_vaf_tmb.R` and makes five
-plots ([base_change](#base-change-analysis), [depth_vs_vaf](#variant-allele-fraction-calculation), [cosmic_plot](#cosmic-mutation-overlap), [snv_region](#genomic-regional-analyses), and [tmb_plot](#tumor-mutation-burden-calculation)) that are put into an overall report.
-
-**Option descriptions**
-```
- --label : Label to be used for folder and all output. eg. 'strelka2'. Optional.
-           Default is 'maf'
- --plot_type : Specify what kind of plots you want printed out. Must be
-               compatible with ggsave. eg pdf. Default is png
- --vaf : Folder from 01-calculate_vaf_tmb.R following files:
-                                             <caller_name>_vaf.<file_format>
-                                             <caller_name>_region.<file_format>
-                                             <caller_name>_tmb.<file_format>
- --file_format: What type of file format were the vaf and tmb files saved as? Options are
-                "rds" or "tsv". Default is "rds".
- --output : Where you would like the output from this script to be stored.
- --strategy : Specify whether you would like WXS and WGS separated for the plots.
-              Analysis is still done on all data in the MAF file regardless.
-              Acceptable options are 'wgs', 'wxs' or 'both', both for if you
-              don't want to separate them. Default is both.
- --cosmic : Relative file path to COSMIC file to be analyzed.
- --overwrite : If TRUE, will overwrite any reports of the same name. Default is
-              FALSE
-  --no_region : If used, regional analysis will not be done.
-```
-
-### 03-merge_callers.R
-
-This script merges the callers' TMB and VAF files into total files with a column
-`caller` to designate their origin.
-This script output 4 files:
-
-- `all_callers_vaf.rds` - contains all the VAF file information for all callers.
-- `all_callers_tmb.rds` - contains all the TMB file information for all callers.
-- `mutation_id_list.rds` - a full list of the mutations that can be used for an UpSetR graph
-- `callers_per_mutation.rds` - contains a breakdown for each mutation of what callers called it. Will be used to identify the consensus mutations.
-
-**Option descriptions**
-```
- --vaf : Parent folder containing the vaf and tmb files for each folder.
-                                             <caller_name>_vaf.<file_format>
-                                             <caller_name>_tmb.<file_format>
- --file_format: What type of file format were the vaf and tmb files saved as? Options are
-               "rds" or "tsv". Default is "rds".
- --output : Where you would like the output from this script to be stored.
- --overwrite : If TRUE, will overwrite any reports of the same name. Default is
-               FALSE
-```
-
-### 04-create_consensus_mut_files.R
-
-This script takes the merged output of 03-merge_callers.R and saves the final consensus mutation calls to a MAF-like file.
-This script outputs two main files:  
-- `consensus_mutation_vaf.tsv` - contains the consensus mutations and their associated variables.
-- `consensus_mutation_tmb.tsv` - contains the re-calculated TMBs based on the conensus mutations only.
-
-**Option descriptions**
-```
- --merged_files : File path to where the 03-merged_callers.R output can be found.
-                  Files required: "all_callers_vaf.<file_format>"
-                                  "all_callers_tmb.<file_format>"
-                                  "mutation_id_list.<file_format>"
- --vaf : What VAF should be used when combining callers? Options are 'median' or
-         one of the caller names."
- --combo : What combination of callers need to detect a mutation for it to be
-           considered real and placed in the consensus file? List the callers
-           that need to be considered in alphabetical order with '-'
-           in between. eg. 'lancet-mutect2-strelka2'
- --file_format: What type of file format were the vaf and tmb files saved as? Options are
-               "rds" or "tsv". Default is "rds".
- --output : Where you would like the output from this script to be stored.
- --bed_wgs : File path that specifies the caller-specific BED regions file.
-             Assumes from top directory, 'OpenPBTA-analysis'.
- --bed_wxs : File path that specifies the WXS BED regions file. Assumes file path
-             is given from top directory of 'OpenPBTA-analysis'
- --overwrite : If TRUE, will overwrite any reports of the same name. Default is
-             FALSE
-```
-
-# Individual Caller Evaluation
-
-The first step in this analysis is an individual evaluation of each MAF from each caller.
-This analysis prints out a report that is further subdivided by the WGS and WXS samples.
-
-Overall reports for each caller and strategy can be found:
-* `results/<caller_name>/<caller_name>_wgs_report.html`
-* `results/<caller_name>/<caller_name>_ws_report.html`
-
 ### Base change analysis
 
 To evaluate base changes I summarized standard MAF fields as two new variables:

From 5bc143a409ddaa7aa1138f8192ee8ed6c353ef8a Mon Sep 17 00:00:00 2001
From: Candace Savonen <savonenc@msu.edu>
Date: Mon, 2 Dec 2019 08:30:55 -0500
Subject: [PATCH 02/11] Add in info from SNV consensus README

---
 analyses/snv-callers/README.md | 191 ++++++++-------------------------
 1 file changed, 45 insertions(+), 146 deletions(-)

diff --git a/analyses/snv-callers/README.md b/analyses/snv-callers/README.md
index c9ebee2308..16628614e5 100644
--- a/analyses/snv-callers/README.md
+++ b/analyses/snv-callers/README.md
@@ -38,28 +38,37 @@ These are the mutations dubbed reliable enough to move forward with.
 - By default, the scripts will not overwrite existing files of the same name. However,
 this can be overridden with `--overwrite` option.
 
-### 00-set_up.R
-
-00-set_up.R creates the [annotation RDS file](#genomic-regional-analyses) and [COSMIC mutations file](#cosmic-mutation-overlap) that are used by
-the subsequent scripts.
-This set up script only needs to be run once and its three options are all relating to where the reference files should be stored.
+### 01-setup_db.py
 
-**Option descriptions**
+**Argument descriptions**
 ```
- --annot_rds : File path to where you would like the annotation_rds file to be
-               stored
- --cosmic_og : Path to original COSMIC file. Can be .gz compressed. Will need to
-               download this from COSMIC at https://cancer.sanger.ac.uk/cosmic/download
-               These data are available if you register.
- --cosmic_clean : File path specifying where you would like the cleaned brain-related
-                  COSMIC mutations file to be stored. This file is provided to you in
-                  GitHub. Only coordinates are needed.
+  -d DB_FILE, --db-file DB_FILE
+     Path of the database file to use or create. Defaults to `data.sqlite`.
+   --strelka-file STRELKA_FILE
+     Path of the MAF formatted data file from the strelka2 caller(TSV).
+   --mutect-file MUTECT_FILE
+     Path of the MAF formatted data file from the mutect2 caller(TSV).
+   --lancet-file LANCET_FILE
+     Path of the MAF formatted data file from the lancet caller(TSV).
+   --vardict-file VARDICT_FILE
+     Path of the MAF formatted data file from the vardict caller(TSV).
+   --meta-file META_FILE, --hist-file META_FILE
+     Path of the metadata/histology data file(TSV).
+   --overwrite           Overwrite tables that may already exist.
 ```
 
-### 01-setup_db.py
-
 ### 02-merge_callers.R
 
+```
+ --db_file : Path to sqlite database file made from 01-setup_db.py
+ --output_file : File path and file name of where you would like the MAF-like
+                 output from this script to be stored.
+ --vaf_filter: Optional Variant Allele Fraction filter. Specify a number; any
+               mutations with a VAF that are NA or below this number will be
+               removed from the vaf data.frame before it is saved to a TSV file.
+ --overwrite : If TRUE, will overwrite any reports of the same name. Default is
+              FALSE
+```
 ### 03-calculate_tmb.R
 
 This script sets up the given MAF file and outputs three files ([VAF](#variant-allele-fraction-calculation),
@@ -68,18 +77,14 @@ that are used to make an overall evaluation report in `02-run_eval.R`.
 
 **Option descriptions**
 ```
- -label : Label to be used for folder and all output. eg. 'strelka2'. Default is 'maf'.
- -output : File path that specifies the folder where the output should go.
-           New folder will be created if it doesn't exist.
- --file_format: What type of file format would you like the output as? Options are
-               "rds" or "tsv". Default is "rds".
- --maf :  Relative file path to MAF file to be analyzed. Can be .gz compressed.
- --metadata : Relative file path to original metadata file.
- --annot_rds : Relative file path to annotation object RDS file to be analyzed.
+ --consensus : File path to the MAF-like file.
+ --metadata : Relative file path to MAF file to be analyzed. Can be .gz compressed.
+              Assumes file path is given from top directory of 'OpenPBTA-analysis'.
  --bed_wgs : File path that specifies the caller-specific BED regions file.
- --bed_wxs : File path that specifies the WXS BED regions file.
+             Assumes from top directory, 'OpenPBTA-analysis'.
+ --bed_wxs : File path that specifies the WXS BED regions file. Assumes file path
+             is given from top directory of 'OpenPBTA-analysis'
  --overwrite : If specified, will overwrite any files of the same name. Default is FALSE.
- --no_region : If used, regional analysis will not be done.
 ```
 ### Base change analysis
 
@@ -89,13 +94,11 @@ concatenating `Reference_Allele`, `>`, and `Allele`.
 The `change` variable is made from the `base_change` variable but groups
 together deletions, insertions, and long (more than a SNV) as their own groups.
 
-*Output for this analysis*
-* `results/<caller_name>/<caller_name>_vaf.rds`
-* `plots/<caller_name>/<caller_name>_<strategy>_base_change.png`
-
 ### Variant Allele Fraction Calculation
 
 Calculate variant allele fraction (VAF) for each variant.
+This is done in `03-calculate_tmb.R`.
+
 ```
 vaf = (t_alt_count) / (t_ref_count + t_alt_count)
 ```
@@ -103,21 +106,6 @@ This is following the [code used in
 `maftools`](https://github.com/PoisonAlien/maftools/blob/1d0270e35c2e0f49309eba08b62343ac0db10560/R/plot_vaf.R#L39).
 The VAF calculations and other special variables are added to the MAF fields and written to a file ending in `_vaf` in the caller's results folder.
 
- *Output for this analysis*
- * `results/<caller_name>/<caller_name>_vaf.rds`
- * `plots/<caller_name>/<caller_name>_<strategy>_depth_vs_vaf.png`
-
-### Genomic Regional Analyses
-
-To analyze what genomic regions the variants are from, I used [Annotatr
-package](https://bioconductor.org/packages/release/bioc/vignettes/annotatr/inst/doc/annotatr-vignette.html) to obtain hg38 genome annotations.
-This Annotatr object is stored as an RDS file: `hg38_genomic_region_annotations.rds` in the `scratch` directory.
-Mutations are assigned all annotations that they overlap (using `GenomicRanges::overlap`).
-
-*Output for this analysis*
-* `results/<caller_name>/<caller_name>_regions.rds`
-* `plots/<caller_name>/<caller_name>_<strategy>_snv_regions.png`
-
 ### Tumor Mutation Burden Calculation
 
 To calculate TMB, the sum of the bases included in the WXS or WGS BED regions are used as the denominator, depending on the sample's processing strategy.
@@ -131,31 +119,6 @@ Where genome size is calculated from the respective BED file as:
 ```
 genome_size = sum(End_Position - Start_Position)
 ```
-
-BED regions for WXS samples can be [found here](https://raw.githubusercontent.com/AstraZeneca-NGS/reference_data/master/hg38/bed/Exome-AZ_V2.bed).
-BED regions used for WGS samples are caller specific are from <unknown as of now>
-The sample-wise TMB calculations written to a file ending in `_tmb` in the caller's results folder.
-
-*Output for this analysis*
-* `results/<caller_name>/<caller_name>_tmb.rds`
-* `plots/<caller_name>/<caller_name>_<strategy>_tmb_plot.png`
-
-### COSMIC Mutation Overlap
-
-The COSMIC mutation data were obtained from https://cancer.sanger.ac.uk/cosmic/download
-*To run this analysis, you need to obtain these data.*
-The full, unfiltered somatic mutations file `CosmicMutantExport.tsv` for grch38 is used here and the genomic coordinates is arranged to be in BED format.
-The COSMIC set is filtered down to only mutations detected in brain-related
-samples using the `Site subtype 1` field.
-COSMIC mutations are overlapped with the present data's mutations using `GenomicRanges`.
-The outcome of this overlap is added to the VAF data.frame with two `TRUE/FALSE` columns:
-`overlap_w_cosmic` is TRUE for mutations that overlap with COSMIC mutations, while `same_as_cosmic` is TRUE when the base change summary is also identical.
-The VAF for mutations that are or are not overlapping with COSMIC mutations are then plotted in a violin plot.
-
-*Output for this analysis*
-* `results/<caller_name>/<caller_name>_vaf.rds`
-* `plots/<caller_name>/<caller_name>_<strategy>_cosmic_plot.png`
-
 ## Comparison of Callers
 
 After running an initial evaluation and set up of each of the callers' MAF files,
@@ -164,89 +127,25 @@ mutation calls.
 
 ### Mutation IDs  
 
-In order to compare mutations across callers, I created a `mutation_id` from combining information from standard MAF fields.
-This was done in the `01-calculate_vaf_tmb.R` script using the `set_up_maf `
-function.
-
-`mutation_id` is a concatenation of:  
-* `Hugo_Symbol`  
-* [`change`](#base-change-analysis)  
-* `Start_Position`  
-* `Tumor_Sample_Barcode` (the sample ID)  
+In order to compare mutations across callers, the data tables for each caller
+were indexed by: `Chromosome` `Start_Position` `Reference_Allele` and `Allele`.
+Meaning that across callers, if all these fields were identical, they were considered to be the same mutation.
+This was done in the `01-setup_db.py` script using the function.
 
-If mutation_id's are identical among MAF files, they are considered the same.
+### Summary of consensus files:
 
-### Consensus mutation call
+- `consensus_mutation.maf.tsv` - Mutations that were called by all three of these callers for a given sample are saved to this file.
+This file is [MAF-like](#consensus-mutation-call) meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also some added calculations like [Variant Allele Fraction](#variant-allele-fraction-calculation) and some sample metadata information.
+These files combine the [MAF file data](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from 3 different SNV callers: [Mutect2](https://software.broadinstitute.org/cancer/cga/mutect), [Strelka2](https://github.com/Illumina/strelka), and [Lancet](https://github.com/nygenome/lancet).
+See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see the methods of this caller analysis and comparison [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers).
 
-After the comparisons amongst the callers, VarDict proved to be too unreliable and called low VAF mutations.
-Moving forward, mutations that were identified by Lancet, Mutect2 and Strelka2 were included in the final list of mutations for each sample.
-The consensus mutations themselves are saved to a MAF-like file `consensus_mutation.maf.tsv.zip` to the `consensus`.
-It is being called a "MAF-like" file because it has many of the same fields as a MAF file but..  
+It is "MAF-like" file because it has many of the same fields as a MAF file but..  
   - Does not contain the version string in the first row   
   - Has extraneous annotation data has been removed (columns with all `NA`s)  
   - Has VAF calculations and other variables that are calculated by the [`set_up_maf` function](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/util/wrangle_functions.R#L11).
 
-## Overall file structure
-```
-OpenPBTA-analysis
-├── analyses
-│   └── snv-callers
-│       ├── run_caller_analysis.sh
-│       ├── compare_snv_callers.Rmd
-│       ├── scripts
-│       │   ├── 00-set-up.R
-│       │   ├── 01-calculate_vaf_tmb.R
-│       │   ├── 02-run_eval.R
-│       │   ├── 03-merge_callers.R
-│       │   └── 04-create_consensus_mut_files.R
-│       ├── util
-│       │    ├── plot_functions.R
-│       │    └── wrangle_functions.R
-│       ├── results
-│       │   ├── consensus
-│       │   │   ├── consensus_mutation_tmb.tsv
-│       │   │   └── consensus_mutation.maf.tsv
-│       │   ├── lancet
-│       │   │   ├── lancet_vaf.rds
-│       │   │   ├── lancet_tmb.rds
-│       │   │   ├── lancet_region.rds
-│       │   │   ├── lancet_wxs_report.html
-│       │   │   ├── lancet_wxs_report.Rmd
-│       │   │   ├── lancet_wgs_report.html
-│       │   │   ├── lancet_wgs_report.Rmd
-│       │   │   └── lancet_metadata_filtered.rds
-│       │   ├── mutect2
-│       │   │   └── ...
-│       │   ├── strelka2
-│       │   │   └── ...
-│       │   └── vardict
-│       │       └── ...
-│       ├── plots
-│       │   ├── lancet
-│       │   │   ├── lancet_wgs_base_change.png
-│       │   │   ├── lancet_wgs_cosmic_plot.png
-│       │   │   ├── lancet_wgs_depth_vs_vaf.png
-│       │   │   ├── lancet_wgs_snv_region.png
-│       │   │   ├── lancet_wgs_tmb_plot.png
-│       │   │   ├── lancet_wxs_base_change.png
-│       │   │   ├── lancet_wxs_cosmic_plot.png
-│       │   │   ├── lancet_wxs_depth_vs_vaf.png
-│       │   │   ├── lancet_wxs_snv_region.png
-│       │   │   └── lancet_wxs_tmb_plot.png
-│       │   ├── mutect2
-│       │   │   └── ...
-│       │   ├── strelka2
-│       │   │   └── ...
-│       │   └── vardict
-│       │       └── ...
-│       ├── ref_files
-│       │   ├── hg38_genomic_region_annotation.rds
-│       │   └── brain_cosmic_variants_coordinates.tsv
-│       └── template
-│           ├── variant_caller_report_no_region_template.Rmd
-│           └── variant_caller_report_template.Rmd
-├── data
-```
+- `consensus_mutation_tmb.tsv` - After the consensus mutations were identified, Tumor mutation burden was recalculated for each sample from this mutation set.
+See this [analysis' folder](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers#tumor-mutation-burden-calculation) for details on these methods.
 
 ## Summary of custom functions
 

From b6c0fc608f7b0244dd10885d3866eab04d59d0a8 Mon Sep 17 00:00:00 2001
From: Candace Savonen <savonenc@msu.edu>
Date: Mon, 2 Dec 2019 09:40:16 -0500
Subject: [PATCH 03/11] Update outdated text

---
 analyses/snv-callers/README.md | 177 ++++++++++++++-------------------
 1 file changed, 77 insertions(+), 100 deletions(-)

diff --git a/analyses/snv-callers/README.md b/analyses/snv-callers/README.md
index 16628614e5..c40ed9d0d0 100644
--- a/analyses/snv-callers/README.md
+++ b/analyses/snv-callers/README.md
@@ -3,20 +3,27 @@
 This analysis evaluates [MAF files](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from different SNV callers, compares their output, and creates a [consensus mutation file](./results/consensus/consensus_mutation.maf.tsv.zip).
 This consensus mutation file is [MAF-like](#consensus-mutation-call) meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also some added calculations like [Variant Allele Fraction](#variant-allele-fraction-calculation) and some sample metadata information.
 
-**Table of Contents**
-* [How to run this pipeline](#how-to-run-this-pipeline)
-* [General usage of scripts](#general-usage-of-scripts)
-* [Individual caller evaluation](#individual-caller-evaluation)  
-  * [Base change analysis](#base-change-analysis)  
-* [Comparison analysis](#comparison-of-callers)
- * [Mutation IDs](#mutation-ids)
- * [Consensus mutation calls](#consensus-mutation-calls)
-* [Overall file structure](#overall-file-structure)
-* [Summary of functions](#summary-of-custom-functions)
+<!-- START doctoc generated TOC please keep comment here to allow auto update -->
+<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
+**Table of Contents**  *generated with [DocToc](https://github.com/thlorenz/doctoc)*
+
+- [How to run the caller consensus analysis](#how-to-run-the-caller-consensus-analysis)
+  - [Summary of consensus files:](#summary-of-consensus-files)
+- [Summary of Methods](#summary-of-methods)
+  - [Mutation Comparisons](#mutation-comparisons)
+  - [Variant Allele Fraction Calculation](#variant-allele-fraction-calculation)
+  - [Tumor Mutation Burden Calculation](#tumor-mutation-burden-calculation)
+- [General usage of scripts](#general-usage-of-scripts)
+  - [01-setup_db.py](#01-setup_dbpy)
+  - [02-merge_callers.R](#02-merge_callersr)
+  - [03-calculate_tmb.R](#03-calculate_tmbr)
+
+<!-- END doctoc generated TOC please keep comment here to allow auto update -->
 
 ## How to run the caller consensus analysis
 
 To run the evaluations and comparisons of all the SNV callers, call the bash script:
+
 ```
 bash run_caller_analysis.sh
 ```
@@ -30,6 +37,56 @@ The consensus mutations themselves are saved to a [MAF-like file](#consensus-mut
 results folder.
 These are the mutations dubbed reliable enough to move forward with.
 
+### Summary of consensus files:
+
+- `consensus_mutation.maf.tsv` - Mutations that were called by all three of these callers for a given sample are saved to this file.
+This file is [MAF-like](#consensus-mutation-call) meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also some added calculations like [Variant Allele Fraction](#variant-allele-fraction-calculation) and some sample metadata information.
+These files combine the [MAF file data](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from 3 different SNV callers: [Mutect2](https://software.broadinstitute.org/cancer/cga/mutect), [Strelka2](https://github.com/Illumina/strelka), and [Lancet](https://github.com/nygenome/lancet).
+See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see the methods of this caller analysis and comparison [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers).
+
+It is "MAF-like" file because it has many of the same fields as a MAF file but..  
+  - Does not contain the version string in the first row   
+  - Has extraneous annotation data has been removed (columns with all `NA`s)  
+  - Has VAF calculations and other variables that are calculated by the [`set_up_maf` function](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/util/wrangle_functions.R#L11).
+
+- `consensus_mutation_tmb.tsv` - After the consensus mutations were identified, Tumor mutation burden was recalculated for each sample from this mutation set.
+See this [analysis' folder](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers#tumor-mutation-burden-calculation) for details on these methods.
+
+## Summary of Methods
+
+### Mutation Comparisons
+
+In order to compare mutations across callers, the data tables for each caller
+were indexed by: `Chromosome` `Start_Position` `Reference_Allele` and `Allele`.
+Meaning that across callers, if all these fields were identical, they were considered to be the same mutation.
+This was done in the `01-setup_db.py` script using the function.
+
+### Variant Allele Fraction Calculation
+
+Calculate variant allele fraction (VAF) for each variant.
+This is done in `03-calculate_tmb.R`.
+
+```
+vaf = (t_alt_count) / (t_ref_count + t_alt_count)
+```
+This is following the [code used in
+`maftools`](https://github.com/PoisonAlien/maftools/blob/1d0270e35c2e0f49309eba08b62343ac0db10560/R/plot_vaf.R#L39).
+The VAF calculations and other special variables are added to the MAF fields and written to a file ending in `_vaf` in the caller's results folder.
+
+### Tumor Mutation Burden Calculation
+
+To calculate TMB, the sum of the bases included in the WXS or WGS BED regions are used as the denominator, depending on the sample's processing strategy.
+
+```
+TMBwxs = sum(mutation_w-in_bedwxs)/(wxs_genome_size/1000000)
+TMBwgs = sum(all_mutations)/(wgs_genome_size/1000000)
+```
+
+Where genome size is calculated from the respective BED file as:
+```
+genome_size = sum(End_Position - Start_Position)
+```
+
 ## General usage of scripts
 
 **Overall notes about these scripts:**
@@ -40,6 +97,10 @@ this can be overridden with `--overwrite` option.
 
 ### 01-setup_db.py
 
+Creates and/or fills a database of variant calls that will be used by subsequent calls.
+Note: requires `pandas` to be installed, and expects python3
+All arguments are optional; only the included tables will be affected.
+
 **Argument descriptions**
 ```
   -d DB_FILE, --db-file DB_FILE
@@ -59,6 +120,9 @@ this can be overridden with `--overwrite` option.
 
 ### 02-merge_callers.R
 
+Using the database created by `01-setup_db.py`, merge callers' data files into consensus MAF-like file.
+
+**Argument descriptions**
 ```
  --db_file : Path to sqlite database file made from 01-setup_db.py
  --output_file : File path and file name of where you would like the MAF-like
@@ -71,11 +135,10 @@ this can be overridden with `--overwrite` option.
 ```
 ### 03-calculate_tmb.R
 
-This script sets up the given MAF file and outputs three files ([VAF](#variant-allele-fraction-calculation),
-[TMB](#tumor-mutation-burden-calculation), and [Regional analysis](#genomic-regional-analyses))
-that are used to make an overall evaluation report in `02-run_eval.R`.
+Using the consensus file created in `02-merge_callers.R`, calculate TMB for all
+WGS and WXS samples.
 
-**Option descriptions**
+**Argument descriptions**
 ```
  --consensus : File path to the MAF-like file.
  --metadata : Relative file path to MAF file to be analyzed. Can be .gz compressed.
@@ -86,89 +149,3 @@ that are used to make an overall evaluation report in `02-run_eval.R`.
              is given from top directory of 'OpenPBTA-analysis'
  --overwrite : If specified, will overwrite any files of the same name. Default is FALSE.
 ```
-### Base change analysis
-
-To evaluate base changes I summarized standard MAF fields as two new variables:
-The `base_change` variable that indicates the exact change in bases from
-concatenating `Reference_Allele`, `>`, and `Allele`.
-The `change` variable is made from the `base_change` variable but groups
-together deletions, insertions, and long (more than a SNV) as their own groups.
-
-### Variant Allele Fraction Calculation
-
-Calculate variant allele fraction (VAF) for each variant.
-This is done in `03-calculate_tmb.R`.
-
-```
-vaf = (t_alt_count) / (t_ref_count + t_alt_count)
-```
-This is following the [code used in
-`maftools`](https://github.com/PoisonAlien/maftools/blob/1d0270e35c2e0f49309eba08b62343ac0db10560/R/plot_vaf.R#L39).
-The VAF calculations and other special variables are added to the MAF fields and written to a file ending in `_vaf` in the caller's results folder.
-
-### Tumor Mutation Burden Calculation
-
-To calculate TMB, the sum of the bases included in the WXS or WGS BED regions are used as the denominator, depending on the sample's processing strategy.
-
-```
-TMBwxs = sum(mutation_w-in_bedwxs)/(wxs_genome_size/1000000)
-TMBwgs = sum(all_mutations)/(wgs_genome_size/1000000)
-```
-
-Where genome size is calculated from the respective BED file as:
-```
-genome_size = sum(End_Position - Start_Position)
-```
-## Comparison of Callers
-
-After running an initial evaluation and set up of each of the callers' MAF files,
-the `compare_snv_callers.Rmd` notebook can be run to get final results on consensus
-mutation calls.  
-
-### Mutation IDs  
-
-In order to compare mutations across callers, the data tables for each caller
-were indexed by: `Chromosome` `Start_Position` `Reference_Allele` and `Allele`.
-Meaning that across callers, if all these fields were identical, they were considered to be the same mutation.
-This was done in the `01-setup_db.py` script using the function.
-
-### Summary of consensus files:
-
-- `consensus_mutation.maf.tsv` - Mutations that were called by all three of these callers for a given sample are saved to this file.
-This file is [MAF-like](#consensus-mutation-call) meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also some added calculations like [Variant Allele Fraction](#variant-allele-fraction-calculation) and some sample metadata information.
-These files combine the [MAF file data](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from 3 different SNV callers: [Mutect2](https://software.broadinstitute.org/cancer/cga/mutect), [Strelka2](https://github.com/Illumina/strelka), and [Lancet](https://github.com/nygenome/lancet).
-See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see the methods of this caller analysis and comparison [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers).
-
-It is "MAF-like" file because it has many of the same fields as a MAF file but..  
-  - Does not contain the version string in the first row   
-  - Has extraneous annotation data has been removed (columns with all `NA`s)  
-  - Has VAF calculations and other variables that are calculated by the [`set_up_maf` function](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/util/wrangle_functions.R#L11).
-
-- `consensus_mutation_tmb.tsv` - After the consensus mutations were identified, Tumor mutation burden was recalculated for each sample from this mutation set.
-See this [analysis' folder](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers#tumor-mutation-burden-calculation) for details on these methods.
-
-## Summary of custom functions
-
-### Wrangling functions
-|Function Name|Output created|Main Arguments|
-|-------------|--------------|---------|
-|`set_up_maf`|Columns: VAF, mutation_id, base_change, change| A MAF formatted data.frame|
-|`maf_to_granges`|A `GenomicRanges` format object|A MAF formatted data.frame|
-|`wxs_bed_filter`|A filtered MAF df with only mutations within the provided BED regions|A MAF formatted data.frame and a BED formatted data.frame|
-|`calculate_tmb`|A data.frame with TMB stat per sample |A `wxs_bed_filter`ed MAF formatted data.frame, a WGS and WXS genome sizes|
-|`annotr_maf`|A data.frame with genomic annotations for the provided MAF data.frame|A MAF formatted data.frame and a [built AnnotatR annotation object](https://rdrr.io/bioc/annotatr/man/build_annotations.html)|
-|`find_cosmic_overlap`|Find overlap with [COSMIC](https://cancer.sanger.ac.uk/cosmic) mutations|A MAF formatted data.frame with `change` and variable from the `set_up_maf` function|
-
-### Plotting functions
-
-All plotting functions use `ggplot2` and all use a argument: `exp_strategy` that
-determines whether to plot WGS or WXS samples only or both. Default is to plot
-both.
-
-|Function Name|Plot output|Main Arguments|
-|-------------|-----------|---------|
-|`base_change_plot`|Base change ggplot barplot|A MAF formatted data.frame with the `change` column from `set_up_maf` function|
-|`depth_vs_vaf_plot`|A scatterplot of depth vs VAF|A MAF formatted data.frame with the `change` column from `set_up_maf` function|
-|`snv_region_plot`|Genomic region ggplot barplot|An genomic region annotated MAF data.frame and has from `annotr_maf`|
-|`cosmic_plot`|A violin plot of overlapping COSMIC vs non-COSMIC mutations|A MAF formatted data.frame with the `vaf` column from `set_up_maf` function|
-|`tmb_plot`|Plot Tumor Mutational Burden as a jitter plot|A data.frame with the tmb stats calculated from `calculate_tmb` function, `x_axis` to specify what variable to plot on the x-axis|

From 514922cebbeeb503d59b222f70661892f803980f Mon Sep 17 00:00:00 2001
From: Candace Savonen <savonenc@msu.edu>
Date: Mon, 2 Dec 2019 09:48:16 -0500
Subject: [PATCH 04/11] Get rid of unnecessary 00-setup.R script

---
 analyses/snv-callers/README.md           |   5 +-
 analyses/snv-callers/scripts/00-set_up.R | 152 -----------------------
 2 files changed, 2 insertions(+), 155 deletions(-)
 delete mode 100644 analyses/snv-callers/scripts/00-set_up.R

diff --git a/analyses/snv-callers/README.md b/analyses/snv-callers/README.md
index c40ed9d0d0..db9b941254 100644
--- a/analyses/snv-callers/README.md
+++ b/analyses/snv-callers/README.md
@@ -1,7 +1,7 @@
 # SNV caller comparison analysis
 
 This analysis evaluates [MAF files](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from different SNV callers, compares their output, and creates a [consensus mutation file](./results/consensus/consensus_mutation.maf.tsv.zip).
-This consensus mutation file is [MAF-like](#consensus-mutation-call) meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also some added calculations like [Variant Allele Fraction](#variant-allele-fraction-calculation) and some sample metadata information.
+This consensus mutation file is [MAF-like](#consensus-mutation-call) meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also has [VAF](#variant-allele-fraction-calculation)
 
 <!-- START doctoc generated TOC please keep comment here to allow auto update -->
 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
@@ -64,14 +64,13 @@ This was done in the `01-setup_db.py` script using the function.
 ### Variant Allele Fraction Calculation
 
 Calculate variant allele fraction (VAF) for each variant.
-This is done in `03-calculate_tmb.R`.
+This is done in `01-setup_db.py`.
 
 ```
 vaf = (t_alt_count) / (t_ref_count + t_alt_count)
 ```
 This is following the [code used in
 `maftools`](https://github.com/PoisonAlien/maftools/blob/1d0270e35c2e0f49309eba08b62343ac0db10560/R/plot_vaf.R#L39).
-The VAF calculations and other special variables are added to the MAF fields and written to a file ending in `_vaf` in the caller's results folder.
 
 ### Tumor Mutation Burden Calculation
 
diff --git a/analyses/snv-callers/scripts/00-set_up.R b/analyses/snv-callers/scripts/00-set_up.R
deleted file mode 100644
index e1711fb8e8..0000000000
--- a/analyses/snv-callers/scripts/00-set_up.R
+++ /dev/null
@@ -1,152 +0,0 @@
-# SNV-caller set up
-# 2019
-# C. Savonen for ALSF - CCDL
-#
-# Purpose: Set up for reference files that are used by 01-calculate_vaf_tmb.R
-#          script.
-#
-# Option descriptions
-# --annot_rds : File path to where you would like the annotation_rds file to be
-#               stored
-# --cosmic_og : Path to original COSMIC file. Can be .gz compressed. Give path relative
-#               to top directory, 'OpenPBTA-analysis'. Will need to download this from
-#               COSMIC at https://cancer.sanger.ac.uk/cosmic/download.
-#               These data are available if you register.
-# --cosmic_clean : File path specifying where you would like the cleaned brain-related
-#                  COSMIC mutations file to be stored.
-#
-#################################### Set Up ####################################
-# Establish base dir
-root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))
-
-# Magrittr pipe
-`%>%` <- dplyr::`%>%`
-
-# Load the optparse library
-library(optparse)
-
-# Will need optparse for collecting options
-if (!("optparse" %in% installed.packages())) {
-  install.packages("optparse", repos = "http://cran.us.r-project.org")
-}
-# Will need R.utils for zipping up the results file
-if (!("R.utils" %in% installed.packages())) {
-  install.packages("R.utils", repos = "http://cran.us.r-project.org")
-}
-# Install package if not installed
-if (!("annotatr" %in% installed.packages())) {
-  BiocManager::install("annotatr")
-}
-# Install package if not installed
-if (!("data.table" %in% installed.packages())) {
-  install.packages("data.table", repos = "http://cran.us.r-project.org")
-}
-
-################################ Set up options ################################
-# Set up optparse options
-option_list <- list(
-  make_option(
-    opt_str = "--cosmic_og", type = "character", default = "none",
-    help = "Relative file path (assuming from top directory of 
-              'OpenPBTA-analysis') to where the COSMIC mutation file is stored. 
-              Will need to download this from COSMIC at 
-              https://cancer.sanger.ac.uk/cosmic/download
-              These data are available if you register.",
-    metavar = "character"
-  ),
-  make_option(
-    opt_str = "--cosmic_clean", type = "character", default = "none",
-    help = "Relative file path (assuming from top directory of 
-              'OpenPBTA-analysis') where you would like the cleaned COSMIC file
-              to be stored.",
-    metavar = "character"
-  ),
-  make_option(
-    opt_str = "--annot_rds", type = "character",
-    default = "none", help = "Relative file path (assuming from top 
-              directory of 'OpenPBTA-analysis') that specifies where you would
-              like the annotation_rds file to be stored.",
-    metavar = "character"
-  )
-)
-
-# Parse options
-opt <- parse_args(OptionParser(option_list = option_list))
-
-##################### Set base directories common file paths ###################
-# Make all base names relative to root_dir
-opt$cosmic_og <- file.path(root_dir, opt$cosmic_og)
-opt$cosmic_clean <- file.path(root_dir, opt$cosmic_clean)
-opt$annot_rds <- file.path(root_dir, opt$annot_rds)
-
-################################ Build Annotation ##############################
-# We will only run this if it hasn't been run before
-if (opt$annot_rds != "none" && !file.exists(opt$annot_rds)) {
-  # Print progress message
-  message("Setting up genomic region annotation file. Only need to do this once.")
-
-  # Here we are specifying what types of annotation we would like
-  annots <- c(
-    "hg38_basicgenes",
-    "hg38_genes_intergenic",
-    "hg38_genes_intronexonboundaries",
-    "hg38_genes_exonintronboundaries",
-    "hg38_genes_5UTRs",
-    "hg38_genes_exons",
-    "hg38_genes_introns",
-    "hg38_genes_3UTRs"
-  )
-
-  # Build all the annotations into GRanges object
-  annotations <- annotatr::build_annotations(genome = "hg38", annotations = annots)
-
-  # Write this object so we don't have to write it again
-  readr::write_rds(annotations,
-    opt$annot_rds,
-    compress = "gz"
-  )
-}
-######################## Set up COSMIC Mutations file ###########################
-# This set up will not be run unless you obtain the original data file
-# from https://cancer.sanger.ac.uk/cosmic/download , these data are available
-# if you register. The full, unfiltered somatic mutations file
-# CosmicMutantExport.tsv for grch38 is used here but only mutations from brain
-# related samples are kept.
-if (opt$cosmic_clean != "none" && !file.exists(opt$cosmic_clean)) {
-
-  # Print progress message
-  message("Setting up COSMIC mutation file. Only need to do this once.")
-
-  # Read in original file
-  cosmic_variants <- data.table::fread(opt$cosmic_og,
-    data.table = FALSE
-  ) %>%
-    # Keep only brain mutations so the file is smaller
-    dplyr::filter(`Site subtype 1` == "brain") %>%
-    # Get rid of spaces in column names
-    dplyr::rename_all(dplyr::funs(stringr::str_replace_all(., " ", "_"))) %>%
-    # Separate the genome coordinates into their own BED like variables
-    dplyr::mutate(
-      Chromosome = paste0(
-        "chr",
-        stringr::word(Mutation_genome_position, sep = ":|-", 1) %>%
-          stringr::str_replace_all(c("23" = "X", "24" = "Y"))
-      ),
-      Start_Position = stringr::word(Mutation_genome_position, sep = ":|-", 2),
-      End_Position = stringr::word(Mutation_genome_position, sep = ":|-", 3),
-      # Make a base_change variable so we can compare to our set up for PBTA data
-      base_change = substr(Mutation_CDS, nchar(Mutation_CDS) - 2, nchar(Mutation_CDS))
-    ) %>%
-    # Carry over the strand info, but rename to match our PBTA set up
-    dplyr::rename(Strand = Mutation_strand) %>%
-    # Narrow down to just the needed columns
-    dplyr::select(
-      "Chromosome", "Start_Position", "End_Position", "Strand",
-      "base_change"
-    ) %>%
-    # Write to a TSV file
-    readr::write_tsv(opt$cosmic_clean)
-} else {
-  warning("A cleaned COSMIC Mutation file was already found with this name. Delete this if you 
-         wanted to re-run this step.")
-}

From f8b0722c0dbe828438e149714975b9bd1ee77b3f Mon Sep 17 00:00:00 2001
From: Candace Savonen <savonenc@msu.edu>
Date: Mon, 2 Dec 2019 09:50:10 -0500
Subject: [PATCH 05/11] Take out 00-setup call from bash script

---
 analyses/snv-callers/run_caller_consensus_analysis.sh | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/analyses/snv-callers/run_caller_consensus_analysis.sh b/analyses/snv-callers/run_caller_consensus_analysis.sh
index 085817b3f2..7294eec5d7 100644
--- a/analyses/snv-callers/run_caller_consensus_analysis.sh
+++ b/analyses/snv-callers/run_caller_consensus_analysis.sh
@@ -25,15 +25,6 @@ vaf_cutoff=${OPENPBTA_VAF_CUTOFF:-0}
 # To run plots, set OPENPBTA_PLOTS to 1 or more
 run_plots_nb=${OPENPBTA_PLOTS:-0}
 
-############################ Set Up Reference Files ############################
-# The original COSMIC file is obtained from: https://cancer.sanger.ac.uk/cosmic/download
-# These data are available if you register. The full, unfiltered somatic mutations
-# file CosmicMutantExport.tsv.gz for grch38 is used here.
-Rscript analyses/snv-callers/scripts/00-set_up.R \
-  --annot_rds $annot_rds \
-  --cosmic_og scratch/CosmicMutantExport.tsv.gz \
-  --cosmic_clean $cosmic
-
 ################################ Set Up Database ################################
 python3 analyses/snv-callers/scripts/01-setup_db.py \
   --db-file $dbfile \

From 58eadaa368374ec841dac575895405454da658ff Mon Sep 17 00:00:00 2001
From: Candace Savonen <savonenc@msu.edu>
Date: Mon, 2 Dec 2019 11:29:14 -0500
Subject: [PATCH 06/11] Add output summary, streamline wording

---
 analyses/snv-callers/README.md | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/analyses/snv-callers/README.md b/analyses/snv-callers/README.md
index db9b941254..d47a9e1ca7 100644
--- a/analyses/snv-callers/README.md
+++ b/analyses/snv-callers/README.md
@@ -3,6 +3,8 @@
 This analysis evaluates [MAF files](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from different SNV callers, compares their output, and creates a [consensus mutation file](./results/consensus/consensus_mutation.maf.tsv.zip).
 This consensus mutation file is [MAF-like](#consensus-mutation-call) meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also has [VAF](#variant-allele-fraction-calculation)
 
+See the comparison results plots [here](https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/compare_snv_callers_plots.nb.html).
+
 <!-- START doctoc generated TOC please keep comment here to allow auto update -->
 <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
 **Table of Contents**  *generated with [DocToc](https://github.com/thlorenz/doctoc)*
@@ -27,30 +29,27 @@ To run the evaluations and comparisons of all the SNV callers, call the bash scr
 ```
 bash run_caller_analysis.sh
 ```
-This script will return results for each caller in the `plots` and `results` folder.
-To see an overall summary report, look in the `results` folder for that caller.
-(See [Overall File Structure](#overall-file-structure) for more details on
-everything that is returned.)
+This bash script will return:
 
-The final results for the caller consensus are in the form of a notebook, `compare_snv_callers.nb.html`.
-The consensus mutations themselves are saved to a [MAF-like file](#consensus-mutation-call) `consensus_mutation.maf.tsv.zip` to the `results/consensus`
-results folder.
-These are the mutations dubbed reliable enough to move forward with.
+- Comparison plots in a notebook: [`compare_snv_callers_plots.nb.html`](https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/compare_snv_callers_plots.nb.html).
+- A zip file containing:
+  - The consensus mutations themselves, saved to [MAF-like file](#consensus-mutation-call) `consensus_mutation.maf.tsv`.
+  - Tumor Mutation burden calculations in `consensus_mutation_tmb.tsv`.
 
 ### Summary of consensus files:
 
 - `consensus_mutation.maf.tsv` - Mutations that were called by all three of these callers for a given sample are saved to this file.
-This file is [MAF-like](#consensus-mutation-call) meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also some added calculations like [Variant Allele Fraction](#variant-allele-fraction-calculation) and some sample metadata information.
+This file is MAF-like meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also some added calculations like [Variant Allele Fraction](#variant-allele-fraction-calculation) and some sample metadata information.
 These files combine the [MAF file data](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from 3 different SNV callers: [Mutect2](https://software.broadinstitute.org/cancer/cga/mutect), [Strelka2](https://github.com/Illumina/strelka), and [Lancet](https://github.com/nygenome/lancet).
-See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see the methods of this caller analysis and comparison [here](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers).
+See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see [the methods of this caller analysis and comparison below](#summary-of-methods).
 
 It is "MAF-like" file because it has many of the same fields as a MAF file but..  
   - Does not contain the version string in the first row   
-  - Has extraneous annotation data has been removed (columns with all `NA`s)  
-  - Has VAF calculations and other variables that are calculated by the [`set_up_maf` function](https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/analyses/snv-callers/util/wrangle_functions.R#L11).
+  - Extraneous annotation data has been removed (columns with all `NA`s)  
+  - Has VAF calculations
 
 - `consensus_mutation_tmb.tsv` - After the consensus mutations were identified, Tumor mutation burden was recalculated for each sample from this mutation set.
-See this [analysis' folder](https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers#tumor-mutation-burden-calculation) for details on these methods.
+See this [analysis' folder](#tumor-mutation-burden-calculation) for details on these methods.
 
 ## Summary of Methods
 

From a1c12e94484aa60748dd06e55f1f6a2cf0a525be Mon Sep 17 00:00:00 2001
From: Candace Savonen <savonenc@msu.edu>
Date: Mon, 2 Dec 2019 18:22:37 -0500
Subject: [PATCH 07/11] Make @jashapiro suggestions except MNV description

---
 analyses/snv-callers/README.md | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/analyses/snv-callers/README.md b/analyses/snv-callers/README.md
index d47a9e1ca7..ceed491a85 100644
--- a/analyses/snv-callers/README.md
+++ b/analyses/snv-callers/README.md
@@ -1,7 +1,7 @@
 # SNV caller comparison analysis
 
 This analysis evaluates [MAF files](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from different SNV callers, compares their output, and creates a [consensus mutation file](./results/consensus/consensus_mutation.maf.tsv.zip).
-This consensus mutation file is [MAF-like](#consensus-mutation-call) meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also has [VAF](#variant-allele-fraction-calculation)
+This consensus mutation file is [MAF-like](#consensus-mutation-call) meaning it is TSV file that contains the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but adds [VAF](#variant-allele-fraction-calculation), but does not contain a starting comment line with a version number.
 
 See the comparison results plots [here](https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/compare_snv_callers_plots.nb.html).
 
@@ -38,16 +38,10 @@ This bash script will return:
 
 ### Summary of consensus files:
 
-- `consensus_mutation.maf.tsv` - Mutations that were called by all three of these callers for a given sample are saved to this file.
-This file is MAF-like meaning it is TSV file that contains many of the fields of a [MAF file](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) but also some added calculations like [Variant Allele Fraction](#variant-allele-fraction-calculation) and some sample metadata information.
+- `consensus_mutation.maf.tsv` - is MAF-like file that contains the mutations that were called by all three of these callers for a given sample are saved to this file.
 These files combine the [MAF file data](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from 3 different SNV callers: [Mutect2](https://software.broadinstitute.org/cancer/cga/mutect), [Strelka2](https://github.com/Illumina/strelka), and [Lancet](https://github.com/nygenome/lancet).
 See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see [the methods of this caller analysis and comparison below](#summary-of-methods).
 
-It is "MAF-like" file because it has many of the same fields as a MAF file but..  
-  - Does not contain the version string in the first row   
-  - Extraneous annotation data has been removed (columns with all `NA`s)  
-  - Has VAF calculations
-
 - `consensus_mutation_tmb.tsv` - After the consensus mutations were identified, Tumor mutation burden was recalculated for each sample from this mutation set.
 See this [analysis' folder](#tumor-mutation-burden-calculation) for details on these methods.
 
@@ -95,7 +89,7 @@ this can be overridden with `--overwrite` option.
 
 ### 01-setup_db.py
 
-Creates and/or fills a database of variant calls that will be used by subsequent calls.
+Creates and/or fills an SQLite database of variant calls that will be used by subsequent steps to find consensus mutations.
 Note: requires `pandas` to be installed, and expects python3
 All arguments are optional; only the included tables will be affected.
 
@@ -118,7 +112,7 @@ All arguments are optional; only the included tables will be affected.
 
 ### 02-merge_callers.R
 
-Using the database created by `01-setup_db.py`, merge callers' data files into consensus MAF-like file.
+Using the database created by `01-setup_db.py`, merge callers' data files into consensus [MAF-like file](#snv-caller-comparison-analysis).
 
 **Argument descriptions**
 ```

From c6e556b1248e75651109ee1de7b7a2b1f0b5ebae Mon Sep 17 00:00:00 2001
From: Candace Savonen <savonenc@msu.edu>
Date: Tue, 3 Dec 2019 10:47:57 -0500
Subject: [PATCH 08/11] Updates to README to reflect #307 s changes

---
 analyses/snv-callers/README.md | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/analyses/snv-callers/README.md b/analyses/snv-callers/README.md
index ceed491a85..ce0d0fc332 100644
--- a/analyses/snv-callers/README.md
+++ b/analyses/snv-callers/README.md
@@ -33,17 +33,11 @@ This bash script will return:
 
 - Comparison plots in a notebook: [`compare_snv_callers_plots.nb.html`](https://cansavvy.github.io/openpbta-notebook-concept/snv-callers/compare_snv_callers_plots.nb.html).
 - A zip file containing:
-  - The consensus mutations themselves, saved to [MAF-like file](#consensus-mutation-call) `consensus_mutation.maf.tsv`.
-  - Tumor Mutation burden calculations in `consensus_mutation_tmb.tsv`.
-
-### Summary of consensus files:
-
-- `consensus_mutation.maf.tsv` - is MAF-like file that contains the mutations that were called by all three of these callers for a given sample are saved to this file.
-These files combine the [MAF file data](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from 3 different SNV callers: [Mutect2](https://software.broadinstitute.org/cancer/cga/mutect), [Strelka2](https://github.com/Illumina/strelka), and [Lancet](https://github.com/nygenome/lancet).
-See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see [the methods of this caller analysis and comparison below](#summary-of-methods).
-
-- `consensus_mutation_tmb.tsv` - After the consensus mutations were identified, Tumor mutation burden was recalculated for each sample from this mutation set.
-See this [analysis' folder](#tumor-mutation-burden-calculation) for details on these methods.
+  - `consensus_snv.maf.tsv` - is  [MAF-like file](#consensus-mutation-call) that contains the snvs that were called by all three of these callers for a given sample are saved to this file.
+  These files combine the [MAF file data](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from 3 different SNV callers: [Mutect2](https://software.broadinstitute.org/cancer/cga/mutect), [Strelka2](https://github.com/Illumina/strelka), and [Lancet](https://github.com/nygenome/lancet).
+  See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see [the methods of this caller analysis and comparison below](#summary-of-methods).  
+  - `consensus_snv_tmb_coding_only.tsv` - Tumor Mutation burden calculations using *coding only* mutations use the consensus of Lancet, Mutect2, and Strelka2. 
+  - `consensus_snv_tmb_all.tsv` - Tumor Mutation burden calculations using *all* mutations use the consensus of Mutect2, and Strelka2. (Lancet was excluded because it has a coding region bias).
 
 ## Summary of Methods
 
@@ -129,10 +123,12 @@ Using the database created by `01-setup_db.py`, merge callers' data files into c
 
 Using the consensus file created in `02-merge_callers.R`, calculate TMB for all
 WGS and WXS samples.
+Two TMB files are created, one including *all snv* called by Strelka2 and Mutect2 (Lancet is excluded from this TMB calculation consensus because of a coding region bias), and a *coding snvs only* TMB calculation. 
 
 **Argument descriptions**
 ```
  --consensus : File path to the MAF-like file.
+ --db_file : Path to sqlite database file made from 01-setup_db.py
  --metadata : Relative file path to MAF file to be analyzed. Can be .gz compressed.
               Assumes file path is given from top directory of 'OpenPBTA-analysis'.
  --bed_wgs : File path that specifies the caller-specific BED regions file.

From 9d344c83cf69590589be6431dedf838623b2f7b8 Mon Sep 17 00:00:00 2001
From: Candace Savonen <savonenc@msu.edu>
Date: Tue, 3 Dec 2019 11:19:44 -0500
Subject: [PATCH 09/11] clarify Lancet statement and put link

---
 analyses/snv-callers/README.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/analyses/snv-callers/README.md b/analyses/snv-callers/README.md
index ce0d0fc332..6fff7c7ea8 100644
--- a/analyses/snv-callers/README.md
+++ b/analyses/snv-callers/README.md
@@ -37,7 +37,7 @@ This bash script will return:
   These files combine the [MAF file data](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from 3 different SNV callers: [Mutect2](https://software.broadinstitute.org/cancer/cga/mutect), [Strelka2](https://github.com/Illumina/strelka), and [Lancet](https://github.com/nygenome/lancet).
   See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see [the methods of this caller analysis and comparison below](#summary-of-methods).  
   - `consensus_snv_tmb_coding_only.tsv` - Tumor Mutation burden calculations using *coding only* mutations use the consensus of Lancet, Mutect2, and Strelka2. 
-  - `consensus_snv_tmb_all.tsv` - Tumor Mutation burden calculations using *all* mutations use the consensus of Mutect2, and Strelka2. (Lancet was excluded because it has a coding region bias).
+  - `consensus_snv_tmb_all.tsv` - Tumor Mutation burden calculations using *all* mutations use the consensus of Mutect2, and Strelka2. (Lancet was excluded because it has a [coding region bias in the way it was ran](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling)).
 
 ## Summary of Methods
 
@@ -123,7 +123,7 @@ Using the database created by `01-setup_db.py`, merge callers' data files into c
 
 Using the consensus file created in `02-merge_callers.R`, calculate TMB for all
 WGS and WXS samples.
-Two TMB files are created, one including *all snv* called by Strelka2 and Mutect2 (Lancet is excluded from this TMB calculation consensus because of a coding region bias), and a *coding snvs only* TMB calculation. 
+Two TMB files are created, one including *all snv* called by Strelka2 and Mutect2 (Lancet is excluded from this TMB calculation consensus because of a [coding region bias in the way it was ran](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling)), and a *coding snvs only* TMB calculation. 
 
 **Argument descriptions**
 ```

From 999a41a979a4fe3fa69ad1b400b5c75f76498103 Mon Sep 17 00:00:00 2001
From: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Date: Tue, 3 Dec 2019 11:23:08 -0500
Subject: [PATCH 10/11] Update mutation comparison explanation

---
 analyses/snv-callers/README.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/analyses/snv-callers/README.md b/analyses/snv-callers/README.md
index ce0d0fc332..93d993dbb8 100644
--- a/analyses/snv-callers/README.md
+++ b/analyses/snv-callers/README.md
@@ -41,13 +41,6 @@ This bash script will return:
 
 ## Summary of Methods
 
-### Mutation Comparisons
-
-In order to compare mutations across callers, the data tables for each caller
-were indexed by: `Chromosome` `Start_Position` `Reference_Allele` and `Allele`.
-Meaning that across callers, if all these fields were identical, they were considered to be the same mutation.
-This was done in the `01-setup_db.py` script using the function.
-
 ### Variant Allele Fraction Calculation
 
 Calculate variant allele fraction (VAF) for each variant.
@@ -59,6 +52,13 @@ vaf = (t_alt_count) / (t_ref_count + t_alt_count)
 This is following the [code used in
 `maftools`](https://github.com/PoisonAlien/maftools/blob/1d0270e35c2e0f49309eba08b62343ac0db10560/R/plot_vaf.R#L39).
 
+### Mutation Comparisons
+
+The default consensus mutations called are those that are shared among all of Strelka2, Mutect2, and Lancet.
+Mutations were considered to be the same if they were identical in the following field: `Chromosome`, `Start_Position`, `Reference_Allele`,  `Allele`, and `Tumor_Sample_Barcode`.
+As Strelka2 does not call multinucleotide variants (MNV), but instead calls each component SNV as a separate mutation, MNV calls from Mutect2 and Lancet were separated into consecutive SNVs before comparison with Strelka2.
+
+
 ### Tumor Mutation Burden Calculation
 
 To calculate TMB, the sum of the bases included in the WXS or WGS BED regions are used as the denominator, depending on the sample's processing strategy.

From 9543a625e8c40f62c439d99e0b913d92659bcf8a Mon Sep 17 00:00:00 2001
From: Candace Savonen <cansav09@gmail.com>
Date: Tue, 3 Dec 2019 11:39:54 -0500
Subject: [PATCH 11/11] Update analyses/snv-callers/README.md

Co-Authored-By: jashapiro <jashapiro@gmail.com>
---
 analyses/snv-callers/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/analyses/snv-callers/README.md b/analyses/snv-callers/README.md
index 62a8d27314..34671c467e 100644
--- a/analyses/snv-callers/README.md
+++ b/analyses/snv-callers/README.md
@@ -37,7 +37,7 @@ This bash script will return:
   These files combine the [MAF file data](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) from 3 different SNV callers: [Mutect2](https://software.broadinstitute.org/cancer/cga/mutect), [Strelka2](https://github.com/Illumina/strelka), and [Lancet](https://github.com/nygenome/lancet).
   See the methods on the callers' settings [here](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#somatic-single-nucleotide-variant-calling) and see [the methods of this caller analysis and comparison below](#summary-of-methods).  
   - `consensus_snv_tmb_coding_only.tsv` - Tumor Mutation burden calculations using *coding only* mutations use the consensus of Lancet, Mutect2, and Strelka2. 
-  - `consensus_snv_tmb_all.tsv` - Tumor Mutation burden calculations using *all* mutations use the consensus of Mutect2, and Strelka2. (Lancet was excluded because it has a [coding region bias in the way it was ran](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling)).
+  - `consensus_snv_tmb_all.tsv` - Tumor Mutation burden calculations using *all* mutations use the consensus of Mutect2, and Strelka2. (Lancet was excluded because it has a [coding region bias in the way it was run](https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#snv-and-indel-calling)).
 
 ## Summary of Methods