Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Commit

Permalink
Part 5 of n: SNV caller initial calculations (#134)
Browse files Browse the repository at this point in the history
* Set up the set up

* Add circle CI test and Docker config

* Add some more comments

* Set up Rprojroot for circle CI test to work better

* Fixing Circle CI file.

* Change read_tsv to data.table::fread for big file

* read in the .gz file

* push plot function changes

* Fix an error

* Add missing package to Dockerfile

* Reduce cosmic file to only the brain sample mutations

* Update README with changes to cosmic file

* re-updated Dockerfile

* Ran a linter on set up script

* Comment out of date

* Get rid of old WGX/WXS bed file set up

* Incorporate initial PR suggestions from @jashapiro and @cbethell

* Push a working bash script

* Add bash script to circle CI

* Add usage section in README and change name of script

* Add some more comments

* Correct a couple things in the README

* Get rid of remnant comment

* Fix a typo!

* Add Usage to the TOC

* Add more documentation to the README

* Update Circle CI

* Push more exact bash script

* Fix a couple issues with handling metadata file path

* Get rid of dev remnants

* Couple changes for readability

* Add some things to README and get rid of part of bash that isn't there

* Found dumb mistake

* Changed [*] to [+]

* I mean [*] to a [@] which makes more sense.

* Fix some of the overwrite handling

* Update comments

* Temporarily remove VarDict while #135 is unresolved

* Switch out WGS file for lancet

* Add an if statement for no WXS samples

* Get rid of development remnant

* Fix an error with that last commit

* One more change
  • Loading branch information
cansavvy authored and jaclyn-taroni committed Oct 4, 2019
1 parent d4e8c5c commit 542e3a6
Show file tree
Hide file tree
Showing 6 changed files with 526 additions and 94 deletions.
8 changes: 4 additions & 4 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,14 @@ jobs:
- run:
name: fusion filtering
command: ./scripts/run_in_ci.sh Rscript analyses/fusion_filtering/01-fusion-standardization.R -f "data/pbta-fusion-starfusion.tsv.gz" -c "STARfusion" -o "scratch/star.RDS"

- run:
name: Transcriptome dimensionality reduction
command: ./scripts/run_in_ci.sh ./analyses/transcriptomic-dimension-reduction/ci-dimension-reduction-plots.sh

- run:
name: SNV Caller Set Up
command: ./scripts/run_in_ci.sh Rscript analyses/snv-callers/scripts/00-set_up.R
name: SNV Caller Analysis
command: ./scripts/run_in_ci.sh bash analyses/snv-callers/run_caller_evals.sh

#### Add your analysis here ####

Expand Down
96 changes: 93 additions & 3 deletions analyses/snv-callers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ This analysis evaluates [MAF files](https://docs.gdc.cancer.gov/Data/File_Format
The GDC has [good documentation on the fields](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) contained in a standard MAF file.

**Table of Contents**
* [How to run this pipeline](#how-to-run-this-pipeline)
* [General usage of scripts](#general-usage-of-scripts)
* [Individual caller evaluation](#individual-caller-evaluation)
* [Base change analysis](#base-change-analysis)
* [Variant allele fraction calculation](#variant-allele-fraction-calculation)
Expand All @@ -14,7 +16,95 @@ The GDC has [good documentation on the fields](https://docs.gdc.cancer.gov/Data/
* [Mutation IDs](#mutation-ids)
* [Overall file structure](#overall-file-structure)
* [Summary of functions](#summary-of-custom-functions)
## Individual Caller Evaluation

## How to run this pipeline

**Run evaluations of each MAF file**

To run the initial evaluations of all the SNV callers, call the bash script:
```
bash run_caller_evals.sh
```
This script will return results for each caller in the `plots` and `results` folder.
To see an overall summary report, look in the `results` folder for that caller.
(See [Overall File Structure](#overall-file-structure) for more details on
everything that is returned.)

**Run comparison analysis of the callers**

*Coming soon*

## General usage of scripts

**Overall notes about these scripts:**
- The scripts are sequential as noted by their number.
- All file path-related options assume the file path given is relative to `OpenPBTA-analysis`.
- By default, the scripts will not overwrite existing files of the same name. However,
this can be overridden with `--overwrite` option.

### 00-set_up.R

00-set_up.R creates the [annotation RDS file](#genomic-regional-analyses) and [COSMIC mutations file](#cosmic-mutation-overlap) that are used by
the subsequent scripts.
This set up script only needs to be run once and its three options are all relating to where the reference files should be stored.

**Option descriptions**
```
--annot_rds : File path to where you would like the annotation_rds file to be
stored
--cosmic_og : Path to original COSMIC file. Can be .gz compressed. Will need to
download this from COSMIC at https://cancer.sanger.ac.uk/cosmic/download
These data are available if you register.
--cosmic_clean : File path specifying where you would like the cleaned brain-related
COSMIC mutations file to be stored. This file is provided to you in
GitHub. Only coordinates are needed.
```

### 01-calculate_vaf_tmb.R

This script sets up the given MAF file and outputs three files ([VAF](#variant-allele-fraction-calculation),
[TMB](#tumor-mutation-burden-calculation), and [Regional analysis](#genomic-regional-analyses))
that are used to make an overall evaluation report in `02-run_eval.R`.

**Option descriptions**
```
-label : Label to be used for folder and all output. eg. 'strelka2'. Default is 'maf'.
-output : File path that specifies the folder where the output should go.
New folder will be created if it doesn't exist.
--maf : Relative file path to MAF file to be analyzed. Can be .gz compressed.
--metadata : Relative file path to original metadata file.
--annot_rds : Relative file path to annotation object RDS file to be analyzed.
--bed_wgs : File path that specifies the caller-specific BED regions file.
--bed_wxs : File path that specifies the WXS BED regions file.
--overwrite : If specified, will overwrite any files of the same name. Default is FALSE.
```

### 02-run_eval.R

This script takes the files output by `01-calculate_vaf_tmb.R` and makes five
plots ([base_change](#base-change-analysis), [depth_vs_vaf](#variant-allele-fraction-calculation), [cosmic_plot](#cosmic-mutation-overlap), [snv_region](#genomic-regional-analyses), and [tmb_plot](#tumor-mutation-burden-calculation)) that are put into an overall report.

**Option descriptions**
```
# --label : Label to be used for folder and all output. eg. 'strelka2'. Optional.
# Default is 'maf'
# --plot_type : Specify what kind of plots you want printed out. Must be
# compatible with ggsave. eg pdf. Default is png
# --vaf : Folder from 01-calculate_vaf_tmb.R following files:
# <caller_name>_vaf.tsv
# <caller_name>_region.tsv
# <caller_name>_tmb.tsv
# --output : Where you would like the output from this script to be stored.
# --strategy : Specify whether you would like WXS and WGS separated for the plots.
# Analysis is still done on all data in the MAF file regardless.
# Acceptable options are 'wgs', 'wxs' or 'both', both for if you
# don't want to separate them. Default is both.
# --cosmic : Relative file path to COSMIC file to be analyzed.
# --overwrite : If TRUE, will overwrite any reports of the same name. Default is
# FALSE
```

# Individual Caller Evaluation

The first step in this analysis is an individual evaluation of each MAF from each caller.
This analysis prints out a report that is further subdivided by the WGS and WXS samples.
Expand Down Expand Up @@ -45,7 +135,7 @@ This is following the [code used in
`maftools`](https://github.com/PoisonAlien/maftools/blob/1d0270e35c2e0f49309eba08b62343ac0db10560/R/plot_vaf.R#L39).
The VAF calculations and other special variables are added to the MAF fields and written to a TSV ending in `_vaf.tsv` in the caller's results folder.

*Output for this analysiss*
*Output for this analysis*
* `results/<caller_name>/<caller_name>_vaf.tsv`
* `plots/<caller_name>/<caller_name>_<strategy>_depth_vs_vaf.png`

Expand Down Expand Up @@ -87,7 +177,7 @@ The sample-wise TMB calculations written to a TSV ending in `_tmb.tsv` in the ca
The COSMIC mutation data were obtained from https://cancer.sanger.ac.uk/cosmic/download
*To run this analysis, you need to obtain these data.*
The full, unfiltered somatic mutations file `CosmicMutantExport.tsv` for grch38 is used here and the genomic coordinates is arranged to be in BED format.
The COSMIC set is unfiltered down to only mutations detected in brain-related
The COSMIC set is filtered down to only mutations detected in brain-related
samples using the `Site subtype 1` field.
COSMIC mutations are overlapped with the present data's mutations using `GenomicRanges`.
The outcome of this overlap is added to the VAF data.frame with two `TRUE/FALSE` columns:
Expand Down
41 changes: 41 additions & 0 deletions analyses/snv-callers/run_caller_evals.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/bin/bash
# C. Savonen
# CCDL for ALSF 2019

# Purpose:Run an intial evaluation of each variant caller's MAF file

# Change directory
cd kitematic

# The files named in these arrays will be ran in the analysis.
datasets=("strelka2" "mutect2" "lancet")
wgs_files=("WGS.hg38.strelka2.unpadded.bed" "WGS.hg38.mutect2.unpadded.bed" "WGS.hg38.lancet.300bp_padded.bed")

# Reference file paths
cosmic=analyses/snv-callers/brain_cosmic_variants_coordinates.tsv
annot_rds=scratch/hg38_genomic_region_annotation.rds

############################ Set Up Reference Files ############################
# The original COSMIC file is obtained from: https://cancer.sanger.ac.uk/cosmic/download
# These data are available if you register. The full, unfiltered somatic mutations
# file CosmicMutantExport.tsv.gz for grch38 is used here.
Rscript analyses/snv-callers/scripts/00-set_up.R \
--annot_rds $annot_rds \
--cosmic_og scratch/CosmicMutantExport.tsv.gz \
--cosmic_clean $cosmic

########################## Calculate and Set Up Data ##########################
# Create files that contain calculated VAF, TMB, and regional analyses.
for ((i=0;i<${#datasets[@]};i++));
do
echo "Processing dataset: ${datasets[$i]}"
Rscript analyses/snv-callers/scripts/01-calculate_vaf_tmb.R \
--label ${datasets[$i]} \
--output analyses/snv-callers/results/${datasets[$i]} \
--maf data/pbta-snv-${datasets[$i]}.vep.maf.gz \
--metadata data/pbta-histologies.tsv \
--bed_wgs data/${wgs_files[$i]} \
--bed_wxs data/WXS.hg38.100bp_padded.bed \
--annot_rds $annot_rds \
--overwrite
done
170 changes: 89 additions & 81 deletions analyses/snv-callers/scripts/00-set_up.R
Original file line number Diff line number Diff line change
@@ -1,22 +1,30 @@
# SNV Caller set up and functions
#
# SNV-caller set up
# 2019
# C. Savonen for ALSF - CCDL
#
# Purpose: Set up for analyzing MAF files
# Purpose: Set up for reference files that are used by 01-calculate_vaf_tmb.R
# script.
#
# Option descriptions
# --annot_rds : File path to where you would like the annotation_rds file to be
# stored
# --cosmic_og : Path to original COSMIC file. Can be .gz compressed. Give path relative
# to top directory, 'OpenPBTA-analysis'. Will need to download this from
# COSMIC at https://cancer.sanger.ac.uk/cosmic/download.
# These data are available if you register.
# --cosmic_clean : File path specifying where you would like the cleaned brain-related
# COSMIC mutations file to be stored.
#
#################################### Set Up ####################################
# Establish base dir
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))

# Import special functions
source(file.path(root_dir, "analyses", "snv-callers", "util", "wrangle_functions.R"))
source(file.path(root_dir, "analyses", "snv-callers", "util", "plot_functions.R"))

#################################### Set Up ####################################

# Magrittr pipe
`%>%` <- dplyr::`%>%`

# Load the optparse library
library(optparse)

# Will need optparse for collecting options
if (!("optparse" %in% installed.packages())) {
install.packages("optparse", repos = "http://cran.us.r-project.org")
Expand All @@ -34,50 +42,46 @@ if (!("data.table" %in% installed.packages())) {
install.packages("data.table", repos = "http://cran.us.r-project.org")
}

##################### Set base directories common file paths ###################
# Declare base directory names
scratch_dir <- file.path(root_dir, "scratch")
data_dir <- file.path(root_dir, "data")
snv_dir <- file.path(root_dir, "analyses", "snv-callers")

# Directories that we will need
base_results_dir <- file.path(snv_dir, "results")
base_plots_dir <- file.path(snv_dir, "plots")
cosmic_dir <- file.path(snv_dir, "cosmic")

# Create these folders if they haven't been created yet
if (!dir.exists(base_results_dir)) {
dir.create(base_results_dir)
}
if (!dir.exists(base_plots_dir)) {
dir.create(base_plots_dir)
}
if (!dir.exists(cosmic_dir)) {
dir.create(cosmic_dir)
}

# Declare reference file names
original_metadata <- file.path(data_dir, "pbta-histologies.tsv")

# These data were obtained from https://cancer.sanger.ac.uk/cosmic/download
# These data are available if you register.
# The full, unfiltered somatic mutations file CosmicMutantExport.tsv.gz for grch38
# is used here.
cosmic_file <- file.path(snv_dir, "CosmicMutantExport.tsv.gz")

# This is the cleaned version that only contains the genomic coordinates and
# base changes from the original file for brain sample mutations only.
cosmic_clean_file <- file.path(cosmic_dir, "brain_cosmic_variants_coordinates.tsv")
################################ Set up options ################################
# Set up optparse options
option_list <- list(
make_option(
opt_str = "--cosmic_og", type = "character", default = "none",
help = "Relative file path (assuming from top directory of
'OpenPBTA-analysis') to where the COSMIC mutation file is stored.
Will need to download this from COSMIC at
https://cancer.sanger.ac.uk/cosmic/download
These data are available if you register.",
metavar = "character"
),
make_option(
opt_str = "--cosmic_clean", type = "character", default = "none",
help = "Relative file path (assuming from top directory of
'OpenPBTA-analysis') where you would like the cleaned COSMIC file
to be stored.",
metavar = "character"
),
make_option(
opt_str = "--annot_rds", type = "character",
default = "none", help = "Relative file path (assuming from top
directory of 'OpenPBTA-analysis') that specifies where you would
like the annotation_rds file to be stored.",
metavar = "character"
)
)

# Will set up this annotation file below
annot_rds <- file.path(scratch_dir, "hg38_genomic_region_annotation.rds")
# Parse options
opt <- parse_args(OptionParser(option_list = option_list))

# This will be used for all WXS samples, but WGS bed regions are caller-specific
wxs_bed_file <- file.path(data_dir, "WXS.hg38.100bp_padded.bed")
##################### Set base directories common file paths ###################
# Make all base names relative to root_dir
opt$cosmic_og <- file.path(root_dir, opt$cosmic_og)
opt$cosmic_clean <- file.path(root_dir, opt$cosmic_clean)
opt$annot_rds <- file.path(root_dir, opt$annot_rds)

################################ Build Annotation ##############################
# We will only run this if it hasn't been run before
if (!file.exists(annot_rds)) {
if (opt$annot_rds != "none") {
# Print progress message
message("Setting up genomic region annotation file. Only need to do this once.")

Expand All @@ -98,7 +102,7 @@ if (!file.exists(annot_rds)) {

# Write this object so we don't have to write it again
readr::write_rds(annotations,
annot_rds,
opt$annot_rds,
compress = "gz"
)
}
Expand All @@ -108,37 +112,41 @@ if (!file.exists(annot_rds)) {
# if you register. The full, unfiltered somatic mutations file
# CosmicMutantExport.tsv for grch38 is used here but only mutations from brain
# related samples are kept.
if (!file.exists(cosmic_clean_file)) {
# Print progress message
message("Setting up COSMIC mutation file. Only need to do this once.")

# Read in original file
cosmic_variants <- data.table::fread(cosmic_file,
data.table = FALSE
) %>%
# Keep only brain mutations so the file is smaller
dplyr::filter(`Site subtype 1` == "brain")
# Get rid of spaces in column names
cosmic_variants <- cosmic_variants %>%
dplyr::rename_all(dplyr::funs(stringr::str_replace_all(., " ", "_"))) %>%
# Separate the genome coordinates into their own BED like variables
dplyr::mutate(
Chromosome = paste0(
"chr",
stringr::word(Mutation_genome_position, sep = ":", 1)
),
Start_Position = stringr::word(Mutation_genome_position, sep = ":|-", 1),
End_Position = stringr::word(Mutation_genome_position, sep = "-", 2),
# Make a base_change variable so we can compare to our set up for PBTA data
base_change = substr(Mutation_CDS, nchar(Mutation_CDS) - 2, nchar(Mutation_CDS))
) %>%
# Carry over the strand info, but rename to match our PBTA set up
dplyr::rename(Strand = Mutation_strand) %>%
# Narrow down to just the needed columns
dplyr::select(
"Chromosome", "Start_Position", "End_Position", "Strand",
"base_change"
if (opt$cosmic_clean != "none") {
if (file.exists(opt$cosmic_og)) {
# Print progress message
message("Setting up COSMIC mutation file. Only need to do this once.")

# Read in original file
cosmic_variants <- data.table::fread(opt$cosmic_og,
data.table = FALSE
) %>%
# Write to a TSV file
readr::write_tsv(cosmic_clean_file)
# Keep only brain mutations so the file is smaller
dplyr::filter(`Site subtype 1` == "brain")
# Get rid of spaces in column names
cosmic_variants <- cosmic_variants %>%
dplyr::rename_all(dplyr::funs(stringr::str_replace_all(., " ", "_"))) %>%
# Separate the genome coordinates into their own BED like variables
dplyr::mutate(
Chromosome = paste0(
"chr",
stringr::word(Mutation_genome_position, sep = ":", 1)
),
Start_Position = stringr::word(Mutation_genome_position, sep = ":|-", 1),
End_Position = stringr::word(Mutation_genome_position, sep = "-", 2),
# Make a base_change variable so we can compare to our set up for PBTA data
base_change = substr(Mutation_CDS, nchar(Mutation_CDS) - 2, nchar(Mutation_CDS))
) %>%
# Carry over the strand info, but rename to match our PBTA set up
dplyr::rename(Strand = Mutation_strand) %>%
# Narrow down to just the needed columns
dplyr::select(
"Chromosome", "Start_Position", "End_Position", "Strand",
"base_change"
) %>%
# Write to a TSV file
readr::write_tsv(opt$cosmic_clean)
} else {
stop("Original COSMIC Mutation file not found. Specify with --cosmic_og")
}
}
Loading

0 comments on commit 542e3a6

Please sign in to comment.