Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Part 5 of n: SNV caller initial calculations #134

Merged
Merged
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
37a78fb
Set up the set up
Sep 23, 2019
a25be13
Add circle CI test and Docker config
Sep 23, 2019
4a40580
Add some more comments
Sep 23, 2019
8a4d33b
Merge remote-tracking branch 'upstream/master' into cansav09/snv-call…
Sep 23, 2019
e704c0b
Set up Rprojroot for circle CI test to work better
Sep 23, 2019
56014dd
Fixing Circle CI file.
Sep 23, 2019
03a0770
Change read_tsv to data.table::fread for big file
Sep 23, 2019
391fb52
read in the .gz file
Sep 24, 2019
93cb818
push plot function changes
Sep 24, 2019
908aff9
Fix an error
Sep 25, 2019
ee08152
Merge branch 'master' into cansav09/snv-caller_set_up
cansavvy Sep 26, 2019
6f18dff
Add missing package to Dockerfile
Sep 26, 2019
38cd490
Reduce cosmic file to only the brain sample mutations
Sep 26, 2019
b9d8333
Update README with changes to cosmic file
Sep 26, 2019
a7c5495
re-updated Dockerfile
Sep 26, 2019
efde7ef
Ran a linter on set up script
Sep 26, 2019
e2b43a8
Merge branch 'master' into cansav09/snv-caller_set_up
cansavvy Sep 26, 2019
f4c7534
Comment out of date
cansavvy Sep 26, 2019
83d001a
Get rid of old WGX/WXS bed file set up
Sep 26, 2019
30664ec
Merge branch 'master' into cansav09/snv-caller_set_up
cansavvy Sep 27, 2019
374c6a1
Incorporate initial PR suggestions from @jashapiro and @cbethell
Sep 27, 2019
e99d0b4
Push a working bash script
Sep 27, 2019
5cb619e
Add bash script to circle CI
Sep 27, 2019
5e79092
Add usage section in README and change name of script
Sep 30, 2019
0b9f4e6
Merge branch 'master' into cansav09/snv_calculations
cansavvy Sep 30, 2019
83cad86
Add some more comments
Sep 30, 2019
f383869
Merge remote-tracking branch 'origin/cansav09/snv_calculations' into …
Sep 30, 2019
2e908a1
Correct a couple things in the README
Sep 30, 2019
1146afc
Get rid of remnant comment
Sep 30, 2019
abb7dda
Fix a typo!
Sep 30, 2019
76ec476
Merge branch 'master' into cansav09/snv_calculations
cansavvy Sep 30, 2019
20f5d4c
Add Usage to the TOC
Sep 30, 2019
b63c69f
Merge branch 'master' into cansav09/snv_calculations
cansavvy Sep 30, 2019
cf9999c
Add more documentation to the README
Oct 1, 2019
068b08c
Update Circle CI
Oct 1, 2019
e9a5bde
Push more exact bash script
Oct 1, 2019
efd2264
Fix a couple issues with handling metadata file path
Oct 1, 2019
e732516
Get rid of dev remnants
Oct 1, 2019
bab4319
Couple changes for readability
Oct 1, 2019
655bc00
Merge branch 'master' into cansav09/snv_calculations
cansavvy Oct 1, 2019
748e677
Add some things to README and get rid of part of bash that isn't there
Oct 1, 2019
9511997
Merge remote-tracking branch 'origin/cansav09/snv_calculations' into …
Oct 1, 2019
d25cc45
Found dumb mistake
Oct 1, 2019
ff50ff9
Changed [*] to [+]
Oct 1, 2019
26cc61d
I mean [*] to a [@] which makes more sense.
Oct 1, 2019
f0916eb
Merge branch 'master' into cansav09/snv_calculations
cansavvy Oct 2, 2019
0c0a023
Fix some of the overwrite handling
Oct 3, 2019
4666b00
Merge remote-tracking branch 'origin/cansav09/snv_calculations' into …
Oct 3, 2019
60e3fba
Update comments
Oct 3, 2019
9f3f661
Temporarily remove VarDict while #135 is unresolved
Oct 3, 2019
fdba57a
Switch out WGS file for lancet
Oct 3, 2019
cdc1557
Add an if statement for no WXS samples
Oct 3, 2019
5379c05
Get rid of development remnant
Oct 3, 2019
0de0269
Fix an error with that last commit
Oct 3, 2019
cdd7d2b
One more change
Oct 3, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,14 @@ jobs:
- run:
name: fusion filtering
command: ./scripts/run_in_ci.sh Rscript analyses/fusion_filtering/01-fusion-standardization.R -f "data/pbta-fusion-starfusion.tsv.gz" -c "STARfusion" -o "scratch/star.RDS"

- run:
name: Transcriptome dimensionality reduction
command: ./scripts/run_in_ci.sh ./analyses/transcriptomic-dimension-reduction/ci-dimension-reduction-plots.sh

- run:
name: SNV Caller Set Up
command: ./scripts/run_in_ci.sh Rscript analyses/snv-callers/scripts/00-set_up.R
name: SNV Caller Analysis
command: ./scripts/run_in_ci.sh bash analyses/snv-callers/run_caller_evals.sh

#### Add your analysis here ####

Expand Down
96 changes: 93 additions & 3 deletions analyses/snv-callers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ This analysis evaluates [MAF files](https://docs.gdc.cancer.gov/Data/File_Format
The GDC has [good documentation on the fields](https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/) contained in a standard MAF file.

**Table of Contents**
* [How to run this pipeline](#how-to-run-this-pipeline)
* [General usage of scripts](#general-usage-of-scripts)
* [Individual caller evaluation](#individual-caller-evaluation)
* [Base change analysis](#base-change-analysis)
* [Variant allele fraction calculation](#variant-allele-fraction-calculation)
Expand All @@ -14,7 +16,95 @@ The GDC has [good documentation on the fields](https://docs.gdc.cancer.gov/Data/
* [Mutation IDs](#mutation-ids)
* [Overall file structure](#overall-file-structure)
* [Summary of functions](#summary-of-custom-functions)
## Individual Caller Evaluation

## How to run this pipeline

**Run evaluations of each MAF file**

To run the initial evaluations of all the SNV callers, call the bash script:
```
bash run_caller_evals.sh
```
This script will return results for each caller in the `plots` and `results` folder.
To see an overall summary report, look in the `results` folder for that caller.
(See [Overall File Structure](#overall-file-structure) for more details on
everything that is returned.)

**Run comparison analysis of the callers**

*Coming soon*

## General usage of scripts

**Overall notes about these scripts:**
- The scripts are sequential as noted by their number.
- All file path-related options assume the file path given is relative to `OpenPBTA-analysis`.
- By default, the scripts will not overwrite existing files of the same name. However,
this can be overridden with `--overwrite` option.

### 00-set_up.R

00-set_up.R creates the [annotation RDS file](#genomic-regional-analyses) and [COSMIC mutations file](#cosmic-mutation-overlap) that are used by
the subsequent scripts.
This set up script only needs to be run once and its three options are all relating to where the reference files should be stored.

**Option descriptions**
```
--annot_rds : File path to where you would like the annotation_rds file to be
stored
--cosmic_og : Path to original COSMIC file. Can be .gz compressed. Will need to
download this from COSMIC at https://cancer.sanger.ac.uk/cosmic/download
These data are available if you register.
--cosmic_clean : File path specifying where you would like the cleaned brain-related
COSMIC mutations file to be stored. This file is provided to you in
GitHub. Only coordinates are needed.
```

### 01-calculate_vaf_tmb.R

This script sets up the given MAF file and outputs three files ([VAF](#variant-allele-fraction-calculation),
[TMB](#tumor-mutation-burden-calculation), and [Regional analysis](#genomic-regional-analyses))
that are used to make an overall evaluation report in `02-run_eval.R`.

**Option descriptions**
```
-label : Label to be used for folder and all output. eg. 'strelka2'. Default is 'maf'.
-output : File path that specifies the folder where the output should go.
New folder will be created if it doesn't exist.
--maf : Relative file path to MAF file to be analyzed. Can be .gz compressed.
--metadata : Relative file path to original metadata file.
--annot_rds : Relative file path to annotation object RDS file to be analyzed.
--bed_wgs : File path that specifies the caller-specific BED regions file.
--bed_wxs : File path that specifies the WXS BED regions file.
--overwrite : If specified, will overwrite any files of the same name. Default is FALSE.
```

### 02-run_eval.R

This script takes the files output by `01-calculate_vaf_tmb.R` and makes five
plots ([base_change](#base-change-analysis), [depth_vs_vaf](#variant-allele-fraction-calculation), [cosmic_plot](#cosmic-mutation-overlap), [snv_region](#genomic-regional-analyses), and [tmb_plot](#tumor-mutation-burden-calculation)) that are put into an overall report.

**Option descriptions**
```
# --label : Label to be used for folder and all output. eg. 'strelka2'. Optional.
# Default is 'maf'
# --plot_type : Specify what kind of plots you want printed out. Must be
# compatible with ggsave. eg pdf. Default is png
# --vaf : Folder from 01-calculate_vaf_tmb.R following files:
# <caller_name>_vaf.tsv
# <caller_name>_region.tsv
# <caller_name>_tmb.tsv
# --output : Where you would like the output from this script to be stored.
# --strategy : Specify whether you would like WXS and WGS separated for the plots.
# Analysis is still done on all data in the MAF file regardless.
# Acceptable options are 'wgs', 'wxs' or 'both', both for if you
# don't want to separate them. Default is both.
# --cosmic : Relative file path to COSMIC file to be analyzed.
# --overwrite : If TRUE, will overwrite any reports of the same name. Default is
# FALSE
```

# Individual Caller Evaluation

The first step in this analysis is an individual evaluation of each MAF from each caller.
This analysis prints out a report that is further subdivided by the WGS and WXS samples.
Expand Down Expand Up @@ -45,7 +135,7 @@ This is following the [code used in
`maftools`](https://github.com/PoisonAlien/maftools/blob/1d0270e35c2e0f49309eba08b62343ac0db10560/R/plot_vaf.R#L39).
The VAF calculations and other special variables are added to the MAF fields and written to a TSV ending in `_vaf.tsv` in the caller's results folder.

*Output for this analysiss*
*Output for this analysis*
* `results/<caller_name>/<caller_name>_vaf.tsv`
* `plots/<caller_name>/<caller_name>_<strategy>_depth_vs_vaf.png`

Expand Down Expand Up @@ -87,7 +177,7 @@ The sample-wise TMB calculations written to a TSV ending in `_tmb.tsv` in the ca
The COSMIC mutation data were obtained from https://cancer.sanger.ac.uk/cosmic/download
*To run this analysis, you need to obtain these data.*
The full, unfiltered somatic mutations file `CosmicMutantExport.tsv` for grch38 is used here and the genomic coordinates is arranged to be in BED format.
The COSMIC set is unfiltered down to only mutations detected in brain-related
The COSMIC set is filtered down to only mutations detected in brain-related
samples using the `Site subtype 1` field.
COSMIC mutations are overlapped with the present data's mutations using `GenomicRanges`.
The outcome of this overlap is added to the VAF data.frame with two `TRUE/FALSE` columns:
Expand Down
41 changes: 41 additions & 0 deletions analyses/snv-callers/run_caller_evals.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/bin/bash
# C. Savonen
# CCDL for ALSF 2019

# Purpose:Run an intial evaluation of each variant caller's MAF file

# Change directory
cd kitematic

# The files named in these arrays will be ran in the analysis.
datasets=("strelka2" "mutect2" "lancet" "vardict")
wgs_files=("WGS.hg38.strelka2.unpadded.bed" "WGS.hg38.mutect2.unpadded.bed" "WGS.hg38.lancet.unpadded.bed" "WGS.hg38.vardict.100bp_padded.bed")

# Reference file paths
cosmic=analyses/snv-callers/brain_cosmic_variants_coordinates.tsv
annot_rds=scratch/hg38_genomic_region_annotation.rds

############################ Set Up Reference Files ############################
# The original COSMIC file is obtained from: https://cancer.sanger.ac.uk/cosmic/download
# These data are available if you register. The full, unfiltered somatic mutations
# file CosmicMutantExport.tsv.gz for grch38 is used here.
Rscript analyses/snv-callers/scripts/00-set_up.R \
--annot_rds $annot_rds \
--cosmic_og scratch/CosmicMutantExport.tsv.gz \
--cosmic_clean $cosmic

########################## Calculate and Set Up Data ##########################
# Create files that contain calculated VAF, TMB, and regional analyses.
for ((i=0;i<${#datasets[@]};i++));
do
echo "Processing dataset: ${datasets[$i]}"
Rscript analyses/snv-callers/scripts/01-calculate_vaf_tmb.R \
--label ${datasets[$i]} \
--output analyses/snv-callers/results/${datasets[$i]} \
--maf data/pbta-snv-${datasets[$i]}.vep.maf.gz \
--metadata data/pbta-histologies.tsv \
--bed_wgs data/${wgs_files[$i]} \
--bed_wxs data/WXS.hg38.100bp_padded.bed \
--annot_rds $annot_rds \
--overwrite
done
170 changes: 89 additions & 81 deletions analyses/snv-callers/scripts/00-set_up.R
Original file line number Diff line number Diff line change
@@ -1,22 +1,30 @@
# SNV Caller set up and functions
#
# SNV-caller set up
# 2019
# C. Savonen for ALSF - CCDL
#
# Purpose: Set up for analyzing MAF files
# Purpose: Set up for reference files that are used by 01-calculate_vaf_tmb.R
# script.
#
# Option descriptions
# --annot_rds : File path to where you would like the annotation_rds file to be
# stored
# --cosmic_og : Path to original COSMIC file. Can be .gz compressed. Give path relative
# to top directory, 'OpenPBTA-analysis'. Will need to download this from
# COSMIC at https://cancer.sanger.ac.uk/cosmic/download.
# These data are available if you register.
# --cosmic_clean : File path specifying where you would like the cleaned brain-related
# COSMIC mutations file to be stored.
#
#################################### Set Up ####################################
# Establish base dir
root_dir <- rprojroot::find_root(rprojroot::has_dir(".git"))

# Import special functions
source(file.path(root_dir, "analyses", "snv-callers", "util", "wrangle_functions.R"))
source(file.path(root_dir, "analyses", "snv-callers", "util", "plot_functions.R"))

#################################### Set Up ####################################

# Magrittr pipe
`%>%` <- dplyr::`%>%`

# Load the optparse library
library(optparse)

# Will need optparse for collecting options
if (!("optparse" %in% installed.packages())) {
install.packages("optparse", repos = "http://cran.us.r-project.org")
Expand All @@ -34,50 +42,46 @@ if (!("data.table" %in% installed.packages())) {
install.packages("data.table", repos = "http://cran.us.r-project.org")
}

##################### Set base directories common file paths ###################
# Declare base directory names
scratch_dir <- file.path(root_dir, "scratch")
data_dir <- file.path(root_dir, "data")
snv_dir <- file.path(root_dir, "analyses", "snv-callers")

# Directories that we will need
base_results_dir <- file.path(snv_dir, "results")
base_plots_dir <- file.path(snv_dir, "plots")
cosmic_dir <- file.path(snv_dir, "cosmic")

# Create these folders if they haven't been created yet
if (!dir.exists(base_results_dir)) {
dir.create(base_results_dir)
}
if (!dir.exists(base_plots_dir)) {
dir.create(base_plots_dir)
}
if (!dir.exists(cosmic_dir)) {
dir.create(cosmic_dir)
}

# Declare reference file names
original_metadata <- file.path(data_dir, "pbta-histologies.tsv")

# These data were obtained from https://cancer.sanger.ac.uk/cosmic/download
# These data are available if you register.
# The full, unfiltered somatic mutations file CosmicMutantExport.tsv.gz for grch38
# is used here.
cosmic_file <- file.path(snv_dir, "CosmicMutantExport.tsv.gz")

# This is the cleaned version that only contains the genomic coordinates and
# base changes from the original file for brain sample mutations only.
cosmic_clean_file <- file.path(cosmic_dir, "brain_cosmic_variants_coordinates.tsv")
################################ Set up options ################################
# Set up optparse options
option_list <- list(
make_option(
opt_str = "--cosmic_og", type = "character", default = "none",
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
help = "Relative file path (assuming from top directory of
'OpenPBTA-analysis') to where the COSMIC mutation file is stored.
Will need to download this from COSMIC at
https://cancer.sanger.ac.uk/cosmic/download
These data are available if you register.",
metavar = "character"
),
make_option(
opt_str = "--cosmic_clean", type = "character", default = "none",
help = "Relative file path (assuming from top directory of
'OpenPBTA-analysis') where you would like the cleaned COSMIC file
to be stored.",
metavar = "character"
),
make_option(
opt_str = "--annot_rds", type = "character",
default = "none", help = "Relative file path (assuming from top
directory of 'OpenPBTA-analysis') that specifies where you would
like the annotation_rds file to be stored.",
metavar = "character"
)
)

# Will set up this annotation file below
annot_rds <- file.path(scratch_dir, "hg38_genomic_region_annotation.rds")
# Parse options
opt <- parse_args(OptionParser(option_list = option_list))

# This will be used for all WXS samples, but WGS bed regions are caller-specific
wxs_bed_file <- file.path(data_dir, "WXS.hg38.100bp_padded.bed")
##################### Set base directories common file paths ###################
# Make all base names relative to root_dir
opt$cosmic_og <- file.path(root_dir, opt$cosmic_og)
opt$cosmic_clean <- file.path(root_dir, opt$cosmic_clean)
opt$annot_rds <- file.path(root_dir, opt$annot_rds)

################################ Build Annotation ##############################
# We will only run this if it hasn't been run before
if (!file.exists(annot_rds)) {
if (opt$annot_rds != "none") {
# Print progress message
message("Setting up genomic region annotation file. Only need to do this once.")

Expand All @@ -98,7 +102,7 @@ if (!file.exists(annot_rds)) {

# Write this object so we don't have to write it again
readr::write_rds(annotations,
annot_rds,
opt$annot_rds,
compress = "gz"
)
}
Expand All @@ -108,37 +112,41 @@ if (!file.exists(annot_rds)) {
# if you register. The full, unfiltered somatic mutations file
# CosmicMutantExport.tsv for grch38 is used here but only mutations from brain
# related samples are kept.
if (!file.exists(cosmic_clean_file)) {
# Print progress message
message("Setting up COSMIC mutation file. Only need to do this once.")

# Read in original file
cosmic_variants <- data.table::fread(cosmic_file,
data.table = FALSE
) %>%
# Keep only brain mutations so the file is smaller
dplyr::filter(`Site subtype 1` == "brain")
# Get rid of spaces in column names
cosmic_variants <- cosmic_variants %>%
dplyr::rename_all(dplyr::funs(stringr::str_replace_all(., " ", "_"))) %>%
# Separate the genome coordinates into their own BED like variables
dplyr::mutate(
Chromosome = paste0(
"chr",
stringr::word(Mutation_genome_position, sep = ":", 1)
),
Start_Position = stringr::word(Mutation_genome_position, sep = ":|-", 1),
End_Position = stringr::word(Mutation_genome_position, sep = "-", 2),
# Make a base_change variable so we can compare to our set up for PBTA data
base_change = substr(Mutation_CDS, nchar(Mutation_CDS) - 2, nchar(Mutation_CDS))
) %>%
# Carry over the strand info, but rename to match our PBTA set up
dplyr::rename(Strand = Mutation_strand) %>%
# Narrow down to just the needed columns
dplyr::select(
"Chromosome", "Start_Position", "End_Position", "Strand",
"base_change"
if (opt$cosmic_clean != "none") {
if (file.exists(opt$cosmic_og)) {
# Print progress message
message("Setting up COSMIC mutation file. Only need to do this once.")

# Read in original file
cosmic_variants <- data.table::fread(opt$cosmic_og,
data.table = FALSE
) %>%
# Write to a TSV file
readr::write_tsv(cosmic_clean_file)
# Keep only brain mutations so the file is smaller
dplyr::filter(`Site subtype 1` == "brain")
# Get rid of spaces in column names
cosmic_variants <- cosmic_variants %>%
dplyr::rename_all(dplyr::funs(stringr::str_replace_all(., " ", "_"))) %>%
# Separate the genome coordinates into their own BED like variables
dplyr::mutate(
Chromosome = paste0(
"chr",
stringr::word(Mutation_genome_position, sep = ":", 1)
),
Start_Position = stringr::word(Mutation_genome_position, sep = ":|-", 1),
End_Position = stringr::word(Mutation_genome_position, sep = "-", 2),
# Make a base_change variable so we can compare to our set up for PBTA data
base_change = substr(Mutation_CDS, nchar(Mutation_CDS) - 2, nchar(Mutation_CDS))
) %>%
# Carry over the strand info, but rename to match our PBTA set up
dplyr::rename(Strand = Mutation_strand) %>%
# Narrow down to just the needed columns
dplyr::select(
"Chromosome", "Start_Position", "End_Position", "Strand",
"base_change"
) %>%
# Write to a TSV file
readr::write_tsv(opt$cosmic_clean)
} else {
stop("Original COSMIC Mutation file not found. Specify with --cosmic_og")
}
}
Loading