Skip to content

Learning pipeline to identify somatic SNVs under positive selection.

License

Notifications You must be signed in to change notification settings

bbglab/boostdm-pipeline

Repository files navigation

boostDM pipeline

Aim

BoostDM is a method to score single base substitutions in cancer genes for their potential to drive tumorigenesis, which has been described in this study:

In silico saturation mutagenesis of cancer genes
Ferran Muiños, Francisco Martinez-Jimenez, Oriol Pich, Abel Gonzalez-Perez, Nuria Lopez-Bigas
URL: https://www.nature.com/articles/s41586-021-03771-1

The method heavily relies on the Intogen pipeline, which undertakes the necessary steps to identify cancer driver genes and infer relevant mutational features signaling positive selection. The Intogen pipeline has been described in this study:

A compendium of mutational cancer driver genes
Francisco Martínez-Jiménez, Ferran Muiños, Inés Sentís, Jordi Deu-Pons, Iker Reyes-Salazar, Claudia Arnedo-Pac, Loris Mularoni, Oriol Pich, Jose Bonet, Hanna Kranas, Abel Gonzalez-Perez, Nuria Lopez-Bigas
URL: https://www.nature.com/articles/s41568-020-0290-x

Current version

https://github.com/bbglab/boostdm-pipeline/releases/tag/2024.07.15-cancer

Resources

There are several public resources related to the boostDM framework:

boostDM website

Intended for exploration of the predictions and explanations resulting from the boostDM pipeline for a collection of models meeting minimum quality criteria. The website is searchable by cancer gene, tumor type, and mutation coordinates.

URL: https://www.intogen.org/boostdm

Intogen website

Intended for exploration of the landscape of mutations and signals of positive selection in driver genes upon analysis of 33,000+ tumor samples (release v2024.06.21). Intogen is instrumental for boostDM as it provides processed data that is used for the training of boostDM models.

URL: https://www.intogen.org

Cancer Genome Interpreter

Computational framework to interpret cancer genome variants intended to guide clinicians towards optimal decision making regarding the treatment of cancer, in particular resolving the implication of variants of unknown significance.

URL: https://www.cancergenomeinterpreter.org

Other resources

Content

This repo contains the source code to reproduce the training, prediction and post-hoc analysis steps of the boostDM pipeline, starting from the output data coming after the Intogen pipeline.

Prerequisites

HPC environment

It is strongly recommended to run this pipeline in an HPC environment.

Singularity

Two Singularity containers are needed

  • boostdm.simg
  • ensembl-vep_111.0.sif

which must be specified in the nextflow.config file as in the following example:

singularity {
    enabled = true
    cacheDir = "./singularity_images/"
    runOptions = "-B " + env.PIPELINE + "/containers_build:/boostdm"
}

process {
    cpus = 1
    executor = 'slurm'
    queue = 'normal,bigrun'
    errorStrategy = 'ignore'
    withLabel: boostdm {container = "file:///${singularity.cacheDir}/boostdm.simg"}
    withLabel: vep {container = "file:///${singularity.cacheDir}/ensembl-vep_111.0.sif"}
}

boostdm.simg is built from a recipe provided in the boostDM repo https://github.com/bbglab/boostdm-pipeline/blob/master/containers_build/boostdm/Singularity, using the following command line:

singularity build boostdm.sif Singularity

ensembl-vep_111.0.sif can be pulled from https://hub.docker.com/r/ensemblorg, using the following command line:

singularity pull --name vep.sif docker://ensemblorg/ensembl-vep:release_111.0

Nextflow

The pipeline runs with Nextflow and it has been tested with Nextflow version 20.07.1 which can be installed with conda using the following command line:

conda install -c bioconda nextflow=20.07.1

Running the pipeline

Input

The current release requires the output of Intogen release v2024.06.21. There are two main folders, referred to as INTOGEN_DATASETS and BOOSTDM_DATASETS, which are generated by the Intogen pipeline. Check out the Intogen documentation: https://intogen-plus.readthedocs.io/en/latest/index.html.

Config

To run the pipeline it is necessary to specify the paths of the data dependencies in the config file nextflow.config:

env {	GENOME_BUILD = "hg38"
	INTOGEN_DATASETS = "./intogen_datasets/"
	BOOSTDM_DATASETS = "./boostdm_datasets/"
	VEP_SATURATION = env.INTOGEN_DATASETS + "/steps/boostDM/saturation/"
	PIPELINE = "./boostdm-pipeline-2024/"
	OUTPUT= "./boostdm-output/"
	MAVE_DATA = "./mave_data/"
    }

The only dependency that is not provided by Intogen is the MAVE_DATA folder. Check out the full documentation for more details: https://github.com/bbglab/boostdm-pipeline/blob/master/boostDM-cancer-2024-full-docs.pdf

Nextflow run

The pipeline is run with Nextflow DSL=1 and is divided in six steps that are run separately with the following Nextflow scripts:

01_training.nf
02_discovery.nf
03_model-selection.nf
04_prediction.nf
05_output_plots.nf
06_benchmarks.nf

To run each Nextflow script, use the following command line:

nextflow run <nexflow_script>.nf -resume -profile <profile>

Check out the full documentation for more details about the steps of the pipeline: https://github.com/bbglab/boostdm-pipeline/blob/master/boostDM-cancer-2024-full-docs.pdf

Output

The pipeline output is delivered in the following folder tree:

├── benchmarks
│   ├── cv_tables
│   ├── cv_tables_annotated
│   ├── cv_tables_annotated_chasmplus
│   ├── pr_plots
│   ├── saturation_dbNSFP
│   ├── saturation_mave
│   ├── vep_input
│   └── vep_output_dbNSFP
├── create_datasets
│   └── <intogen cohort>.regression_data.tsv
├── discovery
│   └── discovery.tsv.gz
├── evaluation
│   └── <tumor types>
│       └── <gene>.eval.pickle.gz
├── features_group
├── model_selection
├── output_plots
│   ├── blueprints
│   ├── clustered_blueprints
│   └── discovery_bending
├── saturation
│   ├── annotation
│   │   └── <gene>.<tumor type>.annotated.tsv.gz
│   └── prediction
│       └── <gene>.model.<ttype model>.features.<ttype features>.prediction.tsv.gz
├── splitcv
│   └── <intogen cohort>.cvdata.pickle.gz
├── splitcv_meta
│   └── <tumor types>
│        └── <gene>.cvdata.pickle.gz
└── training_meta
    └── <tumor types>
         └── <gene>.models.pickle.gz

Check out the full documentation for a description of the main output formats: https://github.com/bbglab/boostdm-pipeline/blob/master/boostDM-cancer-2024-full-docs.pdf