BoostDM is a method to score single base substitutions in cancer genes for their potential to drive tumorigenesis, which has been described in this study:
In silico saturation mutagenesis of cancer genes
Ferran Muiños, Francisco Martinez-Jimenez, Oriol Pich, Abel Gonzalez-Perez, Nuria Lopez-Bigas
URL: https://www.nature.com/articles/s41586-021-03771-1
The method heavily relies on the Intogen pipeline, which undertakes the necessary steps to identify cancer driver genes and infer relevant mutational features signaling positive selection. The Intogen pipeline has been described in this study:
A compendium of mutational cancer driver genes
Francisco Martínez-Jiménez, Ferran Muiños, Inés Sentís, Jordi Deu-Pons, Iker Reyes-Salazar, Claudia Arnedo-Pac, Loris Mularoni, Oriol Pich, Jose Bonet, Hanna Kranas, Abel Gonzalez-Perez, Nuria Lopez-Bigas
URL: https://www.nature.com/articles/s41568-020-0290-x
https://github.com/bbglab/boostdm-pipeline/releases/tag/2024.07.15-cancer
There are several public resources related to the boostDM framework:
Intended for exploration of the predictions and explanations resulting from the boostDM pipeline for a collection of models meeting minimum quality criteria. The website is searchable by cancer gene, tumor type, and mutation coordinates.
URL: https://www.intogen.org/boostdm
Intended for exploration of the landscape of mutations and signals of positive selection in driver genes upon analysis of 33,000+ tumor samples (release v2024.06.21). Intogen is instrumental for boostDM as it provides processed data that is used for the training of boostDM models.
Computational framework to interpret cancer genome variants intended to guide clinicians towards optimal decision making regarding the treatment of cancer, in particular resolving the implication of variants of unknown significance.
URL: https://www.cancergenomeinterpreter.org
- GitHub repo containing a collection of scripts and notebooks to generate analyses and figures of the main paper: https://github.com/bbglab/boostdm-analyses
- Zenodo repository providing data items generated and used in the main paper and figures: https://zenodo.org/record/4813082
This repo contains the source code to reproduce the training, prediction and post-hoc analysis steps of the boostDM pipeline, starting from the output data coming after the Intogen pipeline.
It is strongly recommended to run this pipeline in an HPC environment.
Two Singularity containers are needed
- boostdm.simg
- ensembl-vep_111.0.sif
which must be specified in the nextflow.config file as in the following example:
singularity {
enabled = true
cacheDir = "./singularity_images/"
runOptions = "-B " + env.PIPELINE + "/containers_build:/boostdm"
}
process {
cpus = 1
executor = 'slurm'
queue = 'normal,bigrun'
errorStrategy = 'ignore'
withLabel: boostdm {container = "file:///${singularity.cacheDir}/boostdm.simg"}
withLabel: vep {container = "file:///${singularity.cacheDir}/ensembl-vep_111.0.sif"}
}
boostdm.simg is built from a recipe provided in the boostDM repo https://github.com/bbglab/boostdm-pipeline/blob/master/containers_build/boostdm/Singularity, using the following command line:
singularity build boostdm.sif Singularity
ensembl-vep_111.0.sif can be pulled from https://hub.docker.com/r/ensemblorg, using the following command line:
singularity pull --name vep.sif docker://ensemblorg/ensembl-vep:release_111.0
The pipeline runs with Nextflow and it has been tested with Nextflow version 20.07.1 which can be installed with conda using the following command line:
conda install -c bioconda nextflow=20.07.1
The current release requires the output of Intogen release v2024.06.21. There are two main folders, referred to as INTOGEN_DATASETS
and BOOSTDM_DATASETS
, which are generated by the Intogen pipeline. Check out the Intogen documentation: https://intogen-plus.readthedocs.io/en/latest/index.html.
To run the pipeline it is necessary to specify the paths of the data dependencies in the config file nextflow.config:
env { GENOME_BUILD = "hg38"
INTOGEN_DATASETS = "./intogen_datasets/"
BOOSTDM_DATASETS = "./boostdm_datasets/"
VEP_SATURATION = env.INTOGEN_DATASETS + "/steps/boostDM/saturation/"
PIPELINE = "./boostdm-pipeline-2024/"
OUTPUT= "./boostdm-output/"
MAVE_DATA = "./mave_data/"
}
The only dependency that is not provided by Intogen is the MAVE_DATA
folder. Check out the full documentation for more details: https://github.com/bbglab/boostdm-pipeline/blob/master/boostDM-cancer-2024-full-docs.pdf
The pipeline is run with Nextflow DSL=1 and is divided in six steps that are run separately with the following Nextflow scripts:
01_training.nf
02_discovery.nf
03_model-selection.nf
04_prediction.nf
05_output_plots.nf
06_benchmarks.nf
To run each Nextflow script, use the following command line:
nextflow run <nexflow_script>.nf -resume -profile <profile>
Check out the full documentation for more details about the steps of the pipeline: https://github.com/bbglab/boostdm-pipeline/blob/master/boostDM-cancer-2024-full-docs.pdf
The pipeline output is delivered in the following folder tree:
├── benchmarks
│ ├── cv_tables
│ ├── cv_tables_annotated
│ ├── cv_tables_annotated_chasmplus
│ ├── pr_plots
│ ├── saturation_dbNSFP
│ ├── saturation_mave
│ ├── vep_input
│ └── vep_output_dbNSFP
├── create_datasets
│ └── <intogen cohort>.regression_data.tsv
├── discovery
│ └── discovery.tsv.gz
├── evaluation
│ └── <tumor types>
│ └── <gene>.eval.pickle.gz
├── features_group
├── model_selection
├── output_plots
│ ├── blueprints
│ ├── clustered_blueprints
│ └── discovery_bending
├── saturation
│ ├── annotation
│ │ └── <gene>.<tumor type>.annotated.tsv.gz
│ └── prediction
│ └── <gene>.model.<ttype model>.features.<ttype features>.prediction.tsv.gz
├── splitcv
│ └── <intogen cohort>.cvdata.pickle.gz
├── splitcv_meta
│ └── <tumor types>
│ └── <gene>.cvdata.pickle.gz
└── training_meta
└── <tumor types>
└── <gene>.models.pickle.gz
Check out the full documentation for a description of the main output formats: https://github.com/bbglab/boostdm-pipeline/blob/master/boostDM-cancer-2024-full-docs.pdf