beacon
: A script to transform genomic variations data (VCF) to queryable data (MongoDB)
beacon <mode> [-arguments] [-options]
Mode:
vcf
-i|input Requires a VCF.gz file
(May require a parameters file)
mongodb
(May require a parameters file)
full (vcf + mongodb)
-i|input Requires a VCF.gz file
(May require a parameters file)
Options:
-h|help Brief help message
-man Full documentation
-v Version
-c Requires a configuration file
-p Requires a parameters file
-n Number of cpus/cores/threads
-debug Print debugging (from 1 to 5, being 5 max)
-verbose Verbosity on
-nc|-no-color Don't print colors to STDOUT
(For convenience, specifiers may have a leading - or --)
beacon
is a script to transform genomic variations data (VCF) to queryable data (MongoDB). See extended documentation here.
The VCF data are annotated and then serialized to a JSON text file (genomicVariationsVcf.json.gz
). The script also enables data load of the BFF to MongoDB.
beacon
is part of the beacon2-ri-tools package, which also contains other utilities to help with data ingestion.
beacon2-ri-tools repository is part of the ELIXIR-CRG Beacon v2 Reference Implementation (B2RI).
* Beacon v2 Reference Implementation *
___________
| |
XLSX | Metadata | (incl. Phenotypic data)
|__________|
|
|
| Validation (bff-validator)
|
_________ ____v____ __________ ______
| | | | | | | | <---- Request
| VCF | ----> | BFF | ----> | Database | <----> | API |
|_______| |_ _____| |__________| |_____| ----> Response
| MongoDB
beacon | beacon
|
|
Optional
|
______v_______
| |
| BFF |
| Genomic | Visualization
| Variations |
| Browser |
|____________|
------------------------------------------------|||------------------------
beacon2-ri-tools beacon2-ri-api
We provide two installation options, one containerized (recommended) and another non-containerized.
Download the Dockerfile
from Github by typing:
wget https://raw.githubusercontent.com/EGA-archive/beacon2-ri-tools/main/Dockerfile
Then execute the following commands:
docker build -t crg/beacon2_ri:latest . # build the container (~1.1G)
IMPORTANT: Docker containers are fully isolated. If you think you'll have to mount a volume to the container please read the section Mounting Volumes before proceeding further.
docker run -tid --name beacon2-ri-tools crg/beacon2_ri:latest # run the image detached
docker ps # list your containers, beacon2-ri-tools should be there
docker exec -ti beacon2-ri-tools bash # connect to the container interactively
After the docker exec
command, you will land at /usr/share/beacon-ri/
, then execute:
nohup beacon2-ri-tools/BEACON/bin/deploy_external_tools.sh &
...that will inject the external tools and DBs into the image and modify the configuration files. It will also run a test to check that the installation was succesful. Note that running deploy_external_tools.sh
will take some time (and disk space!!!). You can check the status by using:
tail -f nohup.out
The last step is to deploy MongoDB. We will deploy it outside the beacon2-ri-tools
container so please exit
from it.
Please download the docker-compose.yml
file:
wget https://raw.githubusercontent.com/EGA-archive/beacon2-ri-tools/main/docker-compose.yml
And then execute:
docker-compose up -d
It's simpler to mount a volume when starting a container than to add it to an existing one. If you need the mount a volume to the container please use the following syntax (-v host:container
). Find an example below (note that you need to change the paths to match yours):
docker run -tid --volume /media/mrueda/4TBT/workdir:/workdir --name beacon2-ri-tools crg/beacon2_ri:latest
Now you'll need to execute:
docker exec -ti beacon2-ri-tools bash # connect to the container interactively
After the docker exec
command, you will land at /usr/share/beacon-ri/
, then execute:
nohup beacon2-ri-tools/BEACON/bin/deploy_external_tools.sh & # see above why
Then, you can run commands inside the container, like this;
# We connect to the container interactively
docker exec -ti beacon2-ri-tools bash
# We go to the mounting point
cd /workdir
# We run the executable
/usr/share/beacon-ri/beacon2-ri-tools/beacon vcf -i example.vcf.gz -p param.in
Alternatively, you can run commands from the host, like this:
# First we create an alias to simplify invocation
alias beacon='docker exec -ti beacon2-ri-tools /usr/share/beacon-ri/beacon2-ri-tools/beacon'
# Now we use a text editor to edit the file <params.in> to include the parameter 'projectdir'
projectdir /workdir/my_fav_job_id
# Finally we use the alias to run the command
beacon vcf -i /workdir/my_vcf.gz -p /workdir/param.in
Download the latest version from Github:
tar -xvf beacon2-ri-tools-2.0.0.tar.gz # Note that naming may be different
Alternatively, you can use git clone to get the latest (stable) version
git clone https://github.com/EGA-archive/beacon2-ri-tools.git
beacon
is a Perl script (no compilation needed) that runs on Linux command-line. Internally, it submits multiple pipelines via customizable Bash scripts (see example here). Note that Perl and Bash are installed by default in Linux.
Perl 5 is installed by default on Linux, but we will need to install a few CPAN modules.
For simplicity, we're going to install the modules (they're harmless) with sudo privileges.
(For Debian and its derivatives, Ubuntu, Mint, etc.)
First we install cpmanminus
utility:
sudo apt-get install cpanminus
Second we use cpanm
to install the CPAN modules. Change directory into the beacon2-ri-tools
folder and run:
cpanm --sudo --installdeps .
Also, to read the documentation you'll need perldoc
that may or may not be installed in your Linux distribution:
sudo apt-get install perl-doc
If you prefer to have the dependencies in a "virtual environment" (i.e., install the CPAN modules in the directory of the application) we recommend using the module Carton
.
cpanm --sudo Carton
Then, we can install our dependencies:
carton install
beacon
also needs that bcftools, SnpEff and MongoDB are installed. See external software for more info.
Before running anything you need to set-up the configuration file:
The configuration file is a YAML text file with locations for executables and files needed for the job (e.g., SnpEff jar files, dbSNSFP database).
You have two options here:
* RECOMMENDED: You set the config file ONCE. This file will serve for all your jobs.
To set it up go to the installation directory and modify the file 'config.yaml' with your paths.
* You provide the config file with the argument -c when you run a
job. This is useful if you want to override the "main" file (see above).
Below are parameters that can be modified by the user along with their default values. Please remember to leave a blank space between the parameter and the value.
Configuration file (YAML)
---
# Reference assemblies (genomes)
hs37fasta: /path
hg19fasta: /path
hg38fasta: /path
# ClinVar
hg19clinvar: /path
hg38clinvar: /path
# Cosmic
hg19cosmic: /path
hg38cosmic: /path
# dbSNSFP Academic
hg19dbnsfp: /path
hg38dbnsfp: /path
# Miscellanea software
snpeff: /path
snpsift: /path
bcftools: /path
# Max RAM memory for snpeff (optional)
mem: 8G
# MongoDB
mongoimport: /path
mongostat: /path
mongosh: /path
mongodburi: string
# Temporary directory (optional)
tmpdir: /path
Please find below a detailed description of all parameters (alphabetical order):
-
bcftools
Location of the bcftools exe (e.g., /home/foo/bcftools_1.11/bcftools).
-
dbnsfpset
The set of fields to be taken from dbNSFP database.
Values: <all> or <ega>
-
genome
Your reference genome.
Accepted values: hg19, hg38 and hs37.
If you used GATKs GRCh37/b37 set it to hg19.
Not supported : NCBI36/hg18, NCBI35/hg17, NCBI34/hg16, hg15 and older.
-
hg19{clinvar,cosmic,dbnsfp,fasta}
Path for each of these files. COSMIC annotations are added but not used (v2.0.0).
-
hg38{clinvar,cosmic,dbnsfp,fasta}
Path for each of these files. COSMIC annotations are added but not used (v2.0.0).
-
hs37
Path for ithe reference genome hs37.
-
mem
RAM memory for the Java processes (e.g., 8G).
-
mongoXYZ
Parameters needed for MongoDB.
-
paneldir
A directory where you can store text files (consisting of a column with a lists of genes) to be displayed by the BFF Genomic Variations Browser.
-
snpeff
Location of the java archive dir (e.g., /home/foo/snpEff/snpEff.jar).
-
snpsift
Location of the java archive dir (e.g., /home/foo/snpEff/snpSift.jar).
-
tmpdir
Use if you a have a preferred tmpdir.
* Ideally a Debian-based distribution (Ubuntu or Mint), but any other (e.g., CentOs, OpenSuse) should do as well (untested).
* Perl 5 (>= 5.10 core; installed by default in most Linux distributions). Check the version with "perl -v"
* 4GB of RAM (ideally 16GB).
* >= 1 cores (ideally i7 or Xeon).
* At least 200GB HDD.
* bcftools, SnpEff and MongoDB
The Perl itself does not need a lot of RAM (max load will reach 400MB) but external tools do (e.g., process mongod
[MongoDB's daemon]).
I am not using any CPAN's module to perform unit tests. When I modify the code my "integration tests" are done by comparing to reference files. You can validate the installation using the files included in the test directory.
We recommend following this tutorial.
This script has three modes: vcf, mongodb
and full
* Mode vcf
Converting a VCF file into a BFF file for genomic variations.
* Mode mongodb
Loading BFF data into MongoDB.
* Mode full
Ingesting VCF and loading metadata and genomic variations (all in one step) into MongoDB.
To perform all these taks you'll need:
-
A gzipped VCF
Note that it does not need to be bgzipped.
-
(Optional) A parameters file
A parameters text file that will contain specific values needed for the job.
-
BFF files (only for modes: mongodb and full)
(see explanation of BFF format here)
-
(Optional) Specify the number of cores (only for VCF processing!)
The number of threads/cores you want to use for the job. In this regard (since SnpEff does not deal well with parallelization) we recommend using
-n 1
and running multiple simultaneous jobs with GNUparallel
or the included queue system). The software scales linearly {O}(n) with the number of variations present in the input file. The easiest way is to run one job per chromosome, but if you are in a hurry and have many cores you can split each chromosome into smaller vcfs.
beacon
will create an independent project directory projectdir
and store all needed information needed there. Thus, many concurrent calculations are supported.
Note that beacon
will treat your data as read-only (i.e., will not modify your original files).
Annex: Parameters file (YAML)
--
bff:
metadatadir: .
analyses: analyses.json
biosamples: biosamples.json
cohorts: cohorts.json
datasets: datasets.json
individuals: individuals.json
runs: runs.json
# Note that genomicVariationsVcf is not affected by <metadatadir>
genomicVariationsVcf: beacon_XXXX/vcf/genomicVariationsVcf.json.gz
datasetid: crg_beacon_test
genome: hs37
bff2html: true
projectdir: my_project
Please find below a detailed description of all parameters (alphabetical order):
-
bff
Location for the Beacon Friendly Format JSON files.
-
center
Experimental feature. Not used for now.
-
datasetid
An unique identifier for the dataset present in the input VCF. Default value is 'id_1'
-
ega
(For EGA internal use only)
egac: EGA DAC Accession ID.
egad: EGA Dataset Accession ID.
egas: EGA Study Accession ID.
-
genome
Your reference genome.
Accepted values: hg19, hg38 and hs37.
If you used GATKs GRCh37/b37 set it to hg19.
Not supported: NCBI36/hg18, NCBI35/hg17, NCBI34/hg16, hg15 and older.
-
organism
Experimental feature. Not used for now.
-
bff2html
Set bff2html to
true
to activate BFF Genomic Variations Browser. -
projectdir
The prefix for dir name (e.g., 'cancer_sample_001'). Note that it can also contain a path (e.g., /workdir/cancer_sample_001). The script will automatically add an unique identifier to each job.
-
technology
Experimental feature. Not used for now.
Optional: The user has the option of turning on the BFF Genomic Variatons Browser. With this option enabled, an HTML file will be created to be used with a web browser. The purpose of such HTML file is to provide a preliminary exploration of the genomic variations data. See the full documentation here.
Examples:
$ ./beacon vcf -i input.vcf.gz
$ ./beacon mongodb -p param_file # MongoDB load only
$ ./beacon full -n 1 --i input.vcf.gz -p param_file > log 2>&1
$ ./beacon full -n 1 --i input.vcf.gz -p param_file -c config_file > log 2>&1
$ nohup $path_to_beacon/beacon full -i input.vcf.gz -verbose
$ parallel "./beacon vcf -n 1 -i chr{}.vcf.gz > chr{}.log 2>&1" ::: {1..22} X Y
$ carton exec -- ./beacon vcf -i input.vcf.gz # If using Carton
NB: If you don't want colors in the output use the flag --no-color
. If you did not use the flag and want to get rid off the colors in your printed log file use this command to parse ANSI colors:
perl -pe 's/\x1b\[[0-9;]*[mG]//g'
Beacon Friendly Format is a set of 7 JSON files (JSON arrays consisting of multiple documents) that match the 7 schemas from Beacon v2 Models.
Six files correspond to Metadata (analyses.json,biosamples.json,cohorts.json,datasets.json,individuals.json,runs.json
) and one corresponds to variations (genomicVariations.json
).
Normally, beacon
script is used to create genomicVariations
JSON file. The other 6 files are created with this utility (part of the distribution). See instructions here.
Once we have all seven files, then we can proceed to load the data into MongoDB.
* Dockerfile:
* DNS errors
- Error: Temporary failure resolving 'foo'
Solution: https://askubuntu.com/questions/91543/apt-get-update-fails-to-fetch-files-temporary-failure-resolving-error
* Perl:
* Compilation errors:
- Error: Unknown PerlIO layer "gzip" at (eval 10) line XXX
Solution: cpanm --sudo PerlIO::gzip
... or ...
sudo apt-get install libperlio-gzip-perl
* Execution errors:
- Error with YAML::XS
Solution: Make sure the YAML (config.yaml or parameters file) is well formatted (e.g., space after param:' ').
* Bash:
(Possible errors that can happen when the embeded Bash scripts are executed)
* bcftools errors: bcftools is nit-picky about VCF fields and nomenclature of contigs/chromosomes in reference genome
=> Failed to execute: beacon_161855926405757/run_vcf2bff.sh
Please check this file beacon_161855926405757/run_vcf2bff.log
- Error:
# Running bcftools
[E::faidx_adjust_position] The sequence "22" was not found
Solution: Make sure you have set the correct genome (e.g., hg19, hg38 or hs37) in parameters_file.
In this case bcftools was expecting to find 22 in the <*.fa.gz> file from reference genome, but found 'chr22' instead.
Tips:
- hg{19,38} use 'chr' in chromosome naming (e.g., chr1)
- hs37 does not use 'chr' in chromosome naming (e.g., 1)
- Error
# Running bcftools
INFO field IDREP only contains 1 field, expecting 2
Solution: Please Fix VCF info field manually (or get rid of problematic fields with bcftools)
e.g., bcftools annotate -x INFO/IDREP input.vcf.gz | gzip > output.vcf.gz
bcftools annotate -x INFO/MLEAC,INFO/MLEAF,FMT/AD,FMT/PL input.vcf.gz | gzip > output.vcf.gz
NB: The bash scripts can be executed "manually" in the beacon_XXX dir. You must provide the
input vcf as an argument. This is a good option for debugging.
* Some Linux distributions do not include perldoc and thus Perl's library Pod::Usage will complain.
Please, install it (sudo apt-get install perl-doc) if needed.
The author requests that any published work that utilizes B2RI includes a cite to the the following reference:
Rueda, M, Ariosa R. "Beacon v2 Reference Implementation: a toolkit to enable federated sharing of genomic and phenotypic data". Bioinformatics, btac568, https://doi.org/10.1093/bioinformatics/btac568
Written by Manuel Rueda, PhD. Info about CRG can be found at https://www.crg.eu
Credits:
* Sabela De La Torre (SDLT) created a Bash script for Beacon v1 to parse vcf files L<https://github.com/ga4gh-beacon/beacon-elixir>.
* Toshiaki Katayamai re-implemented the Beacon v1 script in Ruby.
* Later Dietmar Fernandez-Orth (DFO) modified the Ruby for Beacon v2 L<https://github.com/ktym/vcftobeacon and added post-processing with R, from which I borrowed ideas to implement vcf2bff.pl.
* DFO for usability suggestions and for creating bcftools/snpEff commands.
* Roberto Ariosa for help with MongoDB implementation.
* Mauricio Moldes helped with the containerization.
For Beacon problems, questions, or suggestions, send an e-mail to manuel.rueda@crg.eu.
This PERL file is copyrighted, (C) 2021-2022 Manuel Rueda. See the LICENSE file included in this distribution.