The purpose of this repository is to collect various data form public sources that is eventually used in VarFish for annotation and display to the user. This repository contains a Snakemake workflow with supporting code for downloading the data and preparing it for being used with VarFish.
Quick Facts
- License: MIT
- Programming Language: Python / Snakemake
Use the utility rule help
to get a list of all available rules:
# snakemake --cores=1 help
Run them all with all
:
# snakemake --cores=1 all
Note that this will take a long time, use a lot of disk space, and download a lot of data.
To run on a Slurm cluster, you can use the Snakemake --slurm
option.
See run-slurm.sh
for an example.
Install conda, ideally via miniforge. A quickstart:
# wget -O /tmp/Mambaforge-Linux-x86_64.sh \
https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
# bash /tmp/Mambaforge-Linux-x86_64.sh -b -p ~/mambaforge3 -s
# source ~/mambaforge3/bin/activate
# git clone git@github.com:bihealth/varfish-db-downloader.git
# cd varfish-db-downloader
This will setup the conda environment:
# mamba env create --file environment.yml
# conda activate varfish-db-downloader
This will install the varfish-db-downloader
tools:
# pip install -e .
We use wget
and aria2c
only and not curl
.
The rationale is that for the test mode, we are overriding the two executables with helper commands.
Besides the full output, we also build a subset of the data suitable for development. At the moment of writing, the subset is to the BRCA1 gene only. The rationale is that this gene and its variants are heavily annotated as breast cancer predisposition screening is a common task and users/data is plenty.
The download of files can be disabled to enable a test mode.
Instead, the files in excerpt-data
are used when CI=true
is set in the environment.
This is done by overriding the download executables wget
and aria2
in the Snakemake file when CI=true
has been set.
This again is done by overriding the PATH
environment variable.
The files can be updated by calling
# varfish-db-downloader wget urls-download
The known URLs are managed in download_urls.yml
.
# export GITHUB_OWNER=bihealth
# export GITHUB_TOKEN=ghp_<thetoken>
# cd utils/terraform
# terraform init
# terraform import github_repository.varfish-db-downloader varfish-db-downloader
# terraform validate
# terraform fmt
# terraform plan
# terraform apply
For example, as follows
# s5cmd --dry-run --profile ext-varfish-public \
--endpoint-url https://ceph-s3-ext.cubi.bihealth.org \
sync \
'output/full/mehari/genes-txs-grch3*' \
s3://varfish-public/public/
Generally, follow Semantic Commits v1, also see examples.
Here is a list of the commit message prefixes that we use:
prefix | description |
---|---|
feat | Features |
fix | Bug Fixes |
perf | Performance Improvements |
deps | Dependencies |
revert | Reverts |
docs | Documentation |
style | Styles |
chore | Miscellaneous Chores |
refactor | Code Refactoring |
test | Tests |
build | Build System |
ci | Continuous Integration |