Skip to content

Commit

Permalink
feat: Convert build scripts to snakemake workflow (#38)
Browse files Browse the repository at this point in the history
  • Loading branch information
tedil authored Aug 16, 2024
1 parent f1623a7 commit 45c7021
Show file tree
Hide file tree
Showing 42 changed files with 2,223 additions and 3,345 deletions.
61 changes: 37 additions & 24 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,43 +14,56 @@ defaults:
shell: bash -l {0}

env:
MEHARI_VERSION: "0.25.4"
MEHARI_VERSION: "0.26.1"
SNAKEMAKE_OUTPUT_CACHE: "/github/workspace/snakemake_cache"

jobs:
Build:
build_data_release:
runs-on: ubuntu-latest
strategy:
matrix:
genome_release:
- grch37
- grch38
runs-on: ubuntu-latest
- GRCh37
- GRCh38
source:
- refseq
- ensembl
steps:
- uses: actions/checkout@v4

- name: Get current date
id: date
run: echo "::set-output name=date::$(date +'%Y-%m-%d')"
run: echo "date=$(date +'%Y-%m-%d')" >> $GITHUB_OUTPUT

- name: Cache data directory
id: cache-primes
uses: actions/cache@v3
id: cache-mehari-data-tx
uses: actions/cache@v4
with:
path: mehari-data-tx
key: ${{ steps.date.outputs.date }}-${{ env.MEHARI_VERSION }}-${{ matrix.genome_release }}-${{ hashFiles('src/download.sh') }}
path: ~/work/mehari-data-tx/mehari-data-tx/mehari-data-tx-workflow
key: ${{ steps.date.outputs.date }}-${{ env.MEHARI_VERSION }}-${{ matrix.genome_release }}-${{ matrix.source }}-${{ hashFiles('config/config.yaml') }}

- name: Install Conda environment
uses: mamba-org/provision-with-micromamba@main
- name: Cache snakemake cache directory
id: cache-snakemake
uses: actions/cache@v4
with:
environment-file: false
environment-name: mehari-data-tx
channels: conda-forge,bioconda,defaults
extra-specs: |
python =3.8
entrez-direct =16.2
biocommons.seqrepo =0.6.5
htslib =1.17
- name: Run the data build
path: |
/github/workspace/snakemake_cache
~/snakemake_cache
${{ github.workspace }}/snakemake_cache
key: ${{ matrix.genome_release }}-${{ matrix.source }}

- name: Run data build workflow
uses: snakemake/snakemake-github-action@v1
with:
directory: mehari-data-tx-workflow
snakefile: workflow/Snakefile
stagein: |
mkdir -p /github/workspace/snakemake_cache
mkdir -p ~/snakemake_cache
mkdir -p ${{ github.workspace }}/snakemake_cache
args: "--configfile config/config.yaml --sdm conda --show-failed-logs --cores 4 --jobs 4 results/${{ matrix.genome_release }}-${{ matrix.source }}/mehari/seqrepo/report/mehari_db_check.txt"
show-disk-usage-on-error: true

- name: List files
run: |
export GENOME_RELEASE=${{ matrix.genome_release }}
bash src/run.sh
tree -a -L 5 ~/work/mehari-data-tx/mehari-data-tx
67 changes: 37 additions & 30 deletions .github/workflows/release-please.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ defaults:
shell: bash -l {0}

env:
MEHARI_VERSION: "0.25.4"
MEHARI_VERSION: "0.26.1"
SNAKEMAKE_OUTPUT_CACHE: "/github/workspace/snakemake_cache"

jobs:
release-please:
Expand All @@ -34,65 +35,71 @@ jobs:
Build-Release-Assets:
if: github.repository_owner == 'varfish-org'
needs: release-please
runs-on: ubuntu-latest
strategy:
matrix:
genome_release:
- grch37
- grch38
runs-on: ubuntu-latest
- GRCh37
- GRCh38
source:
- refseq
- ensembl
steps:
- uses: actions/checkout@v4
if: ${{ needs.release-please.outputs.release_created }}

- name: Get current date
if: ${{ needs.release-please.outputs.release_created }}
id: date
run: echo "::set-output name=date::$(date +'%Y-%m-%d')"
run: echo "date=$(date +'%Y-%m-%d')" >> $GITHUB_OUTPUT

- name: Cache data directory
if: ${{ needs.release-please.outputs.release_created }}
id: cache-primes
uses: actions/cache@v3
id: cache-mehari-data-tx
uses: actions/cache@v4
with:
path: mehari-data-tx
key: ${{ steps.date.outputs.date }}-${{ env.MEHARI_VERSION }}-${{ matrix.genome_release }}-${{ hashFiles('src/download.sh') }}
path: ~/work/mehari-data-tx/mehari-data-tx/mehari-data-tx-workflow
key: ${{ steps.date.outputs.date }}-${{ env.MEHARI_VERSION }}-${{ matrix.genome_release }}-${{ matrix.source }}-${{ hashFiles('config/config.yaml') }}

- name: Install Conda environment
if: ${{ needs.release-please.outputs.release_created }}
uses: mamba-org/provision-with-micromamba@main
- name: Cache snakemake cache directory
id: cache-snakemake
uses: actions/cache@v4
with:
environment-file: false
environment-name: mehari-data-tx
channels: conda-forge,bioconda,defaults
extra-specs: |
python =3.8
entrez-direct =16.2
biocommons.seqrepo =0.6.5
htslib =1.17
path: |
/github/workspace/snakemake_cache
~/snakemake_cache
${{ github.workspace }}/snakemake_cache
key: ${{ matrix.genome_release }}-${{ matrix.source }}

- name: Run the data build
- name: Run data build workflow
if: ${{ needs.release-please.outputs.release_created }}
run: |
export GENOME_RELEASE=${{ matrix.genome_release }}
bash src/run.sh
uses: snakemake/snakemake-github-action@v1
with:
directory: mehari-data-tx-workflow
snakefile: workflow/Snakefile
stagein: |
mkdir -p /github/workspace/snakemake_cache
mkdir -p ~/snakemake_cache
mkdir -p ${{ github.workspace }}/snakemake_cache
args: "--configfile config/config.yaml --sdm conda --show-failed-logs --cores 4 --jobs 4 results/${{ matrix.genome_release }}-${{ matrix.source }}/mehari/seqrepo/report/mehari_db_check.txt"
show-disk-usage-on-error: true

- name: upload release assets
if: ${{ needs.release-please.outputs.release_created }}
id: upload-release-assets
run: |
set -x
dir=/home/runner/mehari-data-tx/pass-2
dir=/home/runner/mehari-data-tx/mehari-data-tx-workflow/results/mehari/${{ matrix.genome_release }}-${{ matrix.source }}/seqrepo
src_prefix=$dir/txs.bin
dst_prefix=$dir/mehari-data-txs-${{ matrix.genome_release }}-${{ needs.release-please.outputs.major }}.${{ needs.release-please.outputs.minor }}.${{ needs.release-please.outputs.patch }}.bin
for ext in .zst .zst.sha256 .zst.report .zst.report.sha256; do
dst_prefix=$dir/mehari-data-txs-${{ matrix.genome_release }}-${{ matrix.source }}-${{ needs.release-please.outputs.major }}.${{ needs.release-please.outputs.minor }}.${{ needs.release-please.outputs.patch }}.bin
for ext in .zst .zst.sha256 .zst.report.jsonl .zst.report.jsonl.sha256; do
mv $src_prefix$ext $dst_prefix$ext
done
gh release upload \
${{ needs.release-please.outputs.tag_name }} \
$dst_prefix.zst \
$dst_prefix.zst.sha256 \
$dst_prefix.zst.report \
$dst_prefix.zst.report.sha256
$dst_prefix.zst.report.jsonl \
$dst_prefix.zst.report.jsonl.sha256
env:
# GitHub provides this variable in the CI env. You don't
# need to add anything to the secrets vault.
Expand Down
66 changes: 29 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,50 +2,42 @@

Reproducible data builds for [mehari](https://github.com/bihealth/mehari).

This repository contains scripts to build the transcript protobuf files for meharib based on [SACGF/cdot](https://github.com/SACGF/cdot).
This repository contains a snakemake workflow to build mehari transcript databases based on [SACGF/cdot](https://github.com/SACGF/cdot).

## Resulting Files

The following explains the content and compatibility.
Databases are built for each combination of genome release (GRCh37, GRCh38) and reference source (ensembl, refseq):

- mehari-data `v0.7.0`
- uses:
- mehari `v0.26.1`
- cdot: `v0.2.26`
- `GRCh37-refseq`
- genome release: GRCh37.p13
- VEP/ENSEMBL equivalent: `105`
- RefSeq assembly: [GCF\_000001405.25](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.25/)
- resulting transcript database: `mehari-0.26.1.GRCh37-refseq.txs.bin.zst`
- `GRCh37-ensembl`
- genome release: GRCh37.p13
- VEP/ENSEMBL release: `105`
- RefSeq equivalent: [GCF\_000001405.25](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.25/)
- resulting transcript database: `mehari-0.26.1.GRCh37-ensembl.txs.bin.zst`
- `GRCh38-refseq`
- genome release: GRCh38.p13
- VEP/ENSEMBL equivalent: `112`
- RefSeq assembly: [GCF\_000001405.39](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/)
- resulting transcript database: `mehari-0.26.1.GRCh38-refseq.txs.bin.zst`
- `GRCh38-ensembl`
- genome release: GRCh38.p13
- VEP/ENSEMBL release: `112`
- RefSeq equivalent: [GCF\_000001405.39](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/)
- resulting transcript database: `mehari-0.26.1.GRCh38-ensembl.txs.bin.zst`

- mehari-data `v0.2.1`
- compatible to: mehari `v0.4.1..`
- `grch37` data file: `mehari-data-txs-grch37-0.2.1.bin.zst`
- cdot: `v0.2.14`
- genome release: GRCh37.p13
- VEP/ENSEMBL equivalent: `r105`
- RefSeq assembly: [GCF\_000001405.25](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.25/)
- `grch38` data: `mehari-data-txs-grch38-0.2.1.bin.zst`
- cdot: `v0.2.14`
- genome release: GRCh38.p13
- VEP/ENSEMBL equivalent: `r109`
- RefSeq assembly: [GCF\_000001405.39](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/)

So, for example:

- `mehari-data-txs-grch37-0.2.1.bin.zst` is compatible to the mehari software `v0.4.1` and above;
- it was created from the cdot transcripts `v0.2.14`;
- these were built for GRCh37.p13 based on the VEP/ENSEMBL Release r105
- and the transcripts from the RefSeq assembly GCF\_000001405.25.

New builds of mehari-data will be considered when

- the new mehari software version changes protobuf schema, OR
- a new cdot has a new release corresponding to a VEP release, OR
- a new mehari version is released, OR
- a new cdot version is released, OR
- bugs are found and make a new release necessary.

Generally, only the data for the latest mehari protobuf schema is created.

## Utility Files

As RefSeq does not contain transcripts for mitochondrial genes, we graft the ENSEMBL transcripts over.

```
# python src/cdot_extract_chrmt.py \
/tmp/cdot-0.2.24.ensembl.Homo_sapiens.GRCh37.87.gff3.json.gz \
> data/cdot-0.2.24.ensembl.chrMT.grch37.gff3.json
# python src/cdot_extract_chrmt.py \
/tmp/cdot-0.2.24.ensembl.Homo_sapiens.GRCh38.111.gff3.json.gz \
> data/cdot-0.2.24.ensembl.chrMT.grch38.gff3.json
```
14 changes: 0 additions & 14 deletions config.json

This file was deleted.

Loading

0 comments on commit 45c7021

Please sign in to comment.