Skip to content

molgenis/data-transform-vkgl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vkgl

Variant annotator for vkgl variant sharing. Consists of two parts:

HGVS translator

The active HGVS translator service code lives in another repository . The service uses the biocommons/hgvs library to convert HGVS to VCF. To start the translator service:

> cd docker
> docker-compose up

To see it in action, post an array of HGVS strings to the h2v endpoint:

curl -H 'Content-Type: application/json' -d '["NM_000088.3:c.589G>T", "NC_000017.10:g.48275363C>A"]' 'http://localhost:1234/h2v?keep_left_anchor=False'

These string can either be genomic HGVS (NC_000017.10:g.48275363C>A) or transcript HGVS (NM_000088.3:c.589G>T). Two flags can be set for this service:

Flag Usage Example
keep_left_anchor The service can either return variants with the last unchanged nucleotide (=left anchor), or without it. Please note that this will go for all variants, including insertions for which an anchor is preferred. True: {"ref": "CT", "alt": "CC", "chrom":"19", "pos":"49473113", "type":"sub"}
False: {"ref": "T", "alt": "C", "chrom":"19", "pos":"49473114", "type":"sub"}
strict The strict mode of the HGVS libary will be applied when strict is set to True. This means that validation warnings will result in errors rather than just warnings.

file processing pipeline

Spring Boot application that uses Apache Camel to read input files from src/test/inbox and annotate the lines with vcf info retrieved from the hgvs annotator.

Results will be stored in results dir once finished. Results with an error message will be routed to error file.

Requires JDK 11. To run the pipeline:

> mvn spring-boot:run -Dspring-boot.run.arguments=--hgnc.genes="location/of/your/hgnc/genes/file"

Add --exitOnError=true if you want the service to exit when an exception occurs.

Scroll down for more information about the HGNC genes file.

Pipeline overview

Gene validation (HGNC genes file)

Validation and correction of provided HGNC symbols is done using an export downloaded from the biomart website . The selected attributes are:

  • HGNC ID
  • Status
  • Approved symbol
  • Approved name
  • Alias symbol
  • Previous symbol
  • Chromosome
  • Chromosome location
  • Locus group
  • NCBI gene ID
  • Ensembl gene ID
  • UCSC gene ID

The downloaded file is named hgnc_genes.tsv and is stored in src/main/resources. To have the most accurate validation, it is recommended to update this file before running the pipeline.

Release

Releasing this repository is kind of special because it's a collection of scripts. First do simple github release. Then checkout the master, increase the version number in run.sh and commit it. Put a tag on the new version using the data-release-v prefix. Then push it to the master. Finalise the release by releasing the tag you just pushed.