Transforms a VCF (variant call format) file to a tab-separated values (.tsv) one.
Its compilation and functionality have been verified on the following operating system:
- macOS 🍏
- Linux 🐧
>>> git https://github.com/alexcoppe/vcf_to_tsv
>>> cd vcf_to_tsv
>>> make
After compilation, move the generated executable vcf_to_tsv
to a directory listed in the $PATH variable. You can identify these directories by using the echo $PATH
command.
This software transforms an uncompressed VCF file to a tab-separated values (tsv) file. It also works with VCFs generated by SnpEff and ANNOVAR.
To run it, you need two arguments: the VCF file and a text file specifying the desired fields. Refer to the table below for guidance on creating this file.
When utilizing a SnpEff annotated VCF, the tool currently displays each transcript indicated by SnpEff in separate rows.
Starting character | What you get |
---|---|
None | get the fields from the VCF |
: | get a subfield from the INFO field added by SnpEff |
; | get a specific subfiled from the IMFO field |
| | get a specific subfield from the Genotype fields |
Example of a text file specifying the desired fields and subfields:
:hgvs_c
position
;gnomAD_genome_AMR
|AD
Launching the program with the above text file
vcf_to_tsv a_vcf_file_path.vcf wanted_fields.txt
Output:
n.-3702C>T 157370625 0.0020 14,1 31,5
n.*1931C>T 157370625 0.0020 14,1 31,5
n.-3707C>T 157370630 0 15,1 33,4
...
Currently, the software operates exclusively on 1 or 2 genotype fields.
The table below displays all the sub-fields added by SnpEff along with the corresponding sub-field names used in vcf_to_table (listed in the first column).
Subfield by vcf_to_table | Subfield by SnpEff | Explanation |
---|---|---|
:allele | Allele (or ALT) | The alternative allele |
:annotation | Annotation (a.k.a. effect) | Annotated using Sequence Ontology terms |
:putative_impact | Putative_impact | A simple estimation of putative impact / deleteriousness : {HIGH, MODERATE, LOW, MODIFIER} |
:gene_name | Gene Name | Common gene name (HGNC) |
:gene_id | Gene ID | Gene ID |
:feature_type | Feature type | Which type of feature is in the next field |
:feature_id | Feature ID | Depends on the annotation |
:transcript_biotype | Transcript biotype | The bare minimum is at least a description on whether the transcript is {"Coding", "Noncoding"}. Whenever possible, use ENSEMBL biotypes |
:rank | Rank / total | Exon or Intron rank / total number of exons or introns |
:hgvs_c | HGVS.c | Variant using HGVS notation (DNA level) |
:hgvs_p | HGVS.p | If variant is coding, this field describes the variant using HGVS notation (Protein level) |
:cdna_position | cDNA_position / cDNA_len | Position in cDNA and trancript's cDNA length (one based) |
:cds_position | CDS_position / CDS_len | Position and number of coding bases (one based includes START and STOP codons) |
:protein_position | Protein_position / Protein_len | Position and number of AA (one based, including START, but not STOP) |
:distance_to_feature | Distance to feature | All items in this field are options see SnpEff page for details |
:errors | Errors, Warnings or Information messages | Errors, warnings or informative message that can affect annotation accuracy |