VCF Tools tutorial

Biologists receive so much biological data that they have to spend a lot of time cleaning it up in order to get just the data they are interested in. HPG VCF Tools is a set of tools for preprocessing, filtering and manipulating VCF files. It aims to avoid excessive time consumption in tedious preprocessing tasks.

Supported input formats

Variant Call Format (VCF) 4.0: Variants and genotypical information

Splitting a VCF file

A set of VCF files can be created by splitting one by a criterion. Each one of the output files is a fully valid VCF file.

The most basic command-line for invoking this tool is:

hpg-var-vcf split -v your_vcf_file.vcf --criterion chromosome

Currently available criteria are:

By chromosome
By coverage intervals

By chromosome

Can be invoked using the --criterion chromosome option, like in:

hpg-var-vcf split -v your_vcf_file.vcf --criterion chromosome

Each output file will be named chromosome_N_your_vcf_file.vcf (N being the chromosome name) and will contain the entries from a single chromosome.

By coverage intervals

Can be invoked using the --criterion coverage option along with --intervals, like in:

hpg-var-vcf split -v your_vcf_file.vcf --criterion coverage --intervals 1000,2000,5000,12000

Each output file will be named as follows:

First file: coverage_0_X_your_vcf_file.vcf, being X the first coverage interval. In the example, coverage_0_1000_your_vcf_file.vcf.
Second to penultimate: coverage_X_Y_your_vcf_file.vcf, being X and Y limits of a coverage interval. In the example, coverage_1000_2000_your_vcf_file.vcf, coverage_2000_5000_your_vcf_file.vcf and coverage_5000_12000_your_vcf_file.vcf.
Last: coverage_Y_N_your_vcf_file.vcf, being Y the last coverage interval (N represents infinite). In the example, coverage_12000_N_your_vcf_file.vcf.

Filtering entries from a VCF file

If you are only interested in the entries of a VCF file that satisfy certain criteria, you can apply a collection of filters to the input. Currently available filters are:

By number of alleles (--alleles)
By coverage (--coverage)
By gene (--gene, comma-separated list)
By correspondence to an indel (--indel, values are "include/exclude")
By percentage of samples following dominant or recessive inheritance patterns (--inh-dom or --inh-rec, decimal like 0.1)
By minimum allele frequency (MAF) (--maf, decimal like 0.01)
By percentage of missing values (--missing, decimal like 0.1)
By quality (--quality)
By region (--region, comma-separated list of regions; --region-file, path to GFF or BED file; --region-type, by feature type of the region, used along --region-file)
By correspondence to a SNP (--snp, values are "include/exclude")

The most basic command-line for invoking this tool is:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40

Several filters can be applied at the same time:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40 --maf 0.02 --snp include

By default, only the entries that pass the filters are written to an output file, named your_vcf_file.vcf.filtered.

If you also want to save the entries that failed the tests, add the --save-rejected flag to your command-line. They will be written to a file named your_vcf_file.vcf.rejected and flagged with the first filter they failed. For example, if you requested a minimum quality of 40, the value inserted in the FILTER column will be "q40".

By number of alleles

Variants may be bi-allelic or multi-allelic. In case you want to choose the variants with a certain number of alleles, the command-line would be something like:

hpg-var-vcf filter -v your_vcf_file.vcf --alleles 3

By coverage / quality

Minimum accepted quality and coverage can be specified via the --quality and --coverage options, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --quality 40 --coverage 30

By gene

Regions can be specified directly via the command-line argument --gene, followed by a comma-separated list of genes.

hpg-var-vcf filter -v your_vcf_file.vcf --gene bcl2,brca1

By inheritance pattern

The minimum percentage of samples following a dominant or recessive inheritance pattern can be specified using the --inh-dom and --inh-rec options, respectively, followed by a decimal number:

hpg-var-vcf filter -v your_vcf_file.vcf --inh-dom 0.80 --inh-rec 0.60

These options can be used separately or combined.

By correspondence to an indel

Whether you want to include (or exclude) indels from the input file, you must add the --indel include (or --indel exclude) command-line option, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --indel include

By minimum-allele frequency (MAF)

Maximum accepted MAF can be specified via the --maf option, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --maf 0.02

This way, only positions with a MAF < 0.02 will pass the filter.

By missing values

Maximum percentage of accepted missing values can be specified via the --missing option, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --missing 0.20

This way, only positions with less than 20% of missing values will pass the filter.

By region

Regions can be specified directly via the command-line argument --region, followed by a comma-separated list of regions. Each region must be described as chromosome:start-end, such as 1:12345-67890.

If the list of regions is stored in a GFF file, it can be referenced using the --region-file option, followed by the name of the file, such as:

hpg-var-vcf filter -v your_vcf_file.vcf --region-file your_regions_file.gff

By correspondence to a SNP

Whether you want to include (or exclude) SNPs from the input file, you must add the --snp include (or --snp exclude) command-line option, like in:

hpg-var-vcf filter -v your_vcf_file.vcf --snp exclude

Getting statistics from a VCF file

This tool reports statistics for each variant and sample in the VCF file, as well as statistics referred to the whole file. A command-line for invoking this tool is:

hpg-var-vcf stats -v your_vcf_file.vcf

By default, only statistics per variant are calculated. Not giving a PED file as argument will limit the number of available statistics (the ones related to allele inheritance, such as mendelian errors).

If you want to run tool at full power, just type:

hpg-var-vcf stats -v your_vcf_file.vcf -p your_ped_file.ped --variants --samples

And if you only want to retrieve statistics per sample:

hpg-var-vcf stats -v your_vcf_file.vcf -p your_ped_file.ped --samples

If you want to save the resulting statistics to a SQLite database in order to query them in the future faster than from a plain text file, yo must use the --db option:

hpg-var-vcf stats -v your_vcf_file.vcf --db

Global statistics

Number of variants
Number of samples
Number of bi-allelic sites
Number of multi-allelic sites
Number of SNP
Number of indels
Number of transitions
Number of transversions
Ti/TV ratio
Percentage of PASS
Average quality in the VCF

Variant statistics

Allele counts and frequencies
Genotype counts and frequencies
Number of missing alleles and genotypes
Mendelian errors
Being an indel or not
Samples following dominant and recessive inheritance models

Sample statistics

Missing genotypes
Mendelian errors

Merging multiple VCF files

This tool allows to merge multiple VCF files, for example, containing different samples of the same experiment. The simplest command-line you may use is:

hpg-var-vcf merge --vcf-list your_vcf_file_1.vcf,your_vcf_file_2.vcf,your_vcf_file_N.vcf

A new file named merge_from_N_files.vcf (N being the number of files merged) will be created in the output directory. The headers from all files are written to the new one.

Regarding variants (the most interesting part of the file), multi-sample and multi-allelic variants are accepted. In order to preserve coherence, several operations must be performed. For a better understanding of these operations, see the VCF specification.

When a multi-allelic variant is found, the different alleles are merged and the numerical indices that identify them recalculated.
The FILTER column is the union of the filters applied to a position.
Many of the standard INFO fields (such as AF, AN and DP) can be recalculated based on the merged data.
The FORMAT column is the union of all the FORMATs from the source files.
Samples are the most complex part. Their fields are structured according to the new FORMAT. Since the FILTER and INFO columns from the corresponding source file will be lost when recalculated, they can be copied to the samples on user demand.

Reference allele strictness

It could be the case that not all your files had the same reference allele. By default, the merging tool will accept this situation, as long as the first nucleotide in all of them is the same. For example, it will accept the set (A, AG, ACG), but not (A, AG, __G__A).

However, if you want to be more strict and accept only variants with exactly the same reference allele, you can use the --strict-ref option:

hpg-var-vcf merge --vcf-list your_vcf_file_1.vcf,your_vcf_file_2.vcf,your_vcf_file_N.vcf --strict-ref

Chromosome list

By default, HPG Variant will look for the list of chromosomes for a species on CellBase. In case a CellBase host is not reachable or the species not supported, the argument --chrom-list can be used to provide a file with this content.

Chromosomes must be listed one per line, in the same order as in the input VCF files to merge. This means the chromosomes in all the VCF files must follow this order, although not necessarily be the same.

Filling missing data

As stated before, sample fields in the output file are structured according to the union of all FORMATs. If one of these fields was absent in the source file, it is filled with the missing symbol.

If a sample is absent in all files for a certain position (e.g., does not have a variant in that position), the user can choose whether to mark them as equal to the reference genome or missing. This can be done with the --missing-mode option, which can get the values "missing" or "reference". The "missing" value will fill them with ./., whereas "reference" will be the same as 0/0. By default, "missing" is marked.

hpg-var-vcf merge --vcf-list your_vcf_file_1.vcf,your_vcf_file_2.vcf,your_vcf_file_N.vcf --missing-mode reference

Recalculating the INFO field

Many of the standard INFO fields can be recalculated based on the merged data. For this purpose, the --info-fields options must be used:

hpg-var-vcf merge --vcf-list your_vcf_file_1.vcf,your_vcf_file_2.vcf,your_vcf_file_N.vcf --info-fields DP,H2,NS

At the moment, the following fields can be recalculated:

AC: Allele count in genotypes, for each ALT allele, in the same order as listed
AF: Allele frequency for each ALT allele in the same order as listed
AN: Total number of alleles in called genotypes
DB: dbSNP membership
DP: Combined depth across samples
H2: Membership in HapMap2
H3: Membership in HapMap3
MQ: RMS mapping quality
MQ0: Number of MAPQ == 0 reads covering this record
NS: Number of samples with data
QD: Quality by depth (based on GATK)
SOMATIC: Somatic mutation, for cancer genomics
VALIDATED: Validated by follow-up experiment

Copying old FILTER and INFO fields

Since the FILTER and INFO columns from the corresponding source file will be lost when recalculated, they can be copied to the samples on user demand. In order to do that, the user may use one or both options, like illustrated in the following example:

hpg-var-vcf merge --vcf-list your_vcf_file_1.vcf,your_vcf_file_2.vcf,your_vcf_file_N.vcf --copy-filter --copy-info

⚠️ The size of the output file will grow exponentially when old FILTER and INFO values are inserted!

Redirecting the output

By default, HPG Variant saves its output in the /tmp/variant/ folder. If you want to use a different one, you can set the --outdir flag, or edit the global/outdir property of the file $(HOME)/.hpg-variant/hpg-variant.conf. This file will be created the first time you run HPG Variant, so you can run it once with the --outdir flag set, then edit the global/outdir entry of the configuration file.

⚠️ If you run HPG Variant several times in a row, don't forget to save the results in different folders! The contents of the output folder are overwritten with each execution.

Advanced configuration

There are some other options that may improve the performance of HPG Variant but are not recommended to tweak without basic knowledge of the machine architecture where the application runs. These are:

max-batches: The maximum number of groups of lines that can be loaded at the same time in main memory.
batch-lines: Maximum number of lines each group will contained.
num-threads: Number of threads that will be running the statistical tests at the same time. It should be the number of cores in your machine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly