- Hap.py User's Manual
- Introduction
- Getting started
- Full list of command line options
- Minimal options
- Running and debugging issues
- Restricting to subsets of the genome
- Additional input options
- Working with genome VCFs
- Additional outputs
- Comparison engines
- Stratification via BED regions
- Internal variant normalisation and haplotype comparison
- ROC curves
- Input preprocessing using bcftools
- Haplotype comparison parameters
- Using RTG-Tools - VCFEval as the comparison engine
- Full list of output list of output columns
Hap.py is a tool to compare diploid genotypes at haplotype level. Rather than comparing VCF records row by row, hap.py will generate and match alternate sequences in a superlocus. A superlocus is a small region of the genome (sized between 1 and around 1000 bp) that contains one or more variants.
Matching haplotype sequences rather than VCF records is more accurate. It allows us to do the following things:
- We can match up variant records that represent the same alt sequences in a different form (see ../example/GiaB).
- We can also more accurately merge variant call sets (see ls_example.md).
The inputs to hap.py are two VCF files (a "truth" and a "query" file), and an optional "confident call region" bed file (NOTE: bed files with track information are not supported, all input bed or bed.gz files must only contain bed records).
Note: If you are running hap.py with genome VCFs that include symbolic alleles please see the section Working with genome VCFs
Hap.py will report counts of
- true-positives (TP) : variants/genotypes that match in truth and query.
- false-positives (FP) : variants that have mismatching genotypes or alt alleles, as well as query variant calls in regions a truth set would call confident hom-ref regions.
- false-negatives (FN) : variants present in the truth set, but missed in the query.
- non-assessed calls (UNK) : variants outside the truth set regions
From these counts, we are able to calculate
Recall = TP/(TP+FN)
Precision = TP/(TP+FP)
Frac_NA = UNK/total(query)
F1_Score = 2 * Precision * Recall / (Precision + Recall)
When comparing two VCFs, genotype and allele mismatches are counted both as false-positives and as false-negatives. Truth sets like Platinum Genomes or NIST/Genome in a Bottle also include "confident call regions", which show places where the truth dataset does not expect variant calls, or selects a subset of truth variants that are high-confidence. Hap.py uses these regions to count truth or query variant calls outside these regions as unknown (UNK), query variant calls inside these regions either as TP or FP.
Truthset confident regions are passed to hap.py using the -f
command line
option. Note that this is different from restricting comparison to a subset
of the genome using -T
or -R
: variants outside the confident regions
do not get removed from the input files, they will be used during comparison
but then counted as UNK during quantification.
Note that some special rules for variant capture by confident regions apply. In particular, we trim variant alleles to determine their exact reference locations, and we use these for computing confident region overlap. For more information, see Stratification via Bed Regions.
Counts and statistics will be calculated for the following subsets of variants:
Type | Description |
---|---|
SNP |
SNP or MNP variants. We count single nucleotides that have changed |
INDEL |
Indels and complex variants |
Hap.py (and qfy.py, which is the part of the hap.py that counts variants) also computes counts for subtypes and observed genotypes in the above two categories.
Subtype | Description |
---|---|
* |
Aggregate numbers for all subtypes |
ti or tv |
Transitions and transversions for SNPs |
I1_5 |
Insertions of length 1-5 |
I6_15 |
Insertions of length 6-15 |
I16_PLUS |
Insertions of length 16 or more |
D1_5 |
Deletions of length 1-5 |
D6_15 |
Deletions of length 6-15 |
D16_PLUS |
Deletions of length 16 or more |
C1_5 |
Complex variants of length 1-5 |
C6_15 |
Complex variants of length 6-15 |
C16_PLUS |
Complex variants of length 16 or more |
Genotype | Description |
---|---|
* |
Aggregate numbers for all genotypes |
het |
only heterozygous variant calls (0/1 or similar genotypes) |
homalt |
only homozygous alternative variant calls (1/1 or similar genotypes) |
hetalt |
only heterozygous alternative variant calls (1/2 or similar genotypes) |
Note that currently the granularity of counting is on a per-VCF-record level. In complex/high-variability regions, the classifications above might become inaccurate due to nearby variants (e.g. an insertion with a close-by SNP would be more accurately classified as "complex" if the SNP and the insertion occur on the same haplotype). Future versions will address this by counting superloci rather than VCF records.
Hap.py also supports stratification regions for more detailed output stratification (one example use case is computing precision / recall on exoms as well as on the whole genome, or using the stratification regions from the GA4GH benchmarking repository at https://github.com/ga4gh/benchmarking-tools.
Below, we assume that the code has been installed to the directory ${HAPPY}
.
$ ${HAPPY}/bin/hap.py \
example/happy/PG_NA12878_chr21.vcf.gz \
example/happy/NA12878_chr21.vcf.gz \
-f example/happy/PG_Conf_chr21.bed.gz \
-o test
$ ls test.*
test.extended.csv test.roc.all.csv test.roc.Locations.INDEL.PASS.csv test.roc.Locations.SNP.PASS.csv test.vcf.gz
test.metrics.json test.roc.Locations.INDEL.csv test.roc.Locations.SNP.csv test.summary.csv test.vcf.gz.tbi
This example compares an example run of GATK 1.6 on NA12878 against the Platinum Genomes reference dataset (Note: this is a fairly old version of GATK, so don't rely on these particular numbers for competitive comparisons!).
The summary CSV file contains all computed metrics:
Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
INDEL | ALL | 9124 | 8100 | 1024 | 11731 | 228 | 3460 | 71 | 0.887769 | 0.972434 | 0.294945 | NaN | NaN | 1.463852 | 1.423305 |
INDEL | PASS | 9124 | 7869 | 1255 | 9910 | 176 | 1922 | 48 | 0.862451 | 0.977967 | 0.193946 | NaN | NaN | 1.463852 | 1.209239 |
SNP | ALL | 52520 | 52202 | 318 | 90092 | 476 | 37377 | 109 | 0.993945 | 0.990970 | 0.414876 | 2.081246 | 1.745874 | 1.595621 | 3.132586 |
SNP | PASS | 52520 | 46965 | 5555 | 48083 | 83 | 1013 | 8 | 0.894231 | 0.998237 | 0.021068 | 2.081246 | 2.089405 | 1.595621 | 1.487858 |
Note that the e.g. Ti/Tv and het/hom ratios for different subsets of the genome can differ (e.g. the expected Ti/Tv ratio for WGS and exome is normally different).
The extended table gives metrics in a more stratified format:
- it contains counts stratified over the different genotypes and variant subtypes
- it also contains Ti/Tv and het/hom ratios for each variant classification (TP/FN/...)
- if stratification regions were specified, it will contain a breakdown into these regions.
You can run hap.py with the -h switch to get help.
The first two positional arguments are used as the input VCF files. The output
file prefix is specified using -o
(this should be something in the form
of directory/prefix):
$ ${HAPPY}/bin/hap.py truth.vcf.gz query.vcf.gz \
-o output-prefix --force-interactive
--force-interactive
Force running interactively (i.e. when JOB_ID is not in the environment) This is only available if Hap.py is set up to run inside SGE. Unless forced, it will not run interactively (it detects this by looking for the environment variable SGE_JOB_ID).
The reason for this is that parallelism is implemented using the multiprocessing module in Python, which spawns processes that can take a lot of memory and also may be difficult to kill interactively.
This feature must be enabled when installing / compiling hap.py. See the installation instructions in ../README.md.
--threads THREADS
The number of threads to use. This is detected automatically by default using Python's multiprocessing module (we recommend around 1GB of RAM per thread).
--logfile LOGFILE
Write logging information into file rather than to stderr.
--scratch-prefix SCRATCH_PREFIX
--keep-scratch
All temporary files go into a scratch folder, which normally defaults to a
subdirectory of /tmp
. This can be customised (e.g. when fast local storage is
available).
--location LOCATIONS, -l LOCATIONS
Add a location to the compare list (when not given, hap.py will use chr1-22, chrX, chrY).
--pass-only
Use to include only variants in the comparison that pass all filters.
-R REGIONS_BEDFILE, --restrict-regions REGIONS_BEDFILE
Restrict analysis to given (sparse) regions (similar to using -R in bcftools). Sparse regions should be used when there are not many regions to look at (the corresponding calls are retrieved via tabix lookup, so this will be slow for many regions). Also, the regions must not overlap (otherwise, hap.py will fail).
-T TARGETS_BEDFILE, --target-regions TARGETS_BEDFILE
Restrict analysis to given (dense) regions (similar to using -T in bcftools). One example use for this is to restrict the analysis to exome-only data.
-f FP_BEDFILE, --false-positives FP_BEDFILE
False positive / confident call regions (.bed or .bed.gz).
-r REF, --reference REF
Specify the reference FASTA file to use. Hap.py detects a default reference sequence file at the following locations:
- at
/opt/hap.py-data/hg19.fa
(see the Dockerfile) - at the location of the HGREF or the HG19 environment variable
To specify a default reference file location, you can run
export HGREF=path-to-your-reference.fa
before running hap.py.
The presence of the <NON_REF> symbolic allele in genome VCFs can cause problems for hap.py, especially if it is part of a genotype. As a workaround, we provide several options. Since variants genotyped as <NON_REF> cannot be sensibly scored, the we provide the following option, which is safe to use on both genome VCFs and standard VCFs:
--filter-nonref Remove any variants genotyped as <NON_REF>.
If hap.py still crashes when processing a genome VCF, we provide separate options to perform on-the-fly conversion of a genome VCF to a standard VCF by removing all <NON_REF> alleles and non-variant blocks. Note that this also removes some fields from the INFO column. These options should only be used on genome VCFs since attempting to convert a standard VCF will cause all biallelic variants to be filtered out (most of them).
--convert-gvcf-truth Convert the truth genome VCF to a standard VCF.
--convert-gvcf-query Convert the query genome VCF to a standard VCF.
-V, --write-vcf
Write an annotated VCF. This file follows the GA4GH specifications at https://github.com/ga4gh/benchmarking-tools/blob/master/doc/ref-impl/README.md
It shows the merged and normalised truth and query calls, together with annotation that shows how they were counted. There are two sample columns ("TRUTH" and "QUERY"), showing the genotypes for each variant in truth and query, along with information on the decision for truth and query calls (TP/FP/FN/N/UNK). See the GA4GH page above for more details.
Hap.py can produce benchmarking results and ROCs using different comparison methods which implement the GA4GH intermediate format.
These methods are:
- xcmp (hap.py's default comparison engine): this method will assume that both input samples are diploid / human samples. Matching is performed on a haplotype level: xcmp enumerates all possible haplotypes that may be described by truth and query within a small superlocus. If matching pairs of haplotype sequences are found, xcmp will convert all variants within the surrounding superlocus into TPs. Xcmp also recognizes direct genotype or allele matches / mismatches (i.e. the resulting FP.GT and FP.AL columns after running quantify will be meaningful). Global phasing information is used to restrict haplotype enumeration, but PS phasing is not supported.
- Rtgtools vcfeval (see below for more detailed instructions): this is a comparison that is simular to xcmp, but which uses an optimization method to determine TP/FP status on a per-variant level (rather than xcmp's per-superlocus level). This method does not use phasing information when it is present in the input VCF file, and input files must not contain symbolic records (hap.py's preprocess tool can mitigate this when). Vcfeval will not report 'local matches', i.e. FP.AL will be zero when using this method. Vcfeval support must be enabled at build/install time since it requires Java to be available.
- scmp-distancebased: this will match variants by location only. Any variant in the
query that has a truth variant nearby (the distance can be set using the
--scmp-distance
parameter). - scmp-somatic: this mode is intended to be used with Tumor/Normal VCF files. Variants
in the query will be matched to truth variants if the same normalized alleles are found nearby.
Genotypes are ignored, and multi-sample VCF files will be collapsed into a single column
with a pre-specified genotype (this can be customized using the
--set-gt
command line option, which also works with all the other comparison engines).
There are a quite a few differences between these comparison modes which are reflected in the ROC outputs. Some examples for this are shown in microbench.md.
Hap.py can compute stratified counts using bed regions of interest. One set of such regions can be found here:
https://github.com/ga4gh/benchmarking-tools/tree/master/resources/stratification-bed-files
hap.py example/happy/PG_NA12878_hg38-chr21.vcf.gz example/happy/NA12878-GATK3-chr21.vcf.gz \
-f example/happy/PG_Conf_hg38-chr21.bed.gz -o gatk-stratified --force-interactive
--stratification example/happy/stratification.tsv
The file stratification.tsv contains a tab-separated list of region names and files (the paths can be relative to the location of the TSV file).
The extended csv file will then contain additional rows like this which give counts, precision and recall only for variants that match a particular region:
...
INDEL,*,exons,ALL,*,QUAL,*,0.96729,0.962791,0.12601600000000002,0.48251700000000003,214,207,7,246,207,8,31,0,5,,3.5106379999999997,,5.0,,3.4782610000000003,,0.75,,,3.172414,3.4782610000000003,,2.625
...
By default, only aggregate results will be returned for stratification regions.
We can also obtain full ROC data for selected regions using the --roc-regions
command line option.
--roc-regions ROC_REGIONS
Select a list of regions to compute ROCs in. By
default, only the '*' region will produce ROC output
(aggregate variant counts).
Note that variant calls are captured by their effective reference position, which is
determined by trimming all alleles to their shortest representation. When running
hap.py with the --preserve-info
command line option will add an INFO field to each
record detailing these locations. The insertion below is captured by its
padding base, as indicated by the RegionsExtent=110-110
INFO field.
chrQ 110 . A ATT 0 . BS=100;HapMatch;IQQ=0;ctype=hap:match;gtt1=gt_het;kind=missing;type=FN;RegionsExtent=110-110 GT:BD:BK:BI:BVT:BLT:QQ 0/1:TP:gm:i1_5:INDEL:het:0 ./.:.:.:.:NOCALL:nocall:0
Note that matching insertions to confident or other regions requires special treatment. Insertions normally include a reference padding base.
chrQ 110 tp A ATT . PASS . GT 0/1
In order to fully capture the insertion, both surrounding bases should be included in the bed file (i.e. capture insertions only if the two
Hap.py will capture the insertion above in a region if the region contains both of the two following positions:
chrQ 109 111 ins-ref-base-and-following-base
This behavior will apply to all stratification regions, and to confident regions (supplied by the
-f
command line option). Note that it will not apply to region the subsetting operations
implemented in -T
and -R
, which use the default behavior of bcftools (REF position
match).
If the confident regions only contain the padding base, we can fix them using the
--adjust-conf-regions
command line option. This option will correctly pad all confident
insertions.
-L, --leftshift Left-shift variants in their unary representation. This is off by default.
-D, --no-decompose Switch off variant primitive decomposition
--unhappy Disable all clever matching in xcmp (does not do anything with vcfeval).
--bcftools-norm Enable preprocessing through bcftools norm -c x -D
(requires external preprocessing to be switched on).
--fixchr Add/remove chr prefix (default: auto, attempt to match
reference).
--no-fixchr Add chr prefix to query file (default: auto, attempt
to match reference).
--bcf Use BCF internally. This is the default when the input
file is in BCF format already. Using BCF can speed up
temp file access, but may fail for VCF files that have
broken headers or records that don't comply with the
header.
--preprocess-truth Preprocess truth file with same settings as query
(default is to accept truth in original format).
--usefiltered-truth Preprocess truth file with same settings as query
(default is to accept truth in original format).
These switches control VCF preprocessing.
Left-shifting attempts to safely left-align indel and complex variant calls without shifting variant records past each other.
Variant decompositions splits complex variants into its primitive variants, leaving only SNPs, insertions and deletions. This type of preprocessing is useful when comparing different variant calling methods, for example where one method might output an MNP block, and another method multiple SNPs. If these calls are FPs, we should count the same number for each method, and this can be achieved in many cases by this type of decomposition. The micro-benchmark example in microbench.md shows the effect of different pre-processing switches.
Hap.py can create data for ROC-style curves. Normally, it is preferable to calculate such curves based on the input variant representations, and not to perform any variant splitting or preprocessing.
Here are the options which need to be added to a hap.py command line to create a ROC curve based on the query GQX field:
...
--roc GQX \
...
The --roc
switch specifies the feature to filter on. Hap.py translates the
truth and query GQ(X) fields into the INFO fields T_GQ and Q_GQ, it tries to
use GQX first, if this is not present, it will use GQ. When run without
internal preprocessing any other input INFO field can be used (e.g.
--roc INFO.VQSLOD for GATK).
The --roc-filter
switch may be used to specify the particular VCF filter
which implements a threshold on the quality score. When calculating filtered
TP/FP counts, this filter will be removed, and replaced with a threshold filter
on the feature specified by --roc
. By default, a PASS and an ALL ROC will
be computed corresponding to the variant counts with all filters enabled (PASS)
and no filters (ALL). When --roc-filter
is specified, a third ROC curve
is computed named "SEL", which shows the performance for selectively-filtered
variants.
When computing precision/recall curves, we assume that higher quality scores are better, variants with scores higher than the variable threshold will "pass", all others will "fail".
The output file will be comma-separated value files giving tp / fp / fn counts, as well as precision and recall for different thresholds of the ROC feature. Here is a full example (assuming the folder hap.py contains a clone of the hap.py repository, and that hap.py can be run through PATH):
hap.py hap.py/example/happy/PG_NA12878_hg38-chr21.vcf.gz \
hap.py/example/happy/NA12878-GATK3-chr21.vcf.gz \
-f hap.py/example/happy/PG_Conf_hg38-chr21.bed.gz \
-r hap.py/example/happy/hg38.chr21.fa \
-o gatk-all \
--roc QUAL --roc-filter LowQual
After running, this will produce a set of outputs as follows.
Output File | Contents |
---|---|
gatk-all.summary.csv | Summary statistics |
gatk-all.extended.csv | Extended statistics |
gatk-all.roc.all.csv | All precision / recall data points that were calculated |
gatk-all.vcf.gz | Annotated VCF according to https://github.com/ga4gh/benchmarking-tools/tree/master/doc/ref-impl |
gatk-all.vcf.gz.tbi | VCF Tabix Index |
gatk-all.metrics.json | JSON file containing all computed metrics and tables. |
gatk-all.roc.Locations.INDEL.csv | ROC for ALL indels only. |
gatk-all.roc.Locations.SNP.csv | ROC for ALL SNPs only. |
gatk-all.roc.Locations.INDEL.PASS.csv | ROC for PASSing indels only. |
gatk-all.roc.Locations.SNP.PASS.csv | ROC for PASSing SNPs only. |
gatk-all.roc.Locations.INDEL.SEL.csv | ROC for selectively-filtered indels only. |
gatk-all.roc.Locations.SNP.SEL.csv | ROC for selectively-filtered SNPs only. |
The ROC files give the same columns as the summary and extended statistics output files for a range of thresholds on QUAL (or the feature that was passed to --roc).
Hap.py has a range of options to control pre-processing separately for truth and query. Hap.py supports the same options as pre.py, which is described in normalisation.md.
-w WINDOW, --window-size WINDOW
Window size for haplotype block finder. We use a sliding window, this parameter determines the maximum distance between two variants that ensures that they end up in the same haplotype block. For larger values, the haplotype blocks get larger and might capture more het variants that will cause the enumeration threshold (below) to be reached. Also, longer haplotype blocks take more time to compare. The default value here is 30.
--enumeration-threshold MAX_ENUM
Enumeration threshold / maximum number of sequences to enumerate per block. Basically, each unphased heterozygous variant in a block doubles the number of possible alternate sequences that could be created from the set of variants within a haplotype block (10 hets in a row would result in 1024 different alternate sequences). The default setting emphasizes accuracy over compute time. In some cases, it might be beneficial to reduce this value to 256 (which reduces the number of het calls that can be matched in one superlocus to 8 in truth and query) to get an approximate result more quickly. Another solution is to use vcfeval as a comparison engine instead.
-e HB_EXPAND, --expand-hapblocks HB_EXPAND
Reference-pad and expand the sequences generate haplotype blocks by this many basepairs left and right. This is useful for approximate block matching.
RTG-Tools (see https://github.com/RealTimeGenomics/rtg-tools
provides a feature called "vcfeval" which performs complex variant comparisons. Hap.py
can use this tool as a comparison engine instead of its built-in xcmp
tool.
This feature requires a version of rtg-tools which supports the GA4GH benchmarking intermediate file formats (see https://github.com/ga4gh/benchmarking-tools). Currently, such a version is available in this branch: https://github.com/realtimegenomics/rtg-tools/tree/ga4gh-test
When installing hap.py, we can attempt to build a compatible version of VCFeval.
The installer has the --with-rtgtools
and --rtgtools-wrapper RTGTOOLS_WRAPPER
command line options to do this.
Before using RTG tools, it is necessary to translate the reference Fasta file into the SDF format (this only needs to be done once for each reference sequence). Hap.py can do this on the fly, but when using a single reference with multiple VCFs, it is better to do this beforehand.
rtg format -o hg19.sdf hg19.fa
This creates a folder hg19.sdf
, which must be passed to hap.py using the --engine-vcfeval-template hg19.sdf
arguments.
The hap.py command line arguments for using vcfeval are as follows (assuming that RTG tools is installed
in ~/rtg-install
, and that the folder hg19.sdf
created by the previous command line is in the current
directory):
hap.py truth.vcf.gz query.vcf.gz -f conf.bed.gz -o ./test -V --engine=vcfeval --engine-vcfeval-template hg19
Most other command line arguments and outputs will work as before.
Happy outputs a set of stratification columns, followed by metrics columns. Stratification columns may contain the placeholder "*" value, which indicates that counts in this row are aggregated over all possible values of the column. In case of a subtype, this means that the counts have been summed across all subtypes. In case of QQ, it means that the counts correspond to all variants rather than only ones below a QQ threshold.
Stratification Column | Description |
---|---|
Type | Variant type (SNP / INDEL) |
Subtype | Variant Subtype (ti/tv/indel length, see above |
Subset | Subset of the genome/stratification region |
Filter | Variant filters: PASS, SEL, ALL, or a particular filter from the query VCF |
Genotype | Genotype of benchmarked variants (het / homalt / hetalt) |
QQ.Field | Which field from the original VCF was used to produce QQ values in truth and query |
QQ threshold for ROC values |
Metric Column | Description |
---|---|
METRIC.Recall | Recall for truth variant representation = TRUTH.TP / (TRUTH.TP + TRUTH.FN) |
METRIC.Precision | Precision of query variants = QUERY.TP / (QUERY.TP + QUERY.FP) |
METRIC.Frac_NA | Fraction of non-assessed query calls = QUERY.UNK / QUERY.TOTAL |
METRIC.F1_Score | Harmonic mean of precision and recall = 2METRIC.RecallMetric.Precision/(METRIC.Recall + METRIC.Precision) |
TRUTH.TOTAL | Total number of truth variants |
TRUTH.TP | Number of true-positive calls in truth representation (counted via the truth sample column) |
TRUTH.FN | Number of false-negative calls = calls in truth without matching query call |
QUERY.TOTAL | Total number of query calls |
QUERY.TP | Number of true positive calls in query representation (counted via the query sample column) |
QUERY.FP | Number of false-positive calls in the query file (mismatched query calls within the confident regions) |
QUERY.UNK | Number of query calls outside the confident regions |
FP.gt | Number of genotype mismatches (alleles match, but different zygosity) |
FP.al | Number of allele mismatches (variants matched by position and not by haplotype) |
TRUTH.TOTAL.TiTv_ratio | Transition / Transversion ratio for all truth variants |
TRUTH.TOTAL.het_hom_ratio | Het/Hom ratio for all truth variants |
TRUTH.FN.TiTv_ratio | Transition / Transversion ratio for false-negative variants |
TRUTH.FN.het_hom_ratio | Het/Hom ratio for false-negative variants |
TRUTH.TP.TiTv_ratio | Transition / Transversion ratio for true positive variants |
TRUTH.TP.het_hom_ratio | Het/Hom ratio for true positive variants |
QUERY.FP.TiTv_ratio | Transition / Transversion ratio for false positive variants |
QUERY.FP.het_hom_ratio | Het/Hom ratio for false-positive variants |
QUERY.TOTAL.TiTv_ratio | Transition / Transversion ratio for all query variants |
QUERY.TOTAL.het_hom_ratio | Het/Hom ratio for all query variants |
QUERY.TP.TiTv_ratio | Transition / Transversion ratio for true positive variants (query representation) |
QUERY.TP.het_hom_ratio | Het/Hom ratio for true positive variants (query representation) |
QUERY.UNK.TiTv_ratio | Transition / Transversion ratio for unknown variants |
QUERY.UNK.het_hom_ratio | Het/Hom ratio for unknown variants |
Subset.Size | When using stratification regions, this gives the number of nucleotides contained in the current subset |
Subset.IS_CONF.Size | This gives the number of confident bases (-f regions) in the current subset |