This section presents several experiments on scalability and the cost of the system.
In the following experiment, we tested AnnotationHive and Annovar for one sample (HG00096) of the 1000 Genomes Project with over 4.2M variants, and for all 1000 samples with over 85.2M variants against the following five annotation datasets:
Over 16B annotation records were processed. The y-axis is logarithmic and represents the execution time in minutes. The x-axis depicts the number of variants. In both cases, AnnotationHive performed around two orders of magnitude faster than Annovar. For this experiment, we used n1-highmem-16 instances for Annovar and AnnotationHive's Dataflow sort function.
We compared the annotated VCF files for the BRCA1 region. All records were the same except three records with genotype values of 0, which Annovar considered them in the output. We filtered out variants for which every genotype value was less than or equal to 0.