Skip to content

Latest commit

 

History

History
234 lines (185 loc) · 8.62 KB

GATK_comparison.md

File metadata and controls

234 lines (185 loc) · 8.62 KB

Comparing two large GATK files

This is the exercise to explore the differences in the full BWA-GATK pipeline vs. a chunking-scatter-gather approach that is magnitudes faster. Using my tools bio-vcf, bio-table and bio-locus interesting edge effects were found.

The GATK output variant files are similar in size:

-rw-r--r-- 1 pjotr users 992725762 Aug 29 11:22 HiSeqX_R1.fastq.sorted_dedup_realigned.bam.realigned.raw_variants.vcf
-rw-r--r-- 1 pjotr users 987147441 Aug 29 11:26 HiSeqX_R1.fastq_dedup_realigned.bam.realigned.raw_variants.vcf

Naming suggests the second one (scatter) is unsorted but it is actually sorted.

Install bio-vcf, add it to the path if required, and you should see

gem env
export PATH=$GEM_HOME/bin:PATH
gem install bio-vcf
bio-vcf
bio-vcf 0.9.0 (biogem Ruby 2.1.2 with pcows) by Pjotr Prins 2015

Create simple position files with calls

/usr/bin/time -v bio-vcf -e '[r.chrom,r.pos,r.alt]' < scatter/HiSeqX_R1.fastq_dedup_realigned.bam.realigned.raw_variants.vcf > scatter_calls.vcf
/usr/bin/time -v bio-vcf -e '[r.chrom,r.pos,r.alt]' < full/HiSeqX_R1.fastq.sorted_dedup_realigned.bam.realigned.raw_variants.vcf > full_calls.vcf

Count the calls (we are ignoring the limited header info)

wc -l scatter_calls.vcf 
  4773423 scatter_calls.vcf

wc -l full_calls.vcf 
  4795998 full_calls.vcf

(4795998-4773423)/4795998*100 or 0.5%. Do a diff and count de diffs

egrep -c '^>' calls.diff 
  30401
egrep -c '^<' calls.diff 
  52976

52976/4795998*100 or 1.1% of different calls, hmmm. Remove the GL contig

grep -v GL00 calls.diff > calls_wo_GL00.diff
egrep -c '^<' calls_wo_GL00.diff 
  48797

Now install bio-table and bio-locus

 gem install bio-table
 bio-table
   bio-table 1.0.0 Copyright (C) 2012-2014 Pjotr Prins <pjotr.prins@thebird.nl>
 gem install bio-locus
 bio-locus
   bio-locus 0.0.6 (biogem Ruby 2.1.2) by Pjotr Prins 2014

Create a new VCF file using the diff information. Find all the calls

 egrep '^(<|>)' calls_wo_GL00.diff |grep -v GATKCommandLine| grep -v CHROM > all_diff.txt
 bio-table --columns 0,1 < all_diff.txt > chrom_pos_diff.txt

with vi remove first > and < and make sure there is a tab:

 %s/^..//g
 %s/$/^INA/g

Now use bio-locus to create full VCF files containing only these entries

 bio-locus --store --alt exclude < chrom_pos_diff.txt 
 bio-locus 0.0.6 (biogem Ruby 2.1.2) by Pjotr Prins 2014
   Stored 73644 positions out of 75414 in locus.db (1770 duplicate hits)
 bio-locus --match --alt exclude < ../full/HiSeqX_R1.fastq.sorted_dedup_realigned.bam.realigned.raw_variants.vcf > diff_full.vcf
 bio-locus --match --verbose -d < ../scatter/HiSeqX_R1.fastq_dedup_realigned.bam.realigned.raw_variants.vcf > diff_scatter.vcf
 (note: bio-locus is - still - a slow tool)

Interestingly there is very little overlap between the call positions (only 1770 are shared)...

 wc -l diff*.vcf
   48903 diff_full.vcf
   26731 diff_scatter.vcf

So, arguably the difference is (48903+26731)/4795998*100 or 1.6% of calls.

Now we can compare the call contents

 bio-vcf -i -e '[r.chrom,r.pos,r.info.af]' --seval 's.dp' < diff_scatter.vcf > diff_scatter_af_sdp.txt

For those that do match we see a difference in sample read depth pointing out the two methods differ in placing reads. So, let's see if we can find significant differences in frequency and read depth.

First I am reducing the data to one chromosome to be able to work a bit faster

 bio-vcf --filter 'r.chrom=="3"' < ../full/HiSeqX_R1.fastq.sorted_dedup_realigned.bam.realigned.raw_variants.vcf > full.vcf
 bio-vcf --filter 'r.chrom=="3"' < ../scatter/HiSeqX_R1.fastq_dedup_realigned.bam.realigned.raw_variants.vcf > scatter.vcf
 wc -l *.vcf
   301922 full.vcf
   300326 scatter.vcf

 bio-vcf -i -e '[r.pos,r.alt,r.info.af]' --seval 's.dp' < full.vcf > full_af_dp.txt
 bio-vcf -i -e '[r.pos,r.alt, r.info.af]' --seval 's.dp' < scatter.vcf > scatter_af_dp.txt
 wc -l *.txt
   40001 full_af_dp.txt
   40001 scatter_af_dp.txt
 diff scatter_af_dp.txt full_af_dp.txt |grep -c '>'
   2189

Differences typically look like

  < 402884        G       0.5     26
  < 403020        A       0.5     21
  ---
  > 402884        G       0.5     29
  > 403020        A       0.5     25

i.e. the scatter approach has a different read depth 26 instead of 29 and 21 instead of 25 - reads end up in other places. Frequency-wise we don't see much differentce. More intriguing, a difference would be

1525d1529
< 663038        CATATGTTATATGTGTATGTATTGTATACAT 1.0     4

where an extra call for an insertion was made in the scatter approach with a DP of 4.

Now let's quantify how much DP differs

Combine the tables

  bio-table --columns 0,3 < full_af_dp.txt > f_dp.txt 
  bio-table --columns 0,3 < scatter_af_dp.txt > s_dp.txt
  bio-table --merge f_dp.txt s_dp.txt > merged.txt

After editing the table header to show 'pos\tfull\tscatter'

  bio-table --num-filter 'value[1]!=value[0]' < merged.txt |wc -l
    2416

it shows that for chromosome 3, 2416 calls out of 301922 (0.8%) have a different read depth.

  bio-table --verbose --debug --num-filter '(value[1]-value[0]).abs > 3' < merged_no_NA.txt|wc -l
    356

356 (0.1% of total calls) showed a read depth difference larger than 4 reads. And 61 showed a read depth difference larger than 10 reads:

  fedor13:~/bcosc/gvcf/chr3$ bio-table --verbose --debug --num-filter '(value[1]-value[0]).abs > 10' < merged_no_NA.txt 
  bio-table 1.0.0 Copyright (C) 2012-2014 Pjotr Prins <pjotr.prins@thebird.nl>
   INFO bio-table: Array: [{:show_help=>false, :write_header=>true, :skip=>0, :debug=>true, :num_filter=>"(value[1]-value[0]).abs > 10"}]
  DEBUG bio-table: Filtering on (value[1]-value[0]).abs > 10
   INFO bio-table: Array: ["pos", "full", "scatter"]
  pos             full    scatter
  855606          24      9
  855609          24      10
  855610          24      10
  855617          22      9
  1432738         17      3
  1434184         35      22
  3173353         37      7
  3713421         42      11
  3713601         31      13
  3713646         39      24
  3713647         39      24
  3713669         43      29
  3714214         33      19
  3762444         18      3
  3764753         29      15
  3764808         29      15
  3764830         26      15
  3764904         22      11
  3764918         24      13
  3937379         26      13
  3937468         19      7
  4005161         17      4
  4005568         35      22
  4297382         27      13
  4297644         34      20
  4958959         31      10
  4958960         31      10
  4959272         35      11
  4959609         27      9
  4960175         27      5
  6159995         17      0.5
  8432601         34      23
  8432973         24      13
  8432986         26      15
  11414301        12      28
  11414307        12      28
  11414313        12      28
  11414323        12      28
  11528593        21      6
  12070044        39      23
  12070272        34      18
  12071772        34      22
  15010075        27      16
  17740441        43      27
  17740468        52      33
  18367833        26      13
  18367863        29      16
  18367950        32      19
  18973094        39      27
  18976962        30      16
  18977299        20      8
  19000198        42      29
  19000214        40      26
  19898931        15      2
  19898945        13      2
  19898997        16      5
  19899206        37      24
  21293536        40      25
  21293836        26      15
  21294685        36      24

Based on this information (you can see clustering at certain 'hot spots') I suspect that these are border effects in the regions where chunking took place. Note also that scatter more often has less reads which means we are missing reads because of edges (it is a BWA thing).

Even so, even with these downsides, the method may work for rapid diagnostics, provided the chunking affected FN/FP calls do not fall in the regions of interest. For this setup, where time to diagnostic counts (including cancer), it may prove a valuable approach. I also suggest we find a way of setting chunking in regions of little interest (outside coding genes and or regions of low variation as Brad suggested).