Skip to content
Fritz Sedlazeck edited this page Feb 20, 2017 · 13 revisions

#What is SURVIVOR?

SURVIVOR incorporates a set of tools to:

  • Simulate SVs and evaluate existing callers
  • Merge and compare SVs within a sample and among populations/samples.
  • Convert different formats to vcf files
  • Summarize the results within vcf files or results from SURVIVOR.

The methods are independent of the sequencing technology or variant caller.

Download

Newest release

Code (.zip)

Code (.tar.gz)

How to build SURVIVOR

wget https://github.com/fritzsedlazeck/SURVIVOR/archive/master.tar.gz -O SURVIVOR.tar.gz
tar xzvf SURVIVOR.tar.gz
cd SURVIVOR-master/Debug/
make
./SURVIVOR

Quick start

Here we just list the three top methods implemented in SURVIVOR. See Methods & Parmater for a list of all methods and parameter.

Simulate and Evaluate SVs

The simulation of SVs are explained in more detailed here: Methods & Parmater

Frist, we generate a parameter file:

./SURVIVOR 1 parameter_file

You can alter the parameter file like any other text file. Note the placement of each line is critical so don't add any lines. Just alter the values to what you need.

Second, we simulate the SVs on a given reference sequence (e.g. ref.fa), which has to be a fasta file.

 ./SURVIVOR 1 ref.fa parameter_file 0 simulated

Note you can choose if we want to simulate the reads based on the generated sequence and map these to the original reference genome (option 0) or if we want to use e.g. real reads and map these to the so simulated genome (option 1). The modified fasta file will be named: simulated.fasta. Furthermore, we report the simulated events (simulated.bed) and the insertion sequences (simulated.insertions.fa).

We have been using e.g. Mason as a read simulator but you can use any simulator that you want.

To evaluate the call set:

./SURVIVOR 3 caller.vcf simulated.bed 10 eval_res

This will evaluate the results of caller.vcf with our previous simulated data set allowing a max. 10bp wobble of the predicted breakpoints. The results are stored in eval_res*.

In the end of the program a line is printed:

Overall: 20 11/0/0/0/9 0/0/0/0/0 0/0/0/0/0 1 0

This shows that 20 SVs were simulated. Followed by the true positive (simulated and found) SVs, the false negative (simulated not found) and the false positive (not simulated but found). Each of which is reported by type (Del/DUP/INV/TRA/INS). The second last number is the sensitivity and the last number is the FDR rate. In this particular case we simulated 20 SVs from which 11 deletions and 9 insertions were found again and no SV were missed or additionally reported.

Comparing SVs within a sample/ Consensus calling

To reduce the false positive rate from some caller while remaining a high sensitivity we suggest to use multiple caller for one sample. For example, we often run Manta, Delly and Lumpy on our short read data. All these methods provide a vcf file as a result.

SURVIVOR can be used to compare and obtain a consensus call set from these vcf files. Note in general SURVIVOR can incorporate any technology or other callers as long as the SVs are represented as a vcf file.

First, we collect all files that we want to merge in a file.

ls *vcf > sample_files

Next, we use SURVIVOR to obtain a call set.

./SURVIVOR 5 sample_files 1000 2 1 1 30 sample_merged.vcf

This will merge all the vcf files specified in sample_files together using a maximum allowed distance of 1kb. Furthermore we ask SURVIVOR only to report calls supported by 2 callers and they have to agree on the type (1) and on the strand (1) of the SV. Note you can change this behavior by altering the numbers from 1 to e.g. 0. In addition, we told SURVIVOR to only compare SV that are at least 30bp long and print the output in sample_merged.vcf.

Note that SURVIVOR will produce further a venn diagram file that will not be filtered by the min. number of supporting callers and can be used to visualize the results.

Identify regions of potential falsely called SV.

In our experiments we often want to filter for tricky to align regions. This is often done by requiring a higher MQ (e.g 20) to filter reads. This will filter out less reliable reads, but there are reads mapping to the same region with a mapping quality higher then e.g. 20. This can result in artificial SV call. We developed this following approach to be more conservative.

First, we extract the low mapping quality reads from our e.g. sorted bam file:

samtools view -H our.sort.bam > lowMQ.sam
samtools view our.sort.bam | awk '$5<5 {print $0}' >>  lowMQ.sam
samtools view -S -b -h lowMQ.sam > lowMQ.bam

Now we should have a sorted bam file of all the reads that have an MQ <5 (second line awk statement). Next we compute the bp coverage:

 samtools depth lowMQ.bam >  lowMQ.cov

Next we use SURVIVOR to cluster the coverage track into a bed file for filtering:

./SURVIVOR 12 lowMQ.cov 10 2 > lowMQ.bed

Here we allow to cluster regions together that are maximal 10bp away and only consider regions if their coverage is larger then 2. The resulting bed file can be used to filter SV or read file using e.g. bedtools.

Clone this wiki locally