🔴 ⚠️ SEDEF has been deprecated. Please use BISER (SEDEF's successor) instead. ⚠️ 🔴
SEDEF is a tool for quick detection of segmental duplications in a genome.
SEDEF has been presented at ECCB 2018 (DOI 10.1093/bioinformatics/bty586). Preprint is available here. Get the final paper here.
👨🎨 Human (hg38) | 👨🎨 Human (hg19) | 🐭 Mouse (mm8) |
---|---|---|
Final calls | Final calls | Final calls |
The experiment pipeline from the paper is described in this Jupyter notebook.
Simple! Do this:
git clone https://github.com/vpc-ccg/sedef
cd sedef
make -j release
By default, SEDEF uses Intel C++ compiler. If you are using g++, build with:
make -j release CXX=g++
If you are using Clang on macOS, compile as
brew install libomp
make -j release OPENMP="-Xpreprocessor -fopenmp" CXX=clang++
You need at least g++ 5.1.0 (C++14) to compile SEDEF. Clang should work fine as well.
SEDEF requires Boost libraries in order to compile. In case you installed Boost in a non-standard directory, you can still compile as follows:
CPATH={path_to_boost} make -j release
The genome assembly must be soft-masked (i.e. all common and tandem repeats should be converted to lower-case letters) and indexed.
Suppose that our genome is hg19.fa
(we use UCSC hg19 genome with 24 standard chromosomes that does not contain patches (unGl) or random strains (chrXX_random)).
Just go to sedef
directory and run
./sedef.sh -o <output> -j <jobs> <genome>
For example, to run hg19.fa on 80 cores type:
./sedef.sh -o sedef_hg19 -j 80 hg19.fa
You can add -f
if sedef_hg19
already exists (it will overwrite the existing content though).
The final results will be located in sedef_hg19/final.bed
.
Please note that sedef.sh
depends on Samtools and GNU Parallel.
If you want to experiment with different parameters, run sedef help
for parameter documentation.
Output will be located in <out_dir>/final.bed
.
The fields of BEDPE file are as follows:
First 6 fields are standard BEDPE fields describing the coordinates of SD mates:
chr1
,start1
andend1
chr2
,start2
andend2
Other fields are (in the order of appearance):
All errors are expressed as ratios (0.0--1.0) of the alignment length unless otherwise noted.
Warning: as per WGAC, when calculating the similarity and error rates (fields
score
,fracMatch
,fracMatchIndel
andfilter_score
) SEDEF counts a gap as a single error (so the hypothetical alignment ofA-----GC
andAT-----C
will have error 4 and NOT 8). This might lead to SDs with rather large gap contents. For more filtering, consultcomment
field that provides the percentage of match/mismatch and gap bases.
First make sure to index the genome:
samtools faidx hg19.fa
Then run the sedef-search
in parallel (in this example, we will use GNU parallel) to get the initial seeds:
mkdir -p out # For the output
mkdir -p out/log # For the logs
for i in `seq 1 22` X Y; do
for j in `seq 1 22` X Y; do
SI=`awk '$1=="chr'$i'" {print $2}' hg19.fa.fai`;
SJ=`awk '$1=="chr'$j'" {print $2}' hg19.fa.fai`;
if [ "$SI" -le "$SJ" ] ;
then
for m in y n ; do
[ "$m" == "y" ] && rc="-r" || rc="";
echo "sedef search $rc hg19.fa chr$i chr$j >out/${i}_${j}_${m}.bed 2>out/log/${i}_${j}_${m}.log"
done;
fi
done
done | time parallel --will-cite -j 80 --eta
# Now make sure that all runs completed successfully
grep Total out/log/*.log | wc -l
# You should see here 600 (or n(n+1) if you have n chromosomes in your file)
# Get the single-core running time
grep Wall out/log/*.log | tr -d '(' | awk '{s+=$4}END{print s}'
# Get the maximum meory usage as well
grep Memory out/log/*.log | awk '{if($3>m)m=$3}END{print m}'
Then use sedef-align
to bucket the files for the optimal parallel alignment.
Afterwards, start the whole alignment:
# First bucket the reads into 1000 bins
mkdir -p out/bins
mkdir -p out/log/bins
time sedef align bucket -n 1000 out out/bins
# Now run the alignment
for j in out/bins/bucket_???? ; do
k=$(basename $j);
echo "sedef align generate -k 11 hg19.fa $j >${j}.bed 2>out/log/bins/${k}.log"
done | time parallel --will-cite -j 80 --eta
# Make sure that all runs finished nicely
grep Finished out/log/bins/*.log | wc -l
# Should be number of bins (in our case, 1000)
# Get again the total running time
grep Wall out/log/bins/*.log | tr -d '(' | awk '{s+=$4}END{print s}'
# And the memory
grep Memory out/log/bins/*.log | awk '{if($3>m)m=$3}END{print m}'
Finally, run sedef-stats
to produce the final output:
# Concatenate the files
cat out/*.bed > out.bed # seed SDs
cat out/bins/bucket_???? > out.init.bed # potential SD regions
cat out/bins/*.bed | sort -k1,1V -k9,9r -k10,10r -k4,4V -k2,2n -k3,3n -k5,5n -k6,6n |\
uniq > out.final.bed # final chains
# Now get the final calls
sedef stats generate hg19.fa out.final.bed |\
sort -k1,1V -k9,9r -k10,10r -k4,4V -k2,2n -k3,3n -k5,5n -k6,6n |\
uniq > out.hg19.bed
SEDEF uses {fmt}, argh and the modified version of Heng Li's ksw2.
Questions, bugs? Open a GitHub issue or drop me an e-mail at inumanag at mit dot edu
.