Skip to content
/ biser Public

A fast tool for detecting and decomposing segmental duplications in genome assemblies

License

Notifications You must be signed in to change notification settings

0xTCG/biser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BISER

BISER (🦪🔮; Brisk Inference of Segmental duplication Evolutionary stRucture) is a fast tool for detecting and decomposing segmental duplications (SDs) in a single genome or multiple genomes. BISER is SEDEF's successor.

Instalation

BISER needs Python 3.7+ and Samtools to run.

To install BISER, just run:

pip install biser

If you wish to build BISER from source, you will also need Codon programming language with the Seq plugin. To install BISER from source, run:

pip install git+https://github.com/0xTCG/biser.git

See Dockerfile for detailed instructions to build BISER from source.

Usage

Single genome

To find SDs in a single genome, just run:

biser -o <output> -t <threads> <genome.fa>

BISER will also produce a file called output.elem that will contain the elementary SD decomposition of the found SDs.

All genomes should be indexed beforehand with samtools faidx genome.fa.

⚠️: BISER requires a soft-masked or a hard-masked genome assemblies for the optimal performance. Check for the presence of lowercase bases in your genome; if you have them, you are good to go.

⚠️: If you are experiences crashes on Linux machines (especially in cluster environments), try setting --gc-heap 1G (or higher).

Multiple genomes

To find SDs in multiple genomes, just run:

biser -o <output> -t <threads> <genome1.fa> <genome2.fa> ...

Other options

Usage: biser [-h] [--temp TEMP] [--threads THREADS] --output OUTPUT [--hard]
             [--keep-contigs] [--keep-temp] [--no-decomposition]
             genomes [genomes ...]

Positional arguments:
  genomes               Indexed genomes in FASTA format.

Optional arguments:
  -h, --help            show this help message and exit
  --temp TEMP, -T TEMP  Temporary directory location
  --threads THREADS, -t THREADS
                        Number of threads
  --output OUTPUT, -o OUTPUT
                        Indexed genomes in FASTA format.
  --hard, -H            Are input genomes already hard-masked?
  --keep-contigs        Do not ignore contigs, unplaced sequences, alternate
                        alleles, patch chromosomes and mitochondrion sequences
                        (i.e., chrM and chromosomes whose name contains
                        underscore). Enable this when running BISER on
                        scaffolds and custom assemblies.
  --keep-temp, -k       Keep temporary directory after the execution. Useful
                        for debugging.
  --resume RESUME       Resume the previously interrupted run (that was run
                        with --keep-temp; needs the temp directory for
                        resume).
  --no-decomposition    Skip SD decomposition step.
  --max-error MAX_ERROR
                        Maximum SD error (large gaps includes).
  --max-edit-error MAX_EDIT_ERROR
                        Maximum SD edit error (large gaps NOT included).
  --max-chromosome-size MAX_CHROMOSOME_SIZE
                        Maximum chromosome size.
  --kmer-size KMER_SIZE
                        Search k-mer size.
  --winnow-size WINNOW_SIZE
                        Search winnow size.
  --version, -v         show program's version number and exit
  --gc-heap GC_HEAP     Set GC_INITIAL_HEAP_SIZE.

Output format

The output follows the BEDPE file format.

The first six (6) fields are the standard BEDPE fields describing the coordinates of SD mates:

  • chr1, start1 and end1
  • chr2, start2 and end2 (both intervals are semi-open and 0-indexed).

Other fields are as follows:

Field Description
reference Reference genome names of the first and the second mate, separated by :.
score Total alignment error (0--100%): the number of mismatches and indels divided by the total alignment span.
strand1 Strand (+ or -) of the first SD mate.
strand2 Strand (+ or -) of the second SD mate.
max_len Length of the longer mate.
aln_len Alignment span (mate length with gaps included)
cigar CIGAR string that describes the alignment
optional Optional fields in the format NAME=VALUE;.... Currently contains the mismatch rate (starts with X=) and the gap rate (starts with ID=).

In addition to BEDPE output, BISER might also output the decomposition file (with the .elem extension) as well. This file contains the list of core SD regions in the analyzed reference genomes. The format of decomposition file is as follows:

Field Description
reference Reference genome name.
start Start position of the core region (0-indexed).
end End position of the core region.
id Core region. Note that many regions share the same core ID because core regions are duplicated across the genome(s).
len Length of the core region.
score Core region score (internal use only).
strand Strand (+ or -) of the core region.

Paper & Simulations

BISER was published in the Algorithms for Molecular Biology and was presented at the WABI 2021.

Please cite:

Išerić, H., Alkan, C., Hach, F. et al. Fast characterization of segmental duplication structure in multiple genome assemblies. Algorithms Mol Biol 17, 4 (2022). https://doi.org/10.1186/s13015-022-00210-2

BibTeX entry:

@article{ivseric2022fast,
  title={Fast characterization of segmental duplication structure in multiple genome assemblies},
  author={I{\v{s}}eri{\'c}, Hamza and Alkan, Can and Hach, Faraz and Numanagi{\'c}, Ibrahim},
  journal={Algorithms for Molecular Biology},
  volume={17},
  number={1},
  pages={1--15},
  year={2022},
  publisher={Springer}
}

Paper simulations are available in paper directory.

Changelog

  • BISER v1.4 (Mar 2023):
    • Change of alignment refinement heuristics (should be faster now)
      • Note: SDs generated with v1.4 might be slightly different than those generated by the earlier version
    • Switch to Codon
    • Minor bugfixes

Contact

Please reach out to Ibrahim Numanagić.