Minimizer based sparse de Bruijn Graph constructor. Homopolymer compress input sequences, pick syncmers from hpc-compressed sequences, connect syncmers with an edge if they are adjacent in a read, unitigify and homopolymer decompress. Suggested input is PacBio HiFi/CCS reads, or ONT duplex reads. May or may not work with Illumina reads. Not suggested for PacBio CLR or regular ONT reads. Algorithmic details and citation: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab004/6104877
Bioconda: conda install -c bioconda mbg
git clone https://github.com/maickrau/MBG.git
cd MBG
git submodule update --init --recursive
make bin/MBG
MBG -i input_reads.fa -o output_graph.gfa -k kmer_size -w window_size -a kmer_min_abundance -u unitig_min_abundance
eg MBG -i reads.fa -o graph.gfa -k 1501 -w 1450 -a 1 -u 3
Multiple read files can be inputted with "-i file1.fa -i file2.fa" etc. Input read type can be .fa / .fq / .fa.gz / .fq.gz.
-k
: k-mer size. Must be odd and at least 31-w
: window size. Cannot be greater than k-30. Default k-30-a
: minimum k-mer abundance. Discard k-mers whose coverage is less than this. Default 1-u
: minimum unitig abundance. Discard unitigs whose average coverage is less than this. Default 2-t
: number of threads. Default 1-i
: input read files. Can use multiple times to input multiple files. Format .fa/.fasta/.fa.gz/.fasta.gz/.fq/.fastq/.fq.gz/.fastq.gz-o
: output graph file
Other options:
-r
: assemble using a multiplex DBG and increase k-mer size up tor
. See Bankevich et al. 2020 for a description of the multiplex DBG algorithm.-R
: allow the multiplex DBG resolution to remove low coverage k-mers whenk<R
. Often low coverage short k-mers are sequencing errors so a lowR
(around 1000-4000) can improve the assembly by removing sequencing errors.-h
: print help-v
: print version--blunt
: output a graph without edge overlaps. Cannot be combined with-r
or--output-sequence-paths
--error-masking
: mask errors in the input reads. Options arehpc
(default) to mask homopolymer errors and call consensus on homopolymer run lengths,no
to disable error masking,collapse
to collapse homopolymer runs to one base pair (not recommended outside testing),collapse-dinuc
to collapse homopolymer runs and call consensus on dinucleotide runs,collapse-msat
to collapse homopolymer runs and call consensus on microsatellite where the repeat motif is up to 6bp long,dinuc
to call consensus on dinucleotide runs,msat
to call consensus on microsatellites where the repeat motif is up to 6bp long.--include-end-kmers
: force k-mers at the ends of input sequences to be picked. Recommended if building from a reference of a linear genome, not recommended when building from reads or from a reference of a circular genome. Uses significantly more time and memory but prevents the last up tow
base pairs at chromosome ends from being clipped.--output-sequence-paths
: output the paths of the input sequences as alignments in GAF format to the given file. The alignments are exact in homopolymer compressed space but might differ in homopolymer run lengths.--do-unsafe-guesswork-resolutions
: use extra heuristics during multiplex DBG resolution. Typically leads to slightly more resolved assemblies but might introduce misassemblies.--hpc-variant-onecopy-coverage
: separate k-mers based on their homopolymer (with--error-masking=hpc
) or microsatellite (with--error-masking=msat
orcollapse-msat
) variation.
k and w can be arbitrarily large but at some point the error rate and limited read length will cause the graph to be fragmented. Runtime stays approximately the same if the ratio k/w is kept constant. All repeats shorter than k are separated, all repeats longer than k+w are collapsed, and repeats in between may be separated or collapsed depending on if a k-mer was selected from within the repeat. When using --blunt
, you should clean the graph afterwards with vg. --blunt
uses an extension of an algorithm invented by Hassan Nikaein (personal communication).