.ooooo. .ooooo. .oooo.o ooo. .oo. .oo. .ooooo.
d88' `"Y8 d88' `88b d88( "8 `888P"Y88bP"Y88b d88' `88b
888 888 888 `"Y88b. 888 888 888 888 888
888 .o8 888 888 o. )88b 888 888 888 888 888
`Y8bod8P' `Y8bod8P' 8""888P' o888o o888o o888o `Y8bod8P'
ver 0.5.1
Version: 0.5.1
Cosmo is a fast, low-memory DNA assembler that uses a succinct de Bruijn graph.
After compiling, you can run Cosmo like so:
$ pack-edges <input_file> # this adds reverse complements and dummy edges, and packs them
$ cosmo-build <input_file>.packed # compresses and builds indices
$ cosmo-assemble <input_file>.packed.dbg # output: <input_file>.packed.dbg.fasta # NOT IMPLEMENTED YET
Where input_file
is the binary output of a DSK run. Each program has a --help
option for a more
detailed description of how to use them.
Here are some things that you don't want to let surprise you:
Currently Cosmo only supports DSK files with k <= 64 (so, 128 bit or less blocks). Support is planned for DSK files with larger k, and possibly output from other k-mer counters.
Note that since our graph is edge-based, k defines the length of our edges, hence our nodes are only k-1 symbols long.
If you want to construct a Succinct de Bruijn Graph where the nodes are k-mers, you will need to run DSK
with k set to k+1. E.g. using output from $ dsk <input_file> 27
will actually build a 26-dimension de Bruijn graph.
Note: Both even and odd k values should work with this assembler due to our loop-immune traversal.
Furthermore, most de Bruijn graph based assemblers add edges between all nodes that overlap. Instead, we are taking the k-mers as our edges (of two k-1-length nodes), so we only have edges that were directly represented in the read set (this makes more sense to us, though, as it reduces unnecessary branching). I may add support for the standard way in the future if anyone wants it (it would be similar to the dummy edge adding code).
We currently only output the unitigs (paths between branching nodes).
There is an included Makefile - just type make
to build it (assuming you have the dependencies listed below).
To build with "Variable order mode", use the varord=1
flag.
*Note: it has only been tested on Mac OS X. Changes to work on any NIX should be minor.
- A compiler that supports C++11,
- Boost - ranges and range algorithms, zip iterator, tuple comparison, lots of good stuff,
- SDSL-lite - low level succinct data structures (For now you will have to use my branch if you want to use variable order
graphs: clone this and checkout the
develop
branch before compiling), - TClap - command line parsing,
- DSK - k-mer counting (we need this for input),
- Optionally (for developers): Python and NumPy - rebuilding the lookup tables,
- STXXL - external merging (not actually required yet though)
Many of these are all installable with a package manager (e.g. (apt-get | yum | brew ) install boost libstxxl tclap
).
However, you will have to download and build these manually: DSK and SDSL-lite.
Implemented by Alex Bowe. Original concept and prototype by Kunihiko Sadakane.
These people also proved incredibly helpful: Rayan Chikhi, Simon Puglisi, Travis Gagie, Christina Boucher, Simon Gog, Dominik Kempa.
Your help is more than welcome! Please fork and send a pull request, or contact me directly :)
Cosmos /ˈkɑz.moʊs/ (n) : "An ordered, harmonious whole.".
If that doesn't suit an assembly program then I don't know what does. The last s was dropped because it was nicer to say. Furthermore, it is a reference to the Seinfeld character Cosmo Kramer (whose last name I'm often reminded of while working on this stuff).
This software is copyright (c) Alex Bowe 2014 (bowe dot alexander at gmail dot com). It is released under the GNU General Public License (GPL) version 3.