GitHub

PartSortBWT

Low memory BWT computation. Usage:

std::string input;
input.push_back(1); // N
input.push_back(2); // A
input.push_back(3); // C
input.push_back(4); // G
input.push_back(5); // T
input.push_back(0); // $
std::string output;
PartSortBWT(input, output); // computes BWT, stores in output
PartSortBWT(input, input); // computes BWT, stores in the same string and overwrites

The input and output can be the same string. For a real example see src/main.cpp. The input string must end with 0, and all other characters must be between 1-5 representing the alphabet $NACGT.

Benchmark

Measured with /usr/bin/time -v bin/main input_file > /dev/null using commit 489c112 on a laptop.

Dataset	Size	Time (h:mm:ss)	Memory
Human assembly	6.0 Gbp	1:15:51	9.0 Gb
GRCh38+alts	3.3 Gbp	0:40:01	4.9 Gb
Assembly graph hpc	3.8 Gbp	0:31:11	6.5 Gb
CHM13	3.1 Gbp	0:23:45	4.6 Gb
CHM13 hpc	2.1 Gbp	0:12:02	3.6 Gb
CHM13 chr1	248 Mbp	0:01:40	375 Mb
CHM13 chr1 hpc	175 Mbp	0:00:49	291 Mb
ACACACAC...	1 Mbp	0:05:25	11 Mbp
ACACACAC...	100 kbp	0:00:02	3 Mbp

Method

Classify suffixes according to their first 4 characters. First collect and sort all suffixes starting with AAAA to get the first x characters of the BWT, then suffixes with AAAC to get the next y characters, etc. Repeat for all 4-character prefixes. Special case for homopolymer runs of >=4 chars: sort the suffixes starting at the ends of the homopolymer runs, then do induced sorting to sort the rest of the homopolymer runs.

Expected memory use for a random DNA string is input + output + ~0.875n bytes. If input == output then total memory use is ~1.875n bytes. Less random strings have higher memory use and worst case is dinucleotide repeating (eg ACACACAC...) using ~8n bytes. Expected runtime for a random string is O(n (log n)^2) and worst case is O(n^2 log n).

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PartSortBWT

Benchmark

Method

About

Releases

Packages

Contributors 2

Languages

License

maickrau/PartSortBWT

Folders and files

Latest commit

History

Repository files navigation

PartSortBWT

Benchmark

Method

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages