Skip to content

Converting to the fmlrc RLE BWT format

Matt Holt edited this page Feb 9, 2019 · 1 revision

Converting to the fmlrc RLE-BWT format

The Run-Length Encoded (RLE) format is primarily for reducing the disk space required to store the MSBWT. Additionally, it has the side effect of reducing computation as well, especially in high coverage genomic datasets.

However, there are many other tools for constructing the MSBWT that are not a part of this package. As of fmlrc v1.0.0, we provide access to a "fmlrc-convert" pre-processing program that will take as input a MSBWT string and save it into the RLE format we use. For example, the MSBWT of the string "ACAT$" is "T$CAA", so the following command will convert that string to the RLE format we use and store it on disk:

echo -e "T\$CAA" | fmlrc-convert /path/to/output/comp_msbwt.npy

We note that this tool does not perform any sanity checks on the input given to it, instead just performing the compression and conversion. Giving our conversion tool, a MSBWT that does not follow our definition of a MSBWT may have unexpected consequences in downstream queries.

Recommended build method

We recommend using ropebwt2 to build the MSBWT:

Due to difference in the BWT encoding, the data in and out of ropebwt2 must be manipulated. The following commands will create a sorted, plain-text read file. Then it will run that data through ropebwt2 and fmlrc-convert to create the RLE-BWT that fmlrc expects.

gunzip -c reads.fq.gz | awk 'NR % 4 == 2' | sort | gzip > reads.sorted.txt.gz
gunzip -c reads.sorted.txt.gz | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc-convert /path/to/output/comp_msbwt.npy

If an intermediate file is unnecessary, then these two lines can be combined into one command:

gunzip -c reads.fq.gz | awk 'NR % 4 == 2' | sort | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc-convert /path/to/output/comp_msbwt.npy

For more details on what these commands are doing or details on the RLE-BWT format specification, please refer to Converting to Run-Length Encoded (RLE) format from the msbwt python package.

Clone this wiki locally