LPHash is a minimal perfect hash function (or MPHF) designed for k-mer sets where overlaps of k-1 bases between consecutive k-mers are exploited to:
-
Reduce the space of the data structure. For example, with the right combination of parameters, LPHash is able to achive < 0.9 bits/k-mer in practice whereas the known theoretical lower-bound of a classic MPHF is 1.44 bits/k-mer and practical constructions take 2-3 bits/k-mer.
-
Boost the evaluation speed of queries performed for consecutive k-mers.
-
Preserve the locality of k-mers: consecutive k-mers are likely to receive consecutive hash codes.
The data structure and its construction/query algorithms are described in the paper Locality-Preserving Minimal Perfect Hashing of k-mers [1].
Please, cite this paper if you use LPHash.
Datasets are available on Zenodo here: https://zenodo.org/record/7239205.
The code is tested on Linux with gcc
and on Mac OS with clang
(Intel and ARM processors supported, like Apple M1).
To build the code, CMake
is required.
Clone the repository with
git clone --recursive https://github.com/jermp/lphash.git
If you have cloned the repository without --recursive
, be sure you pull the dependencies with the following command before compiling:
git submodule update --init --recursive
K-mers can have maximal lengths of both 32 or 64, depending on their definition found in the file include/compile_constants.tpd
(either uint64_t
or __uint128_t
). Although larger 128-bit k-mers also work for k <= 32, we recommend to use the right type whenever possible.
To compile the code for a release environment (see file CMakeLists.txt
for the used compilation flags), it is sufficient to do the following:
mkdir build
cd build
cmake ..
make -j
For a testing environment, use the following instead:
mkdir debug_build
cd debug_build
cmake .. -D CMAKE_BUILD_TYPE=Debug -D LPHASH_USE_SANITIZERS=On
make -j
The repository has minimal dependencies: it only uses the PTHash library (for minimal perfect hashing [2,3]), and zlib
to read gzip-compressed streams.
To automatically pull the PTHash dependency, just clone the repo with
--recursive
as explained in Compiling the Code.
If you do not have zlib
installed, you can do
sudo apt-get install zlib1g
if you are on Linux/Ubuntu, or
brew install zlib
if you have a Mac (and Homebrew installed: https://brew.sh/).
There is one main executable called lphash
after the compilation, which can be used to run a tool.
Run ./lphash
as follows to see a list of available tools.
LP-Hash: (L)ocality (P)reserving Minimal Perfect (Hash)ing of k-mers
Usage: ./lphash <tool> ...
Available tools:
build-p build a partitioned LP-MPHF
build-u build an unpartitioned LP-MPHF
query-p query a partitioned LP-MPHF
query-u query an unpartitioned LP-MPHF
The driver program called lphash
can be used to build and query Locality-Preserving MPHFs.
The subcommand build-p
uses the partitioning strategy, while build-u
does not (check out the paper for more details).
In the following, we will focus on build-p
alone since it is the most memory efficient between the two.
From within the directory where the code was compiled (see the section Compiling the Code), run the command:
./lphash build-p --help
to show the usage of the subcommand (reported below for convenience).
Usage: build-p [-h,--help] [-i input_filename] [-k k] [-m m] [-s seed] [-t threads] [-o output_filename] [-d tmp_dirname] [-c c] [--max-memory max-memory] [--check] [--verbose]
[-i input_filename]
REQUIRED: Must be a FASTA file (.fa/fasta extension) compressed with gzip (.gz) or not:
- without duplicate nor invalid kmers
- one DNA sequence per line.
For example, it could be the de Bruijn graph topology output by BCALM.
[-k k]
REQUIRED: K-mer length (must be <= 63).
[-m m]
REQUIRED: Minimizer length (must be < k and <= 32).
[-s seed]
Seed for minimizer computation (default is 42).
[-t threads]
Number of threads for pthash (default is 1).
[-o output_filename]
Output file name where the data structure will be serialized (no files generated by default).
[-d tmp_dirname]
Temporary directory used for construction in external memory (default is directory '.').
[-c c]
A (floating point) constant that trades construction speed for space effectiveness of minimal perfect hashing.
A reasonable value lies between 3.0 and 10.0 (default is 3.00).
[--max-memory max-memory]
Maximum internal memory [GB] for building (8GB by default). Use external memory if needed.
[--check]
Check correctness after construction (disabled by default).
[--verbose]
Verbose output during construction (disabled by default).
[-h,--help]
Print this help text and silently exits.
./lphash build-p
prints useful statistics about the constructed MPHF on stdout as single comma-separated values whose fields are:
- The input filename used to build the MPHF,
- k-mer length k,
- minimizer length m,
- the fraction of non-unique minimizer,
- the theoretical epsilon (number of super-k-mers / number of k-mers),
- the true epsilon,
- alpha (fragmentation factor = number of contigs / number of k-mers),
- the space of the MPHF in bits/k-mer.
For the examples, we are going to use some collections of stitched unitigs from the directory data/unitigs_stitched
.
These collections were built for k = 31, 47 and 63, so MPHFs should be built with the right k to ensure correctness.
In the section Input Files, we explain how such collections of stitched unitigs can be obtained from raw FASTA files.
This example
./lphash build-p -i ../data/unitigs_stitched/se.ust.k31.fa.gz -k 31 -m 15 --check -o se_k31_m15.lph --verbose
builds a dictionary for the k-mers read from the file data/unitigs_stitched/se.ust.k31.fa.gz
, with k = 31 and m = 15.
It also check the correctness of the MPHF (--check
option, for both minimality and collisions), and serializes it on disk to the file se_k31_m15.lph
.
Another example is
./lphash build-p -i ../data/unitigs_stitched/se.ust.k63.fa.gz -k 63 -m 17 --check -o se_k63_m17.lph --verbose
which uses k = 63 and m = 17.
Similarly to build-p
, the query-p
command (or query-u
if the MPHF was built by build-u
) outputs single CSV lines on stdout.
Queries can be performed by recomputing each hash value from scratch (random query) of by re-using the fact that successive k-mers overlap by k-1 bases (streaming).
The fields are:
input file,lphash file,total k-mers,barebone streaming time,barebone random time,full streaming time,full random time
- Input (fasta/fastq) file,
- the queried MPHF file,
- the total number of k-mers in the input,
- average streaming query time in ns/k-mer (without saving the hash values),
- average random query time in ns/k-mer (without saving the hash values),
- average streaming query time in ns/k-mer (by saving the hash values inside an in-memory vector),
- average random query time in ns/k-mer (by saving the hash values inside an in-memory vector).
./lphash query-p -i se_k31_m15.lph -q ../data/queries/salmonella_enterica.fasta.gz
LPHash is meant to index k-mers from collections that do not contain duplicates. These collections can be obtained, for example, by extracting the maximal unitigs of a de Bruijn graph.
To do so, we can use the tool BCALM2 [4]. This tool builds a compacted de Bruijn graph and outputs its maximal unitigs. From the output of BCALM2, we can then stitch (i.e., glue) some unitigs to reduce the number of nucleotides. The stitiching process is carried out using the UST tool.
Below we provide a complete example (assuming both BCALM2 and UST are installed correctly) that downloads the Human (GRCh38) Chromosome 13 and extracts the maximal stitiched unitigs for k = 31.
mkdir DNA_datasets
wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz -O DNA_datasets/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz
~/bcalm/build/bcalm -in ~/DNA_datasets/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz -kmer-size 31 -abundance-min 1 -nb-cores 8
~/UST/ust -k 31 -i ~/Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa
gzip Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa.ust.fa
rm ~/Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa
We uploaded a copy of all the datasets used in the experiments of the paper, processed as we described above, on Zenodo: https://zenodo.org/record/7239205.
- [1] Giulio Ermanno Pibiri, Yoshihiro Shibuya, and Antoine Limasset. Locality-Preserving Minimal Perfect Hashing of k-mers. Bioinformatics, Volume 39, Pages i534–i543, 2023.
- [2] Giulio Ermanno Pibiri and Roberto Trani. PTHash: Revisiting FCH Minimal Perfect Hashing. In The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1339–1348, 2021.
- [3] Giulio Ermanno Pibiri and Roberto Trani. Parallel and external-memory construction of minimal perfect hash functions with PTHash. CoRR, abs/2106.02350, 2021.
- [4] Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201–i208, 2016.