This is the code for the current state-of-the-art billion-scale nearest neighbor search system presented in the paper:
Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors,
Dmitry Baranchuk, Artem Babenko, Yury Malkov
The code is developed upon the FAISS library.
Today we provide the C++ implementation supporting only the CPU version, which requires a BLAS library.
The code requires a C++ compiler that understands:
- the Intel intrinsics for SSE instructions
- the GCC intrinsic for the popcount instruction
- basic OpenMP
- Clone repository
git clone https://github.com/dbaranchuk/ivf-hnsw --recursive
- Configure FAISS
There are a few models for makefile.inc in the faiss/example_makefiles/ subdirectory. Copy the relevant one for your system to faiss/ and adjust to your needs. In particular, for ivf-hnsw project, you need to set a proper BLAS library paths. There are also indications for specific configurations in the troubleshooting section of the FAISS wiki
- Replace FAISS CMakeList.txt
Replace faiss/CMakeList.txt with CMakeList.txt.faiss in order to deactivate building of unnecessary tests and the GPU version.
mv CMakeLists.txt.faiss faiss/CMakeLists.txt
- Build project
cmake . && make
The proposed methods are tested on two 1 billion datasets: SIFT1B and DEEP1B. For using provided examples, all data files have to be in data/SIFT1B and data/DEEP1B.
-
SIFT1B:
cd data/SIFT1B && bash load_sift1b.sh
- learned 993127 centroids, GoogleDrive
- precomputed indices of assigned base points, GoogleDrive
-
DEEP1B:
- dataset, YandexDrive
cd data/DEEP1B && python load_deep1b.py
- learned 999973 centroids, GoogleDrive
- precomputed indices of assigned base points, GoogleDrive
Note: precomputed indices are optional, as it just lets avoid assigning step, which takes about 2-3 days for 2^20 centroids.
tests/ provides two tests for each dataset:
- IVFADC
- IVFADC + Grouping (+ Pruning)
Each test requires many options, so we provide bash scripts in examples/, exploiting these tests. Scripts are commented and the Parser class provides short descriptions for each option.
Make sure that:
- models/SIFT1B/ and models/DEEP1B/ exist
mkdir models && mkdir models/SIFT1B && mkdir models/DEEP1B
- the data is placed to data/SIFT1B/ and data/DEEP1B/ respectively (or just make symbolic links)
- run, for example:
bash examples/run_deep1b_grouping.sh
The doxygen documentation gives per-class information