-
Notifications
You must be signed in to change notification settings - Fork 194
How to Use FALCONN
This is a guide to using FALCONN. As a complement to this introduction, we also have a Doxygen-generated API reference for C++ and a pdoc-generated API reference for Python.
As of now, FALCONN supports static datasets. We plan to add support for dynamic datasets shortly. Also, please be aware that FALCONN is currently not thread-safe. We are planning to release a multi-threaded version next.
We start with summarizing the C++ API. For the Python API, see below.
To use FALCONN, include the main header file falconn/lsh_nn_table.h
.
All classes provided by FALCONN are in the namespace falconn
.
The main wrapper class encapsulating an LSH data structure is LSHNearestNeighborTable
.
Instances of this class should be created with the function construct_table
.
The function construct_table
has one required template parameter PointType
, which is the type of an individual point to be handled by the data structure. PointType can be one of the following options:
DenseVector<float>
DenseVector<double>
SparseVector<float>
SparseVector<double>
DenseVector
is simply a typedef for the dynamically-sized vector class defined by the Eigen library, which offers convenient vector operations (see Eigen's documention for a brief introduction). SparseVector
is a typedef for a std::vector
of pairs, where the first component is the index of a coordinate, and the second component is the value (elements must be sorted by the first component).
The function construct_table
has two parameters:
- A dataset
points
, whose type isstd::vector<PointType>
. - The LSH table parameters
params
, whose type isLSHConstructionParameters
.
The function construct_table
returns a unique_ptr
to the actual LSH table object.
The struct LSHConstructionParameters
has several fields.
Some of the fields must be set in every case, while other fields are only necessary for certain hash families.
A good starting point is to use the function get_default_parameters
that sets all parameters to reasonable default values, and then to fine-tune the returned LSH parameters if necessary.
The following five parameters are always required. If the meaning of some parameters is not clear, see our LSH Primer.
-
dimension
is the dimension of the dataset. -
lsh_family
can either beLSHFamily::Hyperplane
orLSHFamily::Crosspolytope
. -
distance_function
is the distance function used for selecting the near neighbors among the filtered points. Currently it can be eitherDistanceFunction::NegativeInnerProduct
, which corresponds to the negated dot product, orDistanceFunction::EuclideanSquared
, which corresponds to the squared Euclidean distance. -
storage_hash_table
: how the low-level hash tables are stored. Set it toStorageHashTable::BitPackedFlatHashTable
if the number of bins is not much larger than the number of points and toStorageHashTable::LinearProbingHashTable
, otherwise. -
num_setup_threads
: number of threads used to set up the hash table. Zero indicates that FALCONN should use the maximum number of available hardware threads (or1
if this number cannot be determined). The number of threads used is always at most the number of tablesl
. -
k
is the number of hash functions per hash table. -
l
is the number of hash tables.
The following parameters are required for the cross-polytope hash:
-
last_cp_dimension
is the dimension of the last of thek
cross-polytopes. This parameter allows more granular control over the effective number of bits in the cross-polytope hash. The parameterlast_cp_dimension
must be between1
and the smallest power of two that is not less thandimension
.
To simplify setting the parameters k
and
last_cp_dimension
for the cross-polytope hash, we provide a helper function
compute_number_of_hash_functions
with the following interface: it takes a parameter number_of_hash_bits
and the partially set construction parameters (in particular, the fields dimension
and lsh_family
must be set).
The function then sets the fields k
and last_cp_dimension
automatically such that the number of buckets
is equal to 2number_of_hash_bits
(i.e., the number of effective bits in the hash is number_of_hash_bits
).
We recommend using this subroutine (for more details, see the example we provide).
-
num_rotations
is the number of pseudo-random rotations (see LSH Families for details). For sufficiently dense data, we recommend a value of 1. For sparse data, we recommend a value of 2.
Optional parameters for all hash functions:
-
seed
is the randomness seed which is used in the hash functions.
Optional parameters specific to the cross-polytope hash:
-
feature_hashing_dimension
is the intermediate feature hashing dimension in the sparse version of the cross-polytope hash. The value offeature_hashing_dimension
must be a power of two. In a nutshell, the meaning of this parameter is as follows: We first perform feature hashing to reduce the dimension of the sparse vector tofeature_hashing_dimension
and then apply the dense cross-polytope hash in this lower-dimensional space. Smaller values offeature_hashing_dimension
lead to a faster hash computation, but the quality of the hash function also decreases. For many high-dimensional and sparse data sets (such as bag of words, etc.), a feature hashing dimension of 512, 1024, or 2048 works well.
After setting up an LSH table, the next important step is setting the number of probes per query (see our LSH Primer for a discussion). This can be done using the member function set_num_probes
. By default, we query one bucket
per hash table. This conservative choice corresponds to "vanilla" (i.e., non-multiprobe) LSH and is often suboptimal.
Setting the right number of probes for the desired query accuracy is crucial to get good performance with FALCONN.
Note that set_num_probes
sets the number of probes we make for all hash tables together. Our GloVe example shows how to set the number of probes automatically.
In contrast to the other LSH parameters such as k
and l
, the number of probes can be changed after the table has been
constructed. Changing the number of probes involves little to no overhead.
Finally, we are ready to answer queries. On a very high level, an LSH table answers queries in two steps. First, the table retrieves a list of candidate data points using the multiprobe LSH procedure. Second, it scans this candidate list and chooses the best points according to the desired distance metric.
FALCONN provides low-level routines for generating the list of candidates, as well as more high-level functions that filter this list according to the specified distance metric.
- High-level methods:
-
find_nearest_neighbor
returns the closest data point among the candidates generated by the LSH look-up -
find_k_nearest_neighbors
returns the k closest data points among the candidates generated by the LSH look-up -
find_near_neighbors
returns all the candidates within a given distance threshold.
-
- Low-level methods:
-
get_candidates_with_duplicates
returns all candidates from the tables with possible duplicates. This is the lowest level query method. -
get_unique_candidates
also returns all candidates as above, but eliminates duplicates from the returned candidate list. -
get_unique_sorted_candidates
: the same as above, but the candidates are sorted according to their index / key (not their distance!). This can be helpful for memory locality when making a pass over all candidates.
-
We provide a thin Python wrapper around the C++ API described above, so most of the above discussion applies also to Python. For details, see our pdoc-generated documentation. As of now, the Python wrapper supports only dense data. Datasets are given as two-dimensional NumPy arrays, where rows are data points.
An important point is that the NumPy array holding a dataset must stay valid while using the LSH data structure. Otherwise, the code might crash silently.