Skip to content

How to Use FALCONN

Ilya Razenshteyn edited this page Oct 12, 2016 · 53 revisions

How to Use FALCONN

This is a guide to using FALCONN. As a complement to this introduction, we also have a Doxygen-generated API reference for C++ and a pdoc-generated API reference for Python.

As of now, FALCONN supports static datasets. We plan to add support for dynamic datasets shortly. Also, please be aware that FALCONN is currently not thread-safe. We are planning to release a multi-threaded version next.

We start with summarizing the C++ API. For the Python API, see below.

The Main Class

To use FALCONN, include the main header file falconn/lsh_nn_table.h. All classes provided by FALCONN are in the namespace falconn. The main wrapper class encapsulating an LSH data structure is LSHNearestNeighborTable. Instances of this class should be created with the function construct_table.

Constructing an LSH Table

The function construct_table has one required template parameter PointType, which is the type of an individual point to be handled by the data structure. PointType can be one of the following options:

  • DenseVector<float>
  • DenseVector<double>
  • SparseVector<float>
  • SparseVector<double>

DenseVector is simply a typedef for the dynamically-sized vector class defined by the Eigen library, which offers convenient vector operations (see Eigen's documention for a brief introduction). SparseVector is a typedef for a std::vector of pairs, where the first component is the index of a coordinate, and the second component is the value (elements must be sorted by the first component).

The function construct_table has two parameters:

  • A dataset points, whose type is std::vector<PointType>.
  • The LSH table parameters params, whose type is LSHConstructionParameters.

The function construct_table returns a unique_ptr to the actual LSH table object.

Construction Parameters

The struct LSHConstructionParameters has several fields. Some of the fields must be set in every case, while other fields are only necessary for certain hash families. A good starting point is to use the function get_default_parameters that sets all parameters to reasonable default values, and then to fine-tune the returned LSH parameters if necessary.

The following five parameters are always required. If the meaning of some parameters is not clear, see our LSH Primer.

  • dimension is the dimension of the dataset.
  • lsh_family can either be LSHFamily::Hyperplane or LSHFamily::Crosspolytope.
  • distance_function is the distance function used for selecting the near neighbors among the filtered points. Currently it can be either DistanceFunction::NegativeInnerProduct, which corresponds to the negated dot product, or DistanceFunction::EuclideanSquared, which corresponds to the squared Euclidean distance.
  • storage_hash_table: how the low-level hash tables are stored. Set it to StorageHashTable::BitPackedFlatHashTable if the number of bins is not much larger than the number of points and to StorageHashTable::LinearProbingHashTable, otherwise.
  • num_setup_threads: number of threads used to set up the hash table. Zero indicates that FALCONN should use the maximum number of available hardware threads (or 1 if this number cannot be determined). The number of threads used is always at most the number of tables l.
  • k is the number of hash functions per hash table.
  • l is the number of hash tables.

The following parameters are required for the cross-polytope hash:

  • last_cp_dimension is the dimension of the last of the k cross-polytopes. This parameter allows more granular control over the effective number of bits in the cross-polytope hash. The parameter last_cp_dimension must be between 1 and the smallest power of two that is not less than dimension.

To simplify setting the parameters k and last_cp_dimension for the cross-polytope hash, we provide a helper function compute_number_of_hash_functions with the following interface: it takes a parameter number_of_hash_bits and the partially set construction parameters (in particular, the fields dimension and lsh_family must be set). The function then sets the fields k and last_cp_dimension automatically such that the number of buckets is equal to 2number_of_hash_bits (i.e., the number of effective bits in the hash is number_of_hash_bits). We recommend using this subroutine (for more details, see the example we provide).

  • num_rotations is the number of pseudo-random rotations (see LSH Families for details). For sufficiently dense data, we recommend a value of 1. For sparse data, we recommend a value of 2.

Optional parameters for all hash functions:

  • seed is the randomness seed which is used in the hash functions.

Optional parameters specific to the cross-polytope hash:

  • feature_hashing_dimension is the intermediate feature hashing dimension in the sparse version of the cross-polytope hash. The value of feature_hashing_dimension must be a power of two. In a nutshell, the meaning of this parameter is as follows: We first perform feature hashing to reduce the dimension of the sparse vector to feature_hashing_dimension and then apply the dense cross-polytope hash in this lower-dimensional space. Smaller values of feature_hashing_dimension lead to a faster hash computation, but the quality of the hash function also decreases. For many high-dimensional and sparse data sets (such as bag of words, etc.), a feature hashing dimension of 512, 1024, or 2048 works well.

Using the LSH Table

Setting the number of probes

After setting up an LSH table, the next important step is setting the number of probes per query (see our LSH Primer for a discussion). This can be done using the member function set_num_probes. By default, we query one bucket per hash table. This conservative choice corresponds to "vanilla" (i.e., non-multiprobe) LSH and is often suboptimal. Setting the right number of probes for the desired query accuracy is crucial to get good performance with FALCONN.

Note that set_num_probes sets the number of probes we make for all hash tables together. Our GloVe example shows how to set the number of probes automatically.

In contrast to the other LSH parameters such as k and l, the number of probes can be changed after the table has been constructed. Changing the number of probes involves little to no overhead.

Making queries

Finally, we are ready to answer queries. On a very high level, an LSH table answers queries in two steps. First, the table retrieves a list of candidate data points using the multiprobe LSH procedure. Second, it scans this candidate list and chooses the best points according to the desired distance metric.

FALCONN provides low-level routines for generating the list of candidates, as well as more high-level functions that filter this list according to the specified distance metric.

  • High-level methods:
    • find_nearest_neighbor returns the closest data point among the candidates generated by the LSH look-up
    • find_k_nearest_neighbors returns the k closest data points among the candidates generated by the LSH look-up
    • find_near_neighbors returns all the candidates within a given distance threshold.
  • Low-level methods:
    • get_candidates_with_duplicates returns all candidates from the tables with possible duplicates. This is the lowest level query method.
    • get_unique_candidates also returns all candidates as above, but eliminates duplicates from the returned candidate list.
    • get_unique_sorted_candidates: the same as above, but the candidates are sorted according to their index / key (not their distance!). This can be helpful for memory locality when making a pass over all candidates.

Python API

We provide a thin Python wrapper around the C++ API described above, so most of the above discussion applies also to Python. For details, see our pdoc-generated documentation. As of now, the Python wrapper supports only dense data. Datasets are given as two-dimensional NumPy arrays, where rows are data points.

An important point is that the NumPy array holding a dataset must stay valid while using the LSH data structure. Otherwise, the code might crash silently.