Skip to content

Two-stage routing with Optimized Guided search and Greedy algorithm

License

Notifications You must be signed in to change notification settings

whenever5225/TOGG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TOGG: Two-Stage Routing with Optimized Guided Search and Greedy Algorithm On Proximity Graph

Introduction

Two-Stage Routing with Optimized Guided Search and Greedy Algorithm is more efficient and effective routing strategy than state-of-the-art method, including SIGKDD'2011, CVPR'2016, PR'2019, and TPAMI'2021. At present, TOGG has been deployed in several state-of-the-art proximity graph algorithms (such as HNSW, NSG, and NSSG), and experiments on multiple real world datasets have verified the efficiency and effectiveness of TOGG.

Routing Strategies

We evaluate four advanced routing strategies applied to the proximity graph algorithms, and their papers are given in the following table. In addition, we propose TOGG, including TOGG that uses OGS-KDT + OGA (TOGG-KDT) and OGS-KMC + OGA (TOGG-KMC); they achieve better performance of efficiency and effectiveness than state-of-the-art work.

OGS-KDT: optimized Guided Search based on modified KD-Tree
OGS-KMC: optimized Guided Search based on k-means clustering
OGA: optimized greedy algorithm
Strategy Paper Source Year
SIGKDD'2011 SIGKDD 2011
CVPR'2016 CVPR 2016
PR'2019 PR 2019
TPAMI'2021 TPAMI 2021

Proximity Graph Algorithms

We applied both TOGG to three state-of-the-art proximity graphs, i.e., HNSW, NSG, and NSSG, and then their scalability were tested on four real-world datasets. Note that greedy algorithm (GA) is the routing strategy currently used by these proximity graph algorithms.

Algorithm Paper Source Year
HNSW TPAMI 2018
NSG VLDB 2019
NSSG TPAMI 2021

Datasets

Our experiment involves four real-world datasets popularly deployed by existing works. All datasets are pre-split into base data and query data and come with groundtruth data in the form of the top 100 neighbors.

dataset base_num base_dim query_num query_dim groundtruth_num/query download
SIFT1M 1000000 128 10000 128 100 sift.tar.gz(161MB)
GIST1M 1000000 960 1000 960 100 gist.tar.gz(2.6GB)
GloVe-100 1183514 100 10000 100 100 glove-100.tar.gz(424MB)
MNIST 60000 784 10000 100 100 mnist.tar.gz(19.8MB)

Parameters

b: interval factor (TOGG-KDT), it controls the size of the partition interval, thereby affecting the number of neighbors in each subtree.
m: the number of clusters (TOGG-KMC), it controls the number of neighbor clusters, which affects the number of neighbors in each neighbor cluster.

To obtain the optimal value of b (0 ≤ b < infinity) or m (1 ≤ m ≤ c, c is maximum outdegree), we first traverse the values in a large interval within their value range, and then reduce the interval and traverse the value within a small range until an optimal value is obtained.

Usage

Prerequisites

  • GCC 4.9+ with OpenMP
  • CMake 2.8+
  • Boost 1.55+

Compile on Linux

Download:

$ git clone https://github.com/whenever5225/TOGG.git

Locate directory:

$ cd TOGG/routing_evalutaion	#for routing strategies evaluation
$ cd TOGG/algorithms/A/B/	#A is GA or TOGG-KDT or TOGG-KMC, B is hnsw or nsg or nssg, for scalability evaluation
$ mkdir build && cd build/

Compile:

$ cmake ..
$ make -j

Comparing with Prior Routing Strategies

Before building index, you should set the root directory for the dataset in TOGG/routing_evaluation/test/evaluation.cpp first. Then, you can run the following instructions for build graph index.

$ cd TOGG/routing_evaluation/build/test/
$ ./evaluation strategy_name dataset_name build [other optional information]

After the build is completed, the graph index will be written in the current folder in binary format (for index size). The index construction time can be viewed from the output log information. With the index built, you can run the following commands to perform the search. Related information about the search such as distance evaluation times, candidate set size, memory load can be obtained or calculated according to the output log information.

$ cd TOGG/routing_evaluation/build/test/
$ ./evaluation strategy_name dataset_name search [other optional information]

Scalability Evaluation

Before building index, you should set the root directory for the dataset in TOGG/algorithms/A/B/test/evaluation.cpp first. Then, you can run the following instructions for build graph index.

$ cd TOGG/algorithms/A/B/build/test/
$ ./evaluation dataset_name build [other optional information]	#see evaluation.cpp file for more detail

After the build is completed, the graph index will be written in the current folder in binary format (for index size). The index construction time can be viewed from the output log information. With the index built, you can run the following commands to perform the search. Related information about the search such as distance evaluation times, candidate set size, memory load can be obtained or calculated according to the output log information.

$ cd TOGG/algorithms/A/B/build/test/
$ ./evaluation dataset_name search [other optional information]	#see evaluation.cpp file for more detail

Results

image-20210508170359383

Acknowledgements

Thanks to everyone who provided references for this project. Special thanks to n2, nsg, and ssg for their assistance in the necessary implementation of this project.

About

Two-stage routing with Optimized Guided search and Greedy algorithm

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published