Skip to content

Commit

Permalink
patelyash/index factory (#340)
Browse files Browse the repository at this point in the history
* gi# This is a combination of 2 commits.

remove _u, _s typedefs

* added some seed files

* add seed files

* New distance metric hierarchy

* Refactoring changes

* Fixing compile errors in refactored code

* Fixing compile errors

* DiskANN Builds with initial refactoring changes

* Saving changes for Ravi

* More refactoring

* Refactor

* Fixed most of the bugs related to _data

* add seed files

* gi# This is a combination of 2 commits.

remove _u, _s typedefs

* added some seed files

* New distance metric hierarchy

* Refactoring changes

* Fixing compile errors in refactored code

* Fixing compile errors

* DiskANN Builds with initial refactoring changes

* Saving changes for Ravi

* More refactoring

* Refactor

* Fixed most of the bugs related to _data

* Post merge with main

* Refactored version which compiles on Windows

* now compiles on linux

* minor clean-up

* minor bug fix

* minor bug

* clang format fix + build error fix

* clang format fix

* minor changes

* added back the fast_l2 feature

* added back set_start_points in index.cpp

* Version for review

* Incorporating Harsha's comments - 2

* move implementation of abstract data store methods to a cpp file

* clang format

* clang format

* Added slot manager file (empty) and fixed compile errors

* fixed a linux compile error

* clang

* debugging workflow failure

* clang

* more debug

* more debug

* debug for workflow

* remove slot manager

* Removed the #ifdef WINDOWS directive from class definitions

* Refactoring alignment factor into distance hierarchy

* Fixing cosine distance

* Ensuring we call preprocess_query always

* Fixed distance invocations

* fixed cosine bug, clang-formatted

* cleaned up and added comments

* clang-formatted

* more clang-format

* clang-format 3

* remove deleted code in scratch.cpp

* reverted clang to Microsoft

* small change

* Removed slot_manager from this PR

* newline at EOF in_mem_Graph_store.cpp

* rename distance_metric to distance_fn

* resolving PR comments

* minor bug fix for initialization

* creating index_factory

* using index factory to build inmem index

* clang format fix

* minor bug fix

* fixing build error

* replacing mem_store with abstract_mem_store + injecting data_store to Index

* minor fix

* clang format fix

* commenting data_store injection to prevent double invocation and mem leak (for now)

* fixing the build for fiters

* moving abstract index to abstract_index.h

* IndexBuildParamsbuilder to build IndexBuildParams properly with error checking

* fixing build errors

* fixing minor error

* refactoring index search to be simple

* clang format fix

* refactoring search_mem_index to use index factory

* clang fix

* minor fix

* minor fix for build

* optimize for fast l2 restore

* removing comments

* removing comments

* adding templating to IndexFactory (can't avoide it anymore)

* fixing build error

* fixing ubuntu build error

* ubuntu build exception fix

* passing num_pq_bytes

* giving one more shot to config dricen arch with boost::any (type erasure)

* clang fix

* modifying search to use boost::any

* fixing ubuntu build errors/warning

* created indexconfigbuilder and fixed a typo

* fixing error in pq build

* some comments + lazy_delete impl

* bumping to std c++17 & replacing boost::any with std::any

* clang fix

* c++ std 17 for ubuntu

* minor fix

* converting search to batch_search + A vector wrapper using std::any to store vector as a shared ptr

* adding AnyVector to encapsulate vector in std::any + adding basic yaml parser(WIP)

* adding wrapper code for vector and set, checked with Andrija

* fixinh ubuntu build error

* trying to resolve ubuntu build error

* testing test streaming index with IndexFactory

* fixing ubuntu build error

* fixing search for test insert delete consolidate

* refactored test_streaming_scenario

* refactored test_insert_delete_consolidate to use AbstractIndex and Indexfactory

* fixing ubuntu build error

* making build method in abstract index consistent

* some code cleanup + abstract_cpp to add implementation

* remoing coments and code cleanup

* build error fix

* fixing -Wreorder warning

* separating build structs to their header + refactor search and remove batch search

* fixing ubuntu build errors

* resolving segfault error from search_mem_index

* fixing query_result_tag allocation

* minor update

* search fix

* trying to fix windows latest build for dynamic index

* ading temp loggin to debug windows latest build issue

* removing logging for debug

* fixning windows latest build error for dynamix index search

* moving any wrappers to separate file + organizing code

* fixing check error

* updating private vsr naming convention

* minor update

* unravelig search methods in abstract index. Iteraton 1

* minor fix

* unused vars remove

* returning a unique_ptr to Abstract Index from index factory

* adding implementation from abstract_index.h to abstract_index.cpp

* making abstract index api to be more explicit (expriment)

* some code cleanup

* removing detected memory leaks (free up index)

* separtaing enums for data and graph stratagy

* Index ctor(config) now uses injected datastore from IndexFactory

* distance in index population in new config ctor

* resolving some comments from Andrija

* Resolving some restructuring comments by Andrija

* minor fix

* fixing ubuntu build error

* warning fix

* simplified get() in anywrappers

* making index config a unique ptr and owned by IndexFactory

* removing complex if/else calling recursively + added unimplemented TagT to AbsIdx

* renaming get_instance to create_instance

* clang format fix

* removing const_cast from any_wrapper

* fixing andrija's comments

* removing warnings

---------

Co-authored-by: harsha vardhan simhadri <harsha.v.simhadri@gmail.com>
Co-authored-by: Gopal Srinivasa <gopalsr@microsoft.com>
Co-authored-by: ravishankar <rakri@microsoft.com>
Co-authored-by: Harsha Vardhan Simhadri <harsha-simhadri@users.noreply.github.com>
  • Loading branch information
5 people authored and varat73 committed Jun 26, 2023
1 parent b4050de commit b2e4a24
Show file tree
Hide file tree
Showing 160 changed files with 18,532 additions and 175 deletions.
77 changes: 37 additions & 40 deletions apps/build_memory_index.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

#include "memory_mapper.h"
#include "ann_exception.h"
#include "index_factory.h"

namespace po = boost::program_options;

Expand Down Expand Up @@ -155,46 +156,42 @@ int main(int argc, char **argv)
{
diskann::cout << "Starting index build with R: " << R << " Lbuild: " << L << " alpha: " << alpha
<< " #threads: " << num_threads << std::endl;
if (label_file != "" && label_type == "ushort")
{
if (data_type == std::string("int8"))
return build_in_memory_index<int8_t, uint32_t, uint16_t>(
metric, data_path, R, L, alpha, index_path_prefix, num_threads, use_pq_build, build_PQ_bytes,
use_opq, label_file, universal_label, Lf);
else if (data_type == std::string("uint8"))
return build_in_memory_index<uint8_t, uint32_t, uint16_t>(
metric, data_path, R, L, alpha, index_path_prefix, num_threads, use_pq_build, build_PQ_bytes,
use_opq, label_file, universal_label, Lf);
else if (data_type == std::string("float"))
return build_in_memory_index<float, uint32_t, uint16_t>(
metric, data_path, R, L, alpha, index_path_prefix, num_threads, use_pq_build, build_PQ_bytes,
use_opq, label_file, universal_label, Lf);
else
{
std::cout << "Unsupported type. Use one of int8, uint8 or float." << std::endl;
return -1;
}
}
else
{
if (data_type == std::string("int8"))
return build_in_memory_index<int8_t>(metric, data_path, R, L, alpha, index_path_prefix, num_threads,
use_pq_build, build_PQ_bytes, use_opq, label_file, universal_label,
Lf);
else if (data_type == std::string("uint8"))
return build_in_memory_index<uint8_t>(metric, data_path, R, L, alpha, index_path_prefix, num_threads,
use_pq_build, build_PQ_bytes, use_opq, label_file,
universal_label, Lf);
else if (data_type == std::string("float"))
return build_in_memory_index<float>(metric, data_path, R, L, alpha, index_path_prefix, num_threads,
use_pq_build, build_PQ_bytes, use_opq, label_file, universal_label,
Lf);
else
{
std::cout << "Unsupported type. Use one of int8, uint8 or float." << std::endl;
return -1;
}
}

size_t data_num, data_dim;
diskann::get_bin_metadata(data_path, data_num, data_dim);

auto config = diskann::IndexConfigBuilder()
.with_metric(metric)
.with_dimension(data_dim)
.with_max_points(data_num)
.with_data_load_store_strategy(diskann::MEMORY)
.with_data_type(data_type)
.with_label_type(label_type)
.is_dynamic_index(false)
.is_enable_tags(false)
.is_use_opq(use_opq)
.is_pq_dist_build(use_pq_build)
.with_num_pq_chunks(build_PQ_bytes)
.build();

auto index_build_params = diskann::IndexWriteParametersBuilder(L, R)
.with_filter_list_size(Lf)
.with_alpha(alpha)
.with_saturate_graph(false)
.with_num_threads(num_threads)
.build();

auto build_params = diskann::IndexBuildParamsBuilder(index_build_params)
.with_universal_label(universal_label)
.with_label_file(label_file)
.with_save_path_prefix(index_path_prefix)
.build();
auto index_factory = diskann::IndexFactory(config);
auto index = index_factory.create_instance();
index->build(data_path, data_num, build_params);
index->save(index_path_prefix.c_str());
index.reset();
return 0;
}
catch (const std::exception &e)
{
Expand Down
67 changes: 37 additions & 30 deletions apps/search_memory_index.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
#include "index.h"
#include "memory_mapper.h"
#include "utils.h"
#include "index_factory.h"

namespace po = boost::program_options;

Expand All @@ -30,14 +31,14 @@ int search_memory_index(diskann::Metric &metric, const std::string &index_path,
const bool dynamic, const bool tags, const bool show_qps_per_thread,
const std::vector<std::string> &query_filters, const float fail_if_recall_below)
{
using TagT = uint32_t;
// Load the query file
T *query = nullptr;
uint32_t *gt_ids = nullptr;
float *gt_dists = nullptr;
size_t query_num, query_dim, query_aligned_dim, gt_num, gt_dim;
diskann::load_aligned_bin<T>(query_file, query, query_num, query_dim, query_aligned_dim);

// Check for ground truth
bool calc_recall_flag = false;
if (truthset_file != std::string("null") && file_exists(truthset_file))
{
Expand Down Expand Up @@ -66,18 +67,32 @@ int search_memory_index(diskann::Metric &metric, const std::string &index_path,
}
}

using TagT = uint32_t;
const bool concurrent = false, pq_dist_build = false, use_opq = false;
const size_t num_pq_chunks = 0;
using IndexType = diskann::Index<T, TagT, LabelT>;
const size_t num_frozen_pts = IndexType::get_graph_num_frozen_points(index_path);
IndexType index(metric, query_dim, 0, dynamic, tags, concurrent, pq_dist_build, num_pq_chunks, use_opq,
num_frozen_pts);
std::cout << "Index class instantiated" << std::endl;
index.load(index_path.c_str(), num_threads, *(std::max_element(Lvec.begin(), Lvec.end())));
const size_t num_frozen_pts = diskann::get_graph_num_frozen_points(index_path);

auto config = diskann::IndexConfigBuilder()
.with_metric(metric)
.with_dimension(query_dim)
.with_max_points(0)
.with_data_load_store_strategy(diskann::MEMORY)
.with_data_type(diskann_type_to_name<T>())
.with_label_type(diskann_type_to_name<LabelT>())
.with_tag_type(diskann_type_to_name<TagT>())
.is_dynamic_index(dynamic)
.is_enable_tags(tags)
.is_concurrent_consolidate(false)
.is_pq_dist_build(false)
.is_use_opq(false)
.with_num_pq_chunks(0)
.with_num_frozen_pts(num_frozen_pts)
.build();

auto index_factory = diskann::IndexFactory(config);
auto index = index_factory.create_instance();
index->load(index_path.c_str(), num_threads, *(std::max_element(Lvec.begin(), Lvec.end())));
std::cout << "Index loaded" << std::endl;

if (metric == diskann::FAST_L2)
index.optimize_index_layout();
index->optimize_index_layout();

std::cout << "Using " << num_threads << " threads to search" << std::endl;
std::cout.setf(std::ios_base::fixed, std::ios_base::floatfield);
Expand Down Expand Up @@ -148,29 +163,22 @@ int search_memory_index(diskann::Metric &metric, const std::string &index_path,
auto qs = std::chrono::high_resolution_clock::now();
if (filtered_search)
{
LabelT filter_label_as_num;
if (query_filters.size() == 1)
{
filter_label_as_num = index.get_converted_label(query_filters[0]);
}
else
{
filter_label_as_num = index.get_converted_label(query_filters[i]);
}
auto retval = index.search_with_filters(query + i * query_aligned_dim, filter_label_as_num, recall_at,
L, query_result_ids[test_id].data() + i * recall_at,
query_result_dists[test_id].data() + i * recall_at);
std::string raw_filter = query_filters.size() == 1 ? query_filters[0] : query_filters[i];

auto retval = index->search_with_filters(query + i * query_aligned_dim, raw_filter, recall_at, L,
query_result_ids[test_id].data() + i * recall_at,
query_result_dists[test_id].data() + i * recall_at);
cmp_stats[i] = retval.second;
}
else if (metric == diskann::FAST_L2)
{
index.search_with_optimized_layout(query + i * query_aligned_dim, recall_at, L,
query_result_ids[test_id].data() + i * recall_at);
index->search_with_optimized_layout(query + i * query_aligned_dim, recall_at, L,
query_result_ids[test_id].data() + i * recall_at);
}
else if (tags)
{
index.search_with_tags(query + i * query_aligned_dim, recall_at, L,
query_result_tags.data() + i * recall_at, nullptr, res);
index->search_with_tags(query + i * query_aligned_dim, recall_at, L,
query_result_tags.data() + i * recall_at, nullptr, res);
for (int64_t r = 0; r < (int64_t)recall_at; r++)
{
query_result_ids[test_id][recall_at * i + r] = query_result_tags[recall_at * i + r];
Expand All @@ -179,8 +187,8 @@ int search_memory_index(diskann::Metric &metric, const std::string &index_path,
else
{
cmp_stats[i] = index
.search(query + i * query_aligned_dim, recall_at, L,
query_result_ids[test_id].data() + i * recall_at)
->search(query + i * query_aligned_dim, recall_at, L,
query_result_ids[test_id].data() + i * recall_at)
.second;
}
auto qe = std::chrono::high_resolution_clock::now();
Expand Down Expand Up @@ -245,7 +253,6 @@ int search_memory_index(diskann::Metric &metric, const std::string &index_path,
}

diskann::aligned_free(query);

return best_recall >= fail_if_recall_below ? 0 : -1;
}

Expand Down
Loading

0 comments on commit b2e4a24

Please sign in to comment.