-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make doc and query count configurable in benchmark #270
Make doc and query count configurable in benchmark #270
Conversation
Makes the document and query count configurable in the benchmarking tool. With this functionality, users can now specify to only index or search a subset of the vectors in the data set. This is useful for indices that require training that may only need a subset of the data set for training. Signed-off-by: John Mazanec <jmazane@amazon.com>
Codecov Report
@@ Coverage Diff @@
## main #270 +/- ##
=========================================
Coverage 83.38% 83.38%
Complexity 884 884
=========================================
Files 127 127
Lines 3833 3833
Branches 361 361
=========================================
Hits 3196 3196
Misses 475 475
Partials 162 162 Continue to review full report at Codecov.
|
What are default values for new params? And another question - does it make sense to reflect values of new params in test result, maybe with an average? |
@martin-gaievski The default is the entire dataset. So, if you dont specify anything it will index each vector in the dataset.
All metrics take into account the dataset size. For instance, querying produces the p50, p90 and p99 metrics. For ingest, it is the total time. In the future we could add an ingest metric that is docs/sec, but I think thats outside scope of this change for now. |
Returns: | ||
Recall at R | ||
""" | ||
correct = 0.0 | ||
query = 0 | ||
while True: | ||
for query in range(query_count): | ||
true_neighbors = neighbor_dataset.read(1) | ||
if true_neighbors is None: | ||
break | ||
true_neighbors_set = set(true_neighbors[0][:k]) | ||
for j in range(r): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FAR: may be change r to limit or something meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r is a technical term in recall@r. I calculate it now as the fraction of (# of top r results returned by the query are in the ground truth k set) / r.
However, I think I may have this mixed up a little bit and I will need to refactor this to follow how faiss computes it: https://github.com/facebookresearch/faiss/blob/main/faiss/AutoTune.cpp#L60-L97. I will make a separate issue for this.
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks.
…t#270) Makes the document and query count configurable in the benchmarking tool. With this functionality, users can now specify to only index or search a subset of the vectors in the data set. This is useful for indices that require training that may only need a subset of the data set for training. Signed-off-by: John Mazanec <jmazane@amazon.com>
…t#270) Makes the document and query count configurable in the benchmarking tool. With this functionality, users can now specify to only index or search a subset of the vectors in the data set. This is useful for indices that require training that may only need a subset of the data set for training. Signed-off-by: John Mazanec <jmazane@amazon.com> Signed-off-by: Martin Gaievski <gaievski@amazon.com>
…t#270) Makes the document and query count configurable in the benchmarking tool. With this functionality, users can now specify to only index or search a subset of the vectors in the data set. This is useful for indices that require training that may only need a subset of the data set for training. Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec jmazane@amazon.com
Description
Makes the document and query count configurable in the benchmarking tool. With this functionality, users can now specify to only index or search a subset of the vectors in the data set. This is useful for indices that require training that may only need a subset of the data set for training.
A query or ingest step might look like this now:
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.