-
-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add aggregation bucket limit #1363
Conversation
379d702
to
7c8e1e3
Compare
Codecov Report
@@ Coverage Diff @@
## main #1363 +/- ##
==========================================
+ Coverage 94.29% 94.31% +0.01%
==========================================
Files 234 236 +2
Lines 42769 43524 +755
==========================================
+ Hits 40331 41050 +719
- Misses 2438 2474 +36
Continue to review full report at Codecov.
|
7c8e1e3
to
7bf00b3
Compare
Validation happens on different phases depending on the aggregation Term: During segment collection Histogram: At the end when converting in intermediate buckets (we preallocate empty buckets for the range) Revisit after #1370 Range: When validating the request update CHANGELOG
cfa4d1b
to
11ac451
Compare
f83dcb7
to
11792a0
Compare
11792a0
to
44ea731
Compare
examples/custom_collector.rs
Outdated
let value = self.fast_field_reader.get(doc) as f64; | ||
self.stats.count += 1; | ||
self.stats.sum += value; | ||
self.stats.squared_sum += value * value; | ||
Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run the bench before and after this change to check it does not impact the compiler perf?
(Top + Count on "the" term query)
The MultiCollector is likely to be the trickiest because of the dynamic dispatch. Unfortunately, it is not part of the benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but in my benches so far adding Result
didn't add overhead, except for very tight and simple loops.
E.g. here let value = self.fast_field_reader.get(doc) as f64;
probably considerably outweighs the Result overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm scared of the tantivy API change (Collector::collect returning a Result).
If I understand correctly, the goal is to detect that we have reached the collection limit and abort the query.
An alternative could be to detect the excess of buckets, and keep on the collection but avoid adding extra bucket, and return an error when harvesting the segment.
Would that be unreasonable?
ac5f3bf
to
9ca04b4
Compare
Scared because of performance? I checked the performance and the "the" TOP10_COUNT query is indeed slower. (TOP_10 and COUNT are the same speed though) I introduced a change which I considered for some time now, which is to collect hits and pass them as block. I need to add a benchmark for the In the current change, the collect_block is quite simple implemented, just fowarding to the underlying function, but I with manual unrolling, we could probably gain some more performance.
Currently untested which may help is to pass to Currently we always pass the score, but having some mechanism to only request docs without score could make make some gains on some scenarios.
I thought about that, but returning a result in collect is the right approach imo, since we could also want to read from other sources than infallible in-memory sources in the segment collection. |
Performance and misuse yes. Maybe a bool would do the trick? COUNT and TOP_10 have a special code path. They are irrelevant here. |
Misuse how? Every tantivy user should know how to handle
Like a C-style API, on false collect the error somewhere? It's likely faster, but also pretty ugly. I added a bench with the MultiCollector (Count + Top10). It is faster with the collect_block approach.
|
add collect_block in segment_collector to handle groups of documents as performance optimization add collect_block for MultiCollector
f5a9123
to
c5c2e59
Compare
@@ -153,7 +154,7 @@ impl SegmentRangeCollector { | |||
) -> crate::Result<IntermediateBucketResult> { | |||
let field_type = self.field_type; | |||
|
|||
let buckets = self | |||
let buckets: FnvHashMap<SerializedKey, IntermediateRangeBucketEntry> = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's super helpful! thanks
Change SegmentCollector.collect to return a
Result
Validation happens on different phases depending on the aggregation
Term: During segment collection
Histogram: At the end when converting in intermediate buckets (we preallocate empty buckets for the range) Revisit after #1370
Range: When validating the request
Closes #1331