Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Concurrent Segment Search] suppport Diversified Sampler and Sampler aggs #11075

Open
jed326 opened this issue Nov 2, 2023 · 3 comments
Open

Comments

@jed326
Copy link
Collaborator

jed326 commented Nov 2, 2023

The max_docs_per_value parameter is being evaluated at the slice level for concurrent segment search instead of at the shard level, which is resulting in a higher than expected document count being returned in the aggregations.

See: #10046

Also, the docs for this agg type does not even mention this parameter.

Related flaky test issues:

@jed326 jed326 self-assigned this Nov 2, 2023
@jed326 jed326 removed the untriaged label Nov 2, 2023
@jed326 jed326 changed the title [Concurrent Segment Search] max_docs_per_value for Diversified Sample Aggregations need to be evaluated at the shard level [Concurrent Segment Search] max_docs_per_value for Diversified Sample Aggregations is not evaluated at the shard level Nov 3, 2023
@jed326 jed326 changed the title [Concurrent Segment Search] max_docs_per_value for Diversified Sample Aggregations is not evaluated at the shard level [Concurrent Segment Search] agg parameters for Diversified Sample and Sample aggs are not evaluated at the shard level Nov 3, 2023
@jed326
Copy link
Collaborator Author

jed326 commented Nov 3, 2023

The problem

The shard_size and max_docs_per_value (just the former for sampler agg) are being evaluated at the slice level instead of at the shard level for concurrent segment search and this is leading to a higher number than expected docs being collected in certain cases.

Background

The Sampler agg uses a BestDocsDeferringCollector which which will collect all matches and then later on during post collection will replay these docs for the sub-aggregator. The Diversified Sampler agg does the same thing, however with a DiverseDocsDeferringCollector instead which will apply the max_docs_per_value limit on the specified doc field.

This means that both shard_size and max_docs_per_value parameters are applied during the collection phase, which happens on the index_searcher threadpool for each slice. There was a similar issue with MultiBucketAggs (for example InternalTerms) which was addressed in #8860 (comment), however this is different and we cannot handle it in the slice-level reduce phase in the same way for a few reasons:

  1. Today the collection itself is limited by the shard_size and max_docs_per_value parameters so these values are not passed to the Aggregation sub-classes used in the reduce phase.
  2. The top docs we collect here are limited by the score and the sub-aggregator will have no concept of score to apply during the reduce phase of the sub-aggregators.
  3. The sampler agg itself doesn't have any intermediate state per se, it basically just calls reduce on the sub-aggs during it's own reduce phase, so there's no existing concept for reducing the number of docs collected during reduce.

Ultimately it looks like we need some sort of mechanism to be able to call reduce on the slices after the deferred collector applies these limits but before the collector replays the collected docs back to the sub-collectors. This doesn't seem straightforward with the existing semantics since collect and post-process are called right after each other on the concurrent search threadpool so a robust solution will require some more exploration.

Solutions & Follow-ups

In the short term I think it's best to disable concurrent search for both of these aggregation types, which I will open a PR to do so shortly.

@jed326
Copy link
Collaborator Author

jed326 commented Nov 3, 2023

@reta @sohami would like to hear your thoughts on this!

@sohami
Copy link
Collaborator

sohami commented Nov 3, 2023

@jed326 I agree, lets come back to this at a later time. As you said, we will need some mechanism to delay the postCollection to replay on sugAggs. Then in reduce phase both the operation gets done considering the scores correctly for reducing the documents collected by Deferred aggs and then replaying for sub aggs. Just from a quick look seems like Parent join Agg does something similar for which as well we have disabled the concurrent search and can be addressed together as a follow-up.

@jed326 jed326 removed their assignment Nov 20, 2023
@jed326 jed326 changed the title [Concurrent Segment Search] agg parameters for Diversified Sample and Sample aggs are not evaluated at the shard level [Concurrent Segment Search] suppport Diversified Sampler and Sampler aggs Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

3 participants