Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new rewrite override parameter to terms query #15012

Closed
wants to merge 29 commits into from

Conversation

harshavamsi
Copy link
Contributor

@harshavamsi harshavamsi commented Jul 29, 2024

Description

I've noticed that some users are reporting slowdowns for MultiTermQueries on keyword fields as reported in #14755. This PR adds a new Rewrite_Override parameter while running queries that users can set to force certain kinds of behaviors while running the query. Default falls back to the default approach, INDEX_ONLY uses only the index structure, DOC_VALUES_ONLY uses only the doc_values.

Related Issues

Resolves #14755

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Jul 29, 2024
@harshavamsi harshavamsi added backport 2.x Backport to 2.x branch v2.17.0 labels Jul 29, 2024
Copy link
Contributor

❌ Gradle check result for 6617c87: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Copy link
Contributor

❌ Gradle check result for 2ce46c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for b3b17a6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Copy link
Contributor

❌ Gradle check result for 9829e9c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Copy link
Contributor

❌ Gradle check result for c978ea3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Copy link
Contributor

github-actions bot commented Aug 1, 2024

❌ Gradle check result for 31e4964: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Copy link
Contributor

github-actions bot commented Aug 5, 2024

❌ Gradle check result for 67425be: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Aug 5, 2024

❌ Gradle check result for 79ae3ef: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
@harshavamsi harshavamsi closed this Aug 7, 2024
@harshavamsi harshavamsi reopened this Aug 7, 2024
Copy link
Contributor

github-actions bot commented Aug 7, 2024

❌ Gradle check result for 79ae3ef: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Aug 7, 2024

❌ Gradle check result for de9e944: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Aug 8, 2024

❌ Gradle check result for ce8a043: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Aug 8, 2024

❌ Gradle check result for bb260f4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Aug 8, 2024

❌ Gradle check result for 815de3d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@reta reta self-requested a review August 22, 2024 17:07
Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Copy link
Contributor

✅ Gradle check result for 4b012e9: SUCCESS

Copy link
Contributor

✅ Gradle check result for 20de476: SUCCESS

Copy link

codecov bot commented Aug 22, 2024

Codecov Report

Attention: Patch coverage is 70.12987% with 92 lines in your changes missing coverage. Please review.

Project coverage is 71.93%. Comparing base (738cdd3) to head (8d7c3c9).
Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
...rg/opensearch/index/mapper/KeywordFieldMapper.java 65.80% 31 Missing and 22 partials ⚠️
...g/opensearch/index/query/support/QueryParsers.java 33.33% 7 Missing and 1 partial ⚠️
.../org/opensearch/index/query/RangeQueryBuilder.java 66.66% 2 Missing and 4 partials ⚠️
...org/opensearch/index/query/RegexpQueryBuilder.java 66.66% 2 Missing and 4 partials ⚠️
...g/opensearch/index/query/WildcardQueryBuilder.java 71.42% 2 Missing and 4 partials ⚠️
.../org/opensearch/index/query/FuzzyQueryBuilder.java 80.00% 0 Missing and 4 partials ⚠️
...org/opensearch/index/query/PrefixQueryBuilder.java 86.95% 0 Missing and 3 partials ⚠️
.../org/opensearch/index/query/TermsQueryBuilder.java 87.50% 0 Missing and 3 partials ⚠️
...opensearch/index/mapper/SimpleMappedFieldType.java 50.00% 1 Missing and 1 partial ⚠️
...a/org/opensearch/index/mapper/MappedFieldType.java 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #15012      +/-   ##
============================================
+ Coverage     71.89%   71.93%   +0.03%     
- Complexity    63791    63845      +54     
============================================
  Files          5249     5249              
  Lines        298149   298362     +213     
  Branches      43084    43134      +50     
============================================
+ Hits         214368   214631     +263     
+ Misses        66126    66091      -35     
+ Partials      17655    17640      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexmm-amzn
Copy link

So, normally, using IndexOrDocValuesQuery is a good idea, since it will intelligently decide whether to use the index (essentially doing a disjunction over matching terms) or doc values (doing a look-up to see if a candidate doc has the given field value). It makes this decision based on cost estimates for the index query and the cost estimate for another possible "lead" iterator.

It doesn't 'intelligently' decide. IndexOrDocValuesQuery estimates the cost for the indexed query (which according to Get better cost estimate on MultiTermQuery over few terms is already not working as expected) and it assumes an 'arbitrary 8x penalty to doc values' (quote from the Lucene source code) to decide between indexed query and doc values query. That's a very naive assumption and it will backfire if the doc values field is expensive to process. For example, a keyword field that is mapped with doc_values and that contains large vectors of data (>1000 values), e.g. tags, category IDs, ZIP codes etc. In this case the query may need to iterate over all values in the document.

At the very minimum the heuristic should take into account how large a doc values field is in terms of values per document instead of making a static '8x penalty' decision.

I'm concerned about the proposal in this pull request to add an additional set of parameters to the request. The issue was introduced by an internal optimization attempt that regresses in some scenarios. This is an implementation detail and shouldn't bother the client. Adding an opt-out parameter exposes this complexity to the client and outsources the problem solution. It's a long-term liability looking forward and makes the API even more complex than it already is. Furthermore it adds additional efforts on the Java client side, the documentation, and it breaks compatibility with ES for something that isn't a 'feature' but only a workaround.

OpenSearch should solve this internally, and rather opt-out from doing this optimization if it doesn't have solid data that doc values filtering is actually faster for a given request.

Considering that the 'optimiziation' was introduced with 2.12 this issue should be treated as a regression and receive bugfixes for all OS versions starting with 2.12 to unblock users migrating from 2.11 or older - without requiring extra work on the client side. Some users may be stuck at this point, or if they did upgrade may suffer from performance degradation and being unable to easily rollback.

@reta
Copy link
Collaborator

reta commented Aug 26, 2024

That's a very naive assumption and it will backfire if the doc values field is expensive to process. For example, a keyword field that is mapped with doc_values and that contains large vectors of data (>1000 values), e.g. tags, category IDs, ZIP codes etc. In this case the query may need to iterate over all values in the document.

Thanks @alexmm-amzn , I suspect it deserves an issue on Apache Lucene side? Why do you think OpenSearch should solve this internally only?

@harshavamsi
Copy link
Contributor Author

So, normally, using IndexOrDocValuesQuery is a good idea, since it will intelligently decide whether to use the index (essentially doing a disjunction over matching terms) or doc values (doing a look-up to see if a candidate doc has the given field value). It makes this decision based on cost estimates for the index query and the cost estimate for another possible "lead" iterator.

It doesn't 'intelligently' decide. IndexOrDocValuesQuery estimates the cost for the indexed query (which according to Get better cost estimate on MultiTermQuery over few terms is already not working as expected) and it assumes an 'arbitrary 8x penalty to doc values' (quote from the Lucene source code) to decide between indexed query and doc values query. That's a very naive assumption and it will backfire if the doc values field is expensive to process. For example, a keyword field that is mapped with doc_values and that contains large vectors of data (>1000 values), e.g. tags, category IDs, ZIP codes etc. In this case the query may need to iterate over all values in the document.

At the very minimum the heuristic should take into account how large a doc values field is in terms of values per document instead of making a static '8x penalty' decision.

I'm concerned about the proposal in this pull request to add an additional set of parameters to the request. The issue was introduced by an internal optimization attempt that regresses in some scenarios. This is an implementation detail and shouldn't bother the client. Adding an opt-out parameter exposes this complexity to the client and outsources the problem solution. It's a long-term liability looking forward and makes the API even more complex than it already is. Furthermore it adds additional efforts on the Java client side, the documentation, and it breaks compatibility with ES for something that isn't a 'feature' but only a workaround.

OpenSearch should solve this internally, and rather opt-out from doing this optimization if it doesn't have solid data that doc values filtering is actually faster for a given request.

Considering that the 'optimiziation' was introduced with 2.12 this issue should be treated as a regression and receive bugfixes for all OS versions starting with 2.12 to unblock users migrating from 2.11 or older - without requiring extra work on the client side. Some users may be stuck at this point, or if they did upgrade may suffer from performance degradation and being unable to easily rollback.

Thanks for your comments @alexmm-amzn, I want to throw some light around how IndexOrDocValuesQuery works. Assuming we have a boolean conjunction query with two queries, say a terms and a range query, the boolean query will use one of the queries as a lead iterator based on the cost of the query. Say we have a few terms that match from the index but a large range query that match a lot of documents, the boolean query will use the terms query as the lead iterator since it matches a few documents. The cost of iterating the terms query is passed to the second query as leadCost. Now if the cost of using doc values * 8 for the range query is < lead cost, we use the indexed points else we use the doc values.

This cost model has been working successfully in lucene and OpenSearch well for a while, numeric ranges have had IndexOrDocValues query for a while, and have seen significant speedups. This is partly due to most queries that have two clauses, eg term and a range on the numeric field, usually have sparser term queries and big ranges that match lots of documents. While the 8x heuristic is simply a heuristic, it works well in most cases. We cannot confidently say it works well in all cases simply because query planning is hard.

In this particular case, adding the IndexOrDocValues query to termsQuery in the keyword field helps cases where the termsQuery is the second iterator because it could be matching more documents than the lead iterator. But for cases where the termsQuery is the second iterator but still matches fewer documents, we should be using the index structure, but the cost estimate for termsQuery was way higher leading to the doc_values iterator being chosen which is significantly slower. Since OpenSearch simply delegates to lucene for query planning, we have to rely on lucene's cost being accurate. This patch simply provides "power users" who want more control over how queries are executed, the real fix should come from lucene. wdyt?

@alexmm-amzn
Copy link

alexmm-amzn commented Aug 26, 2024

That's a very naive assumption and it will backfire if the doc values field is expensive to process. For example, a keyword field that is mapped with doc_values and that contains large vectors of data (>1000 values), e.g. tags, category IDs, ZIP codes etc. In this case the query may need to iterate over all values in the document.

Thanks @alexmm-amzn , I suspect it deserves an issue on Apache Lucene side? Why do you think OpenSearch should solve this internally only?

I'm fine with creating a ticket to Lucene to improve the heuristics as well. The decision to allow this optimization is currently made in OpenSearch and was introduced for keyword fields in 2.12 and since then has been relying on the problematic heuristics - this is why I'm commenting here.

This cost model has been working successfully in lucene and OpenSearch well for a while, numeric ranges have had IndexOrDocValues query for a while, and have seen significant speedups. This is partly due to most queries that have two clauses, eg term and a range on the numeric field, usually have sparser term queries and big ranges that match lots of documents. While the 8x heuristic is simply a heuristic, it works well in most cases.

Yes, but keyword fields are particularly prone to having multiple values - especially compared to numeric range fields. Even if in most cases using doc values may provide performance improvements the concerning part is the 'worst case' scenario rather than the 'average'. With large multi-value keyword fields the behavior may degenerate to O(n*k) for the filtering part. A heuristic that decides on applying optimizations to the query plan should be conservative and only apply the optimizations if it really improves the performance. The current heuristic however does not consider the actual cost for iterating over doc values. I've measured a 25% performance degradation (requests per second) with 2.13 vs. 2.11 related to that.

This patch simply provides "power users" who want more control over how queries are executed, the real fix should come from lucene.

If the heuristic is fixed this control shouldn't be necessary, but it adds complexity and cannot be easily removed again later without breaking backward compatibility of the API. And until the heuristic is fixed having this control means that clients are forced to modify their applications to work around this performance regression - this is costly and outsources the fix to the client. Instead of investing into such a temporary workaround I'd rather suggest working with the Lucene team to improve the heuristic and get it backported to OS 2.12 and later.

Signed-off-by: Harsha Vamsi Kalluri <harshavamsi096@gmail.com>
Copy link
Contributor

github-actions bot commented Sep 2, 2024

✅ Gradle check result for 8d7c3c9: SUCCESS

@msfroh
Copy link
Collaborator

msfroh commented Sep 3, 2024

I'd rather suggest working with the Lucene team to improve the heuristic and get it backported to OS 2.12 and later.

There isn't really a "Lucene team". There's the Lucene community, which can accept changes as they see fit. I did submit a proposed improvement to the cost estimate for multi-term keyword queries back in March, and it was looked at in August. It will come out in Lucene 9.12, which will be released at the end of September, in time to get picked up by OpenSearch 2.18.

We could disable the IndexOrDocValuesQuery rewrite for keyword fields in 2.17, essentially going back to the 2.11 behavior when a keyword field is indexed. Then we could take the time to develop a better heuristic before turning it back on. In particular, your callout around multi-valued fields is worth incorporating into the estimate. Even without IndexOrDocValuesQuery, we would still have the incremental improvement where folks can choose not to index keyword fields, and query the field based on doc values (where previously those would just throw an exception).

My main concern with that approach is that we may roll back the performance improvement that some folks have seen in 2.12 and later. (It's hard to tell how many people have seen performance improve, since nobody reaches out to say things have become faster.)

@msfroh
Copy link
Collaborator

msfroh commented Sep 3, 2024

Chatted with @harshavamsi and we're thinking that a cluster setting that goes back to the 2.11 behavior could be an option.

It's simpler than the per-clause "expert" setting -- you don't need to modify your application code. That provides an immediate solution for the folks who saw better performance on 2.11. Going forward, we can improve the cost estimation logic to do the right thing.

@reta
Copy link
Collaborator

reta commented Sep 3, 2024

Chatted with @harshavamsi and we're thinking that a cluster setting that goes back to the 2.11 behavior could be an option.

So how would this setting work, disable IndexOrDocValuesQuery for keyword fields? would be on or off by default? I think the regression warrants IndexOrDocValuesQuery to be disabled by default even if there are some improvements for certain types of the queries (do not harm existing users would be the reasoning for me here).

@msfroh
Copy link
Collaborator

msfroh commented Sep 3, 2024

So how would this setting work, disable IndexOrDocValuesQuery for keyword fields? would be on or off by default? I think the regression warrants IndexOrDocValuesQuery to be disabled by default even if there are some improvements for certain types of the queries (do not harm existing users would be the reasoning for me here).

Yes, I think it should disable it (when the field is indexed and has doc values, as is the default). The indexed query would be returned instead. The other branches (indexed-only/doc values-only) can still kick in for a field that only supports one or the other.

I think disabled by default might be prudent. For now, we have evidence that folks upgrading from 2.11 are seeing regressions. We have not heard from anyone who likes the new behavior (though, again, there could be people who aren't reaching out to say things are better). If someone upgrades from 2.12-2.16 and wants the behavior from that range, they can enable it.

Going forward, once we have a better heuristic, we could consider making "enabled" be the default, but leave the setting in place just in case.

@reta
Copy link
Collaborator

reta commented Sep 3, 2024

Going forward, once we have a better heuristic, we could consider making "enabled" be the default, but leave the setting in place just in case.

👍 , thanks @msfroh !

@harshavamsi harshavamsi closed this Sep 4, 2024
@harshavamsi
Copy link
Contributor Author

Closing in favor of #15637

@alexmm-amzn
Copy link

Chatted with @harshavamsi and we're thinking that a cluster setting that goes back to the 2.11 behavior could be an option.

Thanks, this makes sense to me. Will unblock upgrades from 2.11 to a later version and also provide a mitigation option for clients that already upgraded to 2.12 or later and only then noticed the regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch Search Search query, autocomplete ...etc v2.17.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Keyword field slowdown while using IndexOrDocValues
6 participants