Coordinator can return partial results after the timeout when allow_partial_search_results is true #16681

kkewwei · 2024-11-19T09:08:29Z

Description

In query phase, the coordinate concurrently search each shard, If any shard is blocked or responds very slowly, the coordination node will be stuck even if the timeout is set.

The pr supports timeout waiting, if the timeout is exceeded, the coordinator considers the shard as failed and gos on the fetch phase.

Related Issues

Resolves #817 (comment)

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-11-19T09:37:54Z

❌ Gradle check result for 61d84d1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

jed326 · 2024-11-19T17:55:49Z

server/src/main/java/org/opensearch/action/search/SearchTransportService.java

+            long leftTimeMills;
+            if (queryPhase) {
+                // it's costly in query phase.
+                leftTimeMills = task.queryPhaseTimeout() - (System.currentTimeMillis() - task.startTimeMills());


What's the motivation behind the queryPhaseTimeoutPercentage concept? I think it's going to depend on the query and the setup whether query or fetch phase takes longer and it doesn't seem super intuitive for a user to understand how to use this. For example a query that matches a lot of sparse documents using searchable snapshots might spend much longer in the fetch phase while a query that performs complex aggregations might spend a lot longer in the query phase.

I hope to reserve some time for the subsequent phase as a backup measure, to ensure each stage can be allocated a certain amount of time. Of course, if the previous stage takes a very short time, it won't affect the remaining time available for the subsequent phases either.

If no such reservation is made, and a shard is blocked in query phase and uses up all the time, even if it returns after the timeout, there won't be any executable time left for the subsequent stages, and the timeout would be meaningless in that case.

Hmm should we have separate timeouts for the coordinator and the shard level search tasks then? I still think it's pretty unintuitive to use a % like this.

@jed326 , thanks for your reply, how about importing a new settings coordinator_timeout?

github-actions · 2024-11-21T13:18:29Z

❌ Gradle check result for f2cb9f7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-11-24T04:56:42Z

❌ Gradle check result for 3e56548: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-11-25T04:52:14Z

❌ Gradle check result for c55f45e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-11-25T06:39:12Z

❌ Gradle check result for e2305c7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-11-25T08:15:53Z

❌ Gradle check result for b6802ae: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

kkewwei · 2024-11-28T13:58:06Z

❌ Gradle check result for b6802ae: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

org.opensearch.search.basic.SearchWithRandomExceptionsIT.testRandomExceptions #15828

github-actions · 2024-11-30T00:39:10Z

❌ Gradle check result for d32aac8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-12-05T13:16:17Z

❌ Gradle check result for 0bb55bc: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

…artial_search_results is true Signed-off-by: kkewwei <kewei.11@bytedance.com> Signed-off-by: kkewwei <kkewwei@163.com>

github-actions · 2024-12-06T04:10:57Z

❕ Gradle check result for 7e5b243: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2024-12-06T04:11:26Z

Codecov Report

Attention: Patch coverage is 75.00000% with 10 lines in your changes missing coverage. Please review.

Project coverage is 72.10%. Comparing base (42dc22e) to head (7e5b243).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...va/org/opensearch/action/search/SearchRequest.java	75.00%	0 Missing and 4 partials ⚠️
...ensearch/action/search/SearchTransportService.java	72.72%	2 Missing and 1 partial ⚠️
...opensearch/action/search/SearchRequestBuilder.java	0.00%	2 Missing ⚠️
...arch/rest/action/search/RestMultiSearchAction.java	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #16681      +/-   ##
============================================
+ Coverage     72.05%   72.10%   +0.05%     
- Complexity    65183    65201      +18     
============================================
  Files          5318     5318              
  Lines        303993   304030      +37     
  Branches      43990    43997       +7     
============================================
+ Hits         219028   219214     +186     
+ Misses        67046    66889     -157     
- Partials      17919    17927       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: kkewwei <kewei.11@bytedance.com>

github-actions · 2024-12-12T13:14:20Z

❌ Gradle check result for 3a7fd6e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions bot added the enhancement Enhancement or improvement to existing feature or request label Nov 19, 2024

kkewwei force-pushed the return_partial_result branch from 3b9454f to 27eed03 Compare November 19, 2024 09:13

kkewwei changed the title ~~opensearch should returns partial results after the timeout in coordinate node when allow_partial_search_results is true~~ Coordinator can return partial results after the timeout when allow_partial_search_results is true Nov 19, 2024

kkewwei force-pushed the return_partial_result branch from 27eed03 to 61d84d1 Compare November 19, 2024 09:14

kkewwei mentioned this pull request Nov 19, 2024

Support timeout based search request cancellation #817

Open

jed326 reviewed Nov 19, 2024

View reviewed changes

opensearch-ci-bot mentioned this pull request Nov 21, 2024

[AUTOCUT] Gradle Check Flaky Test Report for SearchTimeoutIT #16056

Closed

kkewwei force-pushed the return_partial_result branch from e3dda9a to f2cb9f7 Compare November 21, 2024 12:37

opensearch-ci-bot mentioned this pull request Nov 21, 2024

[AUTOCUT] Gradle Check Flaky Test Report for AllocationConstraintsTests #15831

Open

kkewwei force-pushed the return_partial_result branch from f2cb9f7 to 3e56548 Compare November 24, 2024 04:46

kkewwei force-pushed the return_partial_result branch from 3e56548 to c55f45e Compare November 25, 2024 04:41

kkewwei force-pushed the return_partial_result branch from c55f45e to e2305c7 Compare November 25, 2024 06:03

kkewwei force-pushed the return_partial_result branch from e2305c7 to b6802ae Compare November 25, 2024 06:54

opensearch-ci-bot mentioned this pull request Nov 25, 2024

[AUTOCUT] Gradle Check Flaky Test Report for SearchWithRandomExceptionsIT #15828

Closed

kkewwei force-pushed the return_partial_result branch 2 times, most recently from 8a2e7cd to d32aac8 Compare November 29, 2024 23:43

opensearch-ci-bot mentioned this pull request Nov 30, 2024

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreMultipartIT #15819

Open

kkewwei force-pushed the return_partial_result branch from d32aac8 to 0bb55bc Compare December 5, 2024 12:39

Coordinator can return partial results after the timeout when allow_p…

7e5b243

…artial_search_results is true Signed-off-by: kkewwei <kewei.11@bytedance.com> Signed-off-by: kkewwei <kkewwei@163.com>

kkewwei force-pushed the return_partial_result branch from 0bb55bc to 7e5b243 Compare December 6, 2024 03:10

This was referenced Dec 10, 2024

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreIT #16145

Open

[AUTOCUT] Gradle Check Flaky Test Report for SharedClusterSnapshotRestoreIT #15845

Open

Merge branch 'main' into return_partial_result

3a7fd6e

Signed-off-by: kkewwei <kewei.11@bytedance.com>

This was referenced Dec 12, 2024

[AUTOCUT] Gradle Check Flaky Test Report for MinimumClusterManagerNodesIT #14289

Open

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreReplicationSourceTests #16683

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coordinator can return partial results after the timeout when allow_partial_search_results is true #16681

Coordinator can return partial results after the timeout when allow_partial_search_results is true #16681

kkewwei commented Nov 19, 2024 •

edited

Loading

github-actions bot commented Nov 19, 2024

jed326 Nov 19, 2024

kkewwei Nov 21, 2024

jed326 Nov 22, 2024

kkewwei Nov 24, 2024 •

edited

Loading

github-actions bot commented Nov 21, 2024

github-actions bot commented Nov 24, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

kkewwei commented Nov 28, 2024

github-actions bot commented Nov 30, 2024

github-actions bot commented Dec 5, 2024

github-actions bot commented Dec 6, 2024

codecov bot commented Dec 6, 2024

github-actions bot commented Dec 12, 2024

Coordinator can return partial results after the timeout when allow_partial_search_results is true #16681

Are you sure you want to change the base?

Coordinator can return partial results after the timeout when allow_partial_search_results is true #16681

Conversation

kkewwei commented Nov 19, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Nov 19, 2024

jed326 Nov 19, 2024

Choose a reason for hiding this comment

kkewwei Nov 21, 2024

Choose a reason for hiding this comment

jed326 Nov 22, 2024

Choose a reason for hiding this comment

kkewwei Nov 24, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Nov 21, 2024

github-actions bot commented Nov 24, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

kkewwei commented Nov 28, 2024

github-actions bot commented Nov 30, 2024

github-actions bot commented Dec 5, 2024

github-actions bot commented Dec 6, 2024

codecov bot commented Dec 6, 2024

Codecov Report

github-actions bot commented Dec 12, 2024

kkewwei commented Nov 19, 2024 •

edited

Loading

kkewwei Nov 24, 2024 •

edited

Loading