[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247

gbbafna · 2024-02-08T05:26:51Z

…t to a bounded size

Description

Change remote purge threadpool to fixed threadpool
Handle Runtime Exceptions in unhandled Translog deletion paths

Related Issues

Resolves [#12253]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
~~Commit changes are listed out in CHANGELOG.md file (See: Changelog)~~
~~Public documentation issue/PR created~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-02-08T05:39:13Z

❌ Gradle check result for 9882c0c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-02-08T05:48:03Z

Compatibility status:

Checks if related components are compatible with change 699b571

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git]

…t to a bounded size Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

github-actions · 2024-02-08T06:38:22Z

❌ Gradle check result for 7fd8581: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2024-02-08T07:29:41Z

❌ Gradle check result for 7fd8581: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

github-actions · 2024-02-08T11:58:11Z

❌ Gradle check result for 3d08b4c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

peternied

Making this pool bounded creates different problems - namely these tasks fail - how will they be retried, what if they aren't retried?

If there is something generating this large number of tasks, can it regulate itself by inspecting the pool since it seems specifically associated.

peternied · 2024-02-08T18:51:20Z

server/src/main/java/org/opensearch/index/translog/transfer/TranslogTransferManager.java

+        } catch (Exception e) {
+            logger.error("Exception occurred while scheduling listing primary terms from remote store", e);
+        }


This isn't right, other issue could be masked by blindly try/catching exceptions. Why isn't the action listener's onFailure called?

The action listener comes into play only after the list call is successful. In this case the list itself fails.

OK maybe onFailure isn't the best signal - what would be a better signal? Exceptions are mysterious (undocumented) and expensive.

peternied · 2024-02-08T19:01:36Z

server/src/main/java/org/opensearch/threadpool/ThreadPool.java

@@ -183,7 +183,7 @@ public static ThreadPoolType fromType(String type) {
        map.put(Names.SYSTEM_WRITE, ThreadPoolType.FIXED);
        map.put(Names.TRANSLOG_TRANSFER, ThreadPoolType.SCALING);
        map.put(Names.TRANSLOG_SYNC, ThreadPoolType.FIXED);
-        map.put(Names.REMOTE_PURGE, ThreadPoolType.SCALING);
+        map.put(Names.REMOTE_PURGE, ThreadPoolType.FIXED);


Is this creating overhead that will be unused on some cluster configurations?

Yes, but the overhead is quite minimal , given the size of the pool.

Can you help me understand how this pool is used? I don't have a sense for what 'minimal' means as some customers operate OpenSearch on very resource constrained systems such as EC2's T3 instance types.

This pool is used for async deletions related to remote store .

I don't have a sense for what 'minimal' means as some customers operate OpenSearch on very resource constrained systems such as EC2's T3 instance types.

Actually on a domain with no remote store, there is not going to be any overhead . Fixed Size threadpool creates SizeBlockingQueue with some fixed size . With no remote purges happening , it will not create any overhead as the queue will not have any items in the first place.

For remote store enabled clusters, this will help limit the memory impact of this queue by limiting its size.

gbbafna · 2024-02-09T04:24:42Z

Making this pool bounded creates different problems - namely these tasks fail - how will they be retried, what if they aren't retried?

If there is something generating this large number of tasks, can it regulate itself by inspecting the pool since it seems specifically associated.

Good point . It doesn't create different problem . That problem already exists today and will be solved via #8469 . Even if we submit the task and it fails, today there is no retry mechanism .

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

github-actions · 2024-02-09T06:52:48Z

❌ Gradle check result for 50dcbf2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

github-actions · 2024-02-09T10:21:10Z

❌ Gradle check result for 699b571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

peternied · 2024-02-09T16:44:54Z

I'm not convinced we should move forward with this pull request, it looks like a small bandaid on widely impacting scenario, please help me better understand why this change shouldn't be built upon to better handle the root cause.

Key areas of concern:

Not signaling task queue rejection of tasks - exception as flow of control is expensive and undocumented exceptional behavior is going to lead to inconsistent behavior around task management. [Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247 (comment)
Doesn't address the root cause of generating millions of what become unactionable tasks.

gbbafna · 2024-02-14T06:12:53Z

I'm not convinced we should move forward with this pull request, it looks like a small bandaid on widely impacting scenario, please help me better understand why this change shouldn't be built upon to better handle the root cause.

Key areas of concern:
* Not signaling task queue rejection of tasks - exception as flow of control is expensive and undocumented exceptional behavior is going to lead to inconsistent behavior around task management. [[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247 (comment)](https://github.com/opensearch-project/OpenSearch/pull/12247#discussion_r1484563970)

* Doesn't address the root cause of generating millions of what become unactionable tasks.

We do want to solve this problem in the right way. However that needs to be designed and implemented and is not a quick fix kind of work . In the meantime, this solution prevents the queue to become unmanageable in the first place . We need this life saving first-aid for now, till we do the surgery.

peternied

@gbbafna Thanks for your thoughts. In the issue @harishbhakuni mentioned an approach to execute the deletes in batches rather than individual tasks. This approach sounds easier to deliver in the short term and would not introduce risk around exception handling. What do you think about pivoting in that direction?

gbbafna · 2024-02-15T04:20:27Z

@gbbafna Thanks for your thoughts. In the issue @harishbhakuni mentioned an approach to execute the deletes in batches rather than individual tasks. This approach sounds easier to deliver in the short term and would not introduce risk around exception handling. What do you think about pivoting in that direction?

Yes, we would need that change as well as a first level of defense . But this change will be a second level of defense as the problem can still happen . Even the batch deletes can take a lot of time to execute. The chances of threadpool size going very high would still be there, though it be reduced . Making it bounded will help us get upfront rejections, which we can use to reject the snapshot deletion as well . @harishbhakuni will be making that change as well, which relies on using a FixedThreadPool .

opensearch-trigger-bot · 2024-03-16T15:18:28Z

This PR is stalled because it has been open for 30 days with no activity.

gbbafna · 2024-04-23T04:59:04Z

Closing this issue as we are relying on @harishbhakuni 's fix #12319 . Will revisit later if we see any issues .

gbbafna added the skip-changelog label Feb 8, 2024

Change remote purge threadpool to fixed instead of scaling to limit i…

7fd8581

…t to a bounded size Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

gbbafna force-pushed the remote-purge branch from 9882c0c to 7fd8581 Compare February 8, 2024 06:15

Handling threadpool scheduling failures in translog

3d08b4c

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

gbbafna marked this pull request as ready for review February 8, 2024 11:48

gbbafna requested review from peternied, abbashus, adnapibar, anasalkouz, andrross, Bukhtawar, CEHENKLE, dblock, dbwiddis, dreamer-89, kartg, kotwanikunal, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159, ryanbogan, sachinpkale and saratvemulapalli as code owners February 8, 2024 11:48

gbbafna requested review from shwetathareja, sohami, tlfeng and VachaShah as code owners February 8, 2024 11:48

peternied requested changes Feb 8, 2024

View reviewed changes

gbbafna requested a review from peternied February 9, 2024 04:24

Handling threadpool scheduling failures in remote cluster state purge

50dcbf2

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

Swallow exception in deleteTranslogFilesAsync to prevent shard failure

699b571

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>

peternied requested changes Feb 14, 2024

View reviewed changes

opensearch-trigger-bot bot added the stalled Issues that have stalled label Mar 16, 2024

gbbafna closed this Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247

[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247

gbbafna commented Feb 8, 2024 •

edited

Loading

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024 •

edited

Loading

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

peternied left a comment

peternied Feb 8, 2024

gbbafna Feb 9, 2024

peternied Feb 9, 2024

peternied Feb 8, 2024

gbbafna Feb 9, 2024

peternied Feb 9, 2024

gbbafna Feb 14, 2024

gbbafna commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

peternied commented Feb 9, 2024

gbbafna commented Feb 14, 2024

peternied left a comment

gbbafna commented Feb 15, 2024

opensearch-trigger-bot bot commented Mar 16, 2024

gbbafna commented Apr 23, 2024

[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247

[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247

Conversation

gbbafna commented Feb 8, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024 • edited Loading

Compatibility status:

Incompatible components

Skipped components

Compatible components

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

github-actions bot commented Feb 8, 2024

peternied left a comment

Choose a reason for hiding this comment

peternied Feb 8, 2024

Choose a reason for hiding this comment

gbbafna Feb 9, 2024

Choose a reason for hiding this comment

peternied Feb 9, 2024

Choose a reason for hiding this comment

peternied Feb 8, 2024

Choose a reason for hiding this comment

gbbafna Feb 9, 2024

Choose a reason for hiding this comment

peternied Feb 9, 2024

Choose a reason for hiding this comment

gbbafna Feb 14, 2024

Choose a reason for hiding this comment

gbbafna commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

github-actions bot commented Feb 9, 2024

peternied commented Feb 9, 2024

gbbafna commented Feb 14, 2024

peternied left a comment

Choose a reason for hiding this comment

gbbafna commented Feb 15, 2024

opensearch-trigger-bot bot commented Mar 16, 2024

gbbafna commented Apr 23, 2024

gbbafna commented Feb 8, 2024 •

edited

Loading

github-actions bot commented Feb 8, 2024 •

edited

Loading