-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247
Conversation
❌ Gradle check result for 9882c0c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Compatibility status:Checks if related components are compatible with change 699b571 Incompatible componentsSkipped componentsCompatible componentsCompatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git] |
…t to a bounded size Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
❌ Gradle check result for 7fd8581: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 7fd8581: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
❌ Gradle check result for 3d08b4c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this pool bounded creates different problems - namely these tasks fail - how will they be retried, what if they aren't retried?
If there is something generating this large number of tasks, can it regulate itself by inspecting the pool since it seems specifically associated.
} catch (Exception e) { | ||
logger.error("Exception occurred while scheduling listing primary terms from remote store", e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't right, other issue could be masked by blindly try/catching exceptions. Why isn't the action listener's onFailure called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The action listener comes into play only after the list call is successful. In this case the list itself fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK maybe onFailure isn't the best signal - what would be a better signal? Exceptions are mysterious (undocumented) and expensive.
@@ -183,7 +183,7 @@ public static ThreadPoolType fromType(String type) { | |||
map.put(Names.SYSTEM_WRITE, ThreadPoolType.FIXED); | |||
map.put(Names.TRANSLOG_TRANSFER, ThreadPoolType.SCALING); | |||
map.put(Names.TRANSLOG_SYNC, ThreadPoolType.FIXED); | |||
map.put(Names.REMOTE_PURGE, ThreadPoolType.SCALING); | |||
map.put(Names.REMOTE_PURGE, ThreadPoolType.FIXED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this creating overhead that will be unused on some cluster configurations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the overhead is quite minimal , given the size of the pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand how this pool is used? I don't have a sense for what 'minimal' means as some customers operate OpenSearch on very resource constrained systems such as EC2's T3 instance types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pool is used for async deletions related to remote store .
I don't have a sense for what 'minimal' means as some customers operate OpenSearch on very resource constrained systems such as EC2's T3 instance types.
Actually on a domain with no remote store, there is not going to be any overhead . Fixed Size threadpool creates SizeBlockingQueue
with some fixed size . With no remote purges happening , it will not create any overhead as the queue will not have any items in the first place.
For remote store enabled clusters, this will help limit the memory impact of this queue by limiting its size.
Good point . It doesn't create different problem . That problem already exists today and will be solved via #8469 . Even if we submit the task and it fails, today there is no retry mechanism . |
Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
❌ Gradle check result for 50dcbf2: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
❌ Gradle check result for 699b571: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
I'm not convinced we should move forward with this pull request, it looks like a small bandaid on widely impacting scenario, please help me better understand why this change shouldn't be built upon to better handle the root cause. Key areas of concern:
|
We do want to solve this problem in the right way. However that needs to be designed and implemented and is not a quick fix kind of work . In the meantime, this solution prevents the queue to become unmanageable in the first place . We need this life saving first-aid for now, till we do the surgery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gbbafna Thanks for your thoughts. In the issue @harishbhakuni mentioned an approach to execute the deletes in batches rather than individual tasks. This approach sounds easier to deliver in the short term and would not introduce risk around exception handling. What do you think about pivoting in that direction?
Yes, we would need that change as well as a first level of defense . But this change will be a second level of defense as the problem can still happen . Even the batch deletes can take a lot of time to execute. The chances of threadpool size going very high would still be there, though it be reduced . Making it bounded will help us get upfront rejections, which we can use to reject the snapshot deletion as well . @harishbhakuni will be making that change as well, which relies on using a |
This PR is stalled because it has been open for 30 days with no activity. |
Closing this issue as we are relying on @harishbhakuni 's fix #12319 . Will revisit later if we see any issues . |
…t to a bounded size
Description
Related Issues
Resolves [#12253]
Check List
Commit changes are listed out in CHANGELOG.md file (See: Changelog)Public documentation issue/PR createdBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.