Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247

Closed
wants to merge 4 commits into from

Conversation

gbbafna
Copy link
Collaborator

@gbbafna gbbafna commented Feb 8, 2024

…t to a bounded size

Description

  1. Change remote purge threadpool to fixed threadpool
  2. Handle Runtime Exceptions in unhandled Translog deletion paths

Related Issues

Resolves [#12253]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Feb 8, 2024

❌ Gradle check result for 9882c0c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Feb 8, 2024

Compatibility status:

Checks if related components are compatible with change 699b571

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/performance-analyzer.git]

…t to a bounded size

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Copy link
Contributor

github-actions bot commented Feb 8, 2024

❌ Gradle check result for 7fd8581: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Feb 8, 2024

❌ Gradle check result for 7fd8581: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Copy link
Contributor

github-actions bot commented Feb 8, 2024

❌ Gradle check result for 3d08b4c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this pool bounded creates different problems - namely these tasks fail - how will they be retried, what if they aren't retried?

If there is something generating this large number of tasks, can it regulate itself by inspecting the pool since it seems specifically associated.

Comment on lines +435 to +437
} catch (Exception e) {
logger.error("Exception occurred while scheduling listing primary terms from remote store", e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right, other issue could be masked by blindly try/catching exceptions. Why isn't the action listener's onFailure called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The action listener comes into play only after the list call is successful. In this case the list itself fails.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK maybe onFailure isn't the best signal - what would be a better signal? Exceptions are mysterious (undocumented) and expensive.

@@ -183,7 +183,7 @@ public static ThreadPoolType fromType(String type) {
map.put(Names.SYSTEM_WRITE, ThreadPoolType.FIXED);
map.put(Names.TRANSLOG_TRANSFER, ThreadPoolType.SCALING);
map.put(Names.TRANSLOG_SYNC, ThreadPoolType.FIXED);
map.put(Names.REMOTE_PURGE, ThreadPoolType.SCALING);
map.put(Names.REMOTE_PURGE, ThreadPoolType.FIXED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this creating overhead that will be unused on some cluster configurations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the overhead is quite minimal , given the size of the pool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand how this pool is used? I don't have a sense for what 'minimal' means as some customers operate OpenSearch on very resource constrained systems such as EC2's T3 instance types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pool is used for async deletions related to remote store .

I don't have a sense for what 'minimal' means as some customers operate OpenSearch on very resource constrained systems such as EC2's T3 instance types.

Actually on a domain with no remote store, there is not going to be any overhead . Fixed Size threadpool creates SizeBlockingQueue with some fixed size . With no remote purges happening , it will not create any overhead as the queue will not have any items in the first place.

For remote store enabled clusters, this will help limit the memory impact of this queue by limiting its size.

@gbbafna
Copy link
Collaborator Author

gbbafna commented Feb 9, 2024

Making this pool bounded creates different problems - namely these tasks fail - how will they be retried, what if they aren't retried?

If there is something generating this large number of tasks, can it regulate itself by inspecting the pool since it seems specifically associated.

Good point . It doesn't create different problem . That problem already exists today and will be solved via #8469 . Even if we submit the task and it fails, today there is no retry mechanism .

@gbbafna gbbafna requested a review from peternied February 9, 2024 04:24
Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Copy link
Contributor

github-actions bot commented Feb 9, 2024

❌ Gradle check result for 50dcbf2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Copy link
Contributor

github-actions bot commented Feb 9, 2024

❌ Gradle check result for 699b571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@peternied
Copy link
Member

I'm not convinced we should move forward with this pull request, it looks like a small bandaid on widely impacting scenario, please help me better understand why this change shouldn't be built upon to better handle the root cause.

Key areas of concern:

@gbbafna
Copy link
Collaborator Author

gbbafna commented Feb 14, 2024

I'm not convinced we should move forward with this pull request, it looks like a small bandaid on widely impacting scenario, please help me better understand why this change shouldn't be built upon to better handle the root cause.

Key areas of concern:

* Not signaling task queue rejection of tasks - exception as flow of control is expensive and undocumented exceptional behavior is going to lead to inconsistent behavior around task management. [[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247 (comment)](https://github.com/opensearch-project/OpenSearch/pull/12247#discussion_r1484563970)

* Doesn't address the root cause of generating millions of what become unactionable tasks.

We do want to solve this problem in the right way. However that needs to be designed and implemented and is not a quick fix kind of work . In the meantime, this solution prevents the queue to become unmanageable in the first place . We need this life saving first-aid for now, till we do the surgery.

Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gbbafna Thanks for your thoughts. In the issue @harishbhakuni mentioned an approach to execute the deletes in batches rather than individual tasks. This approach sounds easier to deliver in the short term and would not introduce risk around exception handling. What do you think about pivoting in that direction?

@gbbafna
Copy link
Collaborator Author

gbbafna commented Feb 15, 2024

@gbbafna Thanks for your thoughts. In the issue @harishbhakuni mentioned an approach to execute the deletes in batches rather than individual tasks. This approach sounds easier to deliver in the short term and would not introduce risk around exception handling. What do you think about pivoting in that direction?

Yes, we would need that change as well as a first level of defense . But this change will be a second level of defense as the problem can still happen . Even the batch deletes can take a lot of time to execute. The chances of threadpool size going very high would still be there, though it be reduced . Making it bounded will help us get upfront rejections, which we can use to reject the snapshot deletion as well . @harishbhakuni will be making that change as well, which relies on using a FixedThreadPool .

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Mar 16, 2024
@gbbafna
Copy link
Collaborator Author

gbbafna commented Apr 23, 2024

Closing this issue as we are relying on @harishbhakuni 's fix #12319 . Will revisit later if we see any issues .

@gbbafna gbbafna closed this Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog stalled Issues that have stalled
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants