-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Proposal] Add Admission Control - Workload management to improve cluster stability #1144
Comments
Hi @mitalawachat, could you please elaborate more? what do you mean by early layer? what is the mechanism that currently in use to restrict incoming request? what do you want to change? |
Closing this since we didn't receive a response for a while. @mitalawachat please reopen if needed. |
@anasalkouz Please help in re-opening the issue. Earlier I could not get time to actively work on this. Apologies for the same. |
I'll reopen. Thanks @mitalawachat |
This feels like a more upstream and generic version of #1329, or at least something that sits ahead of it. My question is, if back-pressure is implemented well in all paths, do we still want admissions control? How do you see these two features co-exist? Can admissions control be imagined as a cluster-wide, and eventually consistent feature, improving load balancing? In an asynchronous world I would like a feature where I can generate a request ID on the client, make a request to any node (or make the same request to multiple nodes including the generated ID), get the job queued and executed at least once on the most available node, and then come back for results later by supplying my ID to any node. |
Yes I was proposing admission-control to be executed before request reaches OpenSearch threadpool for execution. Backpressure and circuit-breakers are triggered later. To achieve it I was thinking we can place a netty handler at Netty4HttpServerTransport's
If node is already under stress, or in scenarios where there's burst of sudden requests, Admission Control is aimed to prevent node from doing any work (auth/etc) that would turn out wasteful when requests are rejected by other mechanisms (backpressure/cb).
I was scoping admission-control for throttling, but load-balancing seems a nice feature too for future extension. |
@mitalawachat I understand why admissions control is better in theory than the existing backpressure/cb methods, but you still have to show at least anecdotal examples where those are net worse than a whole new admissions control feature - note that multiple ways to prevent the cluster from overloading may have disadvantages for users to reason about |
Hi @dblock, We've actually implemented a version of admissions-control in the Amazon Managed OpenSearch Service and it has been available for around two years. It has helped a great deal in cluster stability, especially on smaller instance types with limited resources. On t2/t3 EC2 instance types it had shown 75%-85% reduction in node drop across various regions, and ~70% node drops reduction on other EC2 instance types. We've observed it helping in case when user has downgraded cluster with underscaled configuration and we observed JVM hovering 90-95% continuously for few hours. Admission-control helped with selective load-shedding allowing cluster to do some useful management work which otherwise would have resulted to node drops/out-of-memory issues. We've observed it helping prevent node drops/out-of-memory issue on a cluster for a user due to spike in search traffic as it resulted into sharp jvm spikes on few data nodes. With admission control in place 429 status were proactively sent back to the clients from affected nodes, preventing the further jvm spikes and those nodes from running into issue. We are planning to enhance the framework while we are open-sourcing it with community feedback as a core component. |
Thanks @mitalawachat! I now understand where you come from. These are some very strong numbers. Great to see this feature open-sourced. Make some PRs! |
Closing this issue as we are tracking this feature as below: RFC for AdmissionController: #8910 |
Feature Proposal : Add Admission Control - Workload management to improve cluster stability
Overview:
Admission-Control is a workload management knob which limits and restricts the new incoming requests early when a node begins to go under stress. It would be resource-aware where it accounts for the new incoming request cost (memory occupancy), along with tracking the point-in-time state of the node (overall JVMMP). This will allow real-time, state-based, admission-control on the node. This feature helps prevent issues where clusters are overloaded with incoming traffic (either steady increase or surge in traffic).
Requirements:
Problem Statement: How do we throttle requests dynamically based upon the Request URI pattern?
Idea is to limit the number of requests per node which reaches the OpenSearch thread-pool for execution. This can be based upon the number of requests inflight already. i.e. requests being executed of a particular type. For example, each request would acquire tokens from a bucket before executing, and release it after execution completes. This implies, all the additional requests on the node after all tokens are acquired will be throttled (with too many requests exception), until the tokens are available. This will be primarily helpful in protecting the resources on the node from brownouts while also ensuring they are available for other request types as well without contention.
Describe alternatives you've considered
Circuit breakers:
Although circuit-breaker can be safely assumed to be last line of defense where it would protect the nodes from browning out, we still need some mechanism in place to regulate the number of request reaching execution phase. Admission-control will ensure that it understand the workload a node can take and prevent it from being overwhelmed by cutting off any additional workload.
Also, we cannot provide rest endpoint-based limits in circuit-breaker.
Proposed Solution:
Dynamic Cluster Settings exposed by Admission Control:
The text was updated successfully, but these errors were encountered: