Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Design Proposal] Admission Control - Workload management to improve cluster stability #3400

Open
mitalawachat opened this issue May 19, 2022 · 0 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@mitalawachat
Copy link

mitalawachat commented May 19, 2022

GOAL

Document proposes design for admission-control to improve cluster stability. For more information on overview please refer: Feature Proposal (#1144)

REQUIREMENTS

Limit the number of requests per node which reaches the OpenSearch thread-pool for execution.

TENETS

  • Selectiveness : Solution should perform throttling selectively based on the type of request. For example, always permit critical requests like health-checks.
  • Non Intrusive : Solution should be non-intrusive to the current OpenSearch functionality & performance.
  • Scalability : Solution should organically scale with the number and type of nodes in the cluster, along with the workload.
  • Extensibility : Admission control framework should be pluggable for incorporating new local/global views of node resources in future (such as EBS Volume IOPS).

APPROACH

Proposed solution will maintain and track below two views on each node for taking any throttling decision:

  1. Request Size:

    1. For every incoming request, tokens equal to the “content-length” (bytes) will be acquired from the token bucket, and reserved for the entire duration till the time request is in-flight.
    2. Later, once the response is ready to be dispatched, the tokens acquired initially will be released and replenished back to the token-bucket.
    3. All token-bucket operations will be atomic in nature, to maintain the consistency under high concurrency.
    4. Any new request which is unable to acquire the desired amount of tokens (equal its content-length), due to current inflight requests, will be throttled.
  2. Global JVM Memory Pressure:

    1. This will allow the requests to fast fail (and prevent wasteful work) whenever the overall JVM Memory Pressure breaches the pre-defined threshold.

COMPONENTS

  1. AdmissionControlService :

    1. Acts as a container for multiple child level admission controllers, such as RequestSizeAdmissionController / GlobalJVMMPAdmissionController.
    2. Responsible for boot-strappping and accessing the child controllers.
    3. Provides OpenSearch Setting definition, such as name/field-type/default-values etc, for child controllers.
  2. RequestSizeAdmissionController :
    Child level controller which tracks the local memory allocation based upon request size.

    1. Performs the Semaphore based byte-sized accounting for inflight requests.
    2. Token count is maintained using performant thread-safe concurrency constructs (such as CAS).
    3. Provides Acquire & Release interface to reserve and replenish tokens from the token-bucket respectively, for the requested size.
    4. Throws AdmissionControlThrottlingException if Acquire call fails due to insufficient token present.
  3. GlobalJVMMPAdmissionController :
    Child level controller for tracking the node level JVM Memory Pressure.

    1. Performs real time Heap Memory Usage check using the MemoryMXBean interfaces.
    2. MemoryMXBean is the management interface for the memory system of the Java virtual machine, and is similarly used in OpenSearch real memory circuit breaker.
    3. Provides check limit interface which throws AdmissionControlThrottlingException if on-spot JVMMP is greater than the pre-defined threshold.
  4. AdmissionControlHandler :

    1. We can place a netty handler at Netty4HttpServerTransport's initChannel, a new handler in handler chain.
    2. AdmissionControlHandler will interact with AdmissionControlService to evaluate if request should proceed or be throttled.
  5. Stats API :

    1. We will track rejection stats on every node.
    2. These stats could be retrieved using
      a. /_admission_control/stats
      b. /_admission_control/_nodes/{nodeId}/stats
    3. We would create AdmissionControlStatsAction, RestAdmissionControlStatsAction, TransportAdmissionControlStatsAction, AdmissionControlStatsRequest extends BaseNodesRequest, AdmissionControlStatsResponse extends BaseNodesResponse, AdmissionControlStats, etc to facilitate stats api.

AdmissionControl

AdmissionControl - Class

@mitalawachat mitalawachat added enhancement Enhancement or improvement to existing feature or request untriaged labels May 19, 2022
@mitalawachat mitalawachat changed the title [Design Proposal] Admission Control - Workload management to improve cluster stability [WIP] [Design Proposal] Admission Control - Workload management to improve cluster stability May 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

2 participants