-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix timeout exception and Add Integ test for Master task throttling #4588
Fix timeout exception and Add Integ test for Master task throttling #4588
Conversation
Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
...c/internalClusterTest/java/org/opensearch/clustermanager/ClusterManagerTaskThrottlingIT.java
Show resolved
Hide resolved
* Here we will assert the client behaviour that client's request is not | ||
* failed with throttling exception but timeout exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pls help me understand why this is the right behavior ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This throttling is introduced for internal task submission, cx don't have to be aware of it.
Without throttling, such tasks will wait in master's pending task queue. If tasks gets timeout in queue itself, then master return ClusterProcessEventTimeoutException
.
With throttling we are making those tasks wait on data node itself instead of master node. So let's say tasks was throttled throughout the time and gets timedout, it is the same as task is waiting in master queue without throttling.
From cx perspective, for both case master node is busy and not able to process the event in given timeout. For that we should throw timeout exception only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer Throttling
exception as the task didn't wait in master's pending task queue , but was outright rejected . The customers which are submitting the task should be aware of the difference between two . This would help us in the future if we want to take different actions on each of those. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now we can keep it as internal, we will go incremental and expose this to cx later. As we will need to expose various retries configuration as well for tuning. Till then cx can treat it as timeout exception only.
I will create followup issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 lets create a different issue to expose ThrottlingException to user. Also need to cross check if this would require changes in the clients
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created followup issue for it: #4724
/** | ||
* If tasks gets timed out in retrying on throttling, | ||
* it should send cluster event timeout exception. | ||
*/ | ||
@Override | ||
public Exception getTimeoutException(Exception e) { | ||
return new ProcessClusterEventTimeoutException(request.masterNodeTimeout, actionName); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are pretty much discarding the underlying exception which is a serious impairment to debuggability. Can we wrap exception causes instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For debuggability, RetrayableAction
will add the underlying exception in suppressed exception list and which will be logged on data node. So in data node logs, throttling exception will be logged. Master will also logs on throttling exceptions.
We just want to keep user experience consistent, so to user it will be thrown as ProcessClusterEventTimeoutException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add another overloaded constructor for and keep the underlying cause as is.
public ProcessClusterEventTimeoutException(TimeValue timeValue, String source, Exception ex) {
super("failed to process cluster event (" + source + ") within " + timeValue, ex);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually when tasks is getting timed out due to throttling, those throttling exception is returned as suppressed exception in process cluster event timeout exception. RetryableAction
does this for us, it stores upto last 4 exceptions and add it in suppressed exception.
From that user will know that task is getting timed out due to throttling till we expose the throttling exception to user.
Do we still need to add underlying cause in event timeout exception?
sample exception which is returned to user:
{
"error" : {
"root_cause" : [
{
"type" : "process_cluster_event_timeout_exception",
"reason" : "failed to process cluster event (indices:admin/mapping/put) within 30s",
"suppressed" : [
{
"type" : "cluster_manager_throttling_exception",
"reason" : "Throttling Exception : Limit exceeded for put-mapping"
},
{
"type" : "cluster_manager_throttling_exception",
"reason" : "Throttling Exception : Limit exceeded for put-mapping"
},
{
"type" : "cluster_manager_throttling_exception",
"reason" : "Throttling Exception : Limit exceeded for put-mapping"
}
]
}
],
"type" : "process_cluster_event_timeout_exception",
"reason" : "failed to process cluster event (indices:admin/mapping/put) within 30s",
"suppressed" : [
{
"type" : "cluster_manager_throttling_exception",
"reason" : "Throttling Exception : Limit exceeded for put-mapping"
},
{
"type" : "cluster_manager_throttling_exception",
"reason" : "Throttling Exception : Limit exceeded for put-mapping"
},
{
"type" : "cluster_manager_throttling_exception",
"reason" : "Throttling Exception : Limit exceeded for put-mapping"
}
]
},
"status" : 503
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice should this be 429 instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have created followup issue (#4724) for exposing throttling exception to user and we will make it as 429.
Gradle Check (Jenkins) Run Completed with:
|
…' into throttling-IT Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
8dd8d10
to
389d1e8
Compare
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## feature/master-task-throttling #4588 +/- ##
=================================================================
Coverage 70.71% 70.72%
+ Complexity 57644 57621 -23
=================================================================
Files 4665 4666 +1
Lines 276614 276704 +90
Branches 40297 40300 +3
=================================================================
+ Hits 195621 195701 +80
Misses 64737 64737
- Partials 16256 16266 +10
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a follow up issue for changing the final exception type . Apart from that, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments otherwise looks good. Thanks @dhwanilpatel
* Throttling exception and data node is performing retries on it. | ||
* | ||
*/ | ||
public void testThrottlingForRemoteMaster() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[minor]: replace Master with "ClusterManager" at all places applicable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Updated.
|
||
@Override | ||
public void onResponseSent(long requestId, String action, Exception error) { | ||
if (action.contains("mapping")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it is expecting all mapping requests to be throttled, which isnt the case. I am missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all requests will be throttled, it will be on random basis.
Here we have overridden TransportServiceMessageListener of cluster manager node to get count of throttled requests. This function will be called whenever exception is returned from cluster manager and we will calculate the throttled mapping requests over here. Based on throttled count we will assert master is getting (totalRequest + throttledRequests).
Gradle Check (Jenkins) Run Completed with:
|
a99bfaf
to
785468b
Compare
Gradle Check (Jenkins) Run Completed with:
|
785468b
to
22f2a97
Compare
Gradle Check (Jenkins) Run Completed with:
|
6a6f9ad
to
55ed984
Compare
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
55ed984
to
b80a4cc
Compare
Gradle Check (Jenkins) Run Completed with:
|
We need a changelog for the build to succeed |
…pensearch-project#4588) * Fix timeout exception and Add Integ test for Master task throttling Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… #3527 Changes required in Master node to perform throttling. Master node changes for master task throttling #3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling #4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling #4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
…t#4986) Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… opensearch-project#3527 Changes required in Master node to perform throttling. Master node changes for master task throttling opensearch-project#3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling opensearch-project#4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling opensearch-project#4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling opensearch-project#4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
* Add basic thorttler/exponential backoff policy for retry/Defination of throttling exception (#3856) * Corrected Java doc for Throttler * Changed the default behaviour of Throttler to return Optional * Removed generics from Throttler and used String as key * Ignore backport / autocut / dependabot branches for gradle checks on push * Master node changes for master task throttling (#3882) * Data node changes for master task throttling (#4204) * Onboarding of few task types to throttling (#4542) * Fix timeout exception and Add Integ test for Master task throttling (#4588) * Complete TODO for version change and removed unused classes(Throttler and Semaphore) (#4846) * Remove V1 version from throttling testcase Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Basic Throttler Framework / Exponential Basic back off policy. Add basic thorttler/exponential backoff policy for retry/Defination o… #3527 Changes required in Master node to perform throttling. Master node changes for master task throttling #3882 Changes required in Data node to perform retry on throttling. Data node changes for master task throttling #4204 Provide support for all task type in throttling framework. Onboarding of few task types to throttling #4542 Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588 Signed-off-by: Dhwanil Patel <dhwanip@amazon.com>
Signed-off-by: Dhwanil Patel dhwanip@amazon.com
Description
Issues Resolved
[#479 ]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.