-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add task resource tracking service to cluster service #14681
add task resource tracking service to cluster service #14681
Conversation
❌ Gradle check result for 31a955e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Chenyang Ji <cyji@amazon.com>
31a955e
to
d1a6038
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #14681 +/- ##
============================================
- Coverage 71.68% 71.58% -0.10%
+ Complexity 62365 62327 -38
============================================
Files 5148 5148
Lines 293297 293301 +4
Branches 42384 42384
============================================
- Hits 210239 209960 -279
- Misses 65717 65923 +206
- Partials 17341 17418 +77 ☔ View full report in Codecov by Sentry. |
this.attributes = in.readMap(Attribute::readFromStream, StreamInput::readGenericValue); | ||
this.attributes = Attribute.readAttributeMap(in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has something changed in the ser/deser logic? If yes, can that cause issue while upgrading from 2.15 to 2.16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jainankitk ! Actually this is a bug we didn't caught. readGenericValue
won't be able to read List<CustomObject>
, we need to write our own reader in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I took this from Siddhant's PR when moving categorization code to the plugin: https://github.com/opensearch-project/OpenSearch/pull/14528/files#r1669900434
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jainankitk This was also discussed in detail in the PR to move categorization to plugin : https://github.com/opensearch-project/OpenSearch/pull/14528/files#r1669900434
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the details! Given the serialization was failing already, should not result into compatibility issues.
out.writeMap( | ||
attributes, | ||
(stream, attribute) -> Attribute.writeTo(out, attribute), | ||
(stream, attributeValue) -> Attribute.writeValueTo(out, attributeValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes for the write part as well
server/src/main/java/org/opensearch/tasks/TaskResourceTrackingService.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Chenyang Ji <cyji@amazon.com>
bfccfa5
to
d00b2df
Compare
Curious why the serialize test was passing earlier when we made changes to main and why it was failing only in my branch? @ansjcy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change to attributes.put(Attribute.SOURCE, request.source());
here itself so we can use this in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe let's do it as part of your PR? to ensure this PR only contains the necessary urgent fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ansjcy - The changes look good. Can we just add a test to demonstrate the failure earlier and now it is passing?
❕ Gradle check result for d00b2df: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Signed-off-by: Chenyang Ji <cyji@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you adding test. LGTM!
Signed-off-by: Chenyang Ji <cyji@amazon.com>
❕ Gradle check result for 14ff852: UNSTABLE
Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
@@ -51,6 +52,7 @@ | |||
/** | |||
* Service that helps track resource usage of tasks running on a node. | |||
*/ | |||
@PublicApi(since = "2.16.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you are marking this public in that it is now exposed publicly from ClusterService. Is this the right level of visibility for this service? Can we instead expose only the required functionality from ClusterService instead the whole of TaskResourceTrackingService
? and/or thinking this maybe should be a separate interface for plugins not exposed through ClusterService?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do a wrapper of refreshResourceStats
in clusterservice
, but one argument could be, does it belong to cluster service? In the right level of encapsulation, task level resource usages related operations should only belong to TaskResourceTrackingService
. But I agree making the whole service public is also risky. I'm open to suggestions. cc @reta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ansjcy the TaskResourceTrackingService
is internal to OpenSearch (and has no relation to the ClusterService
either), should not be exposed to the plugins. Regarding to the issue itself:
Currently we are not refreshing task level resource usages on coordinator node for searchTasks, which means all coordinator node resource usage will be 0.
That seem to be the problem that core implementation has to fix, why the task level resources are not refreshed (on coordinator node)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @reta ! thanks for the input. Currently the TaskResourceTrackingService
only refreshes task usages when a task ends. But in our case we want to get the resource usage in a SearchOperationsListener, which will be triggered before a task finishes.
Let me think about this more. Instead of exposing the TaskResourceTrackingSevice
, does it make sense to you If we can do a usage refresh in core before the listeners are called (in this function: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/action/search/AbstractSearchAsyncAction.java#L469)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ansjcy
thanks for the input. Currently the TaskResourceTrackingService only refreshes task usages when a task ends.
It does not sound right, the task is still ongoing so its usage won't be correct.
Let me think about this more. Instead of exposing the TaskResourceTrackingSevice, does it make sense to you If we can do a usage refresh in core before the listeners are called (in this function:
The logical point (at least to me) of capturing task usages seems to be the moment task ends. It looks to me you are trying to chime in somewhere in between (while task is still executing), that does not look like the correct way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we add per-request listener instance in the TransportSearchAction::executeRequest. The query insights plugin has nothing to do with it at this moment of time, but we capture the search task resource usage upon request completion, so the tracking data becomes available to everyone (including the query insights plugin). Does it make sense?
@reta We are currently working on a PR based on the above discussion.
For the last point you made, what other justification and work is required to make the API public? We are trying to get all the query insights changes in 2.16 and this is the only PR that is dangling currently. Want to make sure we reach a path forward. Please let us know your suggestions. In the meantime will finalize the above draft PR if there are not concerns with this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @deshsidd
For the last point you made, what other justification and work is required to make the API public?
I did not design the original APIs, you may ask the contributor if he has any concerns. On the second point, if you need to make it public, apply the @PublicApi
annotation accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @reta!
cc @buddharajusahil @sgup432 @dzane17 @jainankitk Please let us know your thoughts since you all have contributed to the file.
@reta Looks like you had initially reduced the visibility of the API.
For now I am going to continue with the approach that Reta and Chenyang had discussed above and work on the following PR. Will also make the SearchRequestOperationsListener @publicapi as part of these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reta Looks like you had initially reduced the visibility of the API.
@deshsidd yes, you will understand why once try to apply @PublicApi
to it :-) : it pulls a pile of dependencies with it .... (I am very doubtful it was designed as being public in the first place).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1109,6 +1109,7 @@ protected Node( | |||
clusterService.getClusterSettings(), | |||
threadPool | |||
); | |||
clusterService.setTaskResourceTrackingService(taskResourceTrackingService); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just initialize TaskResourceTrackingService
in ClusterService
then as we're creating a dependency here anyways.
Also should SetOnce
be used?
Closing in favor of : #14832 |
Description
This is mainly used in query insights plugin.
Related PR: opensearch-project/query-insights#13
Currently we are not refreshing task level resource usages on coordinator node for
searchTask
s, which means all coordinator node resource usage will be 0. for example, below node level resource usages in top n queries results:This pr is a prerequisite of the fix to refresh task level resource usage before reading it in query insights - basically expose TaskResourceTrackingService for plugins to use.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.