-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Bulk Item Failure Count #4562
Comments
This will need support from both the framework/server as well as the client libraries. |
Hi @mgodwan, |
@anasalkouz While the API response contains the error as true/false, it is upto the clients to record the metric. As an operator, I don't have visibility into the statistics of how cluster or specific nodes are behaving with regards to failures for specific bulk items. e.g. if there is a bad node which fails to perform indexing actions, or is behind the current cluster state for certain shards, it may keep returning errors for bulks for the specific shard. Having such stats aggregated on node level would help in detecting issues within the cluster. Today, we have node stats API which return stats for nodes such as http connections, circuit breakers, indexing pressure, etc. and adding this stat can contribute further towards providing more node information for indexing item failures to detect issues and also allow operators to build metrics on top of these stats. |
@mgodwan I want to understand the requirements here, currently the |
@adnapibar IndexingStats only account for the actual indexing flow i.e. index into lucene and translog addition. We can have failures during parsing, backpressure, shard failures, etc. which don't get accounted for here. This misses a lot of cases, and hence tracking this at Rest layer/Transport Bulk layer may give the best insight into the error returned to the client for each document which has reached the Opensearch process. |
Hi, All. This looks like a good feature. Let me summarize and spark the discussion. IntroductionCurrently, Opensearch returns a ProposalIntroduce an object at the coordinating node for tracking different rest status codes returned during index action calls. Earlier: {
"indexing": {
"index_total": 12,
"index_time_in_millis": 55,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0
}
} Now: {
"indexing": {
"index_total": 12,
"index_time_in_millis": 55,
"index_current": 0,
"index_failed": 0,
"delete_total": 0,
"delete_time_in_millis": 0,
"delete_current": 0,
"noop_update_total": 0,
"is_throttled": false,
"throttle_time_in_millis": 0,
"doc_status": {
"200": 2,
"400": 1,
"404": 1,
"429": 1
}
}
} Possible way to achieve this can be in the Transport layer through TransportBulkAction. While the rest status code counter can reside within the IndexingStats, the ownership of DocStatusCounter will be with oldShardsStats object within the IndicesService. Since, we’re capturing status codes from each response, the approach aggregates them across all types of indexing actions/APIs (_bulk, _doc, _update, etc). FAQ1. Why introduce a new mechanism while we already have metrics like Metrics like --- I was able to do a PoC. RFC, cf: https://github.com/r1walz/OpenSearch/tree/ra/idx-axn-cntr |
Thanks @r1walz for picking this up, and adding the needed details on this proposal. The overall idea of enhancing the stats API itself with data collected via transport layer is a good idea. |
@shwetathareja Thoughts on this? |
Thanks @r1walz on the proposal to enhance indexing stats with doc level metrics. It would be useful to track doc level status at OpenSearch layer.
|
Status codes should provide some level of aggregation over multiple scenarios/exceptions we return e.g.
We can expose a setting through which can allow a certain set of status codes with default values for tracking as common points of interest (e.g. 200, 201, 400, 409, 413, 429, 500, 503, 504) and operators can update this dynamically. This may prove to be easy to enable/disable certain tracking on demand when compared to custom bucket (which may require code changes to add new keys for translation from internal exceptions to keys). @r1walz @shwetathareja Thoughts? |
@mgodwan I am fine with restricting status code instead of custom buckets - Lets add the status codes list we are proposing to track. Also, make it a node level setting and not dynamic as stats API response could be confusing where certain status code count would be unchanged if they are not collected anymore or when a new status code gets added to the list it will have count from that point onwards and not the since the process came up on that node. |
@shwetathareja @mgodwan : Hi. Another way to approach this is to account for all rest codes since starting of the node and return only the requested codes to the user (filter). Does it make sense to restrict these through request params? We can also have a dynamic list setting as filter. Currently, we support these rest status codes, so maintaining an in-memory Map to hold counter for all shouldn't hurt*. *: assuming we don't overflow the counter for any doc status |
Hey @r1walz, I incline towards the approach recommended by @shwetathareja where we group the failures by their reasons instead of status codes. |
I thought again on this. +1 @ankitkala . There are 50+ rest statuses we track and it would be too granular to return. Ideally Request tracing framework should provide this level of granularity. The purpose of exposing this stats to customers is to understand a trend over time on category of errors they are observing in the production. |
I agree that counting at cause level gives additional data about the nature of failure. Perhaps, we can do both and provide a "view" filter to cater needs of both kind of users. Ones who just want to know whether there are failures and others who relate to the cause.
@ankitkala wrote:
Perhaps, we can utilize Now, I'm thinking if it should be a separate stat class instead, "NodeCoordinatingStats". A part of NodeIndicesStats > CommonStats, instead of IndexingStats. @shwetathareja @mgodwan @ankitkala : Let me formalize it into the proposal. |
@r1walz I would prefer either we expose category of errors or actual status code and not both. Lets come to consensus which one provides more meaningful insights and go with that. |
We would want to fulfill the following requirements with these changes:
For a sample bulk response like:
I see 3 possibilities for the categorisation at this point here:
|
Thanks @mgodwan. Can please you list down all the http status codes which are returned today at doc level for bulk API ? I agree there is always overhead in maintaining something custom if we introduce new custom buckets for doc level errors. For the Option 3. Creating a new bucket aggregating over different types
I am not sure there will be alerting setup on the stats output directly.
My concern is these could be many status codes and these stats that we are exposing is to provide overall trend and not a mechanism to debug point in time errors. Unless we provide some categorization, this list can grow. e.g. security application could come back and say they would need Unauthorized (401) status code to be tracked separately which is not mentioned in the common list of errors. |
Makes sense. 50+ exception types definitely seems a lot and might not really help the operator.
Regarding the concern of growing list of codes, I suspect that the number of actual status codes returned should be relatively low(400, 401, 403, 404, 409, 424, 429, 500, 502, 503, 504) but can't really back it with any data. |
@ankitkala : We shall bucket these rest status codes as per their category class ( @shwetathareja called out earlier that the granularity of previous approach isn't what stats API usually provides. The new bucketization is also aligned with the stats API design philosophies. cc: @mgodwan |
Is your feature request related to a problem? Please describe.
Describe the solution you'd like
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: