Add Bulk Item Failure Count #4562

mgodwan · 2022-09-20T20:34:01Z

Is your feature request related to a problem? Please describe.

Opensearch returns a 200 error response for a Bulk call, even though there can be partial failures within the complete request.
Today, when an operator wants to gather data on the number of bulk item failures seen on a node, there is no way to obtain the data directly from the node/cluster.
There are cases we have seen where the clients may not be fully aware of the stats for partial failures in bulk API.

Describe the solution you'd like

Add a stat within the indexing actions (at the coordinating node) which will count the number of errors of each type.
These actions can publish a counter regularly for each error type (400/429/500, etc.) while handling indexing requests.
The counter will be exposed using an API which can be used by clients to query (e.g. using node stats API) the number of item level failures. The counter will also be the building block to be consumed through the PA plugin or any additional plugins/components.
Based on the way Opensearch publishes all cumulative stats, most of the clients already setup a regular poller which allows them to make sense of the cumulative counter based on the last value seen for a node and the current value seen for a node to gather any insights over a desired time period. This will add value to all those use cases.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

kotwanikunal · 2022-09-26T20:33:18Z

This will need support from both the framework/server as well as the client libraries.

anasalkouz · 2022-09-30T18:51:23Z

Hi @mgodwan,
Can you elaborate on the usecase? I know that bulk api provides count of failures and successes on the API response, but why we need the Failure Count on cluster/node level? is there any other similar stats metrics on cluser/node level?

mgodwan · 2022-10-03T13:40:37Z

@anasalkouz While the API response contains the error as true/false, it is upto the clients to record the metric. As an operator, I don't have visibility into the statistics of how cluster or specific nodes are behaving with regards to failures for specific bulk items. e.g. if there is a bad node which fails to perform indexing actions, or is behind the current cluster state for certain shards, it may keep returning errors for bulks for the specific shard. Having such stats aggregated on node level would help in detecting issues within the cluster.

Today, we have node stats API which return stats for nodes such as http connections, circuit breakers, indexing pressure, etc. and adding this stat can contribute further towards providing more node information for indexing item failures to detect issues and also allow operators to build metrics on top of these stats.

adnapibar · 2022-11-18T01:35:04Z

Today, we have node stats API which return stats for nodes such as http connections, circuit breakers, indexing pressure, etc. and adding this stat can contribute further towards providing more node information for indexing item failures to detect issues and also allow operators to build metrics on top of these stats.

@mgodwan I want to understand the requirements here, currently the _nodes/<node_id>/stats returns the number of failed indexing operations indexing.index_failed which I believe also include the failures during the bulk call. Are you saying we should have more granular level of stats such as count by types of failures?

mgodwan · 2023-06-13T05:52:35Z

@mgodwan I want to understand the requirements here, currently the _nodes/<node_id>/stats returns the number of failed indexing operations indexing.index_failed which I believe also include the failures during the bulk call. Are you saying we should have more granular level of stats such as count by types of failures?

@adnapibar IndexingStats only account for the actual indexing flow i.e. index into lucene and translog addition. We can have failures during parsing, backpressure, shard failures, etc. which don't get accounted for here. This misses a lot of cases, and hence tracking this at Rest layer/Transport Bulk layer may give the best insight into the error returned to the client for each document which has reached the Opensearch process.

r1walz · 2023-07-06T15:37:37Z

Hi, All. This looks like a good feature. Let me summarize and spark the discussion.

Introduction

Currently, Opensearch returns a 200 OK response code for a Bulk API call, even though there can be partial/complete failures within the request E2E. Tracking these failures requires client to parse the response on their side and make sense of them. But, a general idea around trend in growth of different http codes at item level can provide insights on how indexing engine is performing.

Proposal

Introduce an object at the coordinating node for tracking different rest status codes returned during index action calls.

Earlier:

{
    "indexing": {
        "index_total": 12,
        "index_time_in_millis": 55,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 0
    }
}

Now:

{
    "indexing": {
        "index_total": 12,
        "index_time_in_millis": 55,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 0,
        "doc_status": {
            "200": 2,
            "400": 1,
            "404": 1,
            "429": 1
        }
    }
}

Possible way to achieve this can be in the Transport layer through TransportBulkAction. While the rest status code counter can reside within the IndexingStats, the ownership of DocStatusCounter will be with oldShardsStats object within the IndicesService.

Since, we’re capturing status codes from each response, the approach aggregates them across all types of indexing actions/APIs (_bulk, _doc, _update, etc).

FAQ

1. Why introduce a new mechanism while we already have metrics like index_total, index_failed, etc?

Metrics like index_total or index_failed only counts docs which are processed by via the actual indexing flow, i.e., indexed into lucene and translog addition. But, we can have failures during parsing phase or due to shard failures/rejections due to backpressue etc. They miss a lot of cases, and hence tracking at Transport layer shall give best insights into the returned statuses.

---

I was able to do a PoC. RFC, cf: https://github.com/r1walz/OpenSearch/tree/ra/idx-axn-cntr

mgodwan · 2023-07-07T10:42:27Z

Thanks @r1walz for picking this up, and adding the needed details on this proposal. The overall idea of enhancing the stats API itself with data collected via transport layer is a good idea.

mgodwan · 2023-07-07T10:42:40Z

@shwetathareja Thoughts on this?

shwetathareja · 2023-07-12T09:09:24Z

Thanks @r1walz on the proposal to enhance indexing stats with doc level metrics. It would be useful to track doc level status at OpenSearch layer.
Couple of points:

It would also useful to track updates as separate operation as doc updates are more expensive than regular indexing and this metric would help identify if cluster has update heavy workload as opposed to only creates.
Also, regarding doc level status with status code would be alot. Either we restrict the no. of status codes here or we can do grouping like success, throttled, mapping_parser_error, painless_script_error, server_error, version_conflict etc.

mgodwan · 2023-07-12T10:36:40Z

Also, regarding doc level status with status code would be alot. Either we restrict the no. of status codes here or we can do grouping like success, throttled, mapping_parser_error, painless_script_error, server_error, version_conflict etc.

Status codes should provide some level of aggregation over multiple scenarios/exceptions we return e.g.

MapperParsing, IllegalArgument translate to 400,
ThreadPoolRejections/Backpressure Rejection/CB Rejections/Script compilation limits, etc translate to 429

We can expose a setting through which can allow a certain set of status codes with default values for tracking as common points of interest (e.g. 200, 201, 400, 409, 413, 429, 500, 503, 504) and operators can update this dynamically. This may prove to be easy to enable/disable certain tracking on demand when compared to custom bucket (which may require code changes to add new keys for translation from internal exceptions to keys).

@r1walz @shwetathareja Thoughts?

shwetathareja · 2023-07-13T05:28:31Z

@mgodwan I am fine with restricting status code instead of custom buckets - Lets add the status codes list we are proposing to track. Also, make it a node level setting and not dynamic as stats API response could be confusing where certain status code count would be unchanged if they are not collected anymore or when a new status code gets added to the list it will have count from that point onwards and not the since the process came up on that node.

r1walz · 2023-07-18T07:25:29Z

@shwetathareja @mgodwan : Hi. Another way to approach this is to account for all rest codes since starting of the node and return only the requested codes to the user (filter). Does it make sense to restrict these through request params? We can also have a dynamic list setting as filter.

Currently, we support these rest status codes, so maintaining an in-memory Map to hold counter for all shouldn't hurt*.

*: assuming we don't overflow the counter for any doc status

ankitkala · 2023-07-19T07:46:08Z

Hey @r1walz, I incline towards the approach recommended by @shwetathareja where we group the failures by their reasons instead of status codes.
One benefit is that errors messages would be more user friendly than status code.
Another reason is that there might be multiple failure reasons which might fall into same status code. This'll help up segregate those as well. There'd be additional efforts required to label all the issues and categorise though.

shwetathareja · 2023-07-19T10:11:10Z

I thought again on this. +1 @ankitkala . There are 50+ rest statuses we track and it would be too granular to return. Ideally Request tracing framework should provide this level of granularity. The purpose of exposing this stats to customers is to understand a trend over time on category of errors they are observing in the production.

r1walz · 2023-07-19T19:04:58Z

I agree that counting at cause level gives additional data about the nature of failure. Perhaps, we can do both and provide a "view" filter to cater needs of both kind of users. Ones who just want to know whether there are failures and others who relate to the cause.

GET /_nodes/stats/indices/indexing?code_view
{
    "indexing": {
        "index_total": 9,
        "index_time_in_millis": 77,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 1,
        "delete_time_in_millis": 23,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 0,
        "doc_status": {
            "200": 7,
            "201": 4,
            "400": 1,
            "404": 1
        }
    }
}

GET /_nodes/stats/indices/indexing?reason_view
{
    "indexing": {
        "index_total": 9,
        "index_time_in_millis": 77,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 1,
        "delete_time_in_millis": 23,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 0,
        "doc_status": {
            "mapper_parsing_exception": 1,
            "noop": 1,
            "deleted": 1,
            "document_missing_exception": 1,
            "created": 4,
            "updated": 5
        }
    }
}

@ankitkala wrote:

There'd be additional efforts required to label all the issues and categorise though.

Perhaps, we can utilize OpenSearchException.getExceptionName(...) to extract out bucket labels which are captured as failure cause within the bulk item response.

Now, I'm thinking if it should be a separate stat class instead, "NodeCoordinatingStats". A part of NodeIndicesStats > CommonStats, instead of IndexingStats.

@shwetathareja @mgodwan @ankitkala : Let me formalize it into the proposal.

shwetathareja · 2023-07-20T06:41:41Z

@r1walz I would prefer either we expose category of errors or actual status code and not both. Lets come to consensus which one provides more meaningful insights and go with that.

mgodwan · 2023-07-20T07:15:29Z

We would want to fulfill the following requirements with these changes:

Provide OpenSearch operators a way to track their item level failures during indexing, irrespective of the APIs they use as status code is not a good indicator during bulk operations.
We don’t wish to add a lot of overhead to JVM/on-the-wire payload for the returned stats.
OS operators should be able to easily identify the type of failures and also write programmatic applications to consume these stats for alerting if needed.
In line with 3, alerting is usually segregated across by successes, client side errors, and server side errors. We should be able to achieve the same with our returned stat.

For a sample bulk response like:

{
  "error": {
    "type": "mapper_parsing_exception",
    "reason": "failed to parse field [double] of type [float] in document ...",
  },
  "status": 400
}

I see 3 possibilities for the categorisation at this point here:

Using type returned i.e. mapper_parsing_exception
a. The overhead will be very high as we have 50s of different exception types possible for indexing and there is no aggregation achieved here which goes against the suggestion laid out by @shwetathareja
b. The view provided to an operator manually checking these stats is exhaustive
c. Alerting is tricky as the overhead is on operators to determine what errors are client side/service side. If we add any new exception types, these alerting configurations might not include the new ones.
Using the returned status code i.e. 400
a. Provides aggregation over well defined set of status codes across different exception types along with default understanding of client/server side errors (E.g. mapper parsing, illegal json in doc, etc are mapped to 400, shard unavailable, node unavailable are mapped to 500). While we maintain >50 rest statuses in OS, to indexing, only a few in the series of 2xx, 4xx, and 5xx will be applicable
b. It is easy to setup alerts as we usually do on status codes (e.g. aggregate like 2xx, 4xx, 5xx, etc and emit metrics)
c. For operators looking to understand the causes, enabling Add logging to 4xx responses from API at reduced rate. #4770 should help dive deep into actual causes.
Creating a new bucket aggregating over different types
a. Based on the number of buckets we declare, the jvm/wire overhead should be less.
b. We need to be very clear in how something like others get modelled as while emitting metrics a plain others does not allow to segregate client/server side errors. We can choose to create hierarchies but that adds extra computation overhead as well.
Alerting is tricky: Any new bucket added, alerting needs to be updated first in operator’s systems so as to not miss error accounting for new ones.
c. Exception type is already a bucket. Creating new bucket and maintaining exceptions mapping adds extra code overhead with more exceptions being added. Status code already serves as one such logical bucket.

shwetathareja · 2023-07-20T07:54:33Z

Thanks @mgodwan. Can please you list down all the http status codes which are returned today at doc level for bulk API ? I agree there is always overhead in maintaining something custom if we introduce new custom buckets for doc level errors.

For the Option 3. Creating a new bucket aggregating over different types

Alerting is tricky: Any new bucket added, alerting needs to be updated first in operator’s systems so as to not miss error accounting for new ones.

I am not sure there will be alerting setup on the stats output directly.

Status code already serves as one such logical bucket.

My concern is these could be many status codes and these stats that we are exposing is to provide overall trend and not a mechanism to debug point in time errors. Unless we provide some categorization, this list can grow. e.g. security application could come back and say they would need Unauthorized (401) status code to be tracked separately which is not mentioned in the common list of errors.

ankitkala · 2023-07-20T14:51:59Z

The overhead will be very high as we have 50s of different exception types possible for indexing and there is no aggregation achieved here

Makes sense. 50+ exception types definitely seems a lot and might not really help the operator.

Status code already serves as one such logical bucket.

My concern is these could be many status codes and these stats that we are exposing is to provide overall trend and not a mechanism to debug point in time errors. Unless we provide some categorization, this list can grow.

Regarding the concern of growing list of codes, I suspect that the number of actual status codes returned should be relatively low(400, 401, 403, 404, 409, 424, 429, 500, 502, 503, 504) but can't really back it with any data.

r1walz · 2023-07-20T21:33:19Z

@ankitkala : We shall bucket these rest status codes as per their category class (2xx, 4xx, etc). It will be indicative enough to provide information on what's happening with the docs without adding a lot of overhead.

@shwetathareja called out earlier that the granularity of previous approach isn't what stats API usually provides. The new bucketization is also aligned with the stats API design philosophies.

cc: @mgodwan

mgodwan added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 20, 2022

mgodwan changed the title ~~Bulk Item Failure Count~~ Add Bulk Item Failure Count Sep 20, 2022

kotwanikunal added Clients Clients within the Core repository such as High level Rest client and low level client distributed framework labels Sep 26, 2022

anasalkouz removed the untriaged label Sep 30, 2022

dblock mentioned this issue Nov 30, 2022

[BUG] Logs unable to reach OpenSearch #5275

Closed

anasalkouz added Migration:ReqReview and removed Migration:ReqReview labels Mar 16, 2023

r1walz mentioned this issue Jul 17, 2023

Indexing: add Doc status counter #8716

Merged

6 tasks

shwetathareja closed this as completed in #8716 Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Bulk Item Failure Count #4562

Add Bulk Item Failure Count #4562

mgodwan commented Sep 20, 2022 •

edited

Loading

kotwanikunal commented Sep 26, 2022

anasalkouz commented Sep 30, 2022

mgodwan commented Oct 3, 2022 •

edited

Loading

adnapibar commented Nov 18, 2022

mgodwan commented Jun 13, 2023

r1walz commented Jul 6, 2023

mgodwan commented Jul 7, 2023

mgodwan commented Jul 7, 2023

shwetathareja commented Jul 12, 2023 •

edited

Loading

mgodwan commented Jul 12, 2023

shwetathareja commented Jul 13, 2023

r1walz commented Jul 18, 2023

ankitkala commented Jul 19, 2023

shwetathareja commented Jul 19, 2023 •

edited

Loading

r1walz commented Jul 19, 2023 •

edited

Loading

shwetathareja commented Jul 20, 2023

mgodwan commented Jul 20, 2023

shwetathareja commented Jul 20, 2023 •

edited

Loading

ankitkala commented Jul 20, 2023

r1walz commented Jul 20, 2023

Add Bulk Item Failure Count #4562

Add Bulk Item Failure Count #4562

Comments

mgodwan commented Sep 20, 2022 • edited Loading

kotwanikunal commented Sep 26, 2022

anasalkouz commented Sep 30, 2022

mgodwan commented Oct 3, 2022 • edited Loading

adnapibar commented Nov 18, 2022

mgodwan commented Jun 13, 2023

r1walz commented Jul 6, 2023

Introduction

Proposal

FAQ

mgodwan commented Jul 7, 2023

mgodwan commented Jul 7, 2023

shwetathareja commented Jul 12, 2023 • edited Loading

mgodwan commented Jul 12, 2023

shwetathareja commented Jul 13, 2023

r1walz commented Jul 18, 2023

ankitkala commented Jul 19, 2023

shwetathareja commented Jul 19, 2023 • edited Loading

r1walz commented Jul 19, 2023 • edited Loading

shwetathareja commented Jul 20, 2023

mgodwan commented Jul 20, 2023

shwetathareja commented Jul 20, 2023 • edited Loading

ankitkala commented Jul 20, 2023

r1walz commented Jul 20, 2023

mgodwan commented Sep 20, 2022 •

edited

Loading

mgodwan commented Oct 3, 2022 •

edited

Loading

shwetathareja commented Jul 12, 2023 •

edited

Loading

shwetathareja commented Jul 19, 2023 •

edited

Loading

r1walz commented Jul 19, 2023 •

edited

Loading

shwetathareja commented Jul 20, 2023 •

edited

Loading