Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Bulk Item Failure Count #4562

Closed
mgodwan opened this issue Sep 20, 2022 · 20 comments · Fixed by #8716
Closed

Add Bulk Item Failure Count #4562

mgodwan opened this issue Sep 20, 2022 · 20 comments · Fixed by #8716
Labels
Clients Clients within the Core repository such as High level Rest client and low level client distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@mgodwan
Copy link
Member

mgodwan commented Sep 20, 2022

Is your feature request related to a problem? Please describe.

  • Opensearch returns a 200 error response for a Bulk call, even though there can be partial failures within the complete request.
  • Today, when an operator wants to gather data on the number of bulk item failures seen on a node, there is no way to obtain the data directly from the node/cluster.
  • There are cases we have seen where the clients may not be fully aware of the stats for partial failures in bulk API.

Describe the solution you'd like

  • Add a stat within the indexing actions (at the coordinating node) which will count the number of errors of each type.
  • These actions can publish a counter regularly for each error type (400/429/500, etc.) while handling indexing requests.
  • The counter will be exposed using an API which can be used by clients to query (e.g. using node stats API) the number of item level failures. The counter will also be the building block to be consumed through the PA plugin or any additional plugins/components.
  • Based on the way Opensearch publishes all cumulative stats, most of the clients already setup a regular poller which allows them to make sense of the cumulative counter based on the last value seen for a node and the current value seen for a node to gather any insights over a desired time period. This will add value to all those use cases.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@mgodwan mgodwan added enhancement Enhancement or improvement to existing feature or request untriaged labels Sep 20, 2022
@mgodwan mgodwan changed the title Bulk Item Failure Count Add Bulk Item Failure Count Sep 20, 2022
@kotwanikunal kotwanikunal added Clients Clients within the Core repository such as High level Rest client and low level client distributed framework labels Sep 26, 2022
@kotwanikunal
Copy link
Member

This will need support from both the framework/server as well as the client libraries.

@anasalkouz
Copy link
Member

Hi @mgodwan,
Can you elaborate on the usecase? I know that bulk api provides count of failures and successes on the API response, but why we need the Failure Count on cluster/node level? is there any other similar stats metrics on cluser/node level?

@mgodwan
Copy link
Member Author

mgodwan commented Oct 3, 2022

@anasalkouz While the API response contains the error as true/false, it is upto the clients to record the metric. As an operator, I don't have visibility into the statistics of how cluster or specific nodes are behaving with regards to failures for specific bulk items. e.g. if there is a bad node which fails to perform indexing actions, or is behind the current cluster state for certain shards, it may keep returning errors for bulks for the specific shard. Having such stats aggregated on node level would help in detecting issues within the cluster.

Today, we have node stats API which return stats for nodes such as http connections, circuit breakers, indexing pressure, etc. and adding this stat can contribute further towards providing more node information for indexing item failures to detect issues and also allow operators to build metrics on top of these stats.

@adnapibar
Copy link
Contributor

Today, we have node stats API which return stats for nodes such as http connections, circuit breakers, indexing pressure, etc. and adding this stat can contribute further towards providing more node information for indexing item failures to detect issues and also allow operators to build metrics on top of these stats.

@mgodwan I want to understand the requirements here, currently the _nodes/<node_id>/stats returns the number of failed indexing operations indexing.index_failed which I believe also include the failures during the bulk call. Are you saying we should have more granular level of stats such as count by types of failures?

@mgodwan
Copy link
Member Author

mgodwan commented Jun 13, 2023

@mgodwan I want to understand the requirements here, currently the _nodes/<node_id>/stats returns the number of failed indexing operations indexing.index_failed which I believe also include the failures during the bulk call. Are you saying we should have more granular level of stats such as count by types of failures?

@adnapibar IndexingStats only account for the actual indexing flow i.e. index into lucene and translog addition. We can have failures during parsing, backpressure, shard failures, etc. which don't get accounted for here. This misses a lot of cases, and hence tracking this at Rest layer/Transport Bulk layer may give the best insight into the error returned to the client for each document which has reached the Opensearch process.

@r1walz
Copy link
Contributor

r1walz commented Jul 6, 2023

Hi, All. This looks like a good feature. Let me summarize and spark the discussion.

Introduction

Currently, Opensearch returns a 200 OK response code for a Bulk API call, even though there can be partial/complete failures within the request E2E. Tracking these failures requires client to parse the response on their side and make sense of them. But, a general idea around trend in growth of different http codes at item level can provide insights on how indexing engine is performing.

Proposal

Introduce an object at the coordinating node for tracking different rest status codes returned during index action calls.

Earlier:

{
    "indexing": {
        "index_total": 12,
        "index_time_in_millis": 55,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 0
    }
}

Now:

{
    "indexing": {
        "index_total": 12,
        "index_time_in_millis": 55,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 0,
        "delete_time_in_millis": 0,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 0,
        "doc_status": {
            "200": 2,
            "400": 1,
            "404": 1,
            "429": 1
        }
    }
}

Possible way to achieve this can be in the Transport layer through TransportBulkAction. While the rest status code counter can reside within the IndexingStats, the ownership of DocStatusCounter will be with oldShardsStats object within the IndicesService.

Since, we’re capturing status codes from each response, the approach aggregates them across all types of indexing actions/APIs (_bulk, _doc, _update, etc).

FAQ

1. Why introduce a new mechanism while we already have metrics like index_total, index_failed, etc?

Metrics like index_total or index_failed only counts docs which are processed by via the actual indexing flow, i.e., indexed into lucene and translog addition. But, we can have failures during parsing phase or due to shard failures/rejections due to backpressue etc. They miss a lot of cases, and hence tracking at Transport layer shall give best insights into the returned statuses.

---

I was able to do a PoC. RFC, cf: https://github.com/r1walz/OpenSearch/tree/ra/idx-axn-cntr

@mgodwan
Copy link
Member Author

mgodwan commented Jul 7, 2023

Thanks @r1walz for picking this up, and adding the needed details on this proposal. The overall idea of enhancing the stats API itself with data collected via transport layer is a good idea.

@mgodwan
Copy link
Member Author

mgodwan commented Jul 7, 2023

@shwetathareja Thoughts on this?

@shwetathareja
Copy link
Member

shwetathareja commented Jul 12, 2023

Thanks @r1walz on the proposal to enhance indexing stats with doc level metrics. It would be useful to track doc level status at OpenSearch layer.
Couple of points:

  1. It would also useful to track updates as separate operation as doc updates are more expensive than regular indexing and this metric would help identify if cluster has update heavy workload as opposed to only creates.
  2. Also, regarding doc level status with status code would be alot. Either we restrict the no. of status codes here or we can do grouping like success, throttled, mapping_parser_error, painless_script_error, server_error, version_conflict etc.

@mgodwan
Copy link
Member Author

mgodwan commented Jul 12, 2023

Also, regarding doc level status with status code would be alot. Either we restrict the no. of status codes here or we can do grouping like success, throttled, mapping_parser_error, painless_script_error, server_error, version_conflict etc.

Status codes should provide some level of aggregation over multiple scenarios/exceptions we return e.g.

  • MapperParsing, IllegalArgument translate to 400,
  • ThreadPoolRejections/Backpressure Rejection/CB Rejections/Script compilation limits, etc translate to 429

We can expose a setting through which can allow a certain set of status codes with default values for tracking as common points of interest (e.g. 200, 201, 400, 409, 413, 429, 500, 503, 504) and operators can update this dynamically. This may prove to be easy to enable/disable certain tracking on demand when compared to custom bucket (which may require code changes to add new keys for translation from internal exceptions to keys).

@r1walz @shwetathareja Thoughts?

@shwetathareja
Copy link
Member

@mgodwan I am fine with restricting status code instead of custom buckets - Lets add the status codes list we are proposing to track. Also, make it a node level setting and not dynamic as stats API response could be confusing where certain status code count would be unchanged if they are not collected anymore or when a new status code gets added to the list it will have count from that point onwards and not the since the process came up on that node.

@r1walz
Copy link
Contributor

r1walz commented Jul 18, 2023

@shwetathareja @mgodwan : Hi. Another way to approach this is to account for all rest codes since starting of the node and return only the requested codes to the user (filter). Does it make sense to restrict these through request params? We can also have a dynamic list setting as filter.

Currently, we support these rest status codes, so maintaining an in-memory Map to hold counter for all shouldn't hurt*.

*: assuming we don't overflow the counter for any doc status

@ankitkala
Copy link
Member

Hey @r1walz, I incline towards the approach recommended by @shwetathareja where we group the failures by their reasons instead of status codes.
One benefit is that errors messages would be more user friendly than status code.
Another reason is that there might be multiple failure reasons which might fall into same status code. This'll help up segregate those as well. There'd be additional efforts required to label all the issues and categorise though.

@shwetathareja
Copy link
Member

shwetathareja commented Jul 19, 2023

I thought again on this. +1 @ankitkala . There are 50+ rest statuses we track and it would be too granular to return. Ideally Request tracing framework should provide this level of granularity. The purpose of exposing this stats to customers is to understand a trend over time on category of errors they are observing in the production.

@r1walz
Copy link
Contributor

r1walz commented Jul 19, 2023

I agree that counting at cause level gives additional data about the nature of failure. Perhaps, we can do both and provide a "view" filter to cater needs of both kind of users. Ones who just want to know whether there are failures and others who relate to the cause.

GET /_nodes/stats/indices/indexing?code_view
{
    "indexing": {
        "index_total": 9,
        "index_time_in_millis": 77,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 1,
        "delete_time_in_millis": 23,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 0,
        "doc_status": {
            "200": 7,
            "201": 4,
            "400": 1,
            "404": 1
        }
    }
}
GET /_nodes/stats/indices/indexing?reason_view
{
    "indexing": {
        "index_total": 9,
        "index_time_in_millis": 77,
        "index_current": 0,
        "index_failed": 0,
        "delete_total": 1,
        "delete_time_in_millis": 23,
        "delete_current": 0,
        "noop_update_total": 0,
        "is_throttled": false,
        "throttle_time_in_millis": 0,
        "doc_status": {
            "mapper_parsing_exception": 1,
            "noop": 1,
            "deleted": 1,
            "document_missing_exception": 1,
            "created": 4,
            "updated": 5
        }
    }
}

@ankitkala wrote:

There'd be additional efforts required to label all the issues and categorise though.

Perhaps, we can utilize OpenSearchException.getExceptionName(...) to extract out bucket labels which are captured as failure cause within the bulk item response.

Now, I'm thinking if it should be a separate stat class instead, "NodeCoordinatingStats". A part of NodeIndicesStats > CommonStats, instead of IndexingStats.

@shwetathareja @mgodwan @ankitkala : Let me formalize it into the proposal.

@shwetathareja
Copy link
Member

@r1walz I would prefer either we expose category of errors or actual status code and not both. Lets come to consensus which one provides more meaningful insights and go with that.

@mgodwan
Copy link
Member Author

mgodwan commented Jul 20, 2023

We would want to fulfill the following requirements with these changes:

  1. Provide OpenSearch operators a way to track their item level failures during indexing, irrespective of the APIs they use as status code is not a good indicator during bulk operations.
  2. We don’t wish to add a lot of overhead to JVM/on-the-wire payload for the returned stats.
  3. OS operators should be able to easily identify the type of failures and also write programmatic applications to consume these stats for alerting if needed.
  4. In line with 3, alerting is usually segregated across by successes, client side errors, and server side errors. We should be able to achieve the same with our returned stat.

For a sample bulk response like:

{
  "error": {
    "type": "mapper_parsing_exception",
    "reason": "failed to parse field [double] of type [float] in document ...",
  },
  "status": 400
}

I see 3 possibilities for the categorisation at this point here:

  1. Using type returned i.e. mapper_parsing_exception
    a. The overhead will be very high as we have 50s of different exception types possible for indexing and there is no aggregation achieved here which goes against the suggestion laid out by @shwetathareja
    b. The view provided to an operator manually checking these stats is exhaustive
    c. Alerting is tricky as the overhead is on operators to determine what errors are client side/service side. If we add any new exception types, these alerting configurations might not include the new ones.

  2. Using the returned status code i.e. 400
    a. Provides aggregation over well defined set of status codes across different exception types along with default understanding of client/server side errors (E.g. mapper parsing, illegal json in doc, etc are mapped to 400, shard unavailable, node unavailable are mapped to 500). While we maintain >50 rest statuses in OS, to indexing, only a few in the series of 2xx, 4xx, and 5xx will be applicable
    b. It is easy to setup alerts as we usually do on status codes (e.g. aggregate like 2xx, 4xx, 5xx, etc and emit metrics)
    c. For operators looking to understand the causes, enabling Add logging to 4xx responses from API at reduced rate. #4770 should help dive deep into actual causes.

  3. Creating a new bucket aggregating over different types
    a. Based on the number of buckets we declare, the jvm/wire overhead should be less.
    b. We need to be very clear in how something like others get modelled as while emitting metrics a plain others does not allow to segregate client/server side errors. We can choose to create hierarchies but that adds extra computation overhead as well.
    Alerting is tricky: Any new bucket added, alerting needs to be updated first in operator’s systems so as to not miss error accounting for new ones.
    c. Exception type is already a bucket. Creating new bucket and maintaining exceptions mapping adds extra code overhead with more exceptions being added. Status code already serves as one such logical bucket.

@shwetathareja
Copy link
Member

shwetathareja commented Jul 20, 2023

Thanks @mgodwan. Can please you list down all the http status codes which are returned today at doc level for bulk API ? I agree there is always overhead in maintaining something custom if we introduce new custom buckets for doc level errors.

For the Option 3. Creating a new bucket aggregating over different types

Alerting is tricky: Any new bucket added, alerting needs to be updated first in operator’s systems so as to not miss error accounting for new ones.

I am not sure there will be alerting setup on the stats output directly.

Status code already serves as one such logical bucket.

My concern is these could be many status codes and these stats that we are exposing is to provide overall trend and not a mechanism to debug point in time errors. Unless we provide some categorization, this list can grow. e.g. security application could come back and say they would need Unauthorized (401) status code to be tracked separately which is not mentioned in the common list of errors.

@ankitkala
Copy link
Member

The overhead will be very high as we have 50s of different exception types possible for indexing and there is no aggregation achieved here

Makes sense. 50+ exception types definitely seems a lot and might not really help the operator.

Status code already serves as one such logical bucket.

My concern is these could be many status codes and these stats that we are exposing is to provide overall trend and not a mechanism to debug point in time errors. Unless we provide some categorization, this list can grow.

Regarding the concern of growing list of codes, I suspect that the number of actual status codes returned should be relatively low(400, 401, 403, 404, 409, 424, 429, 500, 502, 503, 504) but can't really back it with any data.

@r1walz
Copy link
Contributor

r1walz commented Jul 20, 2023

@ankitkala : We shall bucket these rest status codes as per their category class (2xx, 4xx, etc). It will be indicative enough to provide information on what's happening with the docs without adding a lot of overhead.

@shwetathareja called out earlier that the granularity of previous approach isn't what stats API usually provides. The new bucketization is also aligned with the stats API design philosophies.

cc: @mgodwan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clients Clients within the Core repository such as High level Rest client and low level client distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants