Revisit connector action retry back-off #172518

kobelb · 2023-12-04T20:02:10Z

Feature Description

When xpack.actions.run.maxAttempts is set to a high number it can cause failed actions to be retried a significant duration after the initial attempt because the delay is exponential without any ceiling:

export function calculateDelay(attempts: number) {
  if (attempts === 1) {
    return 30 * 1000; // 30s
  } else {
    // get multiples of 5 min
    const defaultBackoffPerFailure = 5 * 60 * 1000;
    return defaultBackoffPerFailure * Math.pow(2, attempts - 2);
  }
}

For example, if max attempts is set to 10, the 10th attempt will occur 21.333 hours after the first attempt. This occurs even if the alerting rule that triggered the connector action is subsequently deleted.

This creates a very confusing experience for operators who are monitoring the connector action failures and alerting based on them. cough Serverless cough. Additionally, if these connector actions had succeeded on the 10th attempt, this would be rather confusing because such a large time elapsed since when they were supposed to be sent vs when they finally did.

We should revisit the backoff calculation and at a minimum impose a "ceiling" where we no longer increase the backoff exponentially. @cnasikas shared a good article about the way that AWS does backoffs with a jitter that we should learn from.

Business Value

Reduce confusion for operators monitoring connector action success rates. Improve timeliness of connector actions being delivered when there are transient failures.

Definition of Done

Connector actions no longer retried on a purely exponential basis
Unit tests

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-12-04T20:02:13Z

Pinging @elastic/response-ops (Team:ResponseOps)

Resolves #172518 ## Summary Updates the retry delay calculation to cap the delay at 1hr and introduces jitter. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios ### To verify - Create a rule and then force a retry failure - Verify that the retry follows the pattern below: Attempt 1: now Attempt 2: 30s after the first attempt Attempt 3: 0 - 5m after the second attempt Attempt 4: 0 - 10m after the third attempt Attempt 5: 0 - 20m after the fourth attempt Attempt 6: 0 - 40m after the fifth attempt Attempt n: 0 - 1hr for all other attempts

kobelb added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework labels Dec 4, 2023

mikecote added this to AppEx: ResponseOps - Execution & Connectors Dec 5, 2023

github-project-automation bot moved this to Awaiting Triage in AppEx: ResponseOps - Execution & Connectors Dec 5, 2023

doakalexi moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Dec 7, 2023

doakalexi self-assigned this Dec 19, 2023

doakalexi moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Dec 19, 2023

doakalexi mentioned this issue Dec 20, 2023

[ResponseOps] Revisit connector action retry back-off #173779

Merged

1 task

doakalexi moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Jan 2, 2024

doakalexi closed this as completed in #173779 Jan 3, 2024

github-project-automation bot moved this from In Review to Done in AppEx: ResponseOps - Execution & Connectors Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit connector action retry back-off #172518

Revisit connector action retry back-off #172518

kobelb commented Dec 4, 2023

elasticmachine commented Dec 4, 2023

Revisit connector action retry back-off #172518

Revisit connector action retry back-off #172518

Comments

kobelb commented Dec 4, 2023

Feature Description

Business Value

Definition of Done

elasticmachine commented Dec 4, 2023