Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit connector action retry back-off #172518

Closed
kobelb opened this issue Dec 4, 2023 · 1 comment · Fixed by #173779
Closed

Revisit connector action retry back-off #172518

kobelb opened this issue Dec 4, 2023 · 1 comment · Fixed by #173779
Assignees
Labels
Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@kobelb
Copy link
Contributor

kobelb commented Dec 4, 2023

Feature Description

When xpack.actions.run.maxAttempts is set to a high number it can cause failed actions to be retried a significant duration after the initial attempt because the delay is exponential without any ceiling:

export function calculateDelay(attempts: number) {
  if (attempts === 1) {
    return 30 * 1000; // 30s
  } else {
    // get multiples of 5 min
    const defaultBackoffPerFailure = 5 * 60 * 1000;
    return defaultBackoffPerFailure * Math.pow(2, attempts - 2);
  }
}

For example, if max attempts is set to 10, the 10th attempt will occur 21.333 hours after the first attempt. This occurs even if the alerting rule that triggered the connector action is subsequently deleted.

This creates a very confusing experience for operators who are monitoring the connector action failures and alerting based on them. cough Serverless cough. Additionally, if these connector actions had succeeded on the 10th attempt, this would be rather confusing because such a large time elapsed since when they were supposed to be sent vs when they finally did.

We should revisit the backoff calculation and at a minimum impose a "ceiling" where we no longer increase the backoff exponentially. @cnasikas shared a good article about the way that AWS does backoffs with a jitter that we should learn from.

Business Value

Reduce confusion for operators monitoring connector action success rates. Improve timeliness of connector actions being delivered when there are transient failures.

Definition of Done

  • Connector actions no longer retried on a purely exponential basis
  • Unit tests
@kobelb kobelb added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework labels Dec 4, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@doakalexi doakalexi moved this from Awaiting Triage to Todo in AppEx: ResponseOps - Execution & Connectors Dec 7, 2023
@doakalexi doakalexi self-assigned this Dec 19, 2023
@doakalexi doakalexi moved this from Todo to In Progress in AppEx: ResponseOps - Execution & Connectors Dec 19, 2023
@doakalexi doakalexi moved this from In Progress to In Review in AppEx: ResponseOps - Execution & Connectors Jan 2, 2024
doakalexi added a commit that referenced this issue Jan 3, 2024
Resolves #172518

## Summary

Updates the retry delay calculation to cap the delay at 1hr and
introduces jitter.

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios


### To verify
- Create a rule and then force a retry failure
- Verify that the retry follows the pattern below:

  Attempt 1: now
  Attempt 2: 30s after the first attempt
  Attempt 3: 0 - 5m after the second attempt
  Attempt 4: 0 - 10m after the third attempt
  Attempt 5: 0 - 20m after the fourth attempt
  Attempt 6: 0 - 40m after the fifth attempt
  Attempt n: 0 - 1hr for all other attempts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting/RuleActions Issues related to the Actions attached to Rules on the Alerting Framework Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants