Revisit connector action retry back-off #172518
Labels
Feature:Alerting/RuleActions
Issues related to the Actions attached to Rules on the Alerting Framework
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Feature Description
When
xpack.actions.run.maxAttempts
is set to a high number it can cause failed actions to be retried a significant duration after the initial attempt because the delay is exponential without any ceiling:For example, if max attempts is set to
10
, the 10th attempt will occur 21.333 hours after the first attempt. This occurs even if the alerting rule that triggered the connector action is subsequently deleted.This creates a very confusing experience for operators who are monitoring the connector action failures and alerting based on them. cough Serverless cough. Additionally, if these connector actions had succeeded on the 10th attempt, this would be rather confusing because such a large time elapsed since when they were supposed to be sent vs when they finally did.
We should revisit the backoff calculation and at a minimum impose a "ceiling" where we no longer increase the backoff exponentially. @cnasikas shared a good article about the way that AWS does backoffs with a jitter that we should learn from.
Business Value
Reduce confusion for operators monitoring connector action success rates. Improve timeliness of connector actions being delivered when there are transient failures.
Definition of Done
The text was updated successfully, but these errors were encountered: