-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schedule retry based on schedule on recurring tasks #83682
schedule retry based on schedule on recurring tasks #83682
Conversation
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
@@ -260,7 +260,7 @@ export class TaskManagerRunner implements TaskRunner { | |||
startedAt: now, | |||
attempts, | |||
retryAt: this.instance.schedule | |||
? intervalFromNow(this.definition.timeout)! | |||
? intervalFromNow(this.instance.schedule!.interval)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what your thoughts are on my comment here: #83490 (comment).
It seems like the
retryAt
should be the next scheduled run or the timeout window (whichever is greater) but that may break the timeout logic..
The only concern I had in regards to it is when an alert runs every 5 seconds or less and may get retried too early (in case it takes longer than interval to run).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone is running a task that takes longer than X seconds, and is scheduled to run on an X second interval ... I dunno, all bets are off?
I'd guess we will eventually make the retry stuff more complicated anyway, eventually, (eg, add exponential backoff?) so we'll be in this again ... later, with even more interesting cases!
For now, the fix in this PR seems right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone is running a task that takes longer than X seconds
My scenario would be the task running usually fast but taking longer when Elasticsearch is under pressure. I don't have strong preference either way but wanted to make sure we discuss such possibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I went back and forth on this...
It wouldn't break the timeout logic as best as I can tell, but it has a potential pit fall - like Patrick noted, if a timeout is longer than the schedule, it will always trump the schedule. It also occurred to me that if we ever introduce a default timeout and that is higher than a schedule specified by the user, then it will win.
I decided to leave it an open question for now rather than introduce a new behaviour that I'm unsure about.
That said, having read you scenario @mikecote I'm less sure about it... 🤔
So, here's the thing: if a timeout is specified, then we should allow the task to run that long before retrying, so we should use the higher of both. But it's worth noting - alerting tasks do not have a timeout, and we don't allow implementers to specify one, so alerting tasks will use the default timeout of 5m
, which means their retryAt
will never be their schedule...
Perhaps we need to change the default timeout behaviour to equal the schedule when it's specified, otherwise it will be 5m
.
The end result would be:
- A task type with no timeout or schedule will have a
retryAt
of5m
(default timeout). - A task type with a timeout will have a
retryAt
of the specifiedtimeout
(plus the added attempts *5m
). - A task type with a schedule will have a
retryAt
of the specifiedschedule
, so an overruning10s
task will short circuit at10s
. But also means a1h
task that stalled and just never completes will retry on its next scheduled time. - A task with both a timeout and a schedule will have a
retryAt
ofMath.max(task.timeout, task.schedule)
, so a fast task that's got a long timeout (such asschedule:10s
andtimeout:30s
will be allowed to run for30s
before retrying, but aschedule:5m
task with a shortertimeout:30
will only retry after5m
.
Perhaps we should also add support for specified timeout
in alert definitions?
Does that sound right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking "no timeout = default timeout" in the scenarios above. So any task with a schedule would fall into option 4 otherwise fall into option 2.
The reason I'm thinking this path is because of "schedules" being a recommended next run (startAt + schedule) and if the task took longer to run than the schedule, it gets re-scheduled to run right away. Similar to how we don't do true cron like scheduling.
Note / question: This may want to do some sort of Max.max(now, intervalFromDate(...))
otherwise it could re-schedule itself at the a higher level in the queue to run again if the task took longer than the interval to run.
Perhaps we should also add support for specified
timeout
in alert definitions?
This is definitely something we should support at some point, if it's easy to add at the same time, I'm +1 on it otherwise an issue is perfectly fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -260,7 +260,7 @@ export class TaskManagerRunner implements TaskRunner { | |||
startedAt: now, | |||
attempts, | |||
retryAt: this.instance.schedule | |||
? intervalFromNow(this.definition.timeout)! | |||
? intervalFromNow(this.instance.schedule!.interval)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone is running a task that takes longer than X seconds, and is scheduled to run on an X second interval ... I dunno, all bets are off?
I'd guess we will eventually make the retry stuff more complicated anyway, eventually, (eg, add exponential backoff?) so we'll be in this again ... later, with even more interesting cases!
For now, the fix in this PR seems right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished my review and changes LGTM!
@elasticmachine merge upstream |
* master: (60 commits) Forward any registry cache-control header for files (elastic#83680) Revert "[Alerting] Add `alert.updatedAt` field to represent date of last user edit (elastic#83578)" [Security Solution][Detections] Fix adding an action to detection rules (elastic#83722) Make expectSnapshot available in all functional test runs (elastic#82932) Skip failing cypress test Increase bulk request timeout during esArchiver load (elastic#83657) [data.search] Server-side background session service (elastic#81099) [maps] convert VectorStyleEditor to TS (elastic#83582) Revert "[App Search] Engine overview layout stub (elastic#83504)" Adding documentation for global action configuration options (elastic#83557) [Metrics UI] Optimizations for Snapshot and Inventory Metadata (elastic#83596) chore(NA): update lmdb store to v0.8.15 (elastic#83726) [App Search] Engine overview layout stub (elastic#83504) [Workplace Search] Update SourceIcon to match latest changes in ent-search (elastic#83714) [Enterprise Search] Rename React Router helpers (elastic#83718) [Maps] Add 'crossed' & 'exited' events to tracking alert (elastic#82463) Updating code-owners to use new core/app-services team names (elastic#83731) Add Managed label to data streams and a view switch for the table (elastic#83049) [Maps] Add query bar inputs to geo threshold alerts tracked points & boundaries (elastic#80871) fix(NA): search examples kibana version declaration (elastic#83182) ...
…kibana into task-manager/retry-by-schedule * 'task-manager/retry-by-schedule' of github.com:gmmorris/kibana:
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: |
…rring tasks (elastic#83682) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
…rring tasks (#83682) (#83800) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
* master: skip "Dashboards linked by a drilldown are both copied to a space" (elastic#83824) [alerts] adds action group and date to mustache template variables for actions (elastic#83195) skip flaky suite (elastic#79389) [DOCS] Reallocates limitations to point-of-use (elastic#79582) [Enterprise Search] Engine overview layout stub (elastic#83756) Disable exporting/importing of templates. Optimize pitch images a bit (elastic#83098) [DOCS] Consolidates plugins (elastic#83712) [ML] Space management UI (elastic#83320) test just part of the message to avoid updates (elastic#83703) [Data Table] Remove extra column in split mode (elastic#83193) Improve snapshot error messages (elastic#83785) skip flaky suite (elastic#83773) skip flaky suite (elastic#83771) skip flaky suite (elastic#65278) skip flaky suite (elastic#83793) [Task Manager] Ensures retries are inferred from the schedule of recurring tasks (elastic#83682) [index patterns] improve index pattern cache (elastic#83368) [Fleet] Rename ingestManager plugin ID fleet (elastic#83200) fixed pagination in connectors list (elastic#83638)
…rring tasks (elastic#83682) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
Summary
Closes #83490
This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's
retryAt
field is set (which happens at task run), it is currently scheduled to the task definition'stimeout
value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349).In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default
timeout
of the task type.Checklist
Delete any items that are not applicable to this PR.
[ ] Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support[ ] Documentation was added for features that require explanation or tutorials[ ] Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)[ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)[ ] This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)[ ] This was checked for cross-browser compatibilityFor maintainers