schedule retry based on schedule on recurring tasks #83682

gmmorris · 2020-11-18T18:02:53Z

Summary

This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's retryAt field is set (which happens at task run), it is currently scheduled to the task definition's timeout value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349).

In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default timeout of the task type.

Checklist

Delete any items that are not applicable to this PR.

~~[ ] Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support~~
~~[ ] Documentation was added for features that require explanation or tutorials~~
Unit or functional tests were updated or added to match the most common scenarios
~~[ ] Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)~~
~~[ ] Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)~~
~~[ ] This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)~~
~~[ ] This was checked for cross-browser compatibility~~

For maintainers

This was checked for breaking API changes and was labeled appropriately

elasticmachine · 2020-11-18T18:30:38Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mikecote · 2020-11-18T20:06:51Z

x-pack/plugins/task_manager/server/task_running/task_runner.ts

@@ -260,7 +260,7 @@ export class TaskManagerRunner implements TaskRunner {
        startedAt: now,
        attempts,
        retryAt: this.instance.schedule
-          ? intervalFromNow(this.definition.timeout)!
+          ? intervalFromNow(this.instance.schedule!.interval)!


I'm curious what your thoughts are on my comment here: #83490 (comment).

It seems like the retryAt should be the next scheduled run or the timeout window (whichever is greater) but that may break the timeout logic..

The only concern I had in regards to it is when an alert runs every 5 seconds or less and may get retried too early (in case it takes longer than interval to run).

If someone is running a task that takes longer than X seconds, and is scheduled to run on an X second interval ... I dunno, all bets are off?

I'd guess we will eventually make the retry stuff more complicated anyway, eventually, (eg, add exponential backoff?) so we'll be in this again ... later, with even more interesting cases!

For now, the fix in this PR seems right to me.

If someone is running a task that takes longer than X seconds

My scenario would be the task running usually fast but taking longer when Elasticsearch is under pressure. I don't have strong preference either way but wanted to make sure we discuss such possibility.

Yeah I went back and forth on this...

It wouldn't break the timeout logic as best as I can tell, but it has a potential pit fall - like Patrick noted, if a timeout is longer than the schedule, it will always trump the schedule. It also occurred to me that if we ever introduce a default timeout and that is higher than a schedule specified by the user, then it will win.
I decided to leave it an open question for now rather than introduce a new behaviour that I'm unsure about.

That said, having read you scenario @mikecote I'm less sure about it... 🤔

So, here's the thing: if a timeout is specified, then we should allow the task to run that long before retrying, so we should use the higher of both. But it's worth noting - alerting tasks do not have a timeout, and we don't allow implementers to specify one, so alerting tasks will use the default timeout of 5m, which means their retryAt will never be their schedule...

Perhaps we need to change the default timeout behaviour to equal the schedule when it's specified, otherwise it will be 5m.

The end result would be:

A task type with no timeout or schedule will have a retryAt of 5m (default timeout).

A task type with a timeout will have a retryAt of the specified timeout (plus the added attempts * 5m ).

A task type with a schedule will have a retryAt of the specified schedule, so an overruning 10s task will short circuit at 10s. But also means a 1h task that stalled and just never completes will retry on its next scheduled time.

A task with both a timeout and a schedule will have a retryAt of Math.max(task.timeout, task.schedule), so a fast task that's got a long timeout (such as schedule:10s and timeout:30s will be allowed to run for 30s before retrying, but a schedule:5m task with a shorter timeout:30 will only retry after 5m.

Perhaps we should also add support for specified timeout in alert definitions?

Does that sound right?

I'm thinking "no timeout = default timeout" in the scenarios above. So any task with a schedule would fall into option 4 otherwise fall into option 2.

The reason I'm thinking this path is because of "schedules" being a recommended next run (startAt + schedule) and if the task took longer to run than the schedule, it gets re-scheduled to run right away. Similar to how we don't do true cron like scheduling.

Note / question: This may want to do some sort of Max.max(now, intervalFromDate(...)) otherwise it could re-schedule itself at the a higher level in the queue to run again if the task took longer than the interval to run.

Perhaps we should also add support for specified timeout in alert definitions?

This is definitely something we should support at some point, if it's easy to add at the same time, I'm +1 on it otherwise an issue is perfectly fine.

pmuellr

LGTM

pmuellr · 2020-11-18T20:58:20Z

x-pack/plugins/task_manager/server/task_running/task_runner.ts

@@ -260,7 +260,7 @@ export class TaskManagerRunner implements TaskRunner {
        startedAt: now,
        attempts,
        retryAt: this.instance.schedule
-          ? intervalFromNow(this.definition.timeout)!
+          ? intervalFromNow(this.instance.schedule!.interval)!


If someone is running a task that takes longer than X seconds, and is scheduled to run on an X second interval ... I dunno, all bets are off?

I'd guess we will eventually make the retry stuff more complicated anyway, eventually, (eg, add exponential backoff?) so we'll be in this again ... later, with even more interesting cases!

For now, the fix in this PR seems right to me.

mikecote

Finished my review and changes LGTM!

gmmorris · 2020-11-19T00:39:00Z

@elasticmachine merge upstream

* master: (60 commits) Forward any registry cache-control header for files (elastic#83680) Revert "[Alerting] Add `alert.updatedAt` field to represent date of last user edit (elastic#83578)" [Security Solution][Detections] Fix adding an action to detection rules (elastic#83722) Make expectSnapshot available in all functional test runs (elastic#82932) Skip failing cypress test Increase bulk request timeout during esArchiver load (elastic#83657) [data.search] Server-side background session service (elastic#81099) [maps] convert VectorStyleEditor to TS (elastic#83582) Revert "[App Search] Engine overview layout stub (elastic#83504)" Adding documentation for global action configuration options (elastic#83557) [Metrics UI] Optimizations for Snapshot and Inventory Metadata (elastic#83596) chore(NA): update lmdb store to v0.8.15 (elastic#83726) [App Search] Engine overview layout stub (elastic#83504) [Workplace Search] Update SourceIcon to match latest changes in ent-search (elastic#83714) [Enterprise Search] Rename React Router helpers (elastic#83718) [Maps] Add 'crossed' & 'exited' events to tracking alert (elastic#82463) Updating code-owners to use new core/app-services team names (elastic#83731) Add Managed label to data streams and a view switch for the table (elastic#83049) [Maps] Add query bar inputs to geo threshold alerts tracked points & boundaries (elastic#80871) fix(NA): search examples kibana version declaration (elastic#83182) ...

…kibana into task-manager/retry-by-schedule * 'task-manager/retry-by-schedule' of github.com:gmmorris/kibana:

kibanamachine · 2020-11-19T14:06:34Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: 9c0b9c3

Metrics [docs]

✅ unchanged

History

💔 Build #88816 failed e1ba694
💔 Build #88653 failed cadcab0

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…rring tasks (elastic#83682) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.

…rring tasks (#83682) (#83800) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.

* master: skip "Dashboards linked by a drilldown are both copied to a space" (elastic#83824) [alerts] adds action group and date to mustache template variables for actions (elastic#83195) skip flaky suite (elastic#79389) [DOCS] Reallocates limitations to point-of-use (elastic#79582) [Enterprise Search] Engine overview layout stub (elastic#83756) Disable exporting/importing of templates. Optimize pitch images a bit (elastic#83098) [DOCS] Consolidates plugins (elastic#83712) [ML] Space management UI (elastic#83320) test just part of the message to avoid updates (elastic#83703) [Data Table] Remove extra column in split mode (elastic#83193) Improve snapshot error messages (elastic#83785) skip flaky suite (elastic#83773) skip flaky suite (elastic#83771) skip flaky suite (elastic#65278) skip flaky suite (elastic#83793) [Task Manager] Ensures retries are inferred from the schedule of recurring tasks (elastic#83682) [index patterns] improve index pattern cache (elastic#83368) [Fleet] Rename ingestManager plugin ID fleet (elastic#83200) fixed pagination in connectors list (elastic#83638)

…rring tasks (elastic#83682) This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349). In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.

schedule retry based on schedule on recurring tasks

cadcab0

gmmorris added Feature:Task Manager release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.11.0 v8.0.0 labels Nov 18, 2020

gmmorris marked this pull request as ready for review November 18, 2020 18:30

gmmorris requested a review from a team as a code owner November 18, 2020 18:30

mikecote reviewed Nov 18, 2020

View reviewed changes

pmuellr approved these changes Nov 18, 2020

View reviewed changes

mikecote approved these changes Nov 18, 2020

View reviewed changes

kibanamachine and others added 4 commits November 18, 2020 19:39

Merge branch 'master' into task-manager/retry-by-schedule

e1ba694

use max between timeout and schedule when calculating retryAt

75e221b

Merge branch 'task-manager/retry-by-schedule' of github.com:gmmorris/…

9c0b9c3

…kibana into task-manager/retry-by-schedule * 'task-manager/retry-by-schedule' of github.com:gmmorris/kibana:

gmmorris merged commit 3b0215c into elastic:master Nov 19, 2020

gmmorris mentioned this pull request Nov 19, 2020

[7.x] [Task Manager] Ensures retries are inferred from the schedule of recurring tasks (#83682) #83800

Merged

pmuellr mentioned this pull request Mar 31, 2021

[Task manager][discuss] force implementation of cancel()? Or don't reschedule non-cancellable tasks until complete #95985

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schedule retry based on schedule on recurring tasks #83682

schedule retry based on schedule on recurring tasks #83682

gmmorris commented Nov 18, 2020 •

edited

Loading

elasticmachine commented Nov 18, 2020

mikecote Nov 18, 2020

pmuellr Nov 18, 2020

mikecote Nov 18, 2020

gmmorris Nov 19, 2020 •

edited

Loading

mikecote Nov 19, 2020

gmmorris Nov 19, 2020

pmuellr left a comment

pmuellr Nov 18, 2020

mikecote left a comment

gmmorris commented Nov 19, 2020

kibanamachine commented Nov 19, 2020

schedule retry based on schedule on recurring tasks #83682

schedule retry based on schedule on recurring tasks #83682

Conversation

gmmorris commented Nov 18, 2020 • edited Loading

Summary

Checklist

For maintainers

elasticmachine commented Nov 18, 2020

mikecote Nov 18, 2020

Choose a reason for hiding this comment

pmuellr Nov 18, 2020

Choose a reason for hiding this comment

mikecote Nov 18, 2020

Choose a reason for hiding this comment

gmmorris Nov 19, 2020 • edited Loading

Choose a reason for hiding this comment

mikecote Nov 19, 2020

Choose a reason for hiding this comment

gmmorris Nov 19, 2020

Choose a reason for hiding this comment

pmuellr left a comment

Choose a reason for hiding this comment

pmuellr Nov 18, 2020

Choose a reason for hiding this comment

mikecote left a comment

Choose a reason for hiding this comment

gmmorris commented Nov 19, 2020

kibanamachine commented Nov 19, 2020

💚 Build Succeeded

Metrics [docs]

History

gmmorris commented Nov 18, 2020 •

edited

Loading

gmmorris Nov 19, 2020 •

edited

Loading