Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

schedule retry based on schedule on recurring tasks #83682

Merged
merged 5 commits into from
Nov 19, 2020

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented Nov 18, 2020

Summary

Closes #83490

This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's retryAt field is set (which happens at task run), it is currently scheduled to the task definition's timeout value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349).

In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default timeout of the task type.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@gmmorris gmmorris added Feature:Task Manager release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.11.0 v8.0.0 labels Nov 18, 2020
@gmmorris gmmorris marked this pull request as ready for review November 18, 2020 18:30
@gmmorris gmmorris requested a review from a team as a code owner November 18, 2020 18:30
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@@ -260,7 +260,7 @@ export class TaskManagerRunner implements TaskRunner {
startedAt: now,
attempts,
retryAt: this.instance.schedule
? intervalFromNow(this.definition.timeout)!
? intervalFromNow(this.instance.schedule!.interval)!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what your thoughts are on my comment here: #83490 (comment).

It seems like the retryAt should be the next scheduled run or the timeout window (whichever is greater) but that may break the timeout logic..

The only concern I had in regards to it is when an alert runs every 5 seconds or less and may get retried too early (in case it takes longer than interval to run).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone is running a task that takes longer than X seconds, and is scheduled to run on an X second interval ... I dunno, all bets are off?

I'd guess we will eventually make the retry stuff more complicated anyway, eventually, (eg, add exponential backoff?) so we'll be in this again ... later, with even more interesting cases!

For now, the fix in this PR seems right to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone is running a task that takes longer than X seconds

My scenario would be the task running usually fast but taking longer when Elasticsearch is under pressure. I don't have strong preference either way but wanted to make sure we discuss such possibility.

Copy link
Contributor Author

@gmmorris gmmorris Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I went back and forth on this...

It wouldn't break the timeout logic as best as I can tell, but it has a potential pit fall - like Patrick noted, if a timeout is longer than the schedule, it will always trump the schedule. It also occurred to me that if we ever introduce a default timeout and that is higher than a schedule specified by the user, then it will win.
I decided to leave it an open question for now rather than introduce a new behaviour that I'm unsure about.

That said, having read you scenario @mikecote I'm less sure about it... 🤔

So, here's the thing: if a timeout is specified, then we should allow the task to run that long before retrying, so we should use the higher of both. But it's worth noting - alerting tasks do not have a timeout, and we don't allow implementers to specify one, so alerting tasks will use the default timeout of 5m, which means their retryAt will never be their schedule...

Perhaps we need to change the default timeout behaviour to equal the schedule when it's specified, otherwise it will be 5m.

The end result would be:

  1. A task type with no timeout or schedule will have a retryAt of 5m (default timeout).
  2. A task type with a timeout will have a retryAt of the specified timeout (plus the added attempts * 5m ).
  3. A task type with a schedule will have a retryAt of the specified schedule, so an overruning 10s task will short circuit at 10s. But also means a 1h task that stalled and just never completes will retry on its next scheduled time.
  4. A task with both a timeout and a schedule will have a retryAt of Math.max(task.timeout, task.schedule), so a fast task that's got a long timeout (such as schedule:10s and timeout:30s will be allowed to run for 30s before retrying, but a schedule:5m task with a shorter timeout:30 will only retry after 5m.

Perhaps we should also add support for specified timeout in alert definitions?

Does that sound right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking "no timeout = default timeout" in the scenarios above. So any task with a schedule would fall into option 4 otherwise fall into option 2.

The reason I'm thinking this path is because of "schedules" being a recommended next run (startAt + schedule) and if the task took longer to run than the schedule, it gets re-scheduled to run right away. Similar to how we don't do true cron like scheduling.

Note / question: This may want to do some sort of Max.max(now, intervalFromDate(...)) otherwise it could re-schedule itself at the a higher level in the queue to run again if the task took longer than the interval to run.

Perhaps we should also add support for specified timeout in alert definitions?

This is definitely something we should support at some point, if it's easy to add at the same time, I'm +1 on it otherwise an issue is perfectly fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -260,7 +260,7 @@ export class TaskManagerRunner implements TaskRunner {
startedAt: now,
attempts,
retryAt: this.instance.schedule
? intervalFromNow(this.definition.timeout)!
? intervalFromNow(this.instance.schedule!.interval)!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone is running a task that takes longer than X seconds, and is scheduled to run on an X second interval ... I dunno, all bets are off?

I'd guess we will eventually make the retry stuff more complicated anyway, eventually, (eg, add exponential backoff?) so we'll be in this again ... later, with even more interesting cases!

For now, the fix in this PR seems right to me.

Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished my review and changes LGTM!

@gmmorris
Copy link
Contributor Author

@elasticmachine merge upstream

kibanamachine and others added 4 commits November 18, 2020 19:39
* master: (60 commits)
  Forward any registry cache-control header for files (elastic#83680)
  Revert "[Alerting] Add `alert.updatedAt` field to represent date of last user edit (elastic#83578)"
  [Security Solution][Detections] Fix adding an action to detection rules (elastic#83722)
  Make expectSnapshot available in all functional test runs (elastic#82932)
  Skip failing cypress test
  Increase bulk request timeout during esArchiver load (elastic#83657)
  [data.search] Server-side background session service (elastic#81099)
  [maps] convert VectorStyleEditor to TS (elastic#83582)
  Revert "[App Search] Engine overview layout stub (elastic#83504)"
  Adding documentation for global action configuration options (elastic#83557)
  [Metrics UI] Optimizations for Snapshot and Inventory Metadata (elastic#83596)
  chore(NA): update lmdb store to v0.8.15 (elastic#83726)
  [App Search] Engine overview layout stub (elastic#83504)
  [Workplace Search] Update SourceIcon to match latest changes in ent-search (elastic#83714)
  [Enterprise Search] Rename React Router helpers (elastic#83718)
  [Maps] Add 'crossed' & 'exited' events to tracking alert (elastic#82463)
  Updating code-owners to use new core/app-services team names (elastic#83731)
  Add Managed label to data streams and a view switch for the table (elastic#83049)
  [Maps] Add query bar inputs to geo threshold alerts tracked points & boundaries (elastic#80871)
  fix(NA): search examples kibana version declaration (elastic#83182)
  ...
…kibana into task-manager/retry-by-schedule

* 'task-manager/retry-by-schedule' of github.com:gmmorris/kibana:
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@gmmorris gmmorris merged commit 3b0215c into elastic:master Nov 19, 2020
gmmorris added a commit to gmmorris/kibana that referenced this pull request Nov 19, 2020
…rring tasks (elastic#83682)

This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349).

In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
gmmorris added a commit that referenced this pull request Nov 19, 2020
…rring tasks (#83682) (#83800)

This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of #39349).

In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
gmmorris added a commit to gmmorris/kibana that referenced this pull request Nov 19, 2020
* master:
  skip "Dashboards linked by a drilldown are both copied to a space" (elastic#83824)
  [alerts] adds action group and date to mustache template variables for actions (elastic#83195)
  skip flaky suite (elastic#79389)
  [DOCS] Reallocates limitations to point-of-use (elastic#79582)
  [Enterprise Search] Engine overview layout stub (elastic#83756)
  Disable exporting/importing of templates.  Optimize pitch images a bit (elastic#83098)
  [DOCS] Consolidates plugins (elastic#83712)
  [ML] Space management UI (elastic#83320)
  test just part of the message to avoid updates (elastic#83703)
  [Data Table] Remove extra column in split mode (elastic#83193)
  Improve snapshot error messages (elastic#83785)
  skip flaky suite (elastic#83773)
  skip flaky suite (elastic#83771)
  skip flaky suite (elastic#65278)
  skip flaky suite (elastic#83793)
  [Task Manager] Ensures retries are inferred from the schedule of recurring tasks (elastic#83682)
  [index patterns] improve index pattern cache (elastic#83368)
  [Fleet] Rename ingestManager plugin ID fleet (elastic#83200)
  fixed pagination in connectors list (elastic#83638)
chrisronline pushed a commit to chrisronline/kibana that referenced this pull request Nov 19, 2020
…rring tasks (elastic#83682)

This addresses a bug in Task Manager in the task timeout behaviour. When a recurring task's `retryAt` field is set (which happens at task run), it is currently scheduled to the task definition's `timeout` value, but the original intention was for these tasks to retry on their next scheduled run (originally identified as part of elastic#39349).

In this PR we ensure recurring task retries are scheduled according to their recurring schedule, rather than the default `timeout` of the task type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager release_note:fix Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.11.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alerts that timeout during execution don't respect their schedule
5 participants