Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Monitors the Task Manager Poller and automatically recovers from failure #75420

Merged
merged 7 commits into from
Aug 20, 2020

Conversation

gmmorris
Copy link
Contributor

@gmmorris gmmorris commented Aug 19, 2020

Summary

closes #74785

Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls.
This monitor does the following:

  1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller.
  2. Reacts to the poller error (caused by uncaught errors) and completion events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself).
  3. Tracks the rate at which the poller emits events (this can be both work events, and No Task events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller has hung) and replaces the Poller with a new one.

We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases.
The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

* master: (112 commits)
  [Ingest Manager] Fix agent config rollout rate limit to use constants (elastic#75364)
  Update Node.js to version 10.22.0 (elastic#75254)
  [ML] Anomaly Explorer / Single Metric Viewer: Fix error reporting for annotations. (elastic#74953)
  [Discover] Fix histogram cloud tests (elastic#75268)
  Uiactions to navigate to visualize or maps (elastic#74121)
  Use prefix search invis editor field/agg combo box (elastic#75290)
  Fix docs in trigger alerting UI (elastic#75363)
  [SIEM] Fixes search bar Cypress test (elastic#74833)
  Add libnss3.so to Dockerfile template (reporting) (elastic#75370)
  [Discover] Create field_button and add popovers to sidebar (elastic#73226)
  [Reporting] Network Policy: Do not throw from the intercept handler (elastic#75105)
  [Reporting] Increase capture.timeouts.openUrl to 1 minute (elastic#75207)
  Allow routes to specify the idle socket timeout in addition to the payload timeout (elastic#73730)
  [src/dev/build] remove node-version from snapshots (elastic#75303)
  [ENDPOINT] Reintroduced tabs to endpoint management and migrated pages to use common security components (elastic#74886)
  [Canvas] Remove dependency on legacy expressions APIs (elastic#74885)
  Skip failing test in CI (elastic#75266)
  [Task Manager] time out work when it overruns in poller (elastic#74980)
  [Drilldowns] misc improvements & fixes (elastic#75276)
  Small README note on bumping memory for builds (elastic#75247)
  ...
@gmmorris gmmorris added Feature:Task Manager release_note:enhancement Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0 labels Aug 19, 2020
@gmmorris gmmorris marked this pull request as ready for review August 19, 2020 12:31
@gmmorris gmmorris requested a review from a team as a code owner August 19, 2020 12:31
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added a nit comment about inactivityTimeout === 0


// by default don't monitor inactivity as not all observables are expected
// to emit at any kind of fixed interval
const DEFAULT_INACTIVITY_TIMEOUT = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we would ever see the inactivityTimeout === 0 in task manager (or maybe I missed it), so this seems like a generalization we don't really need. OTOH, to get rid of the === 0 checks for it, we'd have to validate that it IS > 0, so ... and there doesn't seem to be much overhead in leaving it the way it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yeah, I had exactly the same thoughts process.
I'll keep it around for now as it made sense from a "utility" perspective and the code read a bit better that way.

@mikecote mikecote self-requested a review August 20, 2020 13:26
Copy link
Contributor

@mikecote mikecote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM!

x-pack/plugins/task_manager/server/task_manager.ts Outdated Show resolved Hide resolved
@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@gmmorris gmmorris merged commit 5308cc7 into elastic:master Aug 20, 2020
gmmorris added a commit to gmmorris/kibana that referenced this pull request Aug 21, 2020
…overs from failure (elastic#75420)

Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls.
This monitor does the following:
1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller.
2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself).
3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller  has hung) and replaces the Poller with a new one.

We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases.
The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.
gmmorris added a commit that referenced this pull request Aug 21, 2020
…overs from failure (#75420) (#75626)

Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls.
This monitor does the following:
1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller.
2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself).
3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller  has hung) and replaces the Poller with a new one.

We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases.
The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.
thomasneirynck pushed a commit to thomasneirynck/kibana that referenced this pull request Aug 21, 2020
…overs from failure (elastic#75420)

Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls.
This monitor does the following:
1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller.
2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself).
3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller  has hung) and replaces the Poller with a new one.

We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases.
The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager release_note:enhancement Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Alerting] Task Manager doesn't automatically recover if polling fails
5 participants