[Task Manager] Monitors the Task Manager Poller and automatically recovers from failure #75420

gmmorris · 2020-08-19T12:22:29Z

Summary

Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls.
This monitor does the following:

Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller.
Reacts to the poller error (caused by uncaught errors) and completion events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself).
Tracks the rate at which the poller emits events (this can be both work events, and No Task events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller has hung) and replaces the Poller with a new one.

We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases.
The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.

Checklist

Delete any items that are not applicable to this PR.

Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios

For maintainers

This was checked for breaking API changes and was labeled appropriately

* master: (112 commits) [Ingest Manager] Fix agent config rollout rate limit to use constants (elastic#75364) Update Node.js to version 10.22.0 (elastic#75254) [ML] Anomaly Explorer / Single Metric Viewer: Fix error reporting for annotations. (elastic#74953) [Discover] Fix histogram cloud tests (elastic#75268) Uiactions to navigate to visualize or maps (elastic#74121) Use prefix search invis editor field/agg combo box (elastic#75290) Fix docs in trigger alerting UI (elastic#75363) [SIEM] Fixes search bar Cypress test (elastic#74833) Add libnss3.so to Dockerfile template (reporting) (elastic#75370) [Discover] Create field_button and add popovers to sidebar (elastic#73226) [Reporting] Network Policy: Do not throw from the intercept handler (elastic#75105) [Reporting] Increase capture.timeouts.openUrl to 1 minute (elastic#75207) Allow routes to specify the idle socket timeout in addition to the payload timeout (elastic#73730) [src/dev/build] remove node-version from snapshots (elastic#75303) [ENDPOINT] Reintroduced tabs to endpoint management and migrated pages to use common security components (elastic#74886) [Canvas] Remove dependency on legacy expressions APIs (elastic#74885) Skip failing test in CI (elastic#75266) [Task Manager] time out work when it overruns in poller (elastic#74980) [Drilldowns] misc improvements & fixes (elastic#75276) Small README note on bumping memory for builds (elastic#75247) ...

elasticmachine · 2020-08-19T12:31:26Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr

LGTM, added a nit comment about inactivityTimeout === 0

pmuellr · 2020-08-19T19:49:19Z

x-pack/plugins/task_manager/server/polling/observable_monitor.ts

+
+// by default don't monitor inactivity as not all observables are expected
+// to emit at any kind of fixed interval
+const DEFAULT_INACTIVITY_TIMEOUT = 0;


It doesn't look like we would ever see the inactivityTimeout === 0 in task manager (or maybe I missed it), so this seems like a generalization we don't really need. OTOH, to get rid of the === 0 checks for it, we'd have to validate that it IS > 0, so ... and there doesn't seem to be much overhead in leaving it the way it is.

Haha yeah, I had exactly the same thoughts process.
I'll keep it around for now as it made sense from a "utility" perspective and the code read a bit better that way.

mikecote

Changes LGTM!

x-pack/plugins/task_manager/server/task_manager.ts

kibanamachine · 2020-08-20T18:22:56Z

💚 Build Succeeded

continuous-integration/kibana-ci/pull-request
Commit: afc7bf4

Build metrics

✅ unchanged

History

💚 Build #69047 succeeded 34dac70

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…overs from failure (elastic#75420) Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls. This monitor does the following: 1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller. 2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself). 3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller has hung) and replaces the Poller with a new one. We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases. The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.

…overs from failure (#75420) (#75626) Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls. This monitor does the following: 1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller. 2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself). 3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller has hung) and replaces the Poller with a new one. We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases. The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.

…overs from failure (elastic#75420) Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls. This monitor does the following: 1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller. 2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself). 3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller has hung) and replaces the Poller with a new one. We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases. The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.

gmmorris added 4 commits August 13, 2020 18:55

time out if work overruns in poller

f4ab5ed

updated docs & test

a3fd505

introduced poller monitor

04f06c3

gmmorris added Feature:Task Manager release_note:enhancement Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.10.0 v8.0.0 labels Aug 19, 2020

gmmorris added 2 commits August 19, 2020 13:26

added some comments

bc84af3

moved files

34dac70

gmmorris marked this pull request as ready for review August 19, 2020 12:31

gmmorris requested a review from a team as a code owner August 19, 2020 12:31

pmuellr approved these changes Aug 20, 2020

View reviewed changes

mikecote self-requested a review August 20, 2020 13:26

mikecote approved these changes Aug 20, 2020

View reviewed changes

x-pack/plugins/task_manager/server/task_manager.ts Outdated Show resolved Hide resolved

prepend observable monitor errors in TM

afc7bf4

gmmorris merged commit 5308cc7 into elastic:master Aug 20, 2020

gmmorris mentioned this pull request Aug 21, 2020

[7.x] [Task Manager] Monitors the Task Manager Poller and automatically recovers from failure (#75420) #75626

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] Monitors the Task Manager Poller and automatically recovers from failure #75420

[Task Manager] Monitors the Task Manager Poller and automatically recovers from failure #75420

gmmorris commented Aug 19, 2020 •

edited

Loading

elasticmachine commented Aug 19, 2020

pmuellr left a comment

pmuellr Aug 19, 2020

gmmorris Aug 20, 2020

mikecote left a comment

kibanamachine commented Aug 20, 2020

[Task Manager] Monitors the Task Manager Poller and automatically recovers from failure #75420

[Task Manager] Monitors the Task Manager Poller and automatically recovers from failure #75420

Conversation

gmmorris commented Aug 19, 2020 • edited Loading

Summary

Checklist

For maintainers

elasticmachine commented Aug 19, 2020

pmuellr left a comment

Choose a reason for hiding this comment

pmuellr Aug 19, 2020

Choose a reason for hiding this comment

gmmorris Aug 20, 2020

Choose a reason for hiding this comment

mikecote left a comment

Choose a reason for hiding this comment

kibanamachine commented Aug 20, 2020

💚 Build Succeeded

Build metrics

History

gmmorris commented Aug 19, 2020 •

edited

Loading