-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] Monitors the Task Manager Poller and automatically recovers from failure #75420
Conversation
* master: (112 commits) [Ingest Manager] Fix agent config rollout rate limit to use constants (elastic#75364) Update Node.js to version 10.22.0 (elastic#75254) [ML] Anomaly Explorer / Single Metric Viewer: Fix error reporting for annotations. (elastic#74953) [Discover] Fix histogram cloud tests (elastic#75268) Uiactions to navigate to visualize or maps (elastic#74121) Use prefix search invis editor field/agg combo box (elastic#75290) Fix docs in trigger alerting UI (elastic#75363) [SIEM] Fixes search bar Cypress test (elastic#74833) Add libnss3.so to Dockerfile template (reporting) (elastic#75370) [Discover] Create field_button and add popovers to sidebar (elastic#73226) [Reporting] Network Policy: Do not throw from the intercept handler (elastic#75105) [Reporting] Increase capture.timeouts.openUrl to 1 minute (elastic#75207) Allow routes to specify the idle socket timeout in addition to the payload timeout (elastic#73730) [src/dev/build] remove node-version from snapshots (elastic#75303) [ENDPOINT] Reintroduced tabs to endpoint management and migrated pages to use common security components (elastic#74886) [Canvas] Remove dependency on legacy expressions APIs (elastic#74885) Skip failing test in CI (elastic#75266) [Task Manager] time out work when it overruns in poller (elastic#74980) [Drilldowns] misc improvements & fixes (elastic#75276) Small README note on bumping memory for builds (elastic#75247) ...
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, added a nit comment about inactivityTimeout === 0
|
||
// by default don't monitor inactivity as not all observables are expected | ||
// to emit at any kind of fixed interval | ||
const DEFAULT_INACTIVITY_TIMEOUT = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like we would ever see the inactivityTimeout
=== 0 in task manager (or maybe I missed it), so this seems like a generalization we don't really need. OTOH, to get rid of the === 0
checks for it, we'd have to validate that it IS > 0, so ... and there doesn't seem to be much overhead in leaving it the way it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yeah, I had exactly the same thoughts process.
I'll keep it around for now as it made sense from a "utility" perspective and the code read a bit better that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM!
💚 Build SucceededBuild metrics
History
To update your PR or re-run it, just comment with: |
…overs from failure (elastic#75420) Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls. This monitor does the following: 1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller. 2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself). 3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller has hung) and replaces the Poller with a new one. We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases. The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.
…overs from failure (#75420) (#75626) Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls. This monitor does the following: 1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller. 2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself). 3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller has hung) and replaces the Poller with a new one. We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases. The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.
…overs from failure (elastic#75420) Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls. This monitor does the following: 1. Catches the poller thrown errors and recovers by proxying the error to a handler and continues listening to the poller. 2. Reacts to the poller `error` (caused by uncaught errors) and `completion` events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself). 3. Tracks the rate at which the poller emits events (this can be both work events, and `No Task` events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller has hung) and replaces the Poller with a new one. We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases. The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.
Summary
closes #74785
Introduces a monitor around the Task Manager poller which pips through all values emitted by the poller and recovers from poller failures or stalls.
This monitor does the following:
error
(caused by uncaught errors) andcompletion
events, by starting a new poller and piping its event through to any previous subscribers (in our case, Task Manager itself).No Task
events, so polling and finding no work, still counts as an emitted event) and times out when this rate gets too long (suggesting the poller has hung) and replaces the Poller with a new one.We're not aware of any clear cases where Task Manager should actually get restarted by the monitor - this is definitely an error case and we have addressed all known cases.
The goal of introducing this monitor is as an insurance policy in case an unexpected error case breaks the poller in a long running production environment.
Checklist
Delete any items that are not applicable to this PR.
For maintainers