[Task Manager] we don't have sufficient observability into Task Manager's runtime operations #77456

gmmorris · 2020-09-15T09:28:18Z

We don't have enough observability into the behavior of Task Manager to properly investigate issues when an SDH is opened.

In order to have confidence in going GA we need a way of ascertaining the following about a deployment:

Basic configuration of the current deployment (polling, workers etc.)

Work load of Task Manager:

how many tasks are scheduled in the system
of what type
execution density in a timeframe (maybe the next hour?) including overdue tasks that should have ran by now

Runtime statistics, including:

How long, on mean and median, does a task run for?
What's the typical drift (How long after a task's scheduled time does Task Manager typically run it)?
Polling rate (when was the last poll) and result frequency (No tasks / Filled task pool / Run out of workers [too many tasks])
~~How much "dead time" is Task Manager experiencing?~~ Dropped: [Task Manager] we don't have sufficient observability into Task Manager's runtime operations #77456 (comment)
success/retry/failure ratio

Most of the above stats could be broken down by Task Type, and perhaps also by space 🤔

We want to make access to this information easy, but we don't want it to be too noisy either.
It's simplest to begin with the server log in 'debug' mode as a logging target at a fixed cadence.
We will also add an http route that can be curled and will stream stats out at a higher cadence. This endpoint will enable external monitoring of Task Manager as a whole, and possibly more granular stats.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-09-15T09:29:17Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2020-09-15T15:42:05Z

note I started in on "event logifying" task manager, in this PR - #75152 - results seemed interesting, guessing we'd need to add some more events, perhaps some kind of hourly summary event would be useful as well

gmmorris · 2020-09-24T15:20:59Z

I'm still trying to come up with a definition of "system is unhealthy" for the monitoring endopoint, as I'm thinking it should return 200 when it's healthy, and 500 when it's unhealthy (that's the easiest thing for an external monitoring system to latch on to).

This is tricky because some systems might have a relatively empty Task Manager, where execution frequency isn't a valid metric but in others, if no task has executed in the past 30s, you probably want to be alerted.

Here are some thoughts for when we should return 500:

No task polling cycle has completed in the past "poll interval + buffer"
mean/median task drift has exceeded a predefined threshold
task failure events have exceeded a predefined threshold

I have a few more ideas, but I'm still assessing them for viability... just wanted to jot this down 🤔 .

pmuellr · 2020-09-28T16:16:04Z

I think 500 is also the status for internal server error, so we wouldn't be able to distinguish between an error in the health check and a negative result from the health check. I'm not sure what other "health" services typically return in this case, I'd think we would like to avoid any 50x's though, as there are other 50x's that get returned for proxy/gateway errors - including from our cloud deployments.

pmuellr · 2020-09-28T16:20:26Z

It would be nice to include some kind of indication of whether the queue is getting overloaded. Not sure what the measurement would be: Basically how many tasks that can run, whose runAt is earlier than the current time, are the signal - perhaps if we can measure that, and over time it's going up, that's the indication. Or maybe get the oldest task that should have been run "by now", and if it's greater than X minutes, that's the indication.

gmmorris · 2020-09-28T16:47:27Z

I'm not sure what other "health" services typically return in this case

That's actually been my past experience, which is why I went with that 🤷
I'm open to suggestions, but in general the idea is - if the health check is returning 500, you need to investigate why your system is unhealthy.

Usually the system and the health endpoint are one and the same and owned by the same people... it's worth considering that this is less true for us.

gmmorris · 2020-09-28T16:53:48Z

Basically how many tasks that can run, whose runAt is earlier than the current time, are the signal - perhaps if we can measure that, and over time it's going up, that's the indication. Or maybe get the oldest task that should have been run "by now", and if it's greater than X minutes, that's the indication.

Broadly, that's what you'll get from the drift metric (which is how long after the task's schedule runAt was the task ran, with both mean and median of this) and the workload (includes broad status by taskType, but I'm also including some time related stats, such as overdue & density in the coming hour [configurable time frame, defaulting to an hour])

gmmorris · 2020-09-28T19:25:24Z

Looking at how we implement the Elasticsearch Health endpoint, it seems we always returns 200 if we can, even when the service is RED.
With that in mind I think we'll stick to what the Stack already does, and add a R/Y/G indicators in our endpoint as well.
The clients will then have to do some code on their end to differentiate.

I looked at the Nagios Elasticsearch module and even though it does check for 50Xs it does also check in detail for the R/Y/G indicators, so if it’s good enough for Nagios 😆

gmmorris · 2020-10-05T18:38:38Z

OK, I've updated the description of the PR but copying over here too:

There are still some open questions from perspective:

When should we be returning an error status? Currently it will only go red if polling fails, not if we see tasks failing.
When should we return a warning status? Perhaps if we see that a certain Task Type is erroring at a certain rate while the others are fine?
Granularity. It occurs to me that different customers will want to monitor different things, such as wanting to monitor when alerting:siem tasks are failing, but don't care if alerting:index_theshold are failing. They could get the stats, and then dig into them themselves and decide on their own. But I have this idea that we can provide this by attaching a status on specific metrics. For example, you could have OK on the stats overall, but have warning on a specific Task Type that's failing at a higher rate than some configured threshold. We could even provide a sub api endpoint where you can filter down by the stats you care about and we would then infer the overall status from the ones you've filtered down to. But this feels like it could be a followup PR and isn't needed for GA.

UPDATE: Removed the huge description I had here as it's now in a README in the code:

https://github.com/elastic/kibana/blob/273d58d22bdf30de0b05709cbdd174275bb8c40e/x-pack/plugins/task_manager/server/MONITORING.md

pmuellr · 2020-10-13T12:54:23Z

When should we be returning an error status? ...

When should we return a warning status? ...

Granularity. ...

I would prefer that task manager only go red/yellow if it's own system is having difficulties, and not reflect the status of the tasks themselves. Because I'd like those task errors to be reported by the task owner. But that puts a burden on those task owners to do book keeping, provide status info via some endpoint, etc. Probably too much to ask.

The main reason to not co-mingle task manager internal status with concrete task type status, is that we won't immediately know who is at fault if it goes red/yellow.

It seems like the comment re: granularity points to the answer - provide an additional sub-status per task type. I don't think an additional endpoint is likely required, to just return overall status info, seems like the current status shapes could accommodate this.

Or perhaps to keep things a bit "cleaner", we'd have a status endpoint for TM "internal" status, and a separate one that reported summary information on the task types, perhaps even returning a single task type's info via path / query string param - as suggested in the referenced comment ^^^.

Also agree not needed for GA - I'd like to get some kind of summary info per task type into the TM status, but think we can defer on trying to decide if that summary info warrants labeling with some "error level" - until we get more familiar with the data being returned.

gmmorris · 2020-10-22T10:01:59Z

Calculating the "down time" was becoming quite complex as different tasks end different times and it's quite hard to reliably and usefully provide a number for this that is actually actionable.

For now I've decided to drop this feature from the health output until we figure out how to make it useful.

gmmorris added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Sep 15, 2020

gmmorris self-assigned this Sep 18, 2020

This was referenced Sep 18, 2020

[Task Manager] adds basic observability into Task Manager's runtime operations #77868

Merged

Convert alerts to use task manager intervals #46001

Closed

gmmorris closed this as completed Sep 28, 2020

gmmorris reopened this Sep 28, 2020

gmmorris mentioned this issue Oct 1, 2020

Alerting GA #74788

Closed

36 tasks

gmmorris mentioned this issue Oct 8, 2020

Failed Task Manager task documents are never cleaned up bloating the index #79977

Closed

3 tasks

gmmorris closed this as completed in #77868 Oct 27, 2020

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] we don't have sufficient observability into Task Manager's runtime operations #77456

[Task Manager] we don't have sufficient observability into Task Manager's runtime operations #77456

gmmorris commented Sep 15, 2020 •

edited

Loading

elasticmachine commented Sep 15, 2020

pmuellr commented Sep 15, 2020

gmmorris commented Sep 24, 2020

pmuellr commented Sep 28, 2020

pmuellr commented Sep 28, 2020

gmmorris commented Sep 28, 2020

gmmorris commented Sep 28, 2020

gmmorris commented Sep 28, 2020

gmmorris commented Oct 5, 2020 •

edited

Loading

pmuellr commented Oct 13, 2020

gmmorris commented Oct 22, 2020

[Task Manager] we don't have sufficient observability into Task Manager's runtime operations #77456

[Task Manager] we don't have sufficient observability into Task Manager's runtime operations #77456

Comments

gmmorris commented Sep 15, 2020 • edited Loading

elasticmachine commented Sep 15, 2020

pmuellr commented Sep 15, 2020

gmmorris commented Sep 24, 2020

pmuellr commented Sep 28, 2020

pmuellr commented Sep 28, 2020

gmmorris commented Sep 28, 2020

gmmorris commented Sep 28, 2020

gmmorris commented Sep 28, 2020

gmmorris commented Oct 5, 2020 • edited Loading

pmuellr commented Oct 13, 2020

gmmorris commented Oct 22, 2020

gmmorris commented Sep 15, 2020 •

edited

Loading

gmmorris commented Oct 5, 2020 •

edited

Loading