Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] we don't have sufficient observability into Task Manager's runtime operations #77456

Closed
8 of 9 tasks
gmmorris opened this issue Sep 15, 2020 · 11 comments · Fixed by #77868
Closed
8 of 9 tasks
Assignees
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Sep 15, 2020

We don't have enough observability into the behavior of Task Manager to properly investigate issues when an SDH is opened.

In order to have confidence in going GA we need a way of ascertaining the following about a deployment:

  • Basic configuration of the current deployment (polling, workers etc.)

Work load of Task Manager:

  • how many tasks are scheduled in the system
  • of what type
  • execution density in a timeframe (maybe the next hour?) including overdue tasks that should have ran by now

Runtime statistics, including:

Most of the above stats could be broken down by Task Type, and perhaps also by space 🤔

We want to make access to this information easy, but we don't want it to be too noisy either.
It's simplest to begin with the server log in 'debug' mode as a logging target at a fixed cadence.
We will also add an http route that can be curled and will stream stats out at a higher cadence. This endpoint will enable external monitoring of Task Manager as a whole, and possibly more granular stats.

@gmmorris gmmorris added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Sep 15, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Sep 15, 2020

note I started in on "event logifying" task manager, in this PR - #75152 - results seemed interesting, guessing we'd need to add some more events, perhaps some kind of hourly summary event would be useful as well

@gmmorris
Copy link
Contributor Author

I'm still trying to come up with a definition of "system is unhealthy" for the monitoring endopoint, as I'm thinking it should return 200 when it's healthy, and 500 when it's unhealthy (that's the easiest thing for an external monitoring system to latch on to).

This is tricky because some systems might have a relatively empty Task Manager, where execution frequency isn't a valid metric but in others, if no task has executed in the past 30s, you probably want to be alerted.

Here are some thoughts for when we should return 500:

  • No task polling cycle has completed in the past "poll interval + buffer"
  • mean/median task drift has exceeded a predefined threshold
  • task failure events have exceeded a predefined threshold

I have a few more ideas, but I'm still assessing them for viability... just wanted to jot this down 🤔 .

@pmuellr
Copy link
Member

pmuellr commented Sep 28, 2020

I think 500 is also the status for internal server error, so we wouldn't be able to distinguish between an error in the health check and a negative result from the health check. I'm not sure what other "health" services typically return in this case, I'd think we would like to avoid any 50x's though, as there are other 50x's that get returned for proxy/gateway errors - including from our cloud deployments.

@pmuellr
Copy link
Member

pmuellr commented Sep 28, 2020

It would be nice to include some kind of indication of whether the queue is getting overloaded. Not sure what the measurement would be: Basically how many tasks that can run, whose runAt is earlier than the current time, are the signal - perhaps if we can measure that, and over time it's going up, that's the indication. Or maybe get the oldest task that should have been run "by now", and if it's greater than X minutes, that's the indication.

@gmmorris
Copy link
Contributor Author

I'm not sure what other "health" services typically return in this case

That's actually been my past experience, which is why I went with that 🤷
I'm open to suggestions, but in general the idea is - if the health check is returning 500, you need to investigate why your system is unhealthy.

Usually the system and the health endpoint are one and the same and owned by the same people... it's worth considering that this is less true for us.

@gmmorris
Copy link
Contributor Author

Basically how many tasks that can run, whose runAt is earlier than the current time, are the signal - perhaps if we can measure that, and over time it's going up, that's the indication. Or maybe get the oldest task that should have been run "by now", and if it's greater than X minutes, that's the indication.

Broadly, that's what you'll get from the drift metric (which is how long after the task's schedule runAt was the task ran, with both mean and median of this) and the workload (includes broad status by taskType, but I'm also including some time related stats, such as overdue & density in the coming hour [configurable time frame, defaulting to an hour])

@gmmorris
Copy link
Contributor Author

Looking at how we implement the Elasticsearch Health endpoint, it seems we always returns 200 if we can, even when the service is RED.
With that in mind I think we'll stick to what the Stack already does, and add a R/Y/G indicators in our endpoint as well.
The clients will then have to do some code on their end to differentiate.

I looked at the Nagios Elasticsearch module and even though it does check for 50Xs it does also check in detail for the R/Y/G indicators, so if it’s good enough for Nagios 😆

@gmmorris gmmorris reopened this Sep 28, 2020
@gmmorris gmmorris mentioned this issue Oct 1, 2020
36 tasks
@gmmorris
Copy link
Contributor Author

gmmorris commented Oct 5, 2020

OK, I've updated the description of the PR but copying over here too:

There are still some open questions from perspective:

  1. When should we be returning an error status? Currently it will only go red if polling fails, not if we see tasks failing.
  2. When should we return a warning status? Perhaps if we see that a certain Task Type is erroring at a certain rate while the others are fine?
  3. Granularity. It occurs to me that different customers will want to monitor different things, such as wanting to monitor when alerting:siem tasks are failing, but don't care if alerting:index_theshold are failing. They could get the stats, and then dig into them themselves and decide on their own. But I have this idea that we can provide this by attaching a status on specific metrics. For example, you could have OK on the stats overall, but have warning on a specific Task Type that's failing at a higher rate than some configured threshold. We could even provide a sub api endpoint where you can filter down by the stats you care about and we would then infer the overall status from the ones you've filtered down to. But this feels like it could be a followup PR and isn't needed for GA.

UPDATE: Removed the huge description I had here as it's now in a README in the code:

https://github.com/elastic/kibana/blob/273d58d22bdf30de0b05709cbdd174275bb8c40e/x-pack/plugins/task_manager/server/MONITORING.md

@pmuellr
Copy link
Member

pmuellr commented Oct 13, 2020

  • When should we be returning an error status? ...
  • When should we return a warning status? ...
  • Granularity. ...

I would prefer that task manager only go red/yellow if it's own system is having difficulties, and not reflect the status of the tasks themselves. Because I'd like those task errors to be reported by the task owner. But that puts a burden on those task owners to do book keeping, provide status info via some endpoint, etc. Probably too much to ask.

The main reason to not co-mingle task manager internal status with concrete task type status, is that we won't immediately know who is at fault if it goes red/yellow.

It seems like the comment re: granularity points to the answer - provide an additional sub-status per task type. I don't think an additional endpoint is likely required, to just return overall status info, seems like the current status shapes could accommodate this.

Or perhaps to keep things a bit "cleaner", we'd have a status endpoint for TM "internal" status, and a separate one that reported summary information on the task types, perhaps even returning a single task type's info via path / query string param - as suggested in the referenced comment ^^^.

Also agree not needed for GA - I'd like to get some kind of summary info per task type into the TM status, but think we can defer on trying to decide if that summary info warrants labeling with some "error level" - until we get more familiar with the data being returned.

@gmmorris
Copy link
Contributor Author

Calculating the "down time" was becoming quite complex as different tasks end different times and it's quite hard to reliably and usefully provide a number for this that is actually actionable.

For now I've decided to drop this feature from the health output until we figure out how to make it useful.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants