-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] we don't have sufficient observability into Task Manager's runtime operations #77456
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
note I started in on "event logifying" task manager, in this PR - #75152 - results seemed interesting, guessing we'd need to add some more events, perhaps some kind of hourly summary event would be useful as well |
I'm still trying to come up with a definition of "system is unhealthy" for the monitoring endopoint, as I'm thinking it should return 200 when it's healthy, and 500 when it's unhealthy (that's the easiest thing for an external monitoring system to latch on to). This is tricky because some systems might have a relatively empty Task Manager, where execution frequency isn't a valid metric but in others, if no task has executed in the past 30s, you probably want to be alerted. Here are some thoughts for when we should return 500:
I have a few more ideas, but I'm still assessing them for viability... just wanted to jot this down 🤔 . |
I think 500 is also the status for internal server error, so we wouldn't be able to distinguish between an error in the health check and a negative result from the health check. I'm not sure what other "health" services typically return in this case, I'd think we would like to avoid any 50x's though, as there are other 50x's that get returned for proxy/gateway errors - including from our cloud deployments. |
It would be nice to include some kind of indication of whether the queue is getting overloaded. Not sure what the measurement would be: Basically how many tasks that can run, whose |
That's actually been my past experience, which is why I went with that 🤷 Usually the system and the health endpoint are one and the same and owned by the same people... it's worth considering that this is less true for us. |
Broadly, that's what you'll get from the |
Looking at how we implement the Elasticsearch Health endpoint, it seems we always returns 200 if we can, even when the service is RED. I looked at the Nagios Elasticsearch module and even though it does check for 50Xs it does also check in detail for the R/Y/G indicators, so if it’s good enough for Nagios 😆 |
OK, I've updated the description of the PR but copying over here too: There are still some open questions from perspective:
UPDATE: Removed the huge description I had here as it's now in a README in the code: |
I would prefer that task manager only go red/yellow if it's own system is having difficulties, and not reflect the status of the tasks themselves. Because I'd like those task errors to be reported by the task owner. But that puts a burden on those task owners to do book keeping, provide status info via some endpoint, etc. Probably too much to ask. The main reason to not co-mingle task manager internal status with concrete task type status, is that we won't immediately know who is at fault if it goes red/yellow. It seems like the comment re: granularity points to the answer - provide an additional sub-status per task type. I don't think an additional endpoint is likely required, to just return overall status info, seems like the current status shapes could accommodate this. Or perhaps to keep things a bit "cleaner", we'd have a status endpoint for TM "internal" status, and a separate one that reported summary information on the task types, perhaps even returning a single task type's info via path / query string param - as suggested in the referenced comment ^^^. Also agree not needed for GA - I'd like to get some kind of summary info per task type into the TM status, but think we can defer on trying to decide if that summary info warrants labeling with some "error level" - until we get more familiar with the data being returned. |
Calculating the "down time" was becoming quite complex as different tasks end different times and it's quite hard to reliably and usefully provide a number for this that is actually actionable. For now I've decided to drop this feature from the health output until we figure out how to make it useful. |
We don't have enough observability into the behavior of Task Manager to properly investigate issues when an SDH is opened.
In order to have confidence in going GA we need a way of ascertaining the following about a deployment:
Work load of Task Manager:
overdue
tasks that should have ran by nowRuntime statistics, including:
How much "dead time" is Task Manager experiencing?Dropped: [Task Manager] we don't have sufficient observability into Task Manager's runtime operations #77456 (comment)Most of the above stats could be broken down by Task Type, and perhaps also by space 🤔
We want to make access to this information easy, but we don't want it to be too noisy either.
It's simplest to begin with the server log in 'debug' mode as a logging target at a fixed cadence.
We will also add an http route that can be curled and will stream stats out at a higher cadence. This endpoint will enable external monitoring of Task Manager as a whole, and possibly more granular stats.
The text was updated successfully, but these errors were encountered: