-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[alerting] rule tasks that fail 3 times are never run again, with no indication in the rule #116321
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Here's a PR for 7.16 regarding actions, not rules, which is very tangentially related - #109655 . In that one we are "cleaning up" these failed action tasks. You could imagine something similar for rules, but it would be kinda sad since we'd have to do it in a cleanup task on a schedule, vs getting immediate feedback - a task that looks for these failed rule tasks, and presumably sets the last execution status to an error condition with hopefully a relevant message. It could also disable / re-enable the rule, if it was marked enabled. But perhaps it should just be disabled, otherwise if the whatever error is being hit, keeps getting hit, it would essentially be ignore the maxAttempts configuration altogether. |
Upon further research, the user has 12 rules in a similar state; the last execution date for them recorded in the rule (presumably last successful task run) was on 9/12 - 9/13, so it seems like there was some kind of systemic problem around that time ... |
At this stage it isn't clear what led to this - whoever picks this up should use it as a research issue to identify how it might have happened and then bring it back to the team to propose work to remediate. |
I'm thinking this may be timeout related and occur after timing our 3x in a row. |
For non-recurring tasks, task manager makes 3 (or the configured maxAttempts) attempts and then marks the task as kibana/x-pack/plugins/task_manager/server/task_running/task_runner.test.ts Lines 1221 to 1254 in 237d68d
For recurring tasks, task manager only marks as kibana/x-pack/plugins/task_manager/server/task_running/task_runner.ts Lines 443 to 475 in 3742d46
Separately, for all tasks, the number of attempts is incremented in the
The only type of unrecoverable error that the alerting task runner throws is the I'm wondering if it's possible that it's a coincidence that the |
I haven't verified but I also noticed this script (https://github.com/elastic/kibana/blob/main/x-pack/plugins/task_manager/server/queries/task_claiming.ts#L403-L412) which marks tasks as failed. Is it possible this script applies to recurring tasks and causes them to get into a |
@mikecote That script uses this logic:
So recurring tasks with a schedule interval should not be marked as failed |
Thanks for bringing up that script @mikecote! Here's something interesting. When I run 7.14 locally and create a rule, the task associated with the rule has a In the issue summary, @pmuellr pasted a task manager task with So I think you are right in that the script is what is marking the task as failed, but how did the schedule interval disappear from the alerting task?!? |
It looks like for rules created in older versions of Kibana (7.7 - 7.10), the associated task manager task does not contain a |
So I created 2 rules on a 7.10 deployment, 1 that runs frequently (every minute) and 1 that runs less frequently (every day). Then I updated the deployment to 7.11. After the upgrade, the rule that runs frequently ran and the associated task was updated to include the So it seems like the conditions that could lead to this state (rule tasks with
|
Just a comment that although post 7.11, we are marking the task as kibana/x-pack/plugins/task_manager/server/queries/mark_available_tasks_as_claimed.test.ts Lines 47 to 60 in b555a22
|
@ymao1 nice find! It makes sense now thinking of it because the task documents get "migrated" / "their interval set" after the rule runs, so if they didn't after an upgrade and encountered maxAttempts, I can see how they would then get marked as failed. It sounds like a tricky fix to ensure these tasks don't get marked as failed? |
@mikecote I think we could possibly update the Do you think this scenario is common enough to add a fix for it? I |
For some reason I was thinking it would be as easy as adding a migration to "fix" these failed rules, by clearing the fails and ensuring they have a scheduled interval. But I guess ... what interval? Maybe we need to mark the tasks with a new field "needs an interval!", and then let alerting fix it in a cleanup task or next time it runs the rule executor. |
Since this issue is a research issue, I think we could come up with a proposal on how we envision this getting fixed and create a follow-up issue to prioritize accordingly. I think @ymao1 and @pmuellr you're onto something; maybe some per-task-type logic, maybe task types can define themselves as single occurrences (run once) or recurring occurrences (run on interval) and leverage that in the updateByQuery. We may end up prioritizing the fix soon if ever our focus for 8.1 through 8.3 is to ensure alerting rules operate continually. This would be an edge case, so may warrant lower on the list but worth attention for sure. |
I think it's important to note that this scenario shouldn't happen on 7.11+ rules. They should never be marked as failed and they will continue getting claimed because they have a I think if we want to go back and fix pre-7.11 rules, perhaps we could have a task manager migration that looks for alerting task types with no schedule and resets their number of attempts to 0 and resets their status to |
Ya, this sounds better than my idea! They will have a YUUUGE task delay/drift, once they run! My only concern is that if it ends up failing again 3 times, the interval still wouldn't be set, and so it would be in a zombie state again. Maybe add a check when we run the rule, if the schedule isn't in the task, do a one-time update on it to set it? |
That makes sense sounds pretty straightforward! I'll create a follow up issue for these items. |
Created #117593 to implement the suggestions from this research issue. Closing this issue. |
Kibana version: 7.14.1 (reported by user)
User encountered a situation where the task document indicated that it failed with 3 attempts, but the rule doesn't seem to indicate any failure status.
The "three strikes you're out" processing centers around the concept of "max_attempts":
kibana/x-pack/plugins/task_manager/server/config.ts
Lines 46 to 50 in b17e01d
Some data from the docs:
task manager doc
rule doc
I'm not super familiar with how we handle these task manager errors, but just looking at these docs makes me wonder if we are updating the rule when the task fails like this. I'd guess we aren't. But clearly we should be. Perhaps we can't because the task manager failure is occurring before or after the rule runs, so the rule doesn't have a chance to update it's own status? If so, it seems like gathering the "status" on a rule should be looking at the task manager doc to see if it's "given up" on this task document because of the number of attempts. In the end, if we've marked a task as "will not run again", we need to make sure the resource that created the task is informed of this, so they can reflect this back in the resource status.
Note also the user has updated the rule (hence the
updatedAt: 2021-10-22T10:16
property), in hopes of getting the rule to "fire" - they thought the rule stopped recognizing it's alerting conditions, but as near as I can tell, it was just not running at all.I added an
estimate:needs-research
label, as I think we'll need to:The text was updated successfully, but these errors were encountered: