[alerting] rule tasks that fail 3 times are never run again, with no indication in the rule #116321

pmuellr · 2021-10-26T15:55:43Z

Kibana version: 7.14.1 (reported by user)

User encountered a situation where the task document indicated that it failed with 3 attempts, but the rule doesn't seem to indicate any failure status.

The "three strikes you're out" processing centers around the concept of "max_attempts":

kibana/x-pack/plugins/task_manager/server/config.ts

Lines 46 to 50 in b17e01d

    
           /* The maximum number of times a task will be attempted before being abandoned as failed */ 
        
           max_attempts: schema.number({ 
        
             defaultValue: 3, 
        
             min: 1, 
        
           }),

Some data from the docs:

task manager doc

      {
        "_index" : ".kibana_task_manager_7.14.1_001",
        "_type" : "_doc",
        "_id" : "task:5885e3d0-da5f-11eb-8c0e-9782f1ad0c93",
        "_score" : 1.0,
        "_source" : {
          "migrationVersion" : {
            "task" : "7.6.0"
          },
          "task" : {
            "taskType" : "[[elided - pmuellr]]",
            "retryAt" : "2021-09-12T21:30:05.288Z",
            "runAt" : "2021-09-12T20:44:02.033Z",
            "scope" : [
              "alerting"
            ],
            "startedAt" : "2021-09-12T21:10:05.287Z",
            "state" : """{"alertInstances":{},"previousStartedAt":"2021-09-12T20:43:02.033Z"}""",
            "params" : "[[elided - pmuellr]]",
            "ownerId" : "kibana:1ef3f039-cb28-486b-9046-01a41adb06e8",
            "scheduledAt" : "2021-07-01T11:27:40.684Z",
            "attempts" : 3,
            "status" : "failed"
          },
          "references" : [ ],
          "updated_at" : "2021-09-12T21:10:05.339Z",
          "coreMigrationVersion" : "7.14.1",
          "type" : "task"
        }
  }

rule doc

        {
            "id": "80cb3a73-b149-4780-a290-4acb9cf06ccd",
            "notifyWhen": "onActiveAlert",
            "params": { // elided
            },
            "consumer": "alerts",
            "alertTypeId": "elided",
            "schedule": {
                "interval": "1m"
            },
            "actions": [ // elided
            ],
            "tags": [ // elided
            ],
            "name": "elided",
            "enabled": true,
            "throttle": null,
            "apiKeyOwner": "elided",
            "createdBy": "elided",
            "updatedBy": "elided",
            "muteAll": false,
            "mutedInstanceIds": [],
            "updatedAt": "2021-10-22T10:16:03.999Z",
            "createdAt": "2021-07-01T11:27:40.384Z",
            "scheduledTaskId": "5885e3d0-da5f-11eb-8c0e-9782f1ad0c93",
            "executionStatus": {
                "lastExecutionDate": "2021-09-12T20:43:02.778Z", 
                "status": "ok"
            }
        },

I'm not super familiar with how we handle these task manager errors, but just looking at these docs makes me wonder if we are updating the rule when the task fails like this. I'd guess we aren't. But clearly we should be. Perhaps we can't because the task manager failure is occurring before or after the rule runs, so the rule doesn't have a chance to update it's own status? If so, it seems like gathering the "status" on a rule should be looking at the task manager doc to see if it's "given up" on this task document because of the number of attempts. In the end, if we've marked a task as "will not run again", we need to make sure the resource that created the task is informed of this, so they can reflect this back in the resource status.

Note also the user has updated the rule (hence the updatedAt: 2021-10-22T10:16 property), in hopes of getting the rule to "fire" - they thought the rule stopped recognizing it's alerting conditions, but as near as I can tell, it was just not running at all.

I added an estimate:needs-research label, as I think we'll need to:

repro
find out if there is an existing way to get these types of failures reflected back in the rule status; of if there is, why didn't that happen
if we don't reflect this task status back in the rule status, we'll need to invent a way to do that

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-10-26T15:55:45Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2021-10-26T16:06:08Z

Here's a PR for 7.16 regarding actions, not rules, which is very tangentially related - #109655 . In that one we are "cleaning up" these failed action tasks.

You could imagine something similar for rules, but it would be kinda sad since we'd have to do it in a cleanup task on a schedule, vs getting immediate feedback - a task that looks for these failed rule tasks, and presumably sets the last execution status to an error condition with hopefully a relevant message. It could also disable / re-enable the rule, if it was marked enabled. But perhaps it should just be disabled, otherwise if the whatever error is being hit, keeps getting hit, it would essentially be ignore the maxAttempts configuration altogether.

pmuellr · 2021-10-26T19:13:03Z

Upon further research, the user has 12 rules in a similar state; the last execution date for them recorded in the rule (presumably last successful task run) was on 9/12 - 9/13, so it seems like there was some kind of systemic problem around that time ...

gmmorris · 2021-11-01T15:45:03Z

At this stage it isn't clear what led to this - whoever picks this up should use it as a research issue to identify how it might have happened and then bring it back to the team to propose work to remediate.

mikecote · 2021-11-01T15:45:46Z

I'm thinking this may be timeout related and occur after timing our 3x in a row.

ymao1 · 2021-11-02T17:05:04Z

For non-recurring tasks, task manager makes 3 (or the configured maxAttempts) attempts and then marks the task as failed. However, rules are recurring tasks and, as this test specifically calls out, they are not marked as failed even if maxAttempts is reached:

kibana/x-pack/plugins/task_manager/server/task_running/task_runner.test.ts

Lines 1221 to 1254 in 237d68d

    
           test(`Doesn't fail recurring tasks when maxAttempts reached`, async () => { 
        
             const id = _.random(1, 20).toString(); 
        
             const initialAttempts = 3; 
        
             const intervalSeconds = 10; 
        
             const { runner, store } = await readyToRunStageSetup({ 
        
               instance: { 
        
                 id, 
        
                 attempts: initialAttempts, 
        
                 schedule: { interval: `${intervalSeconds}s` }, 
        
                 startedAt: new Date(), 
        
               }, 
        
               definitions: { 
        
                 bar: { 
        
                   title: 'Bar!', 
        
                   maxAttempts: 3, 
        
                   createTaskRunner: () => ({ 
        
                     run: async () => { 
        
                       throw new Error(); 
        
                     }, 
        
                   }), 
        
                 }, 
        
               }, 
        
             }); 
        
             await runner.run(); 
        
             expect(store.update).toHaveBeenCalledTimes(1); 
        
             const instance = store.update.mock.calls[0][0]; 
        
             expect(instance.attempts).toEqual(3); 
        
             expect(instance.status).toEqual('idle'); 
        
             expect(instance.runAt.getTime()).toEqual( 
        
               new Date(Date.now() + intervalSeconds * 1000).getTime() 
        
             ); 
        
           });

For recurring tasks, task manager only marks as failed if running the task results in an unrecoverable error, otherwise it tries to reschedule the task

kibana/x-pack/plugins/task_manager/server/task_running/task_runner.ts

Lines 443 to 475 in 3742d46

    
           private rescheduleFailedRun = ( 
        
             failureResult: FailedRunResult 
        
           ): Result<SuccessfulRunResult, FailedTaskResult> => { 
        
             const { state, error } = failureResult; 
        
             if (this.shouldTryToScheduleRetry() && !isUnrecoverableError(error)) { 
        
               // if we're retrying, keep the number of attempts 
        
               const { schedule, attempts } = this.instance.task; 
        
               const reschedule = failureResult.runAt 
        
                 ? { runAt: failureResult.runAt } 
        
                 : failureResult.schedule 
        
                 ? { schedule: failureResult.schedule } 
        
                 : schedule 
        
                 ? { schedule } 
        
                 : // when result.error is truthy, then we're retrying because it failed 
        
                   { 
        
                     runAt: this.getRetryDelay({ 
        
                       attempts, 
        
                       error, 
        
                     }), 
        
                   }; 
        
               if (reschedule.runAt || reschedule.schedule) { 
        
                 return asOk({ 
        
                   state, 
        
                   attempts, 
        
                   ...reschedule, 
        
                 }); 
        
               } 
        
             } 
        
             // scheduling a retry isn't possible,mark task as failed 
        
             return asErr({ status: TaskStatus.Failed }); 
        
           };

Separately, for all tasks, the number of attempts is incremented in the markTaskAsRunning() function, which, if it encounters an error will still increment the number of attempts

kibana/x-pack/plugins/task_manager/server/task_running/task_runner.ts

Line 389 in 3742d46

}, await this.releaseClaimAndIncrementAttempts());

The only type of unrecoverable error that the alerting task runner throws is the Saved object not found error. Every other type of error that could be thrown as part of the alerting task (whether inside the alerting task runner or inside a rule executor) looks to be caught and returned to the Task Manager as an ok result.

I'm wondering if it's possible that it's a coincidence that the attempts = 3 when the task was marked as failed. If a cluster is experiencing instability, could task manager have had trouble marking the tasks as running, incrementing the attempt count each time until it finally grabbed the task, only to get an unrecoverable error? Or as @mikecote suggested, if rule types are consistently exceeding the timeout, they become up for grabs after the 5m timeout interval and the next task manager to pick up the task will increment the number of attempts. However, the failed status would still need to be triggered by an unrecoverable error

mikecote · 2021-11-02T17:43:31Z

Or as @mikecote suggested, if rule types are consistently exceeding the timeout, they become up for grabs after the 5m timeout interval and the next task manager to pick up the task will increment the number of attempts. However, the failed status would still need to be triggered by an unrecoverable error

I haven't verified but I also noticed this script (https://github.com/elastic/kibana/blob/main/x-pack/plugins/task_manager/server/queries/task_claiming.ts#L403-L412) which marks tasks as failed. Is it possible this script applies to recurring tasks and causes them to get into a failed state?

ymao1 · 2021-11-02T17:52:54Z

@mikecote That script uses this logic:

if (ctx._source.task.schedule != null || ctx._source.task.attempts < 
  params.taskMaxAttempts[ctx._source.task.taskType] || params.claimTasksById.contains(ctx._id)) {
    ${markAsClaimingScript}
} else {
    ctx._source.task.status = "failed";
}

So recurring tasks with a schedule interval should not be marked as failed

ymao1 · 2021-11-02T18:06:13Z

Thanks for bringing up that script @mikecote!

Here's something interesting. When I run 7.14 locally and create a rule, the task associated with the rule has a schedule.interval field that reflects the rule interval.

In the issue summary, @pmuellr pasted a task manager task with attempts:3 and status:failed that has no schedule.interval field. I went back to to the source of this data and it looks like all the task manager tasks (which otherwise are consistent with alerting tasks) for the failed tasks are missing this schedule.interval field in the task.

So I think you are right in that the script is what is marking the task as failed, but how did the schedule interval disappear from the alerting task?!?

ymao1 · 2021-11-03T13:26:28Z

It looks like for rules created in older versions of Kibana (7.7 - 7.10), the associated task manager task does not contain a schedule.interval field, so the mark task as failed script (added in 7.11) will mark them as failed if the max number of attempts is reached, even though they are recurring tasks. Starting in 7.11, schedule.interval is written out to the task manager doc for created rules (with this PR)

ymao1 · 2021-11-03T14:08:22Z

So I created 2 rules on a 7.10 deployment, 1 that runs frequently (every minute) and 1 that runs less frequently (every day). Then I updated the deployment to 7.11. After the upgrade, the rule that runs frequently ran and the associated task was updated to include the schedule.interval field. The rule that runs infrequently was not scheduled to run, so its associated task remained the same, missing the schedule.interval field.

So it seems like the conditions that could lead to this state (rule tasks with attempts:3 and status:failed) are the following:

rules created prior to 7.11
rule did not run after the upgrade (or was unable to successfully run) - this means the task manager doc was never updated with the schedule.interval field
- in the original user issue, they had 12 rules running at 1m interval that ended up in this state, so it's possible that they had connection issues for a short amount of time that led to the number of attempts incrementing rapidly...although without the original logs, it is hard to be certain
the markTaskAsFailed script kicked in on the 4th attempt after this and marked the task as failed bc it doesn't realize it's a recurring task

ymao1 · 2021-11-04T12:33:34Z

Just a comment that although post 7.11, we are marking the task as failed if it has no schedule and has reached its maxAttempts, prior to this, in 7.10, task manager would still not claim these types of tasks. So if a pre-7.10 alerting rule reached attempts:3, it would stop getting claimed by task manager:

kibana/x-pack/plugins/task_manager/server/queries/mark_available_tasks_as_claimed.test.ts

Lines 47 to 60 in b555a22

    
           expect( 
        
             asUpdateByQuery({ 
        
               query: mustBeAllOf( 
        
                 // Either a task with idle status and runAt <= now or 
        
                 // status running or claiming with a retryAt <= now. 
        
                 shouldBeOneOf(IdleTaskWithExpiredRunAt, RunningOrClaimingTaskWithExpiredRetryAt), 
        
                 // Either task has an schedule or the attempts < the maximum configured 
        
                 shouldBeOneOf<ExistsFilter | TermFilter | RangeFilter>( 
        
                   TaskWithSchedule, 
        
                   ...Object.entries(definitions).map(([type, { maxAttempts }]) => 
        
                     taskWithLessThanMaxAttempts(type, maxAttempts || defaultMaxAttempts) 
        
                   ) 
        
                 ) 
        
               ),

mikecote · 2021-11-04T14:18:43Z

@ymao1 nice find! It makes sense now thinking of it because the task documents get "migrated" / "their interval set" after the rule runs, so if they didn't after an upgrade and encountered maxAttempts, I can see how they would then get marked as failed.

It sounds like a tricky fix to ensure these tasks don't get marked as failed?

ymao1 · 2021-11-04T14:54:46Z

@mikecote I think we could possibly update the updateByQuery to take the task type into consideration and never mark alerting tasks as failed?

Do you think this scenario is common enough to add a fix for it? I

pmuellr · 2021-11-04T17:33:35Z

It sounds like a tricky fix to ensure these tasks don't get marked as failed?

For some reason I was thinking it would be as easy as adding a migration to "fix" these failed rules, by clearing the fails and ensuring they have a scheduled interval. But I guess ... what interval? Maybe we need to mark the tasks with a new field "needs an interval!", and then let alerting fix it in a cleanup task or next time it runs the rule executor.

mikecote · 2021-11-04T17:42:45Z

@mikecote I think we could possibly update the updateByQuery to take the task type into consideration and never mark alerting tasks as failed?

Do you think this scenario is common enough to add a fix for it? I

It sounds like a tricky fix to ensure these tasks don't get marked as failed?

For some reason I was thinking it would be as easy as adding a migration to "fix" these failed rules, by clearing the fails and ensuring they have a scheduled interval. But I guess ... what interval? Maybe we need to mark the tasks with a new field "needs an interval!", and then let alerting fix it in a cleanup task or next time it runs the rule executor.

Since this issue is a research issue, I think we could come up with a proposal on how we envision this getting fixed and create a follow-up issue to prioritize accordingly.

I think @ymao1 and @pmuellr you're onto something; maybe some per-task-type logic, maybe task types can define themselves as single occurrences (run once) or recurring occurrences (run on interval) and leverage that in the updateByQuery.

We may end up prioritizing the fix soon if ever our focus for 8.1 through 8.3 is to ensure alerting rules operate continually. This would be an edge case, so may warrant lower on the list but worth attention for sure.

ymao1 · 2021-11-04T17:50:36Z

I think it's important to note that this scenario shouldn't happen on 7.11+ rules. They should never be marked as failed and they will continue getting claimed because they have a schedule field.

I think if we want to go back and fix pre-7.11 rules, perhaps we could have a task manager migration that looks for alerting task types with no schedule and resets their number of attempts to 0 and resets their status to idle. This would allow task manager to start claiming these rules again and if they run successfully, the document would get updated correctly. The edge case in this scenario would be if task manager starts claiming them but they never run successfully...the task manager doc would never get updated to correctly reflect the schedule....

pmuellr · 2021-11-04T19:18:42Z

perhaps we could have a task manager migration that looks for alerting task types with no schedule and resets their number of attempts to 0 and resets their status to idle

Ya, this sounds better than my idea! They will have a YUUUGE task delay/drift, once they run! My only concern is that if it ends up failing again 3 times, the interval still wouldn't be set, and so it would be in a zombie state again. Maybe add a check when we run the rule, if the schedule isn't in the task, do a one-time update on it to set it?

ymao1 · 2021-11-04T19:49:52Z

Maybe add a check when we run the rule, if the schedule isn't in the task, do a one-time update on it to set it?

That makes sense sounds pretty straightforward! I'll create a follow up issue for these items.

ymao1 · 2021-11-04T20:20:31Z

Created #117593 to implement the suggestions from this research issue. Closing this issue.

chrisronline assigned chrisronline and unassigned chrisronline Nov 1, 2021

ymao1 self-assigned this Nov 1, 2021

ymao1 mentioned this issue Nov 4, 2021

[Alerting] Gracefully restore failed rules from pre-7.11 #117593

Closed

ymao1 closed this as completed Nov 4, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[alerting] rule tasks that fail 3 times are never run again, with no indication in the rule #116321

[alerting] rule tasks that fail 3 times are never run again, with no indication in the rule #116321

pmuellr commented Oct 26, 2021

elasticmachine commented Oct 26, 2021

pmuellr commented Oct 26, 2021

pmuellr commented Oct 26, 2021

gmmorris commented Nov 1, 2021

mikecote commented Nov 1, 2021

ymao1 commented Nov 2, 2021 •

edited

Loading

mikecote commented Nov 2, 2021

ymao1 commented Nov 2, 2021

ymao1 commented Nov 2, 2021

ymao1 commented Nov 3, 2021 •

edited

Loading

ymao1 commented Nov 3, 2021 •

edited

Loading

ymao1 commented Nov 4, 2021

mikecote commented Nov 4, 2021

ymao1 commented Nov 4, 2021

pmuellr commented Nov 4, 2021

mikecote commented Nov 4, 2021

ymao1 commented Nov 4, 2021

pmuellr commented Nov 4, 2021

ymao1 commented Nov 4, 2021

ymao1 commented Nov 4, 2021

[alerting] rule tasks that fail 3 times are never run again, with no indication in the rule #116321

[alerting] rule tasks that fail 3 times are never run again, with no indication in the rule #116321

Comments

pmuellr commented Oct 26, 2021

elasticmachine commented Oct 26, 2021

pmuellr commented Oct 26, 2021

pmuellr commented Oct 26, 2021

gmmorris commented Nov 1, 2021

mikecote commented Nov 1, 2021

ymao1 commented Nov 2, 2021 • edited Loading

mikecote commented Nov 2, 2021

ymao1 commented Nov 2, 2021

ymao1 commented Nov 2, 2021

ymao1 commented Nov 3, 2021 • edited Loading

ymao1 commented Nov 3, 2021 • edited Loading

ymao1 commented Nov 4, 2021

mikecote commented Nov 4, 2021

ymao1 commented Nov 4, 2021

pmuellr commented Nov 4, 2021

mikecote commented Nov 4, 2021

ymao1 commented Nov 4, 2021

pmuellr commented Nov 4, 2021

ymao1 commented Nov 4, 2021

ymao1 commented Nov 4, 2021

ymao1 commented Nov 2, 2021 •

edited

Loading

ymao1 commented Nov 3, 2021 •

edited

Loading

ymao1 commented Nov 3, 2021 •

edited

Loading