v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod #12993

jesseanttila-cai · 2024-04-29T10:52:00Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

As part of #12402 (included from v3.5.3 onwards), workflow pod wait-container behavior was changed to create a placeholder (incomplete) WorkflowTaskResult before waiting for the main-container to complete.

argo-workflows/cmd/argoexec/commands/wait.go

Lines 38 to 42 in 0fdf745

    
           // Create a new empty (placeholder) task result with LabelKeyReportOutputsCompleted set to false. 
        
           wfExecutor.InitializeOutput(bgCtx) 
        
           // Wait for main container to complete 
        
           err := wfExecutor.Wait(ctx)

The WorkflowTaskResult is finalized after output artifacts, logs etc. have been handled:

argo-workflows/cmd/argoexec/commands/wait.go

Line 34 in 0fdf745

    
           defer wfExecutor.FinalizeOutput(bgCtx) // Ensures the LabelKeyReportOutputsCompleted is set to true.

If the wait-container is interrupted in a way that prevents FinalizeOutput from being called (e.g. pod deletion without sufficient grace period), an incomplete WorkflowTaskResult remains with the workflows.argoproj.io/report-outputs-completed label set to false. Retries of the same task will produce additional WorkflowTaskResults and will not mark the previous one complete. This leaves the workflow stuck in Processing state until the WorkflowTaskResult is manually edited to mark it complete.

The reproduction example workflow simulates forced pod deletion using a pod that deletes itself, leaving behind an incomplete WorkflowTaskResult. The included workflow controller log snippet shows the resulting processing loop.

This issue may be one of the causes of #12103.

Version

v3.5.3

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: workflow-hang-example-
spec:
  entrypoint: self-delete
  templates:
    - name: self-delete
      retryStrategy:
        limit: "1"
        retryPolicy: Always
      podSpecPatch: |
        containers:
          - name: main
            env:
            - name: RETRY_COUNT
              value: "{{retries}}"
        terminationGracePeriodSeconds: 0
      container:
        image: bitnami/kubectl:latest
        env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
        # Simulate interruption via forced pod deletion on first attempt, complete successfully on retry
        command: ["sh", "-c", "sleep 10; [ $RETRY_COUNT -eq 0 ] && kubectl delete pod $POD_NAME; sleep 10"]

Logs from the workflow controller

time="2024-04-29T10:31:58.427Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=26673983 namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.428Z" level=info msg="Task-result reconciliation" namespace=ext-namespace numObjs=2 workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="task result:\n&WorkflowTaskResult{ObjectMeta:{workflow-hang-example-hf6nh-1013521106  ext-namespace  7ad71c97-1148-4c59-aea4-84e8c56d19c4 26673606 1 2024-04-29 10:29:57 +0000 UTC <nil> <nil> map[workflows.argoproj.io/report-outputs-completed:false workflows.argoproj.io/workflow:workflow-hang-example-hf6nh] map[] [{argoproj.io/v1alpha1 Workflow workflow-hang-example-hf6nh af4266ff-e25f-4e4d-b073-dee4b6778da7 <nil> <nil>}] [] [{argoexec Update argoproj.io/v1alpha1 2024-04-29 10:29:57 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:workflows.argoproj.io/report-outputs-completed\":{},\"f:workflows.argoproj.io/workflow\":{}},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"af4266ff-e25f-4e4d-b073-dee4b6778da7\\\"}\":{}}}} }]},NodeResult:NodeResult{Phase:,Message:,Outputs:nil,Progress:,},}" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="task result name:\nworkflow-hang-example-hf6nh-1013521106" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="Marking task result incomplete workflow-hang-example-hf6nh-1013521106" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=info msg="task-result changed" namespace=ext-namespace nodeID=workflow-hang-example-hf6nh-1013521106 workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="task result:\n&WorkflowTaskResult{ObjectMeta:{workflow-hang-example-hf6nh-3630579407  ext-namespace  3fe40d94-0f39-45ee-aff6-902f6086b578 26673958 2 2024-04-29 10:30:36 +0000 UTC <nil> <nil> map[workflows.argoproj.io/report-outputs-completed:true workflows.argoproj.io/workflow:workflow-hang-example-hf6nh] map[] [{argoproj.io/v1alpha1 Workflow workflow-hang-example-hf6nh af4266ff-e25f-4e4d-b073-dee4b6778da7 <nil> <nil>}] [] [{argoexec Update argoproj.io/v1alpha1 2024-04-29 10:30:57 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:workflows.argoproj.io/report-outputs-completed\":{},\"f:workflows.argoproj.io/workflow\":{}},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"af4266ff-e25f-4e4d-b073-dee4b6778da7\\\"}\":{}}},\"f:outputs\":{}} }]},NodeResult:NodeResult{Phase:,Message:,Outputs:&Outputs{Parameters:[]Parameter{},Artifacts:[]Artifact{},Result:nil,ExitCode:nil,},Progress:,},}" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="task result name:\nworkflow-hang-example-hf6nh-3630579407" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="Marking task result complete workflow-hang-example-hf6nh-3630579407" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=info msg="task-result changed" namespace=ext-namespace nodeID=workflow-hang-example-hf6nh-3630579407 workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="Skipping artifact GC" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="Evaluating node workflow-hang-example-hf6nh: template: *v1alpha1.WorkflowStep (self-delete), boundaryID: " namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.429Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=ext-namespace,name=workflow-hang-example-hf6nh)" tmpl="*v1alpha1.WorkflowStep (self-delete)"
time="2024-04-29T10:31:58.429Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=ext-namespace,name=workflow-hang-example-hf6nh)" tmpl="*v1alpha1.WorkflowStep (self-delete)"
time="2024-04-29T10:31:58.429Z" level=debug msg="Getting the template by name: self-delete" base="*v1alpha1.Workflow (namespace=ext-namespace,name=workflow-hang-example-hf6nh)" tmpl="*v1alpha1.WorkflowStep (self-delete)"
time="2024-04-29T10:31:58.430Z" level=debug msg="Node workflow-hang-example-hf6nh already completed" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.430Z" level=info msg="TaskSet Reconciliation" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.430Z" level=info msg=reconcileAgentPod namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.430Z" level=debug msg="Task results completion status: map[workflow-hang-example-hf6nh-1013521106:false workflow-hang-example-hf6nh-3630579407:true]" namespace=ext-namespace workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.430Z" level=debug msg="taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed" fromPhase=Running namespace=ext-namespace toPhase=Succeeded workflow=workflow-hang-example-hf6nh
time="2024-04-29T10:31:58.443Z" level=info msg="Workflow update successful" namespace=ext-namespace phase=Running resourceVersion=26673983 workflow=workflow-hang-example-hf6nh

Logs from in your workflow's wait container

N/A (wait container must be forcibly interrupted for this issue to appear)

The text was updated successfully, but these errors were encountered:

jswxstw · 2024-04-29T13:35:04Z

It seems unreliable to determine whether the task result has been successfully reported through the AnnotationKeyReportOutputsCompleted.

shuangkun · 2024-04-29T16:24:21Z

After reproducing it, there are indeed some problems and need to think about how to fix it.

alexlatchford · 2024-05-08T19:44:10Z

Yeah we are just concluding an RCA after we had a cohort of workflows "stuck" in Running state as the upgrade of the workflow-controller was rolled out from v3.4.16 to v3.5.6. We had the same conclusion that likely #11947 was the root cause, but #12402 seems related too! In either case looks like it's isolated to a transient issue on our end due to incompatibility between argoexec:v3.4.16 sidecars inside the active pods of workflows running at the time of the upgrade and the new workflow-controller:v3.5.6 control plane! Thanks for the original report and the maintainers for all the effort here!

…. Fixes:argoproj#12993 Signed-off-by: shuangkun <tsk2013uestc@163.com>

juliev0 · 2024-06-17T05:01:07Z

Is the primary case for this one in which the Controller itself issues a SIGKILL to the container because the container isn't responding fast enough to a SIGTERM?

If so, it seems like we need to indicate on the Controller side that if we are doing a SIGKILL not to wait for the WorkflowTaskResult for that task, right?

jesseanttila-cai · 2024-06-17T05:43:30Z

Is the primary case for this one in which the Controller itself issues a SIGKILL to the container because the container isn't responding fast enough to a SIGTERM?

The root cause for the pod interruption in my case was primarily related to node deprovisioning, as the environment where this issue appeared consistently uses EC2 Spot Instances along with an aggressive deprovisioning strategy for underutilized nodes. I believe that node-pressure eviction could also cause this issue, along with any other external condition that would result in a non-graceful pod termination.

juliev0 · 2024-06-18T07:33:33Z

Got it. That's interesting that node-deprovisioning and node-pressure eviction would result in SIGKILL rather than SIGTERM.

Joibel · 2024-07-22T09:58:24Z

Got it. That's interesting that node-deprovisioning and node-pressure eviction would result in SIGKILL rather than SIGTERM.

There are projects like NTH which should get you a SIGTERM with some time to do some work. Perhaps they're not being used here. For non-graceful termination though, we still need a solution.

I think we need to consider "Pod gone away" after a reasonable period (to allow for propogation of the WorkflowTaskResult) to be pod failure and mark the outputs as completed (with error) to allow the workflow to fail/retry.

roelarents · 2024-07-26T09:02:38Z

Good find. We encounter this sometimes in long running workflows.

This leaves the workflow stuck in Processing state until the WorkflowTaskResult is manually edited to mark it complete.

~~How exactly can I manually edit this?~~ edit: o, WorkflowTaskResult is a separate Custom Resource in k8s that can be updated (change the workflows.argoproj.io/report-outputs-completed label from "false" to "true")

agilgur5 · 2024-07-30T03:42:18Z

For non-graceful termination though, we still need a solution.

I think we need to consider "Pod gone away" after a reasonable period (to allow for propogation of the WorkflowTaskResult) to be pod failure and mark the outputs as completed (with error) to allow the workflow to fail/retry.

Agreed, I said something very similar in #13066 (comment) and proposed a similar approach in #13344 (comment). The proposed PR #13051 would not necessarily resolve this as a finalizer does not necessarily prevent non-graceful termination (e.g. OOMs and the like are kubelet/node level and not control plane/k8s resource level).

I don't have time to take this on, @Joibel would you be able to tackle this when you're back from vacation? Or if someone else wants to tackle it in the meantime, that would work too. See my proposed approach in #13344 (comment) for more details.

Joibel · 2024-07-30T06:05:10Z

I don't want to block someone else, but yes, otherwise I have it at the top of the list to do.

ericblackburn · 2024-08-02T15:35:53Z

Is the primary case for this one in which the Controller itself issues a SIGKILL to the container because the container isn't responding fast enough to a SIGTERM?

If so, it seems like we need to indicate on the Controller side that if we are doing a SIGKILL not to wait for the WorkflowTaskResult for that task, right?

I am seeing this issue in 3.5.8 with just a kubectl delete pod. Reproduction case is

submit a workflow
kubectl delete the pod for the running node
workflow becomes stuck in a perpetual Running status

Joibel · 2024-08-02T17:24:57Z

@isubasinghe and I are working on a fix

…3454) Signed-off-by: isubasinghe <isitha@pipekit.io>

Signed-off-by: isubasinghe <isitha@pipekit.io>

#13533 (#13798) Signed-off-by: isubasinghe <isitha@pipekit.io>

jesseanttila-cai added type/bug type/regression Regression from previous behavior (a specific type of bug) labels Apr 29, 2024

shuangkun added the area/executor label Apr 29, 2024

agilgur5 added this to the v3.5.x patches milestone May 13, 2024

shuangkun added a commit to shuangkun/argo-workflows that referenced this issue May 14, 2024

fix: prevent workflow taskresult from being deleted before completion…

e0bba7a

…. Fixes:argoproj#12993 Signed-off-by: shuangkun <tsk2013uestc@163.com>

shuangkun mentioned this issue May 14, 2024

fix: prevent pod from being deleted before workflow taskresult complete. Fixes: #12993 #13051

Open

shuangkun added a commit to shuangkun/argo-workflows that referenced this issue May 14, 2024

fix: prevent workflow taskresult from being deleted before completion…

0f9dd8a

…. Fixes:argoproj#12993 Signed-off-by: shuangkun <tsk2013uestc@163.com>

praveenjindal62 mentioned this issue Jun 7, 2024

Terminated Workflow has label workflows.argoproj.io/completed: false, causing metrics issues #13144

Closed

4 tasks

agilgur5 added the P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important label Jun 9, 2024

agilgur5 changed the title ~~Workflow processing fails to complete due to invalid WorkflowTaskResult from interrupted pod~~ Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod Jun 10, 2024

jswxstw mentioned this issue Jun 19, 2024

v3.4.11: workflow stuck Running state, but only pod is Completed #12103

Open

3 tasks

jswxstw mentioned this issue Jul 22, 2024

v3.5.8: workflow stuck in Running, but only pod exited with OOMKilled (exit code 137) #13373

Closed

4 tasks

agilgur5 added area/controller Controller issues, panics P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important labels Jul 23, 2024

agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Jul 30, 2024

Joibel assigned Joibel and isubasinghe Aug 2, 2024

isubasinghe mentioned this issue Aug 12, 2024

fix: mark node failed if pod absent. Fixes #12993 #13454

Merged

Joibel closed this as completed in #13454 Aug 15, 2024

Joibel closed this as completed in 36b7a72 Aug 15, 2024

romanglo mentioned this issue Aug 26, 2024

v3.5.3-v3.5.10: Controller memory keeps going up due to Workflows stuck in Running #13505

Open

4 tasks

Joibel pushed a commit to pipekit/argo-workflows that referenced this issue Sep 19, 2024

fix: mark node failed if pod absent. Fixes argoproj#12993 (argoproj#1…

488660e

…3454) Signed-off-by: isubasinghe <isitha@pipekit.io>

Joibel pushed a commit that referenced this issue Sep 20, 2024

fix: mark node failed if pod absent. Fixes #12993 (#13454)

f4c92bc

Signed-off-by: isubasinghe <isitha@pipekit.io>

agilgur5 changed the title ~~Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod~~ v3.5.3+: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod Oct 8, 2024

agilgur5 changed the title ~~v3.5.3+: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod~~ v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod Oct 8, 2024

jswxstw mentioned this issue Oct 9, 2024

v3.5.8: workflow shutdown with strategy: Terminate, but stuck in Running #13726

Closed

4 tasks

isubasinghe mentioned this issue Oct 23, 2024

fix: mark taskresult complete when failed or error. Fixes #12993, Fixes #13533 #13798

Merged

isubasinghe added a commit that referenced this issue Oct 28, 2024

fix: mark taskresult complete when failed or error. Fixes #12993, Fixes

eb23eb6

#13533 (#13798) Signed-off-by: isubasinghe <isitha@pipekit.io>

isubasinghe added a commit that referenced this issue Oct 30, 2024

fix: mark taskresult complete when failed or error. Fixes #12993, Fixes

60b4287

#13533 (#13798) Signed-off-by: isubasinghe <isitha@pipekit.io>

isubasinghe added a commit that referenced this issue Oct 30, 2024

fix: mark taskresult complete when failed or error. Fixes #12993, Fixes

8fe8de2

#13533 (#13798) Signed-off-by: isubasinghe <isitha@pipekit.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod #12993

v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod #12993

jesseanttila-cai commented Apr 29, 2024

jswxstw commented Apr 29, 2024

shuangkun commented Apr 29, 2024

alexlatchford commented May 8, 2024

juliev0 commented Jun 17, 2024 •

edited

Loading

jesseanttila-cai commented Jun 17, 2024

juliev0 commented Jun 18, 2024 •

edited by agilgur5

Loading

Joibel commented Jul 22, 2024

roelarents commented Jul 26, 2024 •

edited

Loading

agilgur5 commented Jul 30, 2024

Joibel commented Jul 30, 2024

ericblackburn commented Aug 2, 2024 •

edited

Loading

Joibel commented Aug 2, 2024

v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod #12993

v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod #12993

Comments

jesseanttila-cai commented Apr 29, 2024

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

jswxstw commented Apr 29, 2024

shuangkun commented Apr 29, 2024

alexlatchford commented May 8, 2024

juliev0 commented Jun 17, 2024 • edited Loading

jesseanttila-cai commented Jun 17, 2024

juliev0 commented Jun 18, 2024 • edited by agilgur5 Loading

Joibel commented Jul 22, 2024

roelarents commented Jul 26, 2024 • edited Loading

agilgur5 commented Jul 30, 2024

Joibel commented Jul 30, 2024

ericblackburn commented Aug 2, 2024 • edited Loading

Joibel commented Aug 2, 2024

juliev0 commented Jun 17, 2024 •

edited

Loading

juliev0 commented Jun 18, 2024 •

edited by agilgur5

Loading

roelarents commented Jul 26, 2024 •

edited

Loading

ericblackburn commented Aug 2, 2024 •

edited

Loading