v3.4.11: workflow stuck `Running` state, but only pod is `Completed` #12103

tooptoop4 · 2023-10-29T22:05:28Z

edited by agilgur5: the symptoms of this issue are similar to issues in v3.5.3-v3.5.10, but are different. If you have similar symptoms in v3.5.x, please update to v3.5.11+. You likely had #12993 or similar v3.5.x issues.

If you still have similar symptoms after updating to v3.5.11+, please file a new bug report with a reproduction.

This bug was written against v3.4.11 and so has a different root cause than those issues, despite the similar symptoms.

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

the workflow is running for more than 20 hours
even though i have activedeadlineseconds set at 12 hours

the workflow just has a single step which also shows as 'running' in the argo ui. but looking at the logs of it shows that it has complete the code that i expect for that step and also shows
time="2023-10-29T00:53:09 UTC" level=info msg="sub-process exited" argo=true error="<nil>" at the end of the main log. the pod itself for that step is in Completed state.

there are other workflows that have completed as expected during this time, and no other workflows running right now. note this exact workflow has successfully run 1000s of times in the past so i know my spec/permissions are correct.

Version

3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: mywf
  namespace: auth
spec:
  schedule: "23,53 * * * *"
  timezone: "UTC"
  concurrencyPolicy: "Forbid"
  startingDeadlineSeconds: 0
  workflowSpec:
    entrypoint: main
    priority: 100
    serviceAccountName: argo-workflows-server
    onExit: exit-handler
    templates:
    - name: main
      inputs:
        artifacts:
        - name: watchdog
          path: /s.sh
          raw:
            data: |
              #echo bla...redact
      volumes:
      - name: aws-iam-token
        projected:
          defaultMode: 420
          sources:
          - serviceAccountToken:
              audience: sts.amazonaws.com
              expirationSeconds: 86400
              path: token
      metadata:
        annotations:
          cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
      container:
        image: python:3.10 #redacted real one
        command: [bash, -c]
        args: ["bash /s.sh; RC=$?;if [ $RC -ne 0 ]; then exit $RC; fi;"]
        envFrom:
          - configMapRef:
              name: approle-config
        env:
          - name: AWS_WEB_IDENTITY_TOKEN_FILE
            value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
        volumeMounts:
        - name: aws-iam-token
          mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
    - name: exit-handler
      steps:
        - - name: notifyError
            template: sendmail
            arguments:
              parameters:
                - name: body
                  value: ""
                - name: subject
                  value: "mywf Failed"
                - name: recipients
                  value: "redact"
                - name: workflowname
                  value: "{{workflow.name}}"
            when: "{{workflow.status}} != Succeeded"
    - name: sendmail
      inputs:
        parameters:
          - name: body
          - name: subject
          - name: recipients
          - name: workflowname
      outputs: {}
      metadata: {}
      script:
        name: ''
        image: python:3.10 #redacted real one
        command:
          - python
        env:
          - name: AWS_WEB_IDENTITY_TOKEN_FILE
            value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
        resources: {}
        volumeMounts:
          - name: aws-iam-token
            mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
        source: |
          #redact
      volumes:
        - name: aws-iam-token
          projected:
            sources:
              - serviceAccountToken:
                  audience: sts.amazonaws.com
                  expirationSeconds: 86400
                  path: token
            defaultMode: 420

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}



{"time":"2023-10-29T00:53:00.052668744Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.052Z\" level=info msg=\"Processing workflow\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.065094164Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.064Z\" level=info msg=\"Updated phase  -> Running\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.378202815Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.378Z\" level=warning msg=\"Node was nil, will be initialized as type Skipped\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.379556102Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.379Z\" level=info msg=\"was unable to obtain node for , letting display name to be nodeName\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.379567612Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.379Z\" level=info msg=\"Retry node mywf-1698540780 initialized Running\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.380135805Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.380Z\" level=info msg=\"was unable to obtain node for , letting display name to be nodeName\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.380143135Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.380Z\" level=info msg=\"Pod node mywf-1698540780-3299532583 initialized Pending\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.380166815Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.380Z\" level=error msg=\"was unable to obtain node for \" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.426801702Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.426Z\" level=info msg=\"Created pod: mywf-1698540780(0) (mywf-1698540780-main-3299532583)\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.426863752Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.426Z\" level=info msg=\"TaskSet Reconciliation\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.426870253Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.426Z\" level=info msg=reconcileAgentPod namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:00.452249327Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:00.452Z\" level=info msg=\"Workflow update successful\" namespace=auth phase=Running resourceVersion=93158499 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.053730206Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.053Z\" level=info msg=\"Processing workflow\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.056058783Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.055Z\" level=info msg=\"Task-result reconciliation\" namespace=auth numObjs=0 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.056253675Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.056Z\" level=info msg=\"node changed\" namespace=auth new.message= new.phase=Running new.progress=0/1 nodeID=mywf-1698540780-3299532583 old.message= old.phase=Pending old.progress=0/1 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.057704775Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.057Z\" level=info msg=\"node mywf-1698540780 message: retryStrategy.expression evaluated to false\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.057733486Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.057Z\" level=error msg=\"was unable to obtain node for \" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.057741936Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.057Z\" level=info msg=\"TaskSet Reconciliation\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.057748466Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.057Z\" level=info msg=reconcileAgentPod namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.062072958Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.061Z\" level=info msg=\"cleaning up pod\" action=terminateContainers key=auth/mywf-1698540780-main-3299532583/terminateContainers
{"time":"2023-10-29T00:53:10.062458361Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.062Z\" level=info msg=\"https://172.20.0.1:443/api/v1/namespaces/auth/pods/mywf-1698540780-main-3299532583/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false\"
{"time":"2023-10-29T00:53:10.071843271Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.071Z\" level=info msg=\"Workflow update successful\" namespace=auth phase=Running resourceVersion=93158593 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:10.357968789Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:10.357Z\" level=info msg=\"signaled container\" container=wait error=\"command terminated with exit code 137\" namespace=auth pod=mywf-1698540780-main-3299532583 stderr=\"<nil>\" stdout=\"<nil>\"
{"time":"2023-10-29T00:53:20.285845171Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.285Z\" level=info msg=\"Processing workflow\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.288676072Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.288Z\" level=info msg=\"Task-result reconciliation\" namespace=auth numObjs=1 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.288701152Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.288Z\" level=info msg=\"task-result changed\" namespace=auth nodeID=mywf-1698540780-3299532583 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.288863694Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.288Z\" level=info msg=\"node changed\" namespace=auth new.message= new.phase=Succeeded new.progress=0/1 nodeID=mywf-1698540780-3299532583 old.message= old.phase=Running old.progress=0/1 workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290580037Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=info msg=\"node mywf-1698540780 phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290606277Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=info msg=\"node mywf-1698540780 finished: 2023-10-29 00:53:20.290503226 +0000 UTC\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290612707Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=info msg=\"TaskSet Reconciliation\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290616927Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=info msg=reconcileAgentPod namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290656267Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=info msg=\"Running OnExit handler: exit-handler\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.290670107Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.290Z\" level=warning msg=\"Node was nil, will be initialized as type Skipped\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.425590775Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.425Z\" level=info msg=\"was unable to obtain node for , letting display name to be nodeName\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.425597845Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.425Z\" level=info msg=\"Retry node mywf-1698540780-3480973695 initialized Running\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.425894928Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.425Z\" level=info msg=\"was unable to obtain node for , letting display name to be nodeName\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.425900348Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.425Z\" level=info msg=\"Steps node mywf-1698540780-4129110530 initialized Running\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.425941598Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.425Z\" level=info msg=\"StepGroup node mywf-1698540780-528111236 initialized Running\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.426123869Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"Skipping mywf-1698540780.onExit(0)[0].notifyError: when 'Succeeded != Succeeded' evaluated false\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.42614727Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"Skipped node mywf-1698540780-1020989075 initialized Skipped (message: when 'Succeeded != Succeeded' evaluated false)\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.42616311Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"Step group node mywf-1698540780-528111236 successful\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.42617424Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"node mywf-1698540780-528111236 phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.42624667Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"node mywf-1698540780-528111236 finished: 2023-10-29 00:53:20.42614426 +0000 UTC\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.426273071Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"Outbound nodes of mywf-1698540780-1020989075 is [mywf-1698540780-1020989075]\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.426285391Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"Outbound nodes of mywf-1698540780-4129110530 is [mywf-1698540780-1020989075]\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.426295761Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"node mywf-1698540780-4129110530 phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.426305471Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.426Z\" level=info msg=\"node mywf-1698540780-4129110530 finished: 2023-10-29 00:53:20.42624932 +0000 UTC\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427278588Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"node mywf-1698540780-3480973695 phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427298888Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"node mywf-1698540780-3480973695 message: retryStrategy.expression evaluated to false\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427305358Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"node mywf-1698540780-3480973695 finished: 2023-10-29 00:53:20.427221048 +0000 UTC\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427334099Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Updated phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427365589Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Marking workflow completed\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427425319Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Marking workflow as pending archiving\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.43283518Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.432Z\" level=info msg=\"cleaning up pod\" action=deletePod key=auth/mywf-1698540780-1340600742-agent/deletePod
{"time":"2023-10-29T00:53:20.440336606Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.440Z\" level=warning msg=\"Error updating workflow: Unauthorized Unauthorized\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.440382846Z","stream":"stderr","_p":"F","log":"E1029 00:53:20.440195       1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"mywf-1698540780.17926ddbd5f9adba\", GenerateName:\"\", Namespace:\"auth\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ZZZ_DeprecatedClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Workflow\", Namespace:\"auth\", Name:\"mywf-1698540780\", UID:\"58fcb020-71bb-48ae-a170-add3f2ad283e\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"93158593\", FieldPath:\"\"}, Reason:\"WorkflowSucceeded\", Message:\"Workflow completed\", Source:v1.EventSource{Component:\"workflow-controller\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.October, 29, 0, 53, 20, 427273658, time.Local), LastTimestamp:time.Date(2023, time.October, 29, 0, 53, 20, 427273658, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Unauthorized' (will not retry!)
{"time":"2023-10-29T00:53:20.441209382Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.441Z\" level=warning msg=\"failed to clean-up pod\" action=deletePod error=Unauthorized key=auth/mywf-1698540780-1340600742-agent/deletePod
{"time":"2023-10-29T00:53:40.359148369Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:40.358Z\" level=info msg=\"cleaning up pod\" action=killContainers key=auth/mywf-1698540780-main-3299532583/killContainers
{"time":"2023-10-29T00:56:00.886969281Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:56:00.886Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T01:16:00.887399312Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T01:16:00.887Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T01:36:00.888193298Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T01:36:00.888Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T01:56:00.888829659Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T01:56:00.888Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T02:36:00.890695997Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T02:36:00.890Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T02:56:00.89116664Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T02:56:00.890Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T03:16:00.891733664Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T03:16:00.891Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T03:36:00.892377753Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T03:36:00.892Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T03:56:00.89365189Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T03:56:00.893Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T04:16:00.894685742Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T04:16:00.894Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T04:36:00.894854677Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T04:36:00.894Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T04:56:00.89594993Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T04:56:00.895Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T05:16:00.896569173Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T05:16:00.896Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T05:36:00.897468251Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T05:36:00.897Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T05:56:00.897487112Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T05:56:00.897Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T06:16:00.897956601Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T06:16:00.897Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T06:36:00.899615191Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T06:36:00.899Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T06:56:00.900653158Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T06:56:00.900Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T07:16:00.901689127Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T07:16:00.901Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T07:36:00.901062144Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T07:36:00.900Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T07:56:00.90163483Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T07:56:00.901Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T08:16:00.902643172Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T08:16:00.902Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T08:36:00.902640524Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T08:36:00.902Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T08:56:00.902906351Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T08:56:00.902Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T09:16:00.902743721Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T09:16:00.902Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T09:36:00.903334761Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T09:36:00.903Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T09:56:00.904282091Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T09:56:00.904Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T10:16:00.904660693Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T10:16:00.904Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T10:36:00.9054561Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T10:36:00.905Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T10:56:00.906083764Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T10:56:00.905Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T11:36:00.907986689Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T11:36:00.907Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T11:56:00.908159738Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T11:56:00.907Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T12:16:00.909456184Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T12:16:00.909Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T12:36:00.909439313Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T12:36:00.909Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T12:56:00.91060522Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T12:56:00.910Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T13:16:00.911304788Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T13:16:00.911Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T13:36:00.911790482Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T13:36:00.911Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T13:56:00.912257921Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T13:56:00.912Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T14:16:00.913294595Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T14:16:00.913Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T14:36:00.914315405Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T14:36:00.914Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T14:56:00.91418568Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T14:56:00.914Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T15:16:00.915084611Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T15:16:00.914Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T15:36:00.915637205Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T15:36:00.915Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T15:56:00.915322853Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T15:56:00.915Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T16:16:00.916672011Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T16:16:00.916Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T16:36:00.916635785Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T16:36:00.916Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T16:56:00.917995237Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T16:56:00.917Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T17:16:00.918616725Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T17:16:00.918Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T17:36:00.919663016Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T17:36:00.919Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T17:56:00.920609693Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T17:56:00.920Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T18:16:00.92075436Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T18:16:00.920Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T18:36:00.921621236Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T18:36:00.921Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T18:56:00.922215871Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T18:56:00.922Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T19:16:00.922594784Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T19:16:00.922Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T19:36:00.923035111Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T19:36:00.922Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T19:56:00.924160017Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T19:56:00.923Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T20:16:00.924556526Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T20:16:00.924Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T20:36:00.925492008Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T20:36:00.925Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T20:56:00.9260603Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T20:56:00.925Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T21:16:00.926241513Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T21:16:00.926Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
{"time":"2023-10-29T21:36:00.927554494Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T21:36:00.927Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded


time="2023-10-29T00:53:10.087Z" level=info msg="No output parameters"
time="2023-10-29T00:53:10.087Z" level=info msg="No output artifacts"
time="2023-10-29T00:53:10.087Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: argo_wf_logs/2023/10/29/00/53/mywf-1698540780/mywf-1698540780-main-3299532583/main.log"
time="2023-10-29T00:53:10.087Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-10-29T00:53:10.137Z" level=info msg="Saving file to s3" bucket=redactbuc endpoint=s3.amazonaws.com key=argo_wf_logs/2023/10/29/00/53/mywf-1698540780/mywf-1698540780-main-3299532583/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-10-29T00:53:10.269Z" level=info msg="Save artifact" artifactName=main-logs duration=181.217406ms error="<nil>" key=argo_wf_logs/2023/10/29/00/53/mywf-1698540780/mywf-1698540780-main-3299532583/main.log
time="2023-10-29T00:53:10.269Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-10-29T00:53:10.269Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-10-29T00:53:10.285Z" level=info msg="Alloc=9408 TotalAlloc=18061 Sys=43645 NumGC=5 Goroutines=12"
time="2023-10-29T00:53:10.286Z" level=info msg="Deadline monitor stopped"

The text was updated successfully, but these errors were encountered:

agilgur5 · 2023-10-29T23:04:11Z

the pod itself for that step is in Completed state.

So, to summarize, the Pod is "Completed" but the Step and the Workflow are both still showing as "Running", correct?

I'm imagining that the Controller is failing to process it (especially as it has surpassed the activeDeadlineSeconds) or the Executor isn't reporting it correctly.
Since it happens very infrequently, this sounds like a very rare race condition.

tooptoop4 · 2023-10-30T00:24:37Z

correct. there is something in workflow controller logs below that caught my eye and makes me think its missing retry logic when receiving transient error from k8s control plane:

{"time":"2023-10-29T00:53:20.427334099Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Updated phase Running -> Succeeded\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427365589Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Marking workflow completed\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.427425319Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.427Z\" level=info msg=\"Marking workflow as pending archiving\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.43283518Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.432Z\" level=info msg=\"cleaning up pod\" action=deletePod key=auth/mywf-1698540780-1340600742-agent/deletePod
{"time":"2023-10-29T00:53:20.440336606Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.440Z\" level=warning msg=\"Error updating workflow: Unauthorized Unauthorized\" namespace=auth workflow=mywf-1698540780
{"time":"2023-10-29T00:53:20.440382846Z","stream":"stderr","_p":"F","log":"E1029 00:53:20.440195       1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"mywf-1698540780.17926ddbd5f9adba\", GenerateName:\"\", Namespace:\"auth\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ZZZ_DeprecatedClusterName:\"\", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"Workflow\", Namespace:\"auth\", Name:\"mywf-1698540780\", UID:\"58fcb020-71bb-48ae-a170-add3f2ad283e\", APIVersion:\"argoproj.io/v1alpha1\", ResourceVersion:\"93158593\", FieldPath:\"\"}, Reason:\"WorkflowSucceeded\", Message:\"Workflow completed\", Source:v1.EventSource{Component:\"workflow-controller\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.October, 29, 0, 53, 20, 427273658, time.Local), LastTimestamp:time.Date(2023, time.October, 29, 0, 53, 20, 427273658, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'Unauthorized' (will not retry!)
{"time":"2023-10-29T00:53:20.441209382Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:20.441Z\" level=warning msg=\"failed to clean-up pod\" action=deletePod error=Unauthorized key=auth/mywf-1698540780-1340600742-agent/deletePod
{"time":"2023-10-29T00:53:40.359148369Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:53:40.358Z\" level=info msg=\"cleaning up pod\" action=killContainers key=auth/mywf-1698540780-main-3299532583/killContainers

ZeidAqemia · 2024-01-17T10:41:31Z

We've had this happen on 8 jobs yesterday.
We've noticed that for each one of those:

status.conditions[].PodRunning was set to False but status.conditions[].Completed was missing.
workflows.argoproj.io/completed was "false"
finishedAt was set to null
The Workflow's phase is marked as Succeeded
The Workflows's progress is however showing 51/52
taskResultsCompletionStatus of the specific task was set to False
The WorkflowTaskResult's workflows.argoproj.io/report-outputs-completed was also "false"
The failed node's outputs.exitCode was "0"
Its phase was "Error"
Its progress was "0/1"
Its message was message: OOMKilled (exit code 137)

The problem is exacerbated by the Deadline and it snowballs on following jobs as they get stuck in Pending
with the following message "Workflow processing has been postponed because too many workflows are already running"

zqhi71 · 2024-01-19T11:30:46Z

same problem in 3.5.2
the pod Completed but the workflow is Running or even not display the status (wait container log seem fine). any suggestion?

tooptoop4 · 2024-01-19T22:46:20Z

@ZeidAqemia @zqhi71 do your controller logs have any errors like the ones in #12103 (comment) ? what % of workflows are affected? what version are u running?

for me on v3.4.11 this affects less than 0.01% of workflows

ZeidAqemia · 2024-01-29T12:25:15Z

@ZeidAqemia @zqhi71 do your controller logs have any errors like the ones in #12103 (comment) ? what % of workflows are affected? what version are u running?

for me on v3.4.11 this affects less than 0.01% of workflows

v3.5.2 here and it's 100% of workflows where one task OOMs

zqhi71 · 2024-01-31T01:19:36Z

Hi guys. We found just the default controller settings is not suitable for thounds of cronworkflow. When we adjust --cronworkflow-worker, qps and burst, the cronworkflow works fine. If someone have the same problem maybe adjust settings following this documents (https://argo-workflows.readthedocs.io/en/latest/scaling/) will help.

talebzeghmi · 2024-02-23T23:53:20Z

This seems related to Hang on "Workflow processing has been postponed due to max parallelism limit" #11808

I'm seeing the same issue when using namespaceParallelism

tczhao · 2024-04-11T11:27:02Z

Hi @tooptoop4
Were you able to make any progress on this issue, or is it still reoccurring?

tooptoop4 · 2024-04-11T13:48:56Z

it still reoccurs but rare: roughly 1 in 20000 workflow runs. did u see the log in #12103 (comment) ?

tczhao · 2024-04-12T03:07:07Z

it still reoccurs but rare: roughly 1 in 20000 workflow runs. did u see the log in #12103 (comment) ?

Yes, but in my case, we are facing the same stuck in Running but not seeing any server issue related message from the log 😭

tooptoop4 · 2024-05-11T11:26:31Z

i wonder if #12233 helps @tczhao

tczhao · 2024-05-13T16:58:28Z

Hi @tooptoop4, we tried #12233 but it doesn't help in my case.
I highly suspect the root cause in my case is due to #12997, we typically observe childnode missing when workflow stuck in running in my situation
Should have a PR ready in a few days with more explanation

tobias-oetzel · 2024-05-22T13:43:56Z

We also observed this in a small portion of our workflows.
The workflows all have 5 tasks defined in their spec. On the ones that are running, in the status there are suddenly 6 or 7 workflow results.

We also did a log search. The 5 tasks that are reported as completely are found in the controller logs. For the 2 additional tasks that are not finished, there is no single log.
In the UI there is also no trace of them. Or in any other status field.

We are using version v3.5.6

tooptoop4 · 2024-06-18T20:55:09Z

any idea? @jswxstw @shuangkun

jswxstw · 2024-06-19T02:54:43Z

any idea? @jswxstw @shuangkun

After reading the background of the issue, and it seems that your situation is different from others.(not similar to #12993)

I see a lot of logs like: {"time":"2023-10-29T00:56:00.886969281Z","stream":"stderr","_p":"F","log":"time=\"2023-10-29T00:56:00.886Z\" level=info msg=\"Workflow processing has been postponed due to max parallelism limit\" key=auth/mywf-1698540780
Does the workflow stuck in Running state failed to be processed because of it?

sstaley-hioscar · 2024-06-22T18:31:48Z

We're also seeing this issue only in 3.5.x versions. I initially tried to upgrade and saw this issue on an earlier 3.5.x. It's been a month or so, so I tried again with 3.5.8, and I'm still seeing the issue. This is with any workflow I try to run - steps, dags, containers and both invoked from workflow templates or crons (although I doubt that matters).

Joibel · 2024-06-22T19:39:41Z

@sstaley-hioscar, could you verify what you see in the wait container logs from one of these runs confirms it is using workflowtaskresults, and that the controller itself has appropriate RBAC to read the workflowtaskresults as you've said you have custom RBAC.

sstaley-hioscar · 2024-06-22T20:10:17Z

@Joibel

here are some logs from the wait container:

time="2024-06-22T17:12:38.814Z" level=info msg="Starting Workflow Executor" version=v3.4.8
time="2024-06-22T17:12:38.819Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
...
time="2024-06-22T17:12:40.821Z" level=info msg="stopping progress monitor (context done)" error="context canceled"

It looks like it's using the wrong version. I'll look into that.

sstaley-hioscar · 2024-06-22T20:35:08Z

@Joibel It looks like that was the issue. I can delete some of those example templates to prevent cluttering up this thread, if you like.

It looks like I'm running into a new error though. Now workflows is now attempting to use the service account of the namespace the workflow is running in to patch pods:

\"system:serviceaccount:monitoring:default\" cannot patch resource \"pods\" in API group \"\" in the namespace \"monitoring\""

which wasn't what our rbac was set up for in the previous version. Is this expected new behavior or is there a configuration I need to set to make the controller use its own token for these API calls?

jswxstw · 2024-06-24T02:49:47Z

It looks like it's using the wrong version. I'll look into that.

@sstaley-hioscar This is the root cause, because wait container with version v3.4.8 does not write the LabelKeyReportOutputsCompleted label in WorkflowTaskResult.

When upgrading from version 3.5.1 or below directly to version 3.5.5 or above, if there are running workflows in the cluster, these workflows will stuck in Running, even though their pods are Completed.

Joibel · 2024-07-10T12:49:00Z

It looks like it's using the wrong version. I'll look into that.

@sstaley-hioscar This is the root cause, because wait container with version v3.4.8 does not write the LabelKeyReportOutputsCompleted label in WorkflowTaskResult.

When upgrading from version 3.5.1 or below directly to version 3.5.5 or above, if there are running workflows in the cluster, these workflows will stuck in Running, even though their pods are Completed.

If you're upgrading from a version which does not record TaskResultCompletionStatus in the status block of the workflow, to one that does, the nodes will remain in Running despite the pods being Completed.

This is because of this choice from #12537, which means missing TaskResultCompletionStatus entries are always going to be regarded as incomplete.

This blocks the controller from making any progress, and means upgrades over this with running workflows will always fail to complete inflight workflows.

zhucan · 2024-07-12T03:42:59Z

I found other issue, the task of the wf had been cleaned, but the wf is always running.

root@10-16-10-122:/home/devops# kubectl get wf -n argo-map mapping-pipeline-1524786-1720665764 
NAME                                  STATUS    AGE   MESSAGE
mapping-pipeline-1524786-1720665764   Running   14h   
root@10-16-10-122:/home/devops# kubectl get pods -n argo-map | grep mapping-pipeline-1524786-1720665764 
root@10-16-10-122:/home/devops# 
root@10-16-10-122:/home/devops# 
root@10-16-10-122:/home/devops#

the "taskResultsCompletionStatus" of wf is:

  taskResultsCompletionStatus:
    mapping-pipeline-1524786-1720665764-1391268475: false
    mapping-pipeline-1524786-1720665764-3212493707: false

logs of the argo controller is:
time="2024-07-12T03:31:08.133Z" level=debug msg="taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed" fromPhase=Running namespace=argo-map toPhase=Succeeded workflow=mapping-pipeline-1524786-1720665764

jswxstw · 2024-07-22T03:01:22Z

I have analyzed this issus before like #12103 (comment). According to my analysis, this will only result in label LabelKeyReportOutputsCompleted being completely missing, but taskResultsCompletionStatus will always be false after taskResultReconciliation.

argo-workflows/workflow/controller/taskresult.go

Lines 66 to 73 in d7495b8

    
           // If the task result is completed, set the state to true. 
        
           if result.Labels[common.LabelKeyReportOutputsCompleted] == "true" { 
        
           	woc.log.Debugf("Marking task result complete %s", resultName) 
        
           	woc.wf.Status.MarkTaskResultComplete(resultName) 
        
           } else { 
        
           	woc.log.Debugf("Marking task result incomplete %s", resultName) 
        
           	woc.wf.Status.MarkTaskResultIncomplete(resultName) 
        
           }

I can't think of a scenario where taskResultsCompletionStatus would be missing since WorkflowTaskResult is always created by wait container.

agilgur5 · 2024-07-22T03:25:50Z

I'll let Alan check in more detail; sorry I didn't go through the whole thread too closely, just thought I'd answer an outstanding question I stumbled upon.

nettrino · 2024-07-30T13:04:17Z

We are having a similar issue where if upon a fanout only a few jobs fail, the workflow stays in running state, despite the fact that the exit handler has been called

yonirab · 2024-08-16T07:36:26Z

We have seen similar problems. In our case, in a workflow with steps where at least one step retried, we see sporadic occurrences of all steps appear as completed successfully, and the workflow shows as green in the UI, but kubectl -n argo get <workflow> shows it as Running.
We have also seen cases where one or more steps failed, and workflow correctly appears as red in the UI, and yet kubectl -n argo get <workflow> shows it as Running.

We have seen this behaviour occasionally in sporadic workflows, ever since upgrading to 3.5.6.
I don't recall seeing this with 3.5.5 (but I might be wrong about that - not sure)

Joibel · 2024-08-16T09:21:01Z

In 3.4 WorkflowTaskResults have an owner reference of their Pod, so can get deleted when a Pod is deleted. In 3.5 they are owned by the Workflow, which is more sensible. During an upgrade we can lose the results, and therefore hang waiting for them.

#13454 is supposed to mitigate this and some other scenarios where a Pod may have gone away in 3.5 but the WorkflowTaskResults are not marked completed=true. Tagging here too @isubasinghe as he wrote it.

I'm unsure this is the last fix for these issues.

jswxstw · 2024-08-19T10:02:30Z

I'm unsure this is the last fix for these issues.

#13454 can not fix issues like #13373 #12103 (comment) mentioned.

argo-workflows/workflow/controller/operator.go

Line 2349 in 2192344

    
           if phase.Completed() && (woc.checkTaskResultsInProgress() || woc.hasDaemonNodes()) {

#12574 (comment) @shuangkun Can you take look at this? WorkflowTaskResult will always be incomplete if pod interrupted because of OOM killed or evicted and then workflow will stuck Running.

yonirab · 2024-08-21T06:47:49Z

I don't understand why this issue is marked as only P3.
We are seeing multiple occurrences of workflows with one or more steps that retry one or more times (e.g. due to OOMKilled), where the retries eventually succeed and the workflow appears as green in the UI, but still shows as Running in the UI in the workflows table at the /workflows page (and also shows as Running via kubectl -n argo get workflow.
This incurs a really bad ripple effect on any systems monitoring the status of Argo Workflows.

Surely this should be treated as a Priority 1 issue?

jswxstw · 2024-08-21T07:19:31Z

@yonirab The scenario described in this issue is different from your situation; a more relevant issue should be #12993 or #13373.

agilgur5 · 2024-08-21T13:37:35Z

@jswxstw I discussed your comment above and the OOM scenario with @isubasinghe last night during the contributor meeting and he suspected that #13454 missed the scenario of a Pod that failed/errored but was not yet GC'd (so exists, not yet absent).

Retry scenarios might also need some specific code (Pod errored but user has a retryStrategy)

yonirab · 2024-08-21T13:41:03Z

Retry scenarios might also need some specific code (Pod errored but user has a retryStrategy)

@agilgur5 Pod errored but user has a retryStrategy is indeed our exact scenario.

agilgur5 · 2024-08-21T13:44:04Z

This incurs a really bad ripple effect on any systems monitoring the status of Argo Workflows.

Surely this should be treated as a Priority 1 issue?

#12993, which covered the general case and was mostly fixed by #13454, was a P1.
This variant seems to have 0 upvotes and so is potentially a more specific case that is rarer.

jswxstw · 2024-08-21T14:26:03Z

@jswxstw I discussed your comment above and the OOM scenario with @isubasinghe last night during the contributor meeting and he suspected that #13454 missed the scenario of a Pod that failed/errored but was not yet GC'd (so exists, not yet absent).

Retry scenarios might also need some specific code (Pod errored but user has a retryStrategy)

I have submitted a new PR #13491 to fix this scenario.

jeverett1522 · 2024-09-20T15:56:50Z

We are having a similar issue with one of our workflows. It is a very simply container but it is not finalizing even though it has completed. The output of the wait container shows this:

time="2024-09-20T15:37:02 UTC" level=info msg="Starting Workflow Executor" version=v3.5.6
time="2024-09-20T15:37:02 UTC" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2024-09-20T15:37:02 UTC" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=files-api-stage podName=files-api-workflow-template-h4bfg templateName=files-api-workflow version="&Version{Version:v3.5.6,BuildDate:2024-04-19T20:54:43Z,GitCommit:555030053825dd61689a086cb3c2da329419325a,GitTag:v3.5.6,GitTreeState:clean,GoVersion:go1.21.9,Compiler:gc,Platform:linux/amd64,}"
time="2024-09-20T15:37:02 UTC" level=info msg="Starting deadline monitor"
time="2024-09-20T15:37:04 UTC" level=info msg="Main container completed" error="<nil>"
time="2024-09-20T15:37:04 UTC" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2024-09-20T15:37:04 UTC" level=info msg="No output parameters"
time="2024-09-20T15:37:04 UTC" level=info msg="No output artifacts"
time="2024-09-20T15:37:04 UTC" level=info msg="Alloc=9786 TotalAlloc=13731 Sys=25957 NumGC=3 Goroutines=8"
time="2024-09-20T15:37:04 UTC" level=info msg="stopping progress monitor (context done)" error="context canceled"

jeverett1522 · 2024-09-20T16:18:19Z

Update: I found out why our pods were hanging. We use Linkerd injection for security and if that happened on a workflow job it would fail to complete because the linkerd proxy pod would remain running inside the job. We needed to add the follow into the template for the workflows to complete successfully:

    metadata:
      annotations:
        linkerd.io/inject: disabled

agilgur5 · 2024-09-20T16:31:21Z

We use Linkerd injection for security and if that happened on a workflow job it would fail to complete because the linkerd proxy pod would remain running inside the job

The Pod wouldn't be Completed then, it'd be stuck in Running, so that's different from this issue.
See also the "Sidecar Injection" page of the docs

rowieg · 2024-09-26T06:34:12Z

We needed to add the follow into the template for the workflows to complete successfully:
    metadata:
      annotations:
        linkerd.io/inject: disabled

@jeverett1522

we had a similar issue and used the following kill command for argo workflows

workflows.argoproj.io/kill-cmd-linkerd-proxy: ["/usr/lib/linkerd/linkerd-await","sleep","1","--shutdown"]

With this pod annotation added to the workflow the Linkerd-proxy is killed after the workflow finishes

agilgur5 · 2024-09-26T06:37:48Z

Would you like to add that annotation to the examples in the "Sidecar Injection" page I mentioned above?

zhucan · 2024-10-08T07:54:46Z

workflow shutdown with strategy: Terminate， but the status of the workflow stuck running state.

jswxstw · 2024-10-08T08:52:53Z

workflow shutdown with strategy: Terminate， but the status of the workflow stuck running state.

@zhucan Can you create a new issue for it?

zhucan · 2024-10-08T09:07:19Z

@jswxstw sure

agilgur5 · 2024-10-08T17:44:09Z

workflow shutdown with strategy: Terminate， but the status of the workflow stuck running state.

@zhucan Can you create a new issue for it?

This was reported in #13726, but that was for a fork of v3.5.8, not v3.4.11, as this issue is for, so those are unrelated issues

agilgur5 · 2024-10-08T20:26:04Z

To be clear for other readers that find themselves here, this issue pre-dates the v3.5.3-v3.5.10 WorkflowTaskResults issues (i.e. #12993 et al).

If you're here with a similar v3.5.x issue, that is different, please update to v3.5.11+. If you still have a bug, please file a separate issue with a reproduction.

I've edited the issue description to mention this since we're getting several unrelated comments. I may also lock this issue as "Off-topic" as such, but leave opened. I believe only Approver+ could make comments then, which would be suboptimal given other contributors here, so I'd prefer not to do that.

tooptoop4 added the type/bug label Oct 29, 2023

agilgur5 added the P3 Low priority label Oct 29, 2023

agilgur5 added the area/controller Controller issues, panics label Oct 29, 2023

agilgur5 changed the title ~~workflow stuck in running state, even though the only pod for it is 'completed'~~ workflow stuck in Running state, even though the only pod for it is Completed Feb 5, 2024

jesseanttila-cai mentioned this issue Apr 29, 2024

v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod #12993

Closed

4 tasks

jswxstw mentioned this issue Jul 5, 2024

3.5.5+: Workflow stuck Running when failed resource template #13299

Closed

4 tasks

Joibel mentioned this issue Jul 10, 2024

fix: allow nodes without taskResultCompletionStatus #13332

Merged

agilgur5 changed the title ~~workflow stuck in Running state, even though the only pod for it is Completed~~ workflow stuck Running state, but only pod is Completed Jul 23, 2024

jswxstw mentioned this issue Aug 21, 2024

fix: Mark taskResult completed if wait container terminated not gracefully. Fixes #13373 #13491

Merged

jswxstw mentioned this issue Sep 2, 2024

fix: Mark task result as completed if pod has been deleted for a while. Fixes #13533 #13537

Merged

rowieg mentioned this issue Sep 26, 2024

docs: add sidecar kill command for linkerd-proxy #13666

Merged

tooptoop4 mentioned this issue Oct 7, 2024

workflow status stuck in Running occasionally (no repro) #13238

Open

4 tasks

agilgur5 changed the title ~~workflow stuck Running state, but only pod is Completed~~ 3.4.11: workflow stuck Running state, but only pod is Completed Oct 7, 2024

agilgur5 changed the title ~~3.4.11: workflow stuck Running state, but only pod is Completed~~ v3.4.11: workflow stuck Running state, but only pod is Completed Oct 8, 2024

tooptoop4 mentioned this issue Nov 2, 2024

fix(operator): handle Unauthorized error. Partial fix for #12103 #13855

Closed

v3.4.11: workflow stuck Running state, but only pod is Completed #12103

v3.4.11: workflow stuck Running state, but only pod is Completed #12103

Comments

tooptoop4 commented Oct 29, 2023 • edited by agilgur5 Loading

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

agilgur5 commented Oct 29, 2023 • edited Loading

tooptoop4 commented Oct 30, 2023

ZeidAqemia commented Jan 17, 2024

zqhi71 commented Jan 19, 2024 • edited Loading

tooptoop4 commented Jan 19, 2024

ZeidAqemia commented Jan 29, 2024 • edited Loading

zqhi71 commented Jan 31, 2024

talebzeghmi commented Feb 23, 2024 • edited Loading

tczhao commented Apr 11, 2024

tooptoop4 commented Apr 11, 2024

tczhao commented Apr 12, 2024

tooptoop4 commented May 11, 2024

tczhao commented May 13, 2024

tobias-oetzel commented May 22, 2024

tooptoop4 commented Jun 18, 2024

jswxstw commented Jun 19, 2024

sstaley-hioscar commented Jun 22, 2024 • edited Loading

Joibel commented Jun 22, 2024

sstaley-hioscar commented Jun 22, 2024

sstaley-hioscar commented Jun 22, 2024

jswxstw commented Jun 24, 2024 • edited Loading

Joibel commented Jul 10, 2024 • edited Loading

zhucan commented Jul 12, 2024

jswxstw commented Jul 22, 2024 • edited by agilgur5 Loading

agilgur5 commented Jul 22, 2024

nettrino commented Jul 30, 2024

yonirab commented Aug 16, 2024 • edited Loading

Joibel commented Aug 16, 2024

jswxstw commented Aug 19, 2024 • edited Loading

yonirab commented Aug 21, 2024

jswxstw commented Aug 21, 2024 • edited by agilgur5 Loading

agilgur5 commented Aug 21, 2024

yonirab commented Aug 21, 2024

agilgur5 commented Aug 21, 2024 • edited Loading

jswxstw commented Aug 21, 2024

jeverett1522 commented Sep 20, 2024

jeverett1522 commented Sep 20, 2024 • edited by agilgur5 Loading

agilgur5 commented Sep 20, 2024

rowieg commented Sep 26, 2024 • edited by agilgur5 Loading

agilgur5 commented Sep 26, 2024 • edited Loading

zhucan commented Oct 8, 2024 • edited by agilgur5 Loading

jswxstw commented Oct 8, 2024

zhucan commented Oct 8, 2024

agilgur5 commented Oct 8, 2024 • edited Loading

agilgur5 commented Oct 8, 2024 • edited Loading

v3.4.11: workflow stuck `Running` state, but only pod is `Completed` #12103

v3.4.11: workflow stuck `Running` state, but only pod is `Completed` #12103

tooptoop4 commented Oct 29, 2023 •

edited by agilgur5

Loading

agilgur5 commented Oct 29, 2023 •

edited

Loading

zqhi71 commented Jan 19, 2024 •

edited

Loading

ZeidAqemia commented Jan 29, 2024 •

edited

Loading

talebzeghmi commented Feb 23, 2024 •

edited

Loading

sstaley-hioscar commented Jun 22, 2024 •

edited

Loading

jswxstw commented Jun 24, 2024 •

edited

Loading

Joibel commented Jul 10, 2024 •

edited

Loading

jswxstw commented Jul 22, 2024 •

edited by agilgur5

Loading

yonirab commented Aug 16, 2024 •

edited

Loading

jswxstw commented Aug 19, 2024 •

edited

Loading

jswxstw commented Aug 21, 2024 •

edited by agilgur5

Loading

agilgur5 commented Aug 21, 2024 •

edited

Loading

jeverett1522 commented Sep 20, 2024 •

edited by agilgur5

Loading

rowieg commented Sep 26, 2024 •

edited by agilgur5

Loading

agilgur5 commented Sep 26, 2024 •

edited

Loading

zhucan commented Oct 8, 2024 •

edited by agilgur5

Loading

agilgur5 commented Oct 8, 2024 •

edited

Loading

agilgur5 commented Oct 8, 2024 •

edited

Loading