Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource type steps that timeout due to activeDeadlineSeconds reached make the workflow remain stuck in Running forever #13937

Open
3 of 4 tasks
alexxed opened this issue Nov 25, 2024 · 6 comments
Labels
area/templates/resource type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@alexxed
Copy link

alexxed commented Nov 25, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Related to what #12985, #12544 that was supposed to fix, I see there are still bugs.

When a workflow step of type Resource times out due to activeDeadlineSeconds reached, the associated workflowtaskresult remains with workflows.argoproj.io/report-outputs-completed: "false" and blocks the completion of the workflow.

Version(s)

v3.6.0

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Here's a bash script that you can run that reproduces the issue.

#!/bin/bash

cat << EOF > test-ns.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: test-ns
EOF

cat << EOF > test-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  namespace: test-ns
  name: test-pod
spec:
  containers:
    - name: app
      lifecycle:
        preStop:
          exec:
            command: ["sleep", "15"]
      command: ["/bin/sh", "-c", "sleep 600"]
      image: busybox:latest
EOF

cat << EOF > rbac.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-delete
  namespace: test-ns
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pod-delete
  namespace: test-ns
subjects:
- kind: ServiceAccount
  name: pod-delete
roleRef:
  kind: Role
  name: pod-delete
  apiGroup: rbac.authorization.k8s.io
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pod-delete
  namespace: test-ns
rules:
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
  - delete
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - watch
  - patch
  - delete
  - list
- apiGroups:
  - ""
  resources:
  - pods/log
  verbs:
  - get
  - watch
- apiGroups:
  - "argoproj.io"
  resources:
  - workflowtaskresults
  verbs:
  - create
  - patch
EOF

cat << EOF > workflow.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  namespace: test-ns
  name: test-workflow
spec:
  ttlStrategy:
    secondsAfterCompletion: 600
    secondsAfterSuccess: 600
    secondsAfterFailure: 600
  serviceAccountName: pod-delete
  entrypoint: main
  templates:
  - name: main
    activeDeadlineSeconds: 20
    retryStrategy:
      limit: 2
      retryPolicy: "Always"
    resource:
      action: delete
      manifest: |
        apiVersion: v1
        kind: Pod
        metadata:
          namespace: test-ns
          name: test-pod
EOF

kubectl delete -f test-ns.yaml 
kubectl delete -f test-pod.yaml
kubectl delete -f rbac.yaml
kubectl delete -f workflow.yaml
kubectl apply -f test-ns.yaml 
kubectl apply -f rbac.yaml
kubectl apply -f test-pod.yaml 
sleep 10

argo submit --wait --log workflow.yaml --namespace test-ns &
ARGO_PID=$!
bash -c "while true; do kubectl get pods -n test-ns; sleep 5; done" &
GET_PODS_PID=$!
sleep 60
argo get test-workflow --namespace test-ns
kubectl get workflow test-workflow -n test-ns -o yaml
kubectl get workflowtaskresult -n test-ns -o yaml
kill -9 $ARGO_PID
kill -9 $GET_PODS_PID

Logs from the workflow controller

14:11:40.690Z msg="Processing workflow" Phase= ResourceVersion=1361850195 namespace=test-ns workflow=test-workflow
14:11:40.696Z msg="Task-result reconciliation" namespace=test-ns numObjs=0 workflow=test-workflow
14:11:40.696Z msg="Updated phase  -> Running" namespace=test-ns workflow=test-workflow
14:11:40.696Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=test-ns workflow=test-workflow
14:11:40.696Z msg="was unable to obtain node for , letting display name to be nodeName" namespace=test-ns workflow=test-workflow
14:11:40.696Z msg="Retry node test-workflow initialized Running" namespace=test-ns workflow=test-workflow
14:11:40.696Z msg="was unable to obtain node for , letting display name to be nodeName" namespace=test-ns workflow=test-workflow
14:11:40.696Z msg="Pod node test-workflow-111726068 initialized Pending" namespace=test-ns workflow=test-workflow
14:11:40.696Z" level=warning msg="couldn't get boundaryTemplate through nodeName test-workflow(0)" namespace=test-ns workflow=test-workflow
14:11:40.754Z msg="Created pod: test-workflow(0) (test-workflow-main-111726068)" namespace=test-ns workflow=test-workflow
14:11:40.754Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:11:40.754Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:11:40.766Z msg="Workflow update successful" namespace=test-ns phase=Running resourceVersion=1361850205 workflow=test-workflow
14:11:41.691Z msg="Processing workflow" Phase=Running ResourceVersion=1361850205 namespace=test-ns workflow=test-workflow
14:11:41.692Z msg="Task-result reconciliation" namespace=test-ns numObjs=0 workflow=test-workflow
14:11:41.692Z msg="node changed" namespace=test-ns new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=test-workflow-111726068 old.message= old.phase=Pending old.progress=0/1 workflow=test-workflow
14:11:41.692Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:11:41.692Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:11:41.701Z msg="Workflow update successful" namespace=test-ns phase=Running resourceVersion=1361850263 workflow=test-workflow
14:11:42.702Z msg="Processing workflow" Phase=Running ResourceVersion=1361850263 namespace=test-ns workflow=test-workflow
14:11:42.703Z msg="Task-result reconciliation" namespace=test-ns numObjs=0 workflow=test-workflow
14:11:42.703Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-111726068 workflow=test-workflow
14:11:42.703Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:11:42.703Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:11:45.518Z msg="Processing workflow" Phase=Running ResourceVersion=1361850263 namespace=test-ns workflow=test-workflow
14:11:45.519Z msg="Task-result reconciliation" namespace=test-ns numObjs=1 workflow=test-workflow
14:11:45.519Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-111726068 workflow=test-workflow
14:11:45.519Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:11:45.519Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:11:46.531Z msg="Processing workflow" Phase=Running ResourceVersion=1361850263 namespace=test-ns workflow=test-workflow
14:11:46.531Z msg="Task-result reconciliation" namespace=test-ns numObjs=1 workflow=test-workflow
14:11:46.532Z msg="node changed" namespace=test-ns new.message= new.phase=Running new.progress=0/1 nodeID=test-workflow-111726068 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=test-workflow
14:11:46.532Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:11:46.532Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:11:46.539Z msg="Workflow update successful" namespace=test-ns phase=Running resourceVersion=1361850439 workflow=test-workflow
14:11:47.541Z msg="Processing workflow" Phase=Running ResourceVersion=1361850439 namespace=test-ns workflow=test-workflow
14:11:47.542Z msg="Task-result reconciliation" namespace=test-ns numObjs=1 workflow=test-workflow
14:11:47.542Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-111726068 workflow=test-workflow
14:11:47.542Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:11:47.542Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:01.427Z msg="Processing workflow" Phase=Running ResourceVersion=1361850439 namespace=test-ns workflow=test-workflow
14:12:01.428Z msg="Task-result reconciliation" namespace=test-ns numObjs=1 workflow=test-workflow
14:12:01.428Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-111726068 workflow=test-workflow
14:12:01.429Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:01.429Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:04.596Z msg="Processing workflow" Phase=Running ResourceVersion=1361850439 namespace=test-ns workflow=test-workflow
14:12:04.597Z msg="Task-result reconciliation" namespace=test-ns numObjs=1 workflow=test-workflow
14:12:04.597Z msg="Pod failed: Pod was active on the node longer than the specified deadline" displayName="test-workflow(0)" namespace=test-ns pod=test-workflow-main-111726068 templateName=main workflow=test-workflow
14:12:04.597Z msg="node changed" namespace=test-ns new.message="Pod was active on the node longer than the specified deadline" new.phase=Failed new.progress=0/1 nodeID=test-workflow-111726068 old.message= old.phase=Running old.progress=0/1 workflow=test-workflow
14:12:04.597Z msg="Retry Policy: Always (onFailed: true, onError true)" namespace=test-ns workflow=test-workflow
14:12:04.598Z msg="1 child nodes of test-workflow failed. Trying again..." namespace=test-ns workflow=test-workflow
14:12:04.599Z msg="was unable to obtain node for , letting display name to be nodeName" namespace=test-ns workflow=test-workflow
14:12:04.599Z msg="Pod node test-workflow-3802551985 initialized Pending" namespace=test-ns workflow=test-workflow
14:12:04.599Z" level=warning msg="couldn't get boundaryTemplate through nodeName test-workflow(1)" namespace=test-ns workflow=test-workflow
14:12:04.654Z msg="Created pod: test-workflow(1) (test-workflow-main-3802551985)" namespace=test-ns workflow=test-workflow
14:12:04.654Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:04.654Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:04.674Z msg="Workflow update successful" namespace=test-ns phase=Running resourceVersion=1361851090 workflow=test-workflow
14:12:04.681Z msg="cleaning up pod" action=labelPodCompleted key=test-ns/test-workflow-main-111726068/labelPodCompleted
14:12:05.654Z msg="Processing workflow" Phase=Running ResourceVersion=1361851090 namespace=test-ns workflow=test-workflow
14:12:05.656Z msg="Task-result reconciliation" namespace=test-ns numObjs=1 workflow=test-workflow
14:12:05.656Z msg="node changed" namespace=test-ns new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=test-workflow-3802551985 old.message= old.phase=Pending old.progress=0/1 workflow=test-workflow
14:12:05.657Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:05.657Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:05.667Z msg="Workflow update successful" namespace=test-ns phase=Running resourceVersion=1361851171 workflow=test-workflow
14:12:06.657Z msg="Processing workflow" Phase=Running ResourceVersion=1361851171 namespace=test-ns workflow=test-workflow
14:12:06.659Z msg="Task-result reconciliation" namespace=test-ns numObjs=1 workflow=test-workflow
14:12:06.659Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:06.660Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:06.660Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:07.664Z msg="Processing workflow" Phase=Running ResourceVersion=1361851171 namespace=test-ns workflow=test-workflow
14:12:07.665Z msg="Task-result reconciliation" namespace=test-ns numObjs=1 workflow=test-workflow
14:12:07.665Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:07.666Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:07.666Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:08.667Z msg="Processing workflow" Phase=Running ResourceVersion=1361851171 namespace=test-ns workflow=test-workflow
14:12:08.668Z msg="Task-result reconciliation" namespace=test-ns numObjs=1 workflow=test-workflow
14:12:08.668Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:08.669Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:08.669Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:09.669Z msg="Processing workflow" Phase=Running ResourceVersion=1361851171 namespace=test-ns workflow=test-workflow
14:12:09.670Z msg="Task-result reconciliation" namespace=test-ns numObjs=1 workflow=test-workflow
14:12:09.670Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:09.670Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:09.670Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:10.671Z msg="Processing workflow" Phase=Running ResourceVersion=1361851171 namespace=test-ns workflow=test-workflow
14:12:10.672Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:10.672Z msg="node changed" namespace=test-ns new.message= new.phase=Running new.progress=0/1 nodeID=test-workflow-3802551985 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=test-workflow
14:12:10.672Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:10.672Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:10.682Z msg="Workflow update successful" namespace=test-ns phase=Running resourceVersion=1361851333 workflow=test-workflow
14:12:11.673Z msg="Processing workflow" Phase=Running ResourceVersion=1361851333 namespace=test-ns workflow=test-workflow
14:12:11.674Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:11.674Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:11.675Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:11.675Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:12.676Z msg="Processing workflow" Phase=Running ResourceVersion=1361851333 namespace=test-ns workflow=test-workflow
14:12:12.677Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:12.677Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:12.677Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:12.677Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:13.678Z msg="Processing workflow" Phase=Running ResourceVersion=1361851333 namespace=test-ns workflow=test-workflow
14:12:13.681Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:13.681Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:13.682Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:13.682Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:14.683Z msg="Processing workflow" Phase=Running ResourceVersion=1361851333 namespace=test-ns workflow=test-workflow
14:12:14.686Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:14.686Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:14.686Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:14.687Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:15.687Z msg="Processing workflow" Phase=Running ResourceVersion=1361851333 namespace=test-ns workflow=test-workflow
14:12:15.688Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:15.688Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:15.689Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:15.689Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:16.690Z msg="Processing workflow" Phase=Running ResourceVersion=1361851333 namespace=test-ns workflow=test-workflow
14:12:16.691Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:16.691Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:16.691Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:16.691Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:17.692Z msg="Processing workflow" Phase=Running ResourceVersion=1361851333 namespace=test-ns workflow=test-workflow
14:12:17.693Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:17.693Z msg="node changed" namespace=test-ns new.message= new.phase=Running new.progress=0/1 nodeID=test-workflow-3802551985 old.message= old.phase=Running old.progress=0/1 workflow=test-workflow
14:12:17.694Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:17.694Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:17.700Z msg="cleaning up pod" action=terminateContainers key=test-ns/test-workflow-main-3802551985/terminateContainers
14:12:17.703Z msg="Workflow update successful" namespace=test-ns phase=Running resourceVersion=1361851604 workflow=test-workflow
14:12:18.694Z msg="Processing workflow" Phase=Running ResourceVersion=1361851604 namespace=test-ns workflow=test-workflow
14:12:18.695Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:18.695Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:18.696Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:18.696Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:18.701Z msg="cleaning up pod" action=terminateContainers key=test-ns/test-workflow-main-3802551985/terminateContainers
14:12:19.697Z msg="Processing workflow" Phase=Running ResourceVersion=1361851604 namespace=test-ns workflow=test-workflow
14:12:19.699Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:19.701Z msg="node changed" namespace=test-ns new.message= new.phase=Succeeded new.progress=0/1 nodeID=test-workflow-3802551985 old.message= old.phase=Running old.progress=0/1 workflow=test-workflow
14:12:19.701Z msg="node test-workflow phase Running -> Succeeded" namespace=test-ns workflow=test-workflow
14:12:19.701Z msg="node test-workflow finished: 2024-11-25 14:12:19.701403602 +0000 UTC" namespace=test-ns workflow=test-workflow
14:12:19.701Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:19.701Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:19.710Z msg="Workflow update successful" namespace=test-ns phase=Running resourceVersion=1361851670 workflow=test-workflow
14:12:20.702Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:20.703Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:20.703Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:20.703Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:20.703Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:21.713Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:21.714Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:21.714Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:21.714Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:21.715Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:22.715Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:22.716Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:22.716Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:22.717Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:22.717Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:23.717Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:23.718Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:23.718Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:23.719Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:23.719Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:24.712Z msg="cleaning up pod" action=deletePod key=test-ns/test-workflow-main-3802551985/deletePod
14:12:24.720Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:24.722Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:24.723Z msg="node unchanged" namespace=test-ns nodeID=test-workflow-3802551985 workflow=test-workflow
14:12:24.723Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:24.723Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:25.723Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:25.727Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:25.728Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:25.728Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:26.728Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:26.728Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:26.729Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:26.729Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:27.730Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:27.730Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:27.731Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:27.731Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:28.732Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:28.733Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:28.733Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:28.734Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:29.734Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:29.735Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:29.736Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:29.736Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:30.736Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:30.738Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:30.738Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:30.738Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:31.738Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:31.739Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:31.740Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:31.740Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:32.741Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:32.742Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:32.743Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:32.743Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:33.744Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:33.745Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:33.745Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:33.745Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:34.748Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:34.749Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:34.749Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:34.749Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:35.750Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:35.751Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:35.752Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:35.752Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:36.752Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:36.753Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:36.754Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:36.754Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:37.754Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:37.755Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:37.755Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:37.755Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:38.756Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:38.758Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:38.759Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:38.759Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:39.760Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:39.761Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:39.761Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:39.761Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:40.762Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:40.764Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:40.764Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:40.764Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:41.765Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:41.766Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:41.766Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:41.766Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:42.766Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:42.768Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:42.768Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:42.768Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:43.769Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:43.770Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:43.770Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:43.770Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:44.771Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:44.772Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:44.772Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:44.772Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:45.773Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:45.774Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:45.774Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:45.775Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:46.776Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:46.778Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:46.778Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:46.778Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:47.701Z msg="cleaning up pod" action=killContainers key=test-ns/test-workflow-main-3802551985/killContainers
14:12:47.781Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:47.783Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:47.788Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:47.788Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:48.784Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:48.786Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:48.786Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:48.786Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:49.787Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:49.790Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:49.790Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:49.791Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow
14:12:50.791Z msg="Processing workflow" Phase=Running ResourceVersion=1361851670 namespace=test-ns workflow=test-workflow
14:12:50.792Z msg="Task-result reconciliation" namespace=test-ns numObjs=2 workflow=test-workflow
14:12:50.792Z msg="TaskSet Reconciliation" namespace=test-ns workflow=test-workflow
14:12:50.792Z msg=reconcileAgentPod namespace=test-ns workflow=test-workflow

Logs from in your workflow's wait container

Output from the bash script above

bash blocking-workflow.sh | /usr/local/bin/ts
16:11:23 namespace/test-ns created
16:11:26 serviceaccount/pod-delete created
16:11:27 rolebinding.rbac.authorization.k8s.io/pod-delete created
16:11:27 role.rbac.authorization.k8s.io/pod-delete created
16:11:29 pod/test-pod created
16:11:40 Name:                test-workflow
16:11:40 Namespace:           test-ns
16:11:40 ServiceAccount:      pod-delete
16:11:40 Status:              Pending
16:11:40 Created:             Mon Nov 25 16:11:40 +0200 (now)
16:11:40 Progress:
16:11:41 NAME                           READY   STATUS     RESTARTS   AGE
16:11:41 test-pod                       1/1     Running    0          12s
16:11:41 test-workflow-main-111726068   0/1     Init:0/1   0          1s
16:11:45 test-workflow-main-111726068: 14:11:45.028Z msg="Starting Workflow Executor" version=v3.6.0
16:11:45 test-workflow-main-111726068: 14:11:45.037Z msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
16:11:45 test-workflow-main-111726068: 14:11:45.037Z msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=test-ns podName=test-workflow-main-111726068 templateName=main version="&Version{Version:v3.6.0,BuildDate:2024-11-14T19:02:53Z,GitCommit:b26ed4aa4dee395844531efa4a76a022183bec22,GitTag:v3.6.0,GitTreeState:clean,GoVersion:go1.23.3,Compiler:gc,Platform:linux/amd64,}"
16:11:45 test-workflow-main-111726068: 14:11:45.053Z msg="Loading manifest to /tmp/manifest.yaml"
16:11:45 test-workflow-main-111726068: 14:11:45.053Z msg="kubectl delete --ignore-not-found -f /tmp/manifest.yaml -o name"
# 16:11:47 NAME                           READY   STATUS        RESTARTS   AGE
# 16:11:47 test-pod                       1/1     Terminating   0          18s
# 16:11:47 test-workflow-main-111726068   1/1     Running       0          7s
# 16:11:54 NAME                           READY   STATUS        RESTARTS   AGE
# 16:11:54 test-pod                       1/1     Terminating   0          25s
# 16:11:54 test-workflow-main-111726068   1/1     Running       0          14s
# 16:12:00 NAME                           READY   STATUS        RESTARTS   AGE
# 16:12:00 test-pod                       1/1     Terminating   0          31s
# 16:12:00 test-workflow-main-111726068   1/1     Running       0          20s
16:12:02 test-workflow-main-111726068: 14:12:01.990Z msg="sub-process exited" argo=true error="<nil>"
16:12:02 test-workflow-main-111726068: Error: exit status 1
# 16:12:07 NAME                            READY   STATUS        RESTARTS   AGE
# 16:12:07 test-pod                        1/1     Terminating   0          38s
# 16:12:07 test-workflow-main-111726068    0/1     Error         0          27s
# 16:12:07 test-workflow-main-3802551985   0/1     Init:0/1      0          3s
16:12:10 test-workflow-main-3802551985: 14:12:10.086Z msg="Starting Workflow Executor" version=v3.6.0
16:12:10 test-workflow-main-3802551985: 14:12:10.093Z msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
16:12:10 test-workflow-main-3802551985: 14:12:10.094Z msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=test-ns podName=test-workflow-main-3802551985 templateName=main version="&Version{Version:v3.6.0,BuildDate:2024-11-14T19:02:53Z,GitCommit:b26ed4aa4dee395844531efa4a76a022183bec22,GitTag:v3.6.0,GitTreeState:clean,GoVersion:go1.23.3,Compiler:gc,Platform:linux/amd64,}"
16:12:10 test-workflow-main-3802551985: 14:12:10.112Z msg="Loading manifest to /tmp/manifest.yaml"
16:12:10 test-workflow-main-3802551985: 14:12:10.112Z msg="kubectl delete --ignore-not-found -f /tmp/manifest.yaml -o name"
# 16:12:13 NAME                            READY   STATUS        RESTARTS   AGE
# 16:12:13 test-pod                        1/1     Terminating   0          44s
# 16:12:13 test-workflow-main-111726068    0/1     Error         0          33s
# 16:12:13 test-workflow-main-3802551985   1/1     Running       0          9s
16:12:17 test-workflow-main-3802551985: 14:12:16.931Z msg="sub-process exited" argo=true error="<nil>"
# 16:12:19 NAME                            READY   STATUS      RESTARTS   AGE
# 16:12:19 test-workflow-main-111726068    0/1     Error       0          39s
# 16:12:19 test-workflow-main-3802551985   0/1     Completed   0          15s
# 16:12:26 NAME                           READY   STATUS   RESTARTS   AGE
# 16:12:26 test-workflow-main-111726068   0/1     Error    0          46s
# 16:12:32 NAME                           READY   STATUS   RESTARTS   AGE
# 16:12:32 test-workflow-main-111726068   0/1     Error    0          52s
# 16:12:39 NAME                           READY   STATUS   RESTARTS   AGE
# 16:12:39 test-workflow-main-111726068   0/1     Error    0          59s

The workflow is still running, although its steps are not:

Name:                test-workflow
Namespace:           test-ns
ServiceAccount:      pod-delete
Status:              Running
Conditions:
 PodRunning          False
Created:             Mon Nov 25 16:11:40 +0200 (1 minute ago)
Started:             Mon Nov 25 16:11:40 +0200 (1 minute ago)
Duration:            1 minute 0 seconds
Progress:            1/2
ResourcesDuration:   1s*(1 cpu),27s*(100Mi memory)
16:12:40
STEP                   TEMPLATE  PODNAME                        DURATION  MESSAGE
 ✔ test-workflow       main
 ├─✖ test-workflow(0)  main      test-workflow-main-111726068   22s       Pod was active on the node longer than the specified deadline
 └─✔ test-workflow(1)  main      test-workflow-main-3802551985  12s
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  annotations:
    workflows.argoproj.io/pod-name-format: v2
  creationTimestamp: "2024-11-25T14:11:40Z"
  generation: 9
  labels:
    workflows.argoproj.io/completed: "false"
    workflows.argoproj.io/phase: Running
  name: test-workflow
  namespace: test-ns
  resourceVersion: "1361851670"
  uid: 62e84ef2-985c-489e-b32c-56cb4e404ab8
spec:
  arguments: {}
  entrypoint: main
  podGC:
    strategy: OnPodSuccess
  serviceAccountName: pod-delete
  templates:
  - activeDeadlineSeconds: 20
    inputs: {}
    metadata: {}
    name: main
    outputs: {}
    resource:
      action: delete
      manifest: |
        apiVersion: v1
        kind: Pod
        metadata:
          namespace: test-ns
          name: test-pod
    retryStrategy:
      limit: 2
      retryPolicy: Always
  tolerations:
  - key: kaas.acquia.io/pool
    operator: Equal
    value: jobs
  ttlStrategy:
    secondsAfterCompletion: 600
    secondsAfterFailure: 600
    secondsAfterSuccess: 600
status:
  artifactGCStatus:
    notSpecified: true
  artifactRepositoryRef:
    artifactRepository: {}
    default: true
  conditions:
  - status: "False"
    type: PodRunning
  finishedAt: null
  nodes:
    test-workflow:
      children:
      - test-workflow-111726068
      - test-workflow-3802551985
      displayName: test-workflow
      finishedAt: "2024-11-25T14:12:19Z"
      id: test-workflow
      name: test-workflow
      outputs:
        exitCode: "0"
      phase: Succeeded
      progress: 1/2
      resourcesDuration:
        cpu: 1
        memory: 27
      startedAt: "2024-11-25T14:11:40Z"
      templateName: main
      templateScope: local/test-workflow
      type: Retry
    test-workflow-111726068:
      displayName: test-workflow(0)
      finishedAt: "2024-11-25T14:12:02Z"
      hostNodeName: snip
      id: test-workflow-111726068
      message: Pod was active on the node longer than the specified deadline
      name: test-workflow(0)
      nodeFlag:
        retried: true
      outputs:
        exitCode: "1"
      phase: Failed
      progress: 0/1
      resourcesDuration:
        cpu: 1
        memory: 19
      startedAt: "2024-11-25T14:11:40Z"
      templateName: main
      templateScope: local/test-workflow
      type: Pod
    test-workflow-3802551985:
      displayName: test-workflow(1)
      finishedAt: "2024-11-25T14:12:16Z"
      hostNodeName: snip
      id: test-workflow-3802551985
      name: test-workflow(1)
      nodeFlag:
        retried: true
      outputs:
        exitCode: "0"
      phase: Succeeded
      progress: 1/1
      resourcesDuration:
        cpu: 0
        memory: 8
      startedAt: "2024-11-25T14:12:04Z"
      templateName: main
      templateScope: local/test-workflow
      type: Pod
  phase: Running
  progress: 1/2
  resourcesDuration:
    cpu: 1
    memory: 27
  startedAt: "2024-11-25T14:11:40Z"
  taskResultsCompletionStatus:
    test-workflow-111726068: false
    test-workflow-3802551985: true

workflowtaskresult objects, the one corresponding to the pod that timed out has workflows.argoproj.io/report-outputs-completed: "false"; if I set it to true, the workflow will finish

apiVersion: v1
items:
- apiVersion: argoproj.io/v1alpha1
  kind: WorkflowTaskResult
  metadata:
    creationTimestamp: "2024-11-25T14:11:45Z"
    generation: 1
    labels:
      workflows.argoproj.io/report-outputs-completed: "false"
      workflows.argoproj.io/workflow: test-workflow
    name: test-workflow-111726068
    namespace: test-ns
    ownerReferences:
    - apiVersion: argoproj.io/v1alpha1
      kind: Workflow
      name: test-workflow
      uid: 62e84ef2-985c-489e-b32c-56cb4e404ab8
    resourceVersion: "1361850398"
    uid: eeb3a1aa-0b47-4663-891d-32b74d8972ad
- apiVersion: argoproj.io/v1alpha1
  kind: WorkflowTaskResult
  metadata:
    creationTimestamp: "2024-11-25T14:12:10Z"
    generation: 1
    labels:
      workflows.argoproj.io/report-outputs-completed: "true"
      workflows.argoproj.io/workflow: test-workflow
    name: test-workflow-3802551985
    namespace: test-ns
    ownerReferences:
    - apiVersion: argoproj.io/v1alpha1
      kind: Workflow
      name: test-workflow
      uid: 62e84ef2-985c-489e-b32c-56cb4e404ab8
    resourceVersion: "1361851556"
    uid: 46f1a561-e4de-43d7-972f-e88e80de562e
kind: List
metadata:
  resourceVersion: ""
@jswxstw
Copy link
Member

jswxstw commented Nov 26, 2024

This issue is similar to #13373, #13491 fixed the problem caused by the abnormal exit of wait container, but the data/resource template does not use wait container, so this issue still exists.

@ramandeepsharma
Copy link

I am unable to replicate this issue on v3.5.4 version, however able to replicate it on current latest versionv3.6.0.

@blkperl blkperl added the type/regression Regression from previous behavior (a specific type of bug) label Nov 27, 2024
@jswxstw
Copy link
Member

jswxstw commented Nov 28, 2024

This issue cannot be consistently reproduced because main container can usually exit normally even after being terminated due to a timeout.
Therefore, I added a sleep in FinalizeOutput to simulate the scenario of an abnormal exit, and the results were consistent with my analysis above, I will submit a PR to fix it later.

@jswxstw
Copy link
Member

jswxstw commented Dec 2, 2024

#13798 should have fixed this issue, the workflow will eventually change to a Failed state (default is after 2 minutes).

@isubasinghe
Copy link
Member

@jswxstw bit of a tangent, I was curious if we could use k8s to drive the reconciliation instead, mostly just using ownerRefs to witness a finalizer being added in the task result. This feels like a more robust solution?

@jswxstw
Copy link
Member

jswxstw commented Dec 2, 2024

@isubasinghe I don't quite understand the solution you mentioned.
I think the fundamental issue is that when a pod fails or is deleted, we cannot know whether it successfully reported the results; it could have been interrupted at any point. How would your solution help this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/templates/resource type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

6 participants