Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.5.8: workflow stuck in Running, but only pod exited with OOMKilled (exit code 137) #13373

Closed
4 tasks done
zhucan opened this issue Jul 22, 2024 · 17 comments · Fixed by #13491
Closed
4 tasks done

v3.5.8: workflow stuck in Running, but only pod exited with OOMKilled (exit code 137) #13373

zhucan opened this issue Jul 22, 2024 · 17 comments · Fixed by #13491
Labels
area/controller Controller issues, panics type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@zhucan
Copy link

zhucan commented Jul 22, 2024

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

If the pod was exited with "OOMKilled (exit code 137)", after the pods was cleaned by gc controller, we expect the status of wf to be "Eorror" not running.

Version(s)

v3.5.8

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

set source limit to wf, and use many mem out of the limits, and the pod was killed by OOMKilled.

Logs from the workflow controller

time="2024-07-22T08:07:05.052Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1290040021 namespace=argo-simu-simulation-platform workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.069Z" level=info msg="Task-result reconciliation" namespace=argo-simu-simulation-platform numObjs=1 workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.072Z" level=debug msg="task result:\n&WorkflowTaskResult{ObjectMeta:{simulation-platform-new-scenario-183614-1557733-prod-fqjrw  argo-simu-simulation-platform  9e46d1e0-3ccf-4c7d-92e5-e252fb8fa700 1290022481 1 2024-07-20 10:35:55 +0000 UTC <nil> <nil> map[workflows.argoproj.io/controller-instanceid:argo-simu workflows.argoproj.io/report-outputs-completed:false workflows.argoproj.io/workflow:simulation-platform-new-scenario-183614-1557733-prod-fqjrw] map[] [{argoproj.io/v1alpha1 Workflow simulation-platform-new-scenario-183614-1557733-prod-fqjrw 90e473b7-57f1-4fbe-b044-96352b232605 <nil> <nil>}] []  [{argoexec Update argoproj.io/v1alpha1 2024-07-20 10:35:55 +0000 UTC FieldsV1 {\"f:metadata\":{\"f:labels\":{\".\":{},\"f:workflows.argoproj.io/controller-instanceid\":{},\"f:workflows.argoproj.io/report-outputs-completed\":{},\"f:workflows.argoproj.io/workflow\":{}},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"90e473b7-57f1-4fbe-b044-96352b232605\\\"}\":{}}}} }]},NodeResult:NodeResult{Phase:,Message:,Outputs:nil,Progress:,},}" namespace=argo-simu-simulation-platform workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.072Z" level=debug msg="task result name:\nsimulation-platform-new-scenario-183614-1557733-prod-fqjrw" namespace=argo-simu-simulation-platform workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.076Z" level=debug msg="Marking task result incomplete simulation-platform-new-scenario-183614-1557733-prod-fqjrw" namespace=argo-simu-simulation-platform workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.076Z" level=info msg="task-result changed" namespace=argo-simu-simulation-platform nodeID=simulation-platform-new-scenario-183614-1557733-prod-fqjrw workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.076Z" level=debug msg="Skipping artifact GC" namespace=argo-simu-simulation-platform workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.076Z" level=debug msg="Evaluating node simulation-platform-new-scenario-183614-1557733-prod-fqjrw: template: *v1alpha1.WorkflowStep (run-gpu), boundaryID: " namespace=argo-simu-simulation-platform workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.076Z" level=debug msg="Resolving the template" base="*v1alpha1.Workflow (namespace=argo-simu-simulation-platform,name=simulation-platform-new-scenario-183614-1557733-prod-fqjrw)" tmpl="*v1alpha1.WorkflowStep (run-gpu)"
time="2024-07-22T08:07:05.076Z" level=debug msg="Getting the template" base="*v1alpha1.Workflow (namespace=argo-simu-simulation-platform,name=simulation-platform-new-scenario-183614-1557733-prod-fqjrw)" tmpl="*v1alpha1.WorkflowStep (run-gpu)"
time="2024-07-22T08:07:05.076Z" level=debug msg="Getting the template by name: run-gpu" base="*v1alpha1.Workflow (namespace=argo-simu-simulation-platform,name=simulation-platform-new-scenario-183614-1557733-prod-fqjrw)" tmpl="*v1alpha1.WorkflowStep (run-gpu)"
time="2024-07-22T08:07:05.077Z" level=debug msg="Node simulation-platform-new-scenario-183614-1557733-prod-fqjrw already completed" namespace=argo-simu-simulation-platform workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.077Z" level=info msg="TaskSet Reconciliation" namespace=argo-simu-simulation-platform workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.077Z" level=info msg=reconcileAgentPod namespace=argo-simu-simulation-platform workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.077Z" level=debug msg="Task results completion status: map[simulation-platform-new-scenario-183614-1557733-prod-fqjrw:false]" namespace=argo-simu-simulation-platform workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.077Z" level=debug msg="taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed" fromPhase=Running namespace=argo-simu-simulation-platform toPhase=Error workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw
time="2024-07-22T08:07:05.085Z" level=info msg="Workflow update successful" namespace=argo-simu-simulation-platform phase=Running resourceVersion=1290040021 workflow=simulation-platform-new-scenario-183614-1557733-prod-fqjrw

Logs from in your workflow's wait container

The error pod was cleaned by gc controller
@zhucan
Copy link
Author

zhucan commented Jul 22, 2024

if the status of the node task is error, we need to set the task completed, like this

func (ws *WorkflowStatus) TaskResultsInProgress() bool {
	for taskName, value := range ws.TaskResultsCompletionStatus {
		if _, ok := ws.Nodes[taskName]; ok {
			if ws.Nodes[taskName].Phase == NodeError || ws.Nodes[taskName].Phase == NodeFailed {
				return  false
			}
		}
		if !value {
			return true
		}
	}
	return false
}

@zhucan
Copy link
Author

zhucan commented Jul 22, 2024

image

@jswxstw
Copy link
Member

jswxstw commented Jul 22, 2024

Same as #12993.

@agilgur5 agilgur5 changed the title workflow stuck in Running state, even though the only pod for it is exited with "OOMKilled (exit code 137)" workflow stuck in Running, but only pod exited with OOMKilled (exit code 137) Jul 23, 2024
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Jul 23, 2024
@agilgur5
Copy link

agilgur5 commented Jul 23, 2024

Yea same root cause as #12993, so let's consolidate the two

Also, for back-link reference, this issue's title refers to #12103. #12103 (comment) refers to an OOM as well, but unclear that those two are the same issue, might just be related symptoms

set source limit to wf, and use many mem out of the limits, and the pod was killed by OOMKilled.

Also a minimal, reproducible Workflow is nonetheless required for issues, even if very simple. Otherwise someone else has to write one from scratch in order to reproduce. When one is provided, it is a simple copy+paste, confirm, and debug.

@agilgur5 agilgur5 added solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) type/regression Regression from previous behavior (a specific type of bug) labels Jul 23, 2024
@jswxstw
Copy link
Member

jswxstw commented Jul 26, 2024

This issue seems to be a little different, the pod state is already Error, taskResultIncomplete should be false and the workflow shouldn't get stuck🤔.

// Check whether its taskresult is in an incompleted state.
if newState.Succeeded() && woc.wf.Status.IsTaskResultIncomplete(node.ID) {
woc.log.WithFields(log.Fields{"nodeID": newState.ID}).Debug("Taskresult of the node not yet completed")
taskResultIncomplete = true
return
}

@jswxstw
Copy link
Member

jswxstw commented Aug 20, 2024

@zhucan @agilgur5 I think this issue should be reopened.
#12993 has been closed by #13454, but this issue has not been solved as #12103 (comment) analized.

@jswxstw
Copy link
Member

jswxstw commented Aug 21, 2024

@zhucan @agilgur5 I think this issue should be reopened. #12993 has been closed by #13454, but this issue has not been solved as #12103 (comment) analized.

@yonirab #13454 did not fix this issue, so this issue may be reopened and I'm working on it.

@yonirab
Copy link
Contributor

yonirab commented Aug 21, 2024

Thanks @jswxstw !

With 3.5.5 we are seeing multiple scenarios in which a workflow with one or more steps that retry one or more times (e.g. due to OOMKilled) and eventually succeed shows as green in the UI (both the workflow itself and the steps that retried), but is perceived to be still Running e.g. in the list of workflows at /workflows in the UI, or via kubectl -n argo get workflow.

This seems to be a serious regression in 3.5, which causes very serious problems for any downstream systems monitoring the status of Argo workflows.

@agilgur5 agilgur5 reopened this Aug 21, 2024
@agilgur5
Copy link

agilgur5 commented Aug 21, 2024

@zhucan @agilgur5 I think this issue should be reopened. #12993 has been closed by #13454, but this issue has not been solved as #12103 (comment) analized.

@jswxstw You should have permission to re-open yourself since you're a Member these days. Feel free to do so if you have a strong suspicion like this

@agilgur5 agilgur5 removed the solution/superseded This PR or issue has been superseded by another one (slightly different from a duplicate) label Aug 21, 2024
@jswxstw
Copy link
Member

jswxstw commented Aug 21, 2024

It’s strange that after becoming a member, I didn’t find that I had new operating permissions.

@yonirab
Copy link
Contributor

yonirab commented Aug 22, 2024

@jswxstw I'm attaching some more info about our scenario (which is not something I can easily reproduce).

The Argo UI shows the workflow as green, with one step that was OOMKilled on it first attempt, and then retried successfully:

argo-ui-wf-41e31955e6

The same workflow appears with a blue circle in the UI workflows list:

image

Expanding the workflow in the UI workflows list shows the following:

image

Querying the workflow from the argo cli gives the following:

$ argo get wf-41e31955e6
Name:                wf-41e31955e6
Namespace:           default
ServiceAccount:      unset (will run with the default ServiceAccount)
Status:              Running
Conditions:
 PodRunning          False
Created:             Thu Aug 22 09:28:00 +0300 (6 hours ago)
Started:             Thu Aug 22 09:28:00 +0300 (6 hours ago)
Duration:            6 hours 30 minutes
Progress:            1/2
ResourcesDuration:   697h45m25s*(100Mi memory),1h15m4s*(1 cpu)

STEP                          TEMPLATE       PODNAME                   DURATION  MESSAGE
 ✔ wf-41e31955e6(0)           wf
 └───✔ single-cell-root       run-task-ev-1
     ├─✖ single-cell-root(0)  run-task-ev-1  wf-41e31955e6-3397268171  8m        OOMKilled (exit code 137)
     └─✔ single-cell-root(1)  run-task-ev-1  wf-41e31955e6-4001409550  6h

Similarly kubectl perceives the workflow as Running:

$ kubectl get workflows wf-41e31955e6
NAME            STATUS    AGE     MESSAGE
wf-41e31955e6   Running   6h41m

Here are log from the workflow-controller that contain wf-41e31955e6 from around the time that the retry succeeded.
Posting all the workflow-cotroller logs is impossible (this is from a prpduction system with alot of traffic):

ERROR 2024-08-22T12:45:53.578220496Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.578Z" level=info msg="task-result changed" namespace=default nodeID=wf-41e31955e6-3397268171 workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.578306314Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.578Z" level=info msg="node changed" namespace=default new.message= new.phase=Succeeded new.progress=0/1 nodeID=wf-41e31955e6-4001409550 old.message= old.phase=Running old.progress=0/1 workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.579737185Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.579Z" level=info msg="node wf-41e31955e6-2785872888 phase Running -> Succeeded" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.579746953Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.579Z" level=info msg="node wf-41e31955e6-2785872888 finished: 2024-08-22 12:45:53.579674813 +0000 UTC" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.579750923Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.579Z" level=info msg="Step group node wf-41e31955e6-1948630224 successful" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.579753485Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.579Z" level=info msg="node wf-41e31955e6-1948630224 phase Running -> Succeeded" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.579755874Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.579Z" level=info msg="node wf-41e31955e6-1948630224 finished: 2024-08-22 12:45:53.579711266 +0000 UTC" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.579760798Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.579Z" level=info msg="Outbound nodes of wf-41e31955e6-2785872888 is [wf-41e31955e6-4001409550]" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.579763304Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.579Z" level=info msg="Outbound nodes of wf-41e31955e6-284159950 is [wf-41e31955e6-4001409550]" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.579765929Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.579Z" level=info msg="node wf-41e31955e6-284159950 phase Running -> Succeeded" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.579790489Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.579Z" level=info msg="node wf-41e31955e6-284159950 finished: 2024-08-22 12:45:53.579752442 +0000 UTC" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.580059975Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.579Z" level=info msg="node wf-41e31955e6 phase Running -> Succeeded" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.580069677Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.580Z" level=info msg="node wf-41e31955e6 finished: 2024-08-22 12:45:53.58000251 +0000 UTC" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.580073236Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.580Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.580075832Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.580Z" level=info msg=reconcileAgentPod namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.591732481Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.591Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=1638411713 workflow=wf-41e31955e6
ERROR 2024-08-22T12:45:53.599138838Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:45:53.599Z" level=info msg="cleaning up pod" action=labelPodCompleted key=default/wf-41e31955e6-4001409550/labelPodCompleted
ERROR 2024-08-22T12:46:53.594341387Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:46:53.594Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1638411713 namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:46:53.595875226Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:46:53.595Z" level=info msg="Task-result reconciliation" namespace=default numObjs=2 workflow=wf-41e31955e6
ERROR 2024-08-22T12:46:53.595911115Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:46:53.595Z" level=info msg="task-result changed" namespace=default nodeID=wf-41e31955e6-3397268171 workflow=wf-41e31955e6
ERROR 2024-08-22T12:46:53.595929980Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:46:53.595Z" level=info msg="task-result changed" namespace=default nodeID=wf-41e31955e6-4001409550 workflow=wf-41e31955e6
ERROR 2024-08-22T12:46:53.596235226Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:46:53.596Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:46:53.596257572Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:46:53.596Z" level=info msg=reconcileAgentPod namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:46:53.606740237Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:46:53.606Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=1638411713 workflow=wf-41e31955e6
ERROR 2024-08-22T12:48:50.829705516Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:48:50.829Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1638411713 namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:48:50.831185654Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:48:50.831Z" level=info msg="Task-result reconciliation" namespace=default numObjs=2 workflow=wf-41e31955e6
ERROR 2024-08-22T12:48:50.831210913Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:48:50.831Z" level=info msg="task-result changed" namespace=default nodeID=wf-41e31955e6-3397268171 workflow=wf-41e31955e6
ERROR 2024-08-22T12:48:50.831214696Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:48:50.831Z" level=info msg="task-result changed" namespace=default nodeID=wf-41e31955e6-4001409550 workflow=wf-41e31955e6
ERROR 2024-08-22T12:48:50.831516498Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:48:50.831Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:48:50.831530286Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:48:50.831Z" level=info msg=reconcileAgentPod namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:48:50.841673446Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:48:50.841Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=1638411713 workflow=wf-41e31955e6
ERROR 2024-08-22T12:58:42.852848110Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:58:42.852Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1638411713 namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:58:42.854809261Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:58:42.854Z" level=info msg="Task-result reconciliation" namespace=default numObjs=2 workflow=wf-41e31955e6
ERROR 2024-08-22T12:58:42.854853455Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:58:42.854Z" level=info msg="task-result changed" namespace=default nodeID=wf-41e31955e6-4001409550 workflow=wf-41e31955e6
ERROR 2024-08-22T12:58:42.854859187Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:58:42.854Z" level=info msg="task-result changed" namespace=default nodeID=wf-41e31955e6-3397268171 workflow=wf-41e31955e6
ERROR 2024-08-22T12:58:42.855192787Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:58:42.855Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:58:42.855212567Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:58:42.855Z" level=info msg=reconcileAgentPod namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T12:58:42.865836261Z [resource.labels.containerName: workflow-controller] time="2024-08-22T12:58:42.865Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=1638411713 workflow=wf-41e31955e6
ERROR 2024-08-22T13:01:54.403939859Z [resource.labels.containerName: workflow-controller] time="2024-08-22T13:01:54.403Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=1638411713 namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T13:01:54.405429011Z [resource.labels.containerName: workflow-controller] time="2024-08-22T13:01:54.405Z" level=info msg="Task-result reconciliation" namespace=default numObjs=2 workflow=wf-41e31955e6
ERROR 2024-08-22T13:01:54.405444743Z [resource.labels.containerName: workflow-controller] time="2024-08-22T13:01:54.405Z" level=info msg="task-result changed" namespace=default nodeID=wf-41e31955e6-3397268171 workflow=wf-41e31955e6
ERROR 2024-08-22T13:01:54.405448177Z [resource.labels.containerName: workflow-controller] time="2024-08-22T13:01:54.405Z" level=info msg="task-result changed" namespace=default nodeID=wf-41e31955e6-4001409550 workflow=wf-41e31955e6
ERROR 2024-08-22T13:01:54.405832945Z [resource.labels.containerName: workflow-controller] time="2024-08-22T13:01:54.405Z" level=info msg="TaskSet Reconciliation" namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T13:01:54.405866637Z [resource.labels.containerName: workflow-controller] time="2024-08-22T13:01:54.405Z" level=info msg=reconcileAgentPod namespace=default workflow=wf-41e31955e6
ERROR 2024-08-22T13:01:54.415177571Z [resource.labels.containerName: workflow-controller] time="2024-08-22T13:01:54.415Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=1638411713 workflow=wf-41e31955e6

I don't see any logs at all matching ="taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed"

@jswxstw
Copy link
Member

jswxstw commented Aug 22, 2024

@yonirab Can you describe the pod wf-41e31955e6-3397268171?

I don't see any logs at all matching ="taskresults of workflow are incomplete or still have daemon nodes, so can't mark workflow completed"

This log is in debug mode.

@yonirab
Copy link
Contributor

yonirab commented Aug 22, 2024

@yonirab Can you describe the pod wf-41e31955e6-3397268171?
Unfortunately that POD seems to be gone:

$ kubectl describe pod wf-41e31955e6-3397268171
Error from server (NotFound): pods "wf-41e31955e6-3397268171" not found

I can describe the retry pod, though:

Name:             wf-41e31955e6-4001409550
Namespace:        default
Priority:         0
Service Account:  data-access-public
Node:             gke-cyto-cc-producti-n1-m208-c32-s150-90ef6078-w4f4/10.132.0.125
Start Time:       Thu, 22 Aug 2024 09:37:10 +0300
Labels:           cyto-cc.io/root-owner=tom.kaufman
                  cyto-cc.io/root-workflow-id=wf-41e31955e6
                  cyto-cc.io/scope=public
                  cyto-cc.io/task-id=0
                  cyto-cc.io/task-name=single-cell-root
                  cyto-cc.io/workflow-id=wf-41e31955e6
                  workflows.argoproj.io/completed=true
                  workflows.argoproj.io/workflow=wf-41e31955e6
Annotations:      cyto-cc.io/cyto-cc-ui: https://cyto-cc.cytoreason.com/workflow/wf-41e31955e6/0
                  kubectl.kubernetes.io/default-container: main
                  kubernetes.io/limit-ranger:
                    LimitRanger plugin set: cpu request for container wait; cpu request for init container init; cpu request for init container preparedata
                  workflows.argoproj.io/node-id: wf-41e31955e6-4001409550
                  workflows.argoproj.io/node-name: wf-41e31955e6(0)[0].single-cell-root(1)
Status:           Succeeded
IP:               10.4.1.6
IPs:
  IP:           10.4.1.6
Controlled By:  Workflow/wf-41e31955e6
Init Containers:
  init:
    Container ID:  containerd://f79a5e4979a01c7105c254d751dd8f054b38cfdaceb224e8ee811aa3b1203831
    Image:         quay.io/argoproj/argoexec:v3.5.6
    Image ID:      quay.io/argoproj/argoexec@sha256:c7405360797347aee20cf252c2c0cbed045077e58bf572042e118acefc74e94e
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      init
      --loglevel
      info
      --log-format
      text
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 22 Aug 2024 09:37:10 +0300
      Finished:     Thu, 22 Aug 2024 09:37:10 +0300
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  11Gi
    Environment:
      ARGO_POD_NAME:                      wf-41e31955e6-4001409550 (v1:metadata.name)
      ARGO_POD_UID:                        (v1:metadata.uid)
      GODEBUG:                            x509ignoreCN=0
      ARGO_WORKFLOW_NAME:                 wf-41e31955e6
      ARGO_WORKFLOW_UID:                  2c1af2ae-955b-472f-8a08-5193ce2ec941
      ARGO_CONTAINER_NAME:                init
      ARGO_TEMPLATE:                      {"name":"run-task-ev-1","inputs":{"parameters":[{"name":"task-id","value":"0"},{"name":"workflow-id","value":"wf-41e31955e6"},{"name":"root-owner","value":"tom.kaufman"},{"name":"cyto-cc-ui","value":"https://cyto-cc.cytoreason.com/workflow/wf-41e31955e6/0"},{"name":"in-files","value":"cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST"},{"name":"in-output-files","value":"undefined"},{"name":"in-shared-asset-files","value":""},{"name":"output-dir","value":"./output/"},{"name":"image","value":"eu.gcr.io/cytoreason/cd-singlecell-controller@sha256:892be327d683c07dcfc683051ebe10124d441ca6d4e3f7bfaa8b8a6bfa912c51"},{"name":"cmd","value":"python src/api.py run"},{"name":"memreqnum","value":"1"},{"name":"memrequnit","value":"Gi"},{"name":"initmemreqnum","value":"1"},{"name":"initmemrequnit","value":"Gi"},{"name":"cpureq","value":"100m"},{"name":"numgpus","value":"0"},{"name":"gputype","value":"0"}]},"outputs":{"artifacts":[{"name":"out-dir","path":"./output/","gcs":{"bucket":"cyto_cc","serviceAccountKeySecret":{"name":"storage-secrets","key":"cyto-cc-storage-service-account-keys.json"},"key":"cyto-cc-production/workflows/wf-41e31955e6/outputs/0"},"archive":{"none":{}}}]},"nodeSelector":{"nodetype":"WORKFLOW"},"metadata":{"annotations":{"cyto-cc.io/cyto-cc-ui":"https://cyto-cc.cytoreason.com/workflow/wf-41e31955e6/0"},"labels":{"cyto-cc.io/root-owner":"tom.kaufman","cyto-cc.io/root-workflow-id":"wf-41e31955e6","cyto-cc.io/scope":"public","cyto-cc.io/task-id":"0","cyto-cc.io/task-name":"single-cell-root","cyto-cc.io/workflow-id":"wf-41e31955e6"}},"container":{"name":"","image":"eu.gcr.io/cytoreason/cd-singlecell-controller@sha256:892be327d683c07dcfc683051ebe10124d441ca6d4e3f7bfaa8b8a6bfa912c51","command":["sh","-c"],"args":["echo \"$(date +%FT%T%Z): starting command for $CURRENT_WORKFLOW_ID/$CURRENT_TASK_ID ($CURRENT_TASK_NAME) in pod $POD_NAME (retry $CURRENT_RETRY)\"\nOUTDIR=\"$(dirname ./output/)/$(basename ./output/)\";\nmkdir -p ./output/;\npython src/api.py run 2\u003e\u00261\nexit_code=$?\necho \"$(date +%FT%T%Z): finished command for $CURRENT_WORKFLOW_ID/$CURRENT_TASK_ID ($CURRENT_TASK_NAME) in pod $POD_NAME (retry $CURRENT_RETRY), exit_code=$exit_code\"\nexit $exit_code\n"],"envFrom":[{"configMapRef":{"name":"cr-configmap"}}],"env":[{"name":"CLIENT_ID","valueFrom":{"secretKeyRef":{"name":"storage-secrets","key":"CLIENT_ID_public"}}},{"name":"CLIENT_SECRET","valueFrom":{"secretKeyRef":{"name":"storage-secrets","key":"CLIENT_SECRET_public"}}},{"name":"CLIENT","value":"Machine"},{"name":"TENANT","value":"Production"},{"name":"SERVER_URL","value":"https://cyto-cc.cytoreason.com"},{"name":"SERVICE_ACCOUNT","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"GCS_AUTH_FILE","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"CREATED_BY_WORKFLOW","value":"wf-41e31955e6"},{"name":"GOOGLE_APPLICATION_CREDENTIALS","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"DATA_ACCESS","value":"public"},{"name":"POD_NAME","value":"wf-41e31955e6-4001409550"},{"name":"CURRENT_RETRY","value":"1"},{"name":"CURRENT_WORKFLOW_ID","value":"wf-41e31955e6"},{"name":"CURRENT_TASK_ID","value":"0"},{"name":"CURRENT_TASK_NAME","value":"single-cell-root"}],"resources":{},"volumeMounts":[{"name":"cyto-cc","mountPath":"/cyto_cc"},{"name":"storage-credentials","readOnly":true,"mountPath":"/secrets/storage"}],"imagePullPolicy":"IfNotPresent"},"initContainers":[{"name":"preparedata","image":"google/cloud-sdk:latest","command":["bash","-c"],"args":["export CYTO_CC_DIR=/cyto_cc\nexport INPUTS_ROOTDIR=\"$CYTO_CC_DIR\"/inputs\nmkdir -pv \"$INPUTS_ROOTDIR\"\nexport CACHE_ROOTDIR=\"$CYTO_CC_DIR\"/cache\nmkdir -pv \"$CACHE_ROOTDIR\"\n\ngcloud -v\ngsutil version -l\n\nif [[ \"cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST\" == *\"INPUT_FILES_LIST\" ]]; then\n  echo \"Testing for local inputs list file at gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST...\"\n  gsutil -q stat gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST\n  if [ $? -eq 0 ]; then\n      INPUT_FILES_LIST=/tmp/INPUT_FILES_LIST\n      echo \"Copying input files list from bucket to $INPUT_FILES_LIST...\"\n      gsutil cp gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST $INPUT_FILES_LIST\n      echo \"Listing $INPUT_FILES_LIST...\"\n      ls -l $INPUT_FILES_LIST\n      echo \"Printing contents of $INPUT_FILES_LIST...\"\n      cat $INPUT_FILES_LIST\n      echo \"Looping over contents of $INPUT_FILES_LIST...\"\n      cat $INPUT_FILES_LIST | awk -F, -v rootdir=\"$INPUTS_ROOTDIR\" '{ \\\n        printf(\"echo input file record: %s\\n\",$0); \\\n        printf(\"gsutil ls -l gs://cyto_cc/%s\\n\",$1); \\\n        printf(\"gsutil cp gs://cyto_cc/%s %s/%s\\n\",$1,rootdir,$2); \\\n      }' | bash\n  else\n      echo \"Error testing for local inputs file, exiting...\"\n      exit 126\n  fi\nfi\n\necho \"Copying shared asset input files...\"\nfor in_file in ; do \n  IFS=',' read item1 item2 \u003c\u003c\u003c \"$in_file\"\n  gsutil -m cp -r \"$item1\" \"$INPUTS_ROOTDIR\"/\"$item2\";\ndone;\n\nif [[ \"undefined\" == *\"INPUT_OUTPUT_FILES_LIST\" ]]; then\n  echo \"Testing for output inputs list file at gs://cyto_cc/undefined...\"\n  gsutil -q stat gs://cyto_cc/undefined\n  if [ $? -eq 0 ]; then\n      INPUT_OUTPUT_FILES_LIST=/tmp/INPUT_OUTPUT_FILES_LIST\n      echo \"Copying output inputs files list from bucket to $INPUT_OUTPUT_FILES_LIST...\"\n      gsutil cp gs://cyto_cc/undefined $INPUT_OUTPUT_FILES_LIST\n      echo \"Listing $INPUT_OUTPUT_FILES_LIST...\"\n      ls -l $INPUT_OUTPUT_FILES_LIST\n      echo \"Printing contents of $INPUT_OUTPUT_FILES_LIST...\"\n      cat $INPUT_OUTPUT_FILES_LIST\n      echo \"Looping over contents of $INPUT_OUTPUT_FILES_LIST...\"\n      for in_file in $(cat $INPUT_OUTPUT_FILES_LIST); do \n        IFS=',' read source target_dir \u003c\u003c\u003c \"$in_file\"\n        mkdir -pv \"$INPUTS_ROOTDIR\"/\"$target_dir\";\n        gsutil -m cp -r \"$source\" \"$INPUTS_ROOTDIR\"/\"$target_dir\";\n        dir=$(basename \"$source\");\n        inputs_dir=$(dirname \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\");\n        mv -v \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\"/* \"$inputs_dir\";\n        rmdir -v \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\";\n        wf_id=$(echo \"$source\" | awk -F/ '{print $6}');\n        t_id=$(echo \"$source\" | awk -F/ '{print $8}');\n        outputs_cache=\"$CACHE_ROOTDIR\"/\"$wf_id\"/outputs;\n        mkdir -pv \"$outputs_cache\";\n        ln -s \"$inputs_dir\" \"$outputs_cache\"/\"$t_id\";\n      done;                  \n  else\n    echo \"Error testing for output inputs list file, exiting...\"\n    exit 126\n  fi\nfi\nls -alR \"$CYTO_CC_DIR\"\n"],"resources":{},"volumeMounts":[{"name":"cyto-cc","mountPath":"/cyto_cc"},{"name":"storage-credentials","readOnly":true,"mountPath":"/secrets/storage"}],"imagePullPolicy":"IfNotPresent"}],"archiveLocation":{"archiveLogs":true,"gcs":{"bucket":"cyto_cc","serviceAccountKeySecret":{"name":"storage-secrets","key":"cyto-cc-storage-service-account-keys.json"},"key":"cyto-cc-production/workflows/wf-41e31955e6/logs/0"}},"retryStrategy":{"limit":"6","retryPolicy":"Always","expression":"lastRetry.status == \"Error\" or sprig.contains(lastRetry.message, \"imminent node shutdown\") or (lastRetry.status == \"Failed\" and asInt(lastRetry.exitCode) not in [1,2,127])"},"podSpecPatch":"containers:\n- name: main\n  resources:\n    #{{gpu_limit}}\n    requests:\n      memory: 11Gi\n      cpu: 100m\n\nserviceAccountName: data-access-public\n\ninitContainers:\n- name: init\n  resources:\n    requests:\n      memory: 11Gi\n- name: preparedata\n  resources:\n    requests:\n      memory: 11Gi\n        \n"}
      ARGO_NODE_ID:                       wf-41e31955e6-4001409550
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      2024-08-23T06:28:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /argo/secret/storage-secrets from storage-secrets (ro)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x95zm (ro)
  preparedata:
    Container ID:  containerd://7bd2fd3a260251ed93c3cda48e282fd9c17ff9ac658e46188f7b8eda2797dc0b
    Image:         google/cloud-sdk:latest
    Image ID:      docker.io/google/cloud-sdk@sha256:cc86a2c2a9c0f4b88f678ad30fb54b9f597982056160faae52e45e2ed43f320e
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      export CYTO_CC_DIR=/cyto_cc
      export INPUTS_ROOTDIR="$CYTO_CC_DIR"/inputs
      mkdir -pv "$INPUTS_ROOTDIR"
      export CACHE_ROOTDIR="$CYTO_CC_DIR"/cache
      mkdir -pv "$CACHE_ROOTDIR"
      
      gcloud -v
      gsutil version -l
      
      if [[ "cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST" == *"INPUT_FILES_LIST" ]]; then
        echo "Testing for local inputs list file at gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST..."
        gsutil -q stat gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST
        if [ $? -eq 0 ]; then
            INPUT_FILES_LIST=/tmp/INPUT_FILES_LIST
            echo "Copying input files list from bucket to $INPUT_FILES_LIST..."
            gsutil cp gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST $INPUT_FILES_LIST
            echo "Listing $INPUT_FILES_LIST..."
            ls -l $INPUT_FILES_LIST
            echo "Printing contents of $INPUT_FILES_LIST..."
            cat $INPUT_FILES_LIST
            echo "Looping over contents of $INPUT_FILES_LIST..."
            cat $INPUT_FILES_LIST | awk -F, -v rootdir="$INPUTS_ROOTDIR" '{ \
              printf("echo input file record: %s\n",$0); \
              printf("gsutil ls -l gs://cyto_cc/%s\n",$1); \
              printf("gsutil cp gs://cyto_cc/%s %s/%s\n",$1,rootdir,$2); \
            }' | bash
        else
            echo "Error testing for local inputs file, exiting..."
            exit 126
        fi
      fi
      
      echo "Copying shared asset input files..."
      for in_file in ; do 
        IFS=',' read item1 item2 <<< "$in_file"
        gsutil -m cp -r "$item1" "$INPUTS_ROOTDIR"/"$item2";
      done;
      
      if [[ "undefined" == *"INPUT_OUTPUT_FILES_LIST" ]]; then
        echo "Testing for output inputs list file at gs://cyto_cc/undefined..."
        gsutil -q stat gs://cyto_cc/undefined
        if [ $? -eq 0 ]; then
            INPUT_OUTPUT_FILES_LIST=/tmp/INPUT_OUTPUT_FILES_LIST
            echo "Copying output inputs files list from bucket to $INPUT_OUTPUT_FILES_LIST..."
            gsutil cp gs://cyto_cc/undefined $INPUT_OUTPUT_FILES_LIST
            echo "Listing $INPUT_OUTPUT_FILES_LIST..."
            ls -l $INPUT_OUTPUT_FILES_LIST
            echo "Printing contents of $INPUT_OUTPUT_FILES_LIST..."
            cat $INPUT_OUTPUT_FILES_LIST
            echo "Looping over contents of $INPUT_OUTPUT_FILES_LIST..."
            for in_file in $(cat $INPUT_OUTPUT_FILES_LIST); do 
              IFS=',' read source target_dir <<< "$in_file"
              mkdir -pv "$INPUTS_ROOTDIR"/"$target_dir";
              gsutil -m cp -r "$source" "$INPUTS_ROOTDIR"/"$target_dir";
              dir=$(basename "$source");
              inputs_dir=$(dirname "$INPUTS_ROOTDIR"/"$target_dir"/"$dir");
              mv -v "$INPUTS_ROOTDIR"/"$target_dir"/"$dir"/* "$inputs_dir";
              rmdir -v "$INPUTS_ROOTDIR"/"$target_dir"/"$dir";
              wf_id=$(echo "$source" | awk -F/ '{print $6}');
              t_id=$(echo "$source" | awk -F/ '{print $8}');
              outputs_cache="$CACHE_ROOTDIR"/"$wf_id"/outputs;
              mkdir -pv "$outputs_cache";
              ln -s "$inputs_dir" "$outputs_cache"/"$t_id";
            done;                  
        else
          echo "Error testing for output inputs list file, exiting..."
          exit 126
        fi
      fi
      ls -alR "$CYTO_CC_DIR"
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 22 Aug 2024 09:37:12 +0300
      Finished:     Thu, 22 Aug 2024 09:37:23 +0300
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  11Gi
    Environment:
      ARGO_CONTAINER_NAME:                preparedata
      ARGO_TEMPLATE:                      {"name":"run-task-ev-1","inputs":{"parameters":[{"name":"task-id","value":"0"},{"name":"workflow-id","value":"wf-41e31955e6"},{"name":"root-owner","value":"tom.kaufman"},{"name":"cyto-cc-ui","value":"https://cyto-cc.cytoreason.com/workflow/wf-41e31955e6/0"},{"name":"in-files","value":"cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST"},{"name":"in-output-files","value":"undefined"},{"name":"in-shared-asset-files","value":""},{"name":"output-dir","value":"./output/"},{"name":"image","value":"eu.gcr.io/cytoreason/cd-singlecell-controller@sha256:892be327d683c07dcfc683051ebe10124d441ca6d4e3f7bfaa8b8a6bfa912c51"},{"name":"cmd","value":"python src/api.py run"},{"name":"memreqnum","value":"1"},{"name":"memrequnit","value":"Gi"},{"name":"initmemreqnum","value":"1"},{"name":"initmemrequnit","value":"Gi"},{"name":"cpureq","value":"100m"},{"name":"numgpus","value":"0"},{"name":"gputype","value":"0"}]},"outputs":{"artifacts":[{"name":"out-dir","path":"./output/","gcs":{"bucket":"cyto_cc","serviceAccountKeySecret":{"name":"storage-secrets","key":"cyto-cc-storage-service-account-keys.json"},"key":"cyto-cc-production/workflows/wf-41e31955e6/outputs/0"},"archive":{"none":{}}}]},"nodeSelector":{"nodetype":"WORKFLOW"},"metadata":{"annotations":{"cyto-cc.io/cyto-cc-ui":"https://cyto-cc.cytoreason.com/workflow/wf-41e31955e6/0"},"labels":{"cyto-cc.io/root-owner":"tom.kaufman","cyto-cc.io/root-workflow-id":"wf-41e31955e6","cyto-cc.io/scope":"public","cyto-cc.io/task-id":"0","cyto-cc.io/task-name":"single-cell-root","cyto-cc.io/workflow-id":"wf-41e31955e6"}},"container":{"name":"","image":"eu.gcr.io/cytoreason/cd-singlecell-controller@sha256:892be327d683c07dcfc683051ebe10124d441ca6d4e3f7bfaa8b8a6bfa912c51","command":["sh","-c"],"args":["echo \"$(date +%FT%T%Z): starting command for $CURRENT_WORKFLOW_ID/$CURRENT_TASK_ID ($CURRENT_TASK_NAME) in pod $POD_NAME (retry $CURRENT_RETRY)\"\nOUTDIR=\"$(dirname ./output/)/$(basename ./output/)\";\nmkdir -p ./output/;\npython src/api.py run 2\u003e\u00261\nexit_code=$?\necho \"$(date +%FT%T%Z): finished command for $CURRENT_WORKFLOW_ID/$CURRENT_TASK_ID ($CURRENT_TASK_NAME) in pod $POD_NAME (retry $CURRENT_RETRY), exit_code=$exit_code\"\nexit $exit_code\n"],"envFrom":[{"configMapRef":{"name":"cr-configmap"}}],"env":[{"name":"CLIENT_ID","valueFrom":{"secretKeyRef":{"name":"storage-secrets","key":"CLIENT_ID_public"}}},{"name":"CLIENT_SECRET","valueFrom":{"secretKeyRef":{"name":"storage-secrets","key":"CLIENT_SECRET_public"}}},{"name":"CLIENT","value":"Machine"},{"name":"TENANT","value":"Production"},{"name":"SERVER_URL","value":"https://cyto-cc.cytoreason.com"},{"name":"SERVICE_ACCOUNT","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"GCS_AUTH_FILE","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"CREATED_BY_WORKFLOW","value":"wf-41e31955e6"},{"name":"GOOGLE_APPLICATION_CREDENTIALS","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"DATA_ACCESS","value":"public"},{"name":"POD_NAME","value":"wf-41e31955e6-4001409550"},{"name":"CURRENT_RETRY","value":"1"},{"name":"CURRENT_WORKFLOW_ID","value":"wf-41e31955e6"},{"name":"CURRENT_TASK_ID","value":"0"},{"name":"CURRENT_TASK_NAME","value":"single-cell-root"}],"resources":{},"volumeMounts":[{"name":"cyto-cc","mountPath":"/cyto_cc"},{"name":"storage-credentials","readOnly":true,"mountPath":"/secrets/storage"}],"imagePullPolicy":"IfNotPresent"},"initContainers":[{"name":"preparedata","image":"google/cloud-sdk:latest","command":["bash","-c"],"args":["export CYTO_CC_DIR=/cyto_cc\nexport INPUTS_ROOTDIR=\"$CYTO_CC_DIR\"/inputs\nmkdir -pv \"$INPUTS_ROOTDIR\"\nexport CACHE_ROOTDIR=\"$CYTO_CC_DIR\"/cache\nmkdir -pv \"$CACHE_ROOTDIR\"\n\ngcloud -v\ngsutil version -l\n\nif [[ \"cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST\" == *\"INPUT_FILES_LIST\" ]]; then\n  echo \"Testing for local inputs list file at gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST...\"\n  gsutil -q stat gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST\n  if [ $? -eq 0 ]; then\n      INPUT_FILES_LIST=/tmp/INPUT_FILES_LIST\n      echo \"Copying input files list from bucket to $INPUT_FILES_LIST...\"\n      gsutil cp gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST $INPUT_FILES_LIST\n      echo \"Listing $INPUT_FILES_LIST...\"\n      ls -l $INPUT_FILES_LIST\n      echo \"Printing contents of $INPUT_FILES_LIST...\"\n      cat $INPUT_FILES_LIST\n      echo \"Looping over contents of $INPUT_FILES_LIST...\"\n      cat $INPUT_FILES_LIST | awk -F, -v rootdir=\"$INPUTS_ROOTDIR\" '{ \\\n        printf(\"echo input file record: %s\\n\",$0); \\\n        printf(\"gsutil ls -l gs://cyto_cc/%s\\n\",$1); \\\n        printf(\"gsutil cp gs://cyto_cc/%s %s/%s\\n\",$1,rootdir,$2); \\\n      }' | bash\n  else\n      echo \"Error testing for local inputs file, exiting...\"\n      exit 126\n  fi\nfi\n\necho \"Copying shared asset input files...\"\nfor in_file in ; do \n  IFS=',' read item1 item2 \u003c\u003c\u003c \"$in_file\"\n  gsutil -m cp -r \"$item1\" \"$INPUTS_ROOTDIR\"/\"$item2\";\ndone;\n\nif [[ \"undefined\" == *\"INPUT_OUTPUT_FILES_LIST\" ]]; then\n  echo \"Testing for output inputs list file at gs://cyto_cc/undefined...\"\n  gsutil -q stat gs://cyto_cc/undefined\n  if [ $? -eq 0 ]; then\n      INPUT_OUTPUT_FILES_LIST=/tmp/INPUT_OUTPUT_FILES_LIST\n      echo \"Copying output inputs files list from bucket to $INPUT_OUTPUT_FILES_LIST...\"\n      gsutil cp gs://cyto_cc/undefined $INPUT_OUTPUT_FILES_LIST\n      echo \"Listing $INPUT_OUTPUT_FILES_LIST...\"\n      ls -l $INPUT_OUTPUT_FILES_LIST\n      echo \"Printing contents of $INPUT_OUTPUT_FILES_LIST...\"\n      cat $INPUT_OUTPUT_FILES_LIST\n      echo \"Looping over contents of $INPUT_OUTPUT_FILES_LIST...\"\n      for in_file in $(cat $INPUT_OUTPUT_FILES_LIST); do \n        IFS=',' read source target_dir \u003c\u003c\u003c \"$in_file\"\n        mkdir -pv \"$INPUTS_ROOTDIR\"/\"$target_dir\";\n        gsutil -m cp -r \"$source\" \"$INPUTS_ROOTDIR\"/\"$target_dir\";\n        dir=$(basename \"$source\");\n        inputs_dir=$(dirname \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\");\n        mv -v \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\"/* \"$inputs_dir\";\n        rmdir -v \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\";\n        wf_id=$(echo \"$source\" | awk -F/ '{print $6}');\n        t_id=$(echo \"$source\" | awk -F/ '{print $8}');\n        outputs_cache=\"$CACHE_ROOTDIR\"/\"$wf_id\"/outputs;\n        mkdir -pv \"$outputs_cache\";\n        ln -s \"$inputs_dir\" \"$outputs_cache\"/\"$t_id\";\n      done;                  \n  else\n    echo \"Error testing for output inputs list file, exiting...\"\n    exit 126\n  fi\nfi\nls -alR \"$CYTO_CC_DIR\"\n"],"resources":{},"volumeMounts":[{"name":"cyto-cc","mountPath":"/cyto_cc"},{"name":"storage-credentials","readOnly":true,"mountPath":"/secrets/storage"}],"imagePullPolicy":"IfNotPresent"}],"archiveLocation":{"archiveLogs":true,"gcs":{"bucket":"cyto_cc","serviceAccountKeySecret":{"name":"storage-secrets","key":"cyto-cc-storage-service-account-keys.json"},"key":"cyto-cc-production/workflows/wf-41e31955e6/logs/0"}},"retryStrategy":{"limit":"6","retryPolicy":"Always","expression":"lastRetry.status == \"Error\" or sprig.contains(lastRetry.message, \"imminent node shutdown\") or (lastRetry.status == \"Failed\" and asInt(lastRetry.exitCode) not in [1,2,127])"},"podSpecPatch":"containers:\n- name: main\n  resources:\n    #{{gpu_limit}}\n    requests:\n      memory: 11Gi\n      cpu: 100m\n\nserviceAccountName: data-access-public\n\ninitContainers:\n- name: init\n  resources:\n    requests:\n      memory: 11Gi\n- name: preparedata\n  resources:\n    requests:\n      memory: 11Gi\n        \n"}
      ARGO_NODE_ID:                       wf-41e31955e6-4001409550
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      2024-08-23T06:28:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /cyto_cc from cyto-cc (rw)
      /secrets/storage from storage-credentials (ro)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x95zm (ro)
Containers:
  wait:
    Container ID:  containerd://b947af5b64fe19a3ee220505ecfa61dbd989aa885a478078bfc3a44a3c0be8c8
    Image:         quay.io/argoproj/argoexec:v3.5.6
    Image ID:      quay.io/argoproj/argoexec@sha256:c7405360797347aee20cf252c2c0cbed045077e58bf572042e118acefc74e94e
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      wait
      --loglevel
      info
      --log-format
      text
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 22 Aug 2024 09:37:26 +0300
      Finished:     Thu, 22 Aug 2024 15:44:53 +0300
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:  100m
    Environment:
      ARGO_POD_NAME:                      wf-41e31955e6-4001409550 (v1:metadata.name)
      ARGO_POD_UID:                        (v1:metadata.uid)
      GODEBUG:                            x509ignoreCN=0
      ARGO_WORKFLOW_NAME:                 wf-41e31955e6
      ARGO_WORKFLOW_UID:                  2c1af2ae-955b-472f-8a08-5193ce2ec941
      ARGO_CONTAINER_NAME:                wait
      ARGO_TEMPLATE:                      {"name":"run-task-ev-1","inputs":{"parameters":[{"name":"task-id","value":"0"},{"name":"workflow-id","value":"wf-41e31955e6"},{"name":"root-owner","value":"tom.kaufman"},{"name":"cyto-cc-ui","value":"https://cyto-cc.cytoreason.com/workflow/wf-41e31955e6/0"},{"name":"in-files","value":"cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST"},{"name":"in-output-files","value":"undefined"},{"name":"in-shared-asset-files","value":""},{"name":"output-dir","value":"./output/"},{"name":"image","value":"eu.gcr.io/cytoreason/cd-singlecell-controller@sha256:892be327d683c07dcfc683051ebe10124d441ca6d4e3f7bfaa8b8a6bfa912c51"},{"name":"cmd","value":"python src/api.py run"},{"name":"memreqnum","value":"1"},{"name":"memrequnit","value":"Gi"},{"name":"initmemreqnum","value":"1"},{"name":"initmemrequnit","value":"Gi"},{"name":"cpureq","value":"100m"},{"name":"numgpus","value":"0"},{"name":"gputype","value":"0"}]},"outputs":{"artifacts":[{"name":"out-dir","path":"./output/","gcs":{"bucket":"cyto_cc","serviceAccountKeySecret":{"name":"storage-secrets","key":"cyto-cc-storage-service-account-keys.json"},"key":"cyto-cc-production/workflows/wf-41e31955e6/outputs/0"},"archive":{"none":{}}}]},"nodeSelector":{"nodetype":"WORKFLOW"},"metadata":{"annotations":{"cyto-cc.io/cyto-cc-ui":"https://cyto-cc.cytoreason.com/workflow/wf-41e31955e6/0"},"labels":{"cyto-cc.io/root-owner":"tom.kaufman","cyto-cc.io/root-workflow-id":"wf-41e31955e6","cyto-cc.io/scope":"public","cyto-cc.io/task-id":"0","cyto-cc.io/task-name":"single-cell-root","cyto-cc.io/workflow-id":"wf-41e31955e6"}},"container":{"name":"","image":"eu.gcr.io/cytoreason/cd-singlecell-controller@sha256:892be327d683c07dcfc683051ebe10124d441ca6d4e3f7bfaa8b8a6bfa912c51","command":["sh","-c"],"args":["echo \"$(date +%FT%T%Z): starting command for $CURRENT_WORKFLOW_ID/$CURRENT_TASK_ID ($CURRENT_TASK_NAME) in pod $POD_NAME (retry $CURRENT_RETRY)\"\nOUTDIR=\"$(dirname ./output/)/$(basename ./output/)\";\nmkdir -p ./output/;\npython src/api.py run 2\u003e\u00261\nexit_code=$?\necho \"$(date +%FT%T%Z): finished command for $CURRENT_WORKFLOW_ID/$CURRENT_TASK_ID ($CURRENT_TASK_NAME) in pod $POD_NAME (retry $CURRENT_RETRY), exit_code=$exit_code\"\nexit $exit_code\n"],"envFrom":[{"configMapRef":{"name":"cr-configmap"}}],"env":[{"name":"CLIENT_ID","valueFrom":{"secretKeyRef":{"name":"storage-secrets","key":"CLIENT_ID_public"}}},{"name":"CLIENT_SECRET","valueFrom":{"secretKeyRef":{"name":"storage-secrets","key":"CLIENT_SECRET_public"}}},{"name":"CLIENT","value":"Machine"},{"name":"TENANT","value":"Production"},{"name":"SERVER_URL","value":"https://cyto-cc.cytoreason.com"},{"name":"SERVICE_ACCOUNT","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"GCS_AUTH_FILE","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"CREATED_BY_WORKFLOW","value":"wf-41e31955e6"},{"name":"GOOGLE_APPLICATION_CREDENTIALS","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"DATA_ACCESS","value":"public"},{"name":"POD_NAME","value":"wf-41e31955e6-4001409550"},{"name":"CURRENT_RETRY","value":"1"},{"name":"CURRENT_WORKFLOW_ID","value":"wf-41e31955e6"},{"name":"CURRENT_TASK_ID","value":"0"},{"name":"CURRENT_TASK_NAME","value":"single-cell-root"}],"resources":{},"volumeMounts":[{"name":"cyto-cc","mountPath":"/cyto_cc"},{"name":"storage-credentials","readOnly":true,"mountPath":"/secrets/storage"}],"imagePullPolicy":"IfNotPresent"},"initContainers":[{"name":"preparedata","image":"google/cloud-sdk:latest","command":["bash","-c"],"args":["export CYTO_CC_DIR=/cyto_cc\nexport INPUTS_ROOTDIR=\"$CYTO_CC_DIR\"/inputs\nmkdir -pv \"$INPUTS_ROOTDIR\"\nexport CACHE_ROOTDIR=\"$CYTO_CC_DIR\"/cache\nmkdir -pv \"$CACHE_ROOTDIR\"\n\ngcloud -v\ngsutil version -l\n\nif [[ \"cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST\" == *\"INPUT_FILES_LIST\" ]]; then\n  echo \"Testing for local inputs list file at gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST...\"\n  gsutil -q stat gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST\n  if [ $? -eq 0 ]; then\n      INPUT_FILES_LIST=/tmp/INPUT_FILES_LIST\n      echo \"Copying input files list from bucket to $INPUT_FILES_LIST...\"\n      gsutil cp gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST $INPUT_FILES_LIST\n      echo \"Listing $INPUT_FILES_LIST...\"\n      ls -l $INPUT_FILES_LIST\n      echo \"Printing contents of $INPUT_FILES_LIST...\"\n      cat $INPUT_FILES_LIST\n      echo \"Looping over contents of $INPUT_FILES_LIST...\"\n      cat $INPUT_FILES_LIST | awk -F, -v rootdir=\"$INPUTS_ROOTDIR\" '{ \\\n        printf(\"echo input file record: %s\\n\",$0); \\\n        printf(\"gsutil ls -l gs://cyto_cc/%s\\n\",$1); \\\n        printf(\"gsutil cp gs://cyto_cc/%s %s/%s\\n\",$1,rootdir,$2); \\\n      }' | bash\n  else\n      echo \"Error testing for local inputs file, exiting...\"\n      exit 126\n  fi\nfi\n\necho \"Copying shared asset input files...\"\nfor in_file in ; do \n  IFS=',' read item1 item2 \u003c\u003c\u003c \"$in_file\"\n  gsutil -m cp -r \"$item1\" \"$INPUTS_ROOTDIR\"/\"$item2\";\ndone;\n\nif [[ \"undefined\" == *\"INPUT_OUTPUT_FILES_LIST\" ]]; then\n  echo \"Testing for output inputs list file at gs://cyto_cc/undefined...\"\n  gsutil -q stat gs://cyto_cc/undefined\n  if [ $? -eq 0 ]; then\n      INPUT_OUTPUT_FILES_LIST=/tmp/INPUT_OUTPUT_FILES_LIST\n      echo \"Copying output inputs files list from bucket to $INPUT_OUTPUT_FILES_LIST...\"\n      gsutil cp gs://cyto_cc/undefined $INPUT_OUTPUT_FILES_LIST\n      echo \"Listing $INPUT_OUTPUT_FILES_LIST...\"\n      ls -l $INPUT_OUTPUT_FILES_LIST\n      echo \"Printing contents of $INPUT_OUTPUT_FILES_LIST...\"\n      cat $INPUT_OUTPUT_FILES_LIST\n      echo \"Looping over contents of $INPUT_OUTPUT_FILES_LIST...\"\n      for in_file in $(cat $INPUT_OUTPUT_FILES_LIST); do \n        IFS=',' read source target_dir \u003c\u003c\u003c \"$in_file\"\n        mkdir -pv \"$INPUTS_ROOTDIR\"/\"$target_dir\";\n        gsutil -m cp -r \"$source\" \"$INPUTS_ROOTDIR\"/\"$target_dir\";\n        dir=$(basename \"$source\");\n        inputs_dir=$(dirname \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\");\n        mv -v \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\"/* \"$inputs_dir\";\n        rmdir -v \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\";\n        wf_id=$(echo \"$source\" | awk -F/ '{print $6}');\n        t_id=$(echo \"$source\" | awk -F/ '{print $8}');\n        outputs_cache=\"$CACHE_ROOTDIR\"/\"$wf_id\"/outputs;\n        mkdir -pv \"$outputs_cache\";\n        ln -s \"$inputs_dir\" \"$outputs_cache\"/\"$t_id\";\n      done;                  \n  else\n    echo \"Error testing for output inputs list file, exiting...\"\n    exit 126\n  fi\nfi\nls -alR \"$CYTO_CC_DIR\"\n"],"resources":{},"volumeMounts":[{"name":"cyto-cc","mountPath":"/cyto_cc"},{"name":"storage-credentials","readOnly":true,"mountPath":"/secrets/storage"}],"imagePullPolicy":"IfNotPresent"}],"archiveLocation":{"archiveLogs":true,"gcs":{"bucket":"cyto_cc","serviceAccountKeySecret":{"name":"storage-secrets","key":"cyto-cc-storage-service-account-keys.json"},"key":"cyto-cc-production/workflows/wf-41e31955e6/logs/0"}},"retryStrategy":{"limit":"6","retryPolicy":"Always","expression":"lastRetry.status == \"Error\" or sprig.contains(lastRetry.message, \"imminent node shutdown\") or (lastRetry.status == \"Failed\" and asInt(lastRetry.exitCode) not in [1,2,127])"},"podSpecPatch":"containers:\n- name: main\n  resources:\n    #{{gpu_limit}}\n    requests:\n      memory: 11Gi\n      cpu: 100m\n\nserviceAccountName: data-access-public\n\ninitContainers:\n- name: init\n  resources:\n    requests:\n      memory: 11Gi\n- name: preparedata\n  resources:\n    requests:\n      memory: 11Gi\n        \n"}
      ARGO_NODE_ID:                       wf-41e31955e6-4001409550
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      2024-08-23T06:28:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /argo/secret/storage-secrets from storage-secrets (ro)
      /mainctrfs/cyto_cc from cyto-cc (rw)
      /mainctrfs/secrets/storage from storage-credentials (rw)
      /tmp from tmp-dir-argo (rw,path="0")
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x95zm (ro)
  main:
    Container ID:  containerd://b6a6116e2c0531c312d70bcbb4154cce38744c9581c3695d5f98191ede05b427
    Image:         eu.gcr.io/cytoreason/cd-singlecell-controller@sha256:892be327d683c07dcfc683051ebe10124d441ca6d4e3f7bfaa8b8a6bfa912c51
    Image ID:      eu.gcr.io/cytoreason/cd-singlecell-controller@sha256:892be327d683c07dcfc683051ebe10124d441ca6d4e3f7bfaa8b8a6bfa912c51
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --loglevel
      info
      --log-format
      text
      --
      sh
      -c
    Args:
      echo "$(date +%FT%T%Z): starting command for $CURRENT_WORKFLOW_ID/$CURRENT_TASK_ID ($CURRENT_TASK_NAME) in pod $POD_NAME (retry $CURRENT_RETRY)"
      OUTDIR="$(dirname ./output/)/$(basename ./output/)";
      mkdir -p ./output/;
      python src/api.py run 2>&1
      exit_code=$?
      echo "$(date +%FT%T%Z): finished command for $CURRENT_WORKFLOW_ID/$CURRENT_TASK_ID ($CURRENT_TASK_NAME) in pod $POD_NAME (retry $CURRENT_RETRY), exit_code=$exit_code"
      exit $exit_code
      
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 22 Aug 2024 09:37:28 +0300
      Finished:     Thu, 22 Aug 2024 15:44:53 +0300
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  11Gi
    Environment Variables from:
      cr-configmap  ConfigMap  Optional: false
    Environment:
      CLIENT_ID:                          <set to the key 'CLIENT_ID_public' in secret 'storage-secrets'>      Optional: false
      CLIENT_SECRET:                      <set to the key 'CLIENT_SECRET_public' in secret 'storage-secrets'>  Optional: false
      CLIENT:                             Machine
      TENANT:                             Production
      SERVER_URL:                         https://cyto-cc.cytoreason.com
      SERVICE_ACCOUNT:                    /secrets/storage/cyto-cc-storage-service-account-keys.json
      GCS_AUTH_FILE:                      /secrets/storage/cyto-cc-storage-service-account-keys.json
      CREATED_BY_WORKFLOW:                wf-41e31955e6
      GOOGLE_APPLICATION_CREDENTIALS:     /secrets/storage/cyto-cc-storage-service-account-keys.json
      DATA_ACCESS:                        public
      POD_NAME:                           wf-41e31955e6-4001409550
      CURRENT_RETRY:                      1
      CURRENT_WORKFLOW_ID:                wf-41e31955e6
      CURRENT_TASK_ID:                    0
      CURRENT_TASK_NAME:                  single-cell-root
      ARGO_CONTAINER_NAME:                main
      ARGO_TEMPLATE:                      {"name":"run-task-ev-1","inputs":{"parameters":[{"name":"task-id","value":"0"},{"name":"workflow-id","value":"wf-41e31955e6"},{"name":"root-owner","value":"tom.kaufman"},{"name":"cyto-cc-ui","value":"https://cyto-cc.cytoreason.com/workflow/wf-41e31955e6/0"},{"name":"in-files","value":"cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST"},{"name":"in-output-files","value":"undefined"},{"name":"in-shared-asset-files","value":""},{"name":"output-dir","value":"./output/"},{"name":"image","value":"eu.gcr.io/cytoreason/cd-singlecell-controller@sha256:892be327d683c07dcfc683051ebe10124d441ca6d4e3f7bfaa8b8a6bfa912c51"},{"name":"cmd","value":"python src/api.py run"},{"name":"memreqnum","value":"1"},{"name":"memrequnit","value":"Gi"},{"name":"initmemreqnum","value":"1"},{"name":"initmemrequnit","value":"Gi"},{"name":"cpureq","value":"100m"},{"name":"numgpus","value":"0"},{"name":"gputype","value":"0"}]},"outputs":{"artifacts":[{"name":"out-dir","path":"./output/","gcs":{"bucket":"cyto_cc","serviceAccountKeySecret":{"name":"storage-secrets","key":"cyto-cc-storage-service-account-keys.json"},"key":"cyto-cc-production/workflows/wf-41e31955e6/outputs/0"},"archive":{"none":{}}}]},"nodeSelector":{"nodetype":"WORKFLOW"},"metadata":{"annotations":{"cyto-cc.io/cyto-cc-ui":"https://cyto-cc.cytoreason.com/workflow/wf-41e31955e6/0"},"labels":{"cyto-cc.io/root-owner":"tom.kaufman","cyto-cc.io/root-workflow-id":"wf-41e31955e6","cyto-cc.io/scope":"public","cyto-cc.io/task-id":"0","cyto-cc.io/task-name":"single-cell-root","cyto-cc.io/workflow-id":"wf-41e31955e6"}},"container":{"name":"","image":"eu.gcr.io/cytoreason/cd-singlecell-controller@sha256:892be327d683c07dcfc683051ebe10124d441ca6d4e3f7bfaa8b8a6bfa912c51","command":["sh","-c"],"args":["echo \"$(date +%FT%T%Z): starting command for $CURRENT_WORKFLOW_ID/$CURRENT_TASK_ID ($CURRENT_TASK_NAME) in pod $POD_NAME (retry $CURRENT_RETRY)\"\nOUTDIR=\"$(dirname ./output/)/$(basename ./output/)\";\nmkdir -p ./output/;\npython src/api.py run 2\u003e\u00261\nexit_code=$?\necho \"$(date +%FT%T%Z): finished command for $CURRENT_WORKFLOW_ID/$CURRENT_TASK_ID ($CURRENT_TASK_NAME) in pod $POD_NAME (retry $CURRENT_RETRY), exit_code=$exit_code\"\nexit $exit_code\n"],"envFrom":[{"configMapRef":{"name":"cr-configmap"}}],"env":[{"name":"CLIENT_ID","valueFrom":{"secretKeyRef":{"name":"storage-secrets","key":"CLIENT_ID_public"}}},{"name":"CLIENT_SECRET","valueFrom":{"secretKeyRef":{"name":"storage-secrets","key":"CLIENT_SECRET_public"}}},{"name":"CLIENT","value":"Machine"},{"name":"TENANT","value":"Production"},{"name":"SERVER_URL","value":"https://cyto-cc.cytoreason.com"},{"name":"SERVICE_ACCOUNT","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"GCS_AUTH_FILE","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"CREATED_BY_WORKFLOW","value":"wf-41e31955e6"},{"name":"GOOGLE_APPLICATION_CREDENTIALS","value":"/secrets/storage/cyto-cc-storage-service-account-keys.json"},{"name":"DATA_ACCESS","value":"public"},{"name":"POD_NAME","value":"wf-41e31955e6-4001409550"},{"name":"CURRENT_RETRY","value":"1"},{"name":"CURRENT_WORKFLOW_ID","value":"wf-41e31955e6"},{"name":"CURRENT_TASK_ID","value":"0"},{"name":"CURRENT_TASK_NAME","value":"single-cell-root"}],"resources":{},"volumeMounts":[{"name":"cyto-cc","mountPath":"/cyto_cc"},{"name":"storage-credentials","readOnly":true,"mountPath":"/secrets/storage"}],"imagePullPolicy":"IfNotPresent"},"initContainers":[{"name":"preparedata","image":"google/cloud-sdk:latest","command":["bash","-c"],"args":["export CYTO_CC_DIR=/cyto_cc\nexport INPUTS_ROOTDIR=\"$CYTO_CC_DIR\"/inputs\nmkdir -pv \"$INPUTS_ROOTDIR\"\nexport CACHE_ROOTDIR=\"$CYTO_CC_DIR\"/cache\nmkdir -pv \"$CACHE_ROOTDIR\"\n\ngcloud -v\ngsutil version -l\n\nif [[ \"cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST\" == *\"INPUT_FILES_LIST\" ]]; then\n  echo \"Testing for local inputs list file at gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST...\"\n  gsutil -q stat gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST\n  if [ $? -eq 0 ]; then\n      INPUT_FILES_LIST=/tmp/INPUT_FILES_LIST\n      echo \"Copying input files list from bucket to $INPUT_FILES_LIST...\"\n      gsutil cp gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST $INPUT_FILES_LIST\n      echo \"Listing $INPUT_FILES_LIST...\"\n      ls -l $INPUT_FILES_LIST\n      echo \"Printing contents of $INPUT_FILES_LIST...\"\n      cat $INPUT_FILES_LIST\n      echo \"Looping over contents of $INPUT_FILES_LIST...\"\n      cat $INPUT_FILES_LIST | awk -F, -v rootdir=\"$INPUTS_ROOTDIR\" '{ \\\n        printf(\"echo input file record: %s\\n\",$0); \\\n        printf(\"gsutil ls -l gs://cyto_cc/%s\\n\",$1); \\\n        printf(\"gsutil cp gs://cyto_cc/%s %s/%s\\n\",$1,rootdir,$2); \\\n      }' | bash\n  else\n      echo \"Error testing for local inputs file, exiting...\"\n      exit 126\n  fi\nfi\n\necho \"Copying shared asset input files...\"\nfor in_file in ; do \n  IFS=',' read item1 item2 \u003c\u003c\u003c \"$in_file\"\n  gsutil -m cp -r \"$item1\" \"$INPUTS_ROOTDIR\"/\"$item2\";\ndone;\n\nif [[ \"undefined\" == *\"INPUT_OUTPUT_FILES_LIST\" ]]; then\n  echo \"Testing for output inputs list file at gs://cyto_cc/undefined...\"\n  gsutil -q stat gs://cyto_cc/undefined\n  if [ $? -eq 0 ]; then\n      INPUT_OUTPUT_FILES_LIST=/tmp/INPUT_OUTPUT_FILES_LIST\n      echo \"Copying output inputs files list from bucket to $INPUT_OUTPUT_FILES_LIST...\"\n      gsutil cp gs://cyto_cc/undefined $INPUT_OUTPUT_FILES_LIST\n      echo \"Listing $INPUT_OUTPUT_FILES_LIST...\"\n      ls -l $INPUT_OUTPUT_FILES_LIST\n      echo \"Printing contents of $INPUT_OUTPUT_FILES_LIST...\"\n      cat $INPUT_OUTPUT_FILES_LIST\n      echo \"Looping over contents of $INPUT_OUTPUT_FILES_LIST...\"\n      for in_file in $(cat $INPUT_OUTPUT_FILES_LIST); do \n        IFS=',' read source target_dir \u003c\u003c\u003c \"$in_file\"\n        mkdir -pv \"$INPUTS_ROOTDIR\"/\"$target_dir\";\n        gsutil -m cp -r \"$source\" \"$INPUTS_ROOTDIR\"/\"$target_dir\";\n        dir=$(basename \"$source\");\n        inputs_dir=$(dirname \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\");\n        mv -v \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\"/* \"$inputs_dir\";\n        rmdir -v \"$INPUTS_ROOTDIR\"/\"$target_dir\"/\"$dir\";\n        wf_id=$(echo \"$source\" | awk -F/ '{print $6}');\n        t_id=$(echo \"$source\" | awk -F/ '{print $8}');\n        outputs_cache=\"$CACHE_ROOTDIR\"/\"$wf_id\"/outputs;\n        mkdir -pv \"$outputs_cache\";\n        ln -s \"$inputs_dir\" \"$outputs_cache\"/\"$t_id\";\n      done;                  \n  else\n    echo \"Error testing for output inputs list file, exiting...\"\n    exit 126\n  fi\nfi\nls -alR \"$CYTO_CC_DIR\"\n"],"resources":{},"volumeMounts":[{"name":"cyto-cc","mountPath":"/cyto_cc"},{"name":"storage-credentials","readOnly":true,"mountPath":"/secrets/storage"}],"imagePullPolicy":"IfNotPresent"}],"archiveLocation":{"archiveLogs":true,"gcs":{"bucket":"cyto_cc","serviceAccountKeySecret":{"name":"storage-secrets","key":"cyto-cc-storage-service-account-keys.json"},"key":"cyto-cc-production/workflows/wf-41e31955e6/logs/0"}},"retryStrategy":{"limit":"6","retryPolicy":"Always","expression":"lastRetry.status == \"Error\" or sprig.contains(lastRetry.message, \"imminent node shutdown\") or (lastRetry.status == \"Failed\" and asInt(lastRetry.exitCode) not in [1,2,127])"},"podSpecPatch":"containers:\n- name: main\n  resources:\n    #{{gpu_limit}}\n    requests:\n      memory: 11Gi\n      cpu: 100m\n\nserviceAccountName: data-access-public\n\ninitContainers:\n- name: init\n  resources:\n    requests:\n      memory: 11Gi\n- name: preparedata\n  resources:\n    requests:\n      memory: 11Gi\n        \n"}
      ARGO_NODE_ID:                       wf-41e31955e6-4001409550
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      2024-08-23T06:28:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /cyto_cc from cyto-cc (rw)
      /secrets/storage from storage-credentials (ro)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x95zm (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  var-run-argo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  tmp-dir-argo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  cyto-cc:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  storage-credentials:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  storage-secrets
    Optional:    false
  storage-secrets:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  storage-secrets
    Optional:    false
  kube-api-access-x95zm:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              nodetype=WORKFLOW
Tolerations:                 WORKFLOW=BASIC:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nodetype=WORKFLOW:NoSchedule
Events:                      <none>

@yonirab
Copy link
Contributor

yonirab commented Aug 22, 2024

I can however get the logs for pod wf-41e31955e6-3397268171.

We can see from the end of these logs that both main and wait were running.
main did not complete, so I think it is the one that got OOMKilled.

ERROR 2024-08-22T06:28:01.550152336Z [resource.labels.containerName: init] time="2024-08-22T06:28:01.549Z" level=info msg="Starting Workflow Executor" version=v3.5.6
ERROR 2024-08-22T06:28:01.554005551Z [resource.labels.containerName: init] time="2024-08-22T06:28:01.553Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
ERROR 2024-08-22T06:28:01.554029147Z [resource.labels.containerName: init] time="2024-08-22T06:28:01.553Z" level=info msg="Executor initialized" deadline="2024-08-23 06:28:00 +0000 UTC" includeScriptOutput=false namespace=default podName=wf-41e31955e6-3397268171 templateName=run-task-ev-1 version="&Version{Version:v3.5.6,BuildDate:2024-04-19T20:54:43Z,GitCommit:555030053825dd61689a086cb3c2da329419325a,GitTag:v3.5.6,GitTreeState:clean,GoVersion:go1.21.9,Compiler:gc,Platform:linux/amd64,}"
ERROR 2024-08-22T06:28:01.665699834Z [resource.labels.containerName: init] time="2024-08-22T06:28:01.665Z" level=info msg="Start loading input artifacts..."
ERROR 2024-08-22T06:28:01.665847926Z [resource.labels.containerName: init] time="2024-08-22T06:28:01.665Z" level=info msg="Alloc=9876 TotalAlloc=13500 Sys=25189 NumGC=3 Goroutines=4"
INFO 2024-08-22T06:28:02.305226944Z [resource.labels.containerName: preparedata] mkdir: created directory '/cyto_cc/inputs'
INFO 2024-08-22T06:28:02.307017260Z [resource.labels.containerName: preparedata] mkdir: created directory '/cyto_cc/cache'
INFO 2024-08-22T06:28:02.938278159Z [resource.labels.containerName: preparedata] Google Cloud SDK 489.0.0
INFO 2024-08-22T06:28:02.938305508Z [resource.labels.containerName: preparedata] alpha 2024.08.16
INFO 2024-08-22T06:28:02.938310216Z [resource.labels.containerName: preparedata] app-engine-go 1.9.76
INFO 2024-08-22T06:28:02.938313860Z [resource.labels.containerName: preparedata] app-engine-java 2.0.29
INFO 2024-08-22T06:28:02.938318522Z [resource.labels.containerName: preparedata] app-engine-python 1.9.113
INFO 2024-08-22T06:28:02.938322107Z [resource.labels.containerName: preparedata] app-engine-python-extras 1.9.107
INFO 2024-08-22T06:28:02.938326007Z [resource.labels.containerName: preparedata] beta 2024.08.16
INFO 2024-08-22T06:28:02.938329339Z [resource.labels.containerName: preparedata] bigtable
INFO 2024-08-22T06:28:02.938333122Z [resource.labels.containerName: preparedata] bq 2.1.8
INFO 2024-08-22T06:28:02.938336670Z [resource.labels.containerName: preparedata] bundled-python3-unix 3.11.9
INFO 2024-08-22T06:28:02.938340177Z [resource.labels.containerName: preparedata] cbt 1.21.0
INFO 2024-08-22T06:28:02.938343526Z [resource.labels.containerName: preparedata] cloud-datastore-emulator 2.3.1
INFO 2024-08-22T06:28:02.938347287Z [resource.labels.containerName: preparedata] cloud-firestore-emulator 1.19.8
INFO 2024-08-22T06:28:02.938351128Z [resource.labels.containerName: preparedata] cloud-spanner-emulator 1.5.22
INFO 2024-08-22T06:28:02.938354514Z [resource.labels.containerName: preparedata] core 2024.08.16
INFO 2024-08-22T06:28:02.938357785Z [resource.labels.containerName: preparedata] gcloud-crc32c 1.0.0
INFO 2024-08-22T06:28:02.938361538Z [resource.labels.containerName: preparedata] gke-gcloud-auth-plugin 0.5.9
INFO 2024-08-22T06:28:02.938365571Z [resource.labels.containerName: preparedata] gsutil 5.30
INFO 2024-08-22T06:28:02.938369461Z [resource.labels.containerName: preparedata] kpt 1.0.0-beta.50
INFO 2024-08-22T06:28:02.938372850Z [resource.labels.containerName: preparedata] kubectl 1.28.12
INFO 2024-08-22T06:28:02.938376541Z [resource.labels.containerName: preparedata] local-extract 1.5.10
INFO 2024-08-22T06:28:02.938380173Z [resource.labels.containerName: preparedata] pubsub-emulator 0.8.14
INFO 2024-08-22T06:28:04.514698506Z [resource.labels.containerName: preparedata] gsutil version: 5.30
INFO 2024-08-22T06:28:04.514730071Z [resource.labels.containerName: preparedata] checksum: 9996ebfa53b330353ea6981e3143d15b (OK)
INFO 2024-08-22T06:28:04.514735911Z [resource.labels.containerName: preparedata] boto version: 2.49.0
INFO 2024-08-22T06:28:04.514741574Z [resource.labels.containerName: preparedata] python version: 3.11.9 (main, Jul 27 2024, 03:07:42) [Clang 18.1.8 ]
INFO 2024-08-22T06:28:04.514745435Z [resource.labels.containerName: preparedata] OS: Linux 6.1.85+
INFO 2024-08-22T06:28:04.514749876Z [resource.labels.containerName: preparedata] multiprocessing available: True
INFO 2024-08-22T06:28:04.514753296Z [resource.labels.containerName: preparedata] using cloud sdk: True
INFO 2024-08-22T06:28:04.514757025Z [resource.labels.containerName: preparedata] pass cloud sdk credentials to gsutil: True
INFO 2024-08-22T06:28:04.514760542Z [resource.labels.containerName: preparedata] config path(s): No config found
INFO 2024-08-22T06:28:04.514764548Z [resource.labels.containerName: preparedata] gsutil path: /usr/lib/google-cloud-sdk/bin/gsutil
INFO 2024-08-22T06:28:04.514768393Z [resource.labels.containerName: preparedata] compiled crcmod: True
INFO 2024-08-22T06:28:04.514772118Z [resource.labels.containerName: preparedata] installed via package manager: False
INFO 2024-08-22T06:28:04.514776042Z [resource.labels.containerName: preparedata] editable install: False
INFO 2024-08-22T06:28:04.514780420Z [resource.labels.containerName: preparedata] shim enabled: False
INFO 2024-08-22T06:28:04.742427870Z [resource.labels.containerName: preparedata] Testing for local inputs list file at gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST...
INFO 2024-08-22T06:28:06.428217356Z [resource.labels.containerName: preparedata] Copying input files list from bucket to /tmp/INPUT_FILES_LIST...
ERROR 2024-08-22T06:28:07.816816138Z [resource.labels.containerName: preparedata] Copying gs://cyto_cc/cyto-cc-production/workflows/wf-41e31955e6/local-input-lists/0/INPUT_FILES_LIST...
ERROR 2024-08-22T06:28:07.914597134Z [resource.labels.containerName: preparedata] / [0 files][ 0.0 B/ 83.0 B] / [1 files][ 83.0 B/ 83.0 B]
ERROR 2024-08-22T06:28:07.914631240Z [resource.labels.containerName: preparedata] Operation completed over 1 objects/83.0 B.
INFO 2024-08-22T06:28:08.150320108Z [resource.labels.containerName: preparedata] Listing /tmp/INPUT_FILES_LIST...
INFO 2024-08-22T06:28:08.152226613Z [resource.labels.containerName: preparedata] -rw-r--r-- 1 root root 83 Aug 22 06:28 /tmp/INPUT_FILES_LIST
INFO 2024-08-22T06:28:08.152446775Z [resource.labels.containerName: preparedata] Printing contents of /tmp/INPUT_FILES_LIST...
INFO 2024-08-22T06:28:08.153548982Z [resource.labels.containerName: preparedata] cyto-cc-production/local_files/b33c7e146959d12fd9f3b0581e41b35c,configuration.json
INFO 2024-08-22T06:28:08.153730714Z [resource.labels.containerName: preparedata] Looping over contents of /tmp/INPUT_FILES_LIST...
INFO 2024-08-22T06:28:08.156460224Z [resource.labels.containerName: preparedata] input file record: cyto-cc-production/local_files/b33c7e146959d12fd9f3b0581e41b35c,configuration.json
INFO 2024-08-22T06:28:09.505601790Z [resource.labels.containerName: preparedata] 3581 2024-08-19T14:33:33Z gs://cyto_cc/cyto-cc-production/local_files/b33c7e146959d12fd9f3b0581e41b35c
INFO 2024-08-22T06:28:09.505636611Z [resource.labels.containerName: preparedata] TOTAL: 1 objects, 3581 bytes (3.5 KiB)
ERROR 2024-08-22T06:28:11.202001109Z [resource.labels.containerName: preparedata] Copying gs://cyto_cc/cyto-cc-production/local_files/b33c7e146959d12fd9f3b0581e41b35c...
ERROR 2024-08-22T06:28:11.414958985Z [resource.labels.containerName: preparedata] / [0 files][ 0.0 B/ 3.5 KiB] / [1 files][ 3.5 KiB/ 3.5 KiB]
ERROR 2024-08-22T06:28:11.414992636Z [resource.labels.containerName: preparedata] Operation completed over 1 objects/3.5 KiB.
INFO 2024-08-22T06:28:11.651558403Z [resource.labels.containerName: preparedata] Copying shared asset input files...
INFO 2024-08-22T06:28:11.653497335Z [resource.labels.containerName: preparedata] /cyto_cc:
INFO 2024-08-22T06:28:11.653515434Z [resource.labels.containerName: preparedata] total 16
INFO 2024-08-22T06:28:11.653523117Z [resource.labels.containerName: preparedata] drwxrwxrwx 4 root root 4096 Aug 22 06:28 .
INFO 2024-08-22T06:28:11.653528515Z [resource.labels.containerName: preparedata] drwxr-xr-x 1 root root 4096 Aug 22 06:28 ..
INFO 2024-08-22T06:28:11.653533418Z [resource.labels.containerName: preparedata] drwxr-xr-x 2 root root 4096 Aug 22 06:28 cache
INFO 2024-08-22T06:28:11.653538431Z [resource.labels.containerName: preparedata] drwxr-xr-x 2 root root 4096 Aug 22 06:28 inputs
INFO 2024-08-22T06:28:11.653543281Z [resource.labels.containerName: preparedata] {}
INFO 2024-08-22T06:28:11.653548727Z [resource.labels.containerName: preparedata] /cyto_cc/cache:
INFO 2024-08-22T06:28:11.653553790Z [resource.labels.containerName: preparedata] total 8
INFO 2024-08-22T06:28:11.653559497Z [resource.labels.containerName: preparedata] drwxr-xr-x 2 root root 4096 Aug 22 06:28 .
INFO 2024-08-22T06:28:11.653565357Z [resource.labels.containerName: preparedata] drwxrwxrwx 4 root root 4096 Aug 22 06:28 ..
INFO 2024-08-22T06:28:11.653570669Z [resource.labels.containerName: preparedata] {}
INFO 2024-08-22T06:28:11.653581166Z [resource.labels.containerName: preparedata] /cyto_cc/inputs:
INFO 2024-08-22T06:28:11.653587694Z [resource.labels.containerName: preparedata] total 12
INFO 2024-08-22T06:28:11.653592669Z [resource.labels.containerName: preparedata] drwxr-xr-x 2 root root 4096 Aug 22 06:28 .
INFO 2024-08-22T06:28:11.653599754Z [resource.labels.containerName: preparedata] drwxrwxrwx 4 root root 4096 Aug 22 06:28 ..
INFO 2024-08-22T06:28:11.653606040Z [resource.labels.containerName: preparedata] -rw-r--r-- 1 root root 3581 Aug 22 06:28 configuration.json
ERROR 2024-08-22T06:28:12.354050346Z [resource.labels.containerName: wait] time="2024-08-22T06:28:12.353Z" level=info msg="Starting Workflow Executor" version=v3.5.6
ERROR 2024-08-22T06:28:12.358365284Z [resource.labels.containerName: wait] time="2024-08-22T06:28:12.358Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
ERROR 2024-08-22T06:28:12.358387800Z [resource.labels.containerName: wait] time="2024-08-22T06:28:12.358Z" level=info msg="Executor initialized" deadline="2024-08-23 06:28:00 +0000 UTC" includeScriptOutput=false namespace=default podName=wf-41e31955e6-3397268171 templateName=run-task-ev-1 version="&Version{Version:v3.5.6,BuildDate:2024-04-19T20:54:43Z,GitCommit:555030053825dd61689a086cb3c2da329419325a,GitTag:v3.5.6,GitTreeState:clean,GoVersion:go1.21.9,Compiler:gc,Platform:linux/amd64,}"
ERROR 2024-08-22T06:28:12.371515528Z [resource.labels.containerName: wait] time="2024-08-22T06:28:12.371Z" level=info msg="Starting deadline monitor"
ERROR 2024-08-22T06:28:15.551186211Z [resource.labels.containerName: main] time="2024-08-22T06:28:15.538Z" level=info msg="capturing logs" argo=true
INFO 2024-08-22T06:28:15.711030659Z [resource.labels.containerName: main] 2024-08-22T06:28:15UTC: starting command for wf-41e31955e6/0 (single-cell-root) in pod wf-41e31955e6-3397268171 (retry 0)
INFO 2024-08-22T06:28:27.424138164Z [resource.labels.containerName: main] 2024-08-22 06:28:27.423 | INFO | controller:run_sc_pipeline:31 - Loaded SC configurations for datasets: ['GSE184362']
INFO 2024-08-22T06:28:27.424176896Z [resource.labels.containerName: main] 2024-08-22 06:28:27.423 | INFO | controller:run_sc_pipeline:37 - Building DAG for dataset 'GSE184362'...
INFO 2024-08-22T06:28:27.532669075Z [resource.labels.containerName: main] 2024-08-22 06:28:27,532 [MainThread ] [INFO ] getting fresh token
INFO 2024-08-22T06:28:27.532706545Z [resource.labels.containerName: main] 2024-08-22 06:28:27,532 [MainThread ] [INFO ] getting token
INFO 2024-08-22T06:28:29.276542281Z [resource.labels.containerName: main] 2024-08-22 06:28:29.276 | INFO | controller:run_sc_pipeline:92 - Scheduled 8 tasks for dataset 'GSE184362' in "wf-9915539f1e"
ERROR 2024-08-22T06:33:12.359457025Z [resource.labels.containerName: wait] time="2024-08-22T06:33:12.359Z" level=info msg="Alloc=7597 TotalAlloc=14127 Sys=29797 NumGC=6 Goroutines=8"

@jswxstw
Copy link
Member

jswxstw commented Aug 22, 2024

Unfortunately that POD seems to be gone:

Your pod is gone, #13454 will fix your problem.

We can see from the end of these logs that both main and wait were running.
main did not complete, so I think it is the one that got OOMKilled.

wait container interrupted is the root cause, check the workflow spec, you may see this at the last:

taskResultsCompletionStatus:
    wf-41e31955e6-3397268171: false

@yonirab The PR I submitted will fix this issue and I think you can increase the resource limit for wait container to mitigate this issue now.

jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Aug 23, 2024
…gracefully. Fixes argoproj#13373

Signed-off-by: oninowang <oninowang@tencent.com>
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Aug 23, 2024
…gracefully. Fixes argoproj#13373

Signed-off-by: oninowang <oninowang@tencent.com>
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Aug 23, 2024
…inated not gracefully. Fixes argoproj#13373

Signed-off-by: oninowang <oninowang@tencent.com>
@yonirab
Copy link
Contributor

yonirab commented Aug 23, 2024

@jswxstw Bingo! Here's what I see at the end of the workflow spec:

"taskResultsCompletionStatus": {
      "wf-41e31955e6-3397268171": false,
      "wf-41e31955e6-4001409550": true
    }

Looking forward to a release with your fix to hopefully see the end of this problem!
Thank you very much!

@EladProject - FYI

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Aug 24, 2024
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Aug 26, 2024
…inated not gracefully. Fixes argoproj#13373

Signed-off-by: oninowang <oninowang@tencent.com>
@jswxstw
Copy link
Member

jswxstw commented Aug 28, 2024

Unfortunately that POD seems to be gone:

Your pod is gone, #13454 will fix your problem.

I made a mistake. #13454 may not fix this problem.

if !foundPod && !node.Completed() {

It mark the node as failed after a timeout and mark the workflowtaskresult as completed only when the pod is absent and the node has not been completed in taskResultReconciliation.

woc.markNodeError(node.Name, errors.New("", "pod deleted"))

Unfortunately, the node will be marked as Error if pod is absent and its node is not recently started in podReconciliation.

The logic in these two sections is almost identical, with only two differences:

cc @agilgur5 @Joibel @isubasinghe

@Joibel Joibel closed this as completed in 3d41fb2 Aug 29, 2024
Joibel pushed a commit to pipekit/argo-workflows that referenced this issue Sep 19, 2024
…fully. Fixes argoproj#13373 (argoproj#13491)

Signed-off-by: oninowang <oninowang@tencent.com>
Joibel pushed a commit that referenced this issue Sep 20, 2024
…fully. Fixes #13373 (#13491)

Signed-off-by: oninowang <oninowang@tencent.com>
@agilgur5 agilgur5 changed the title workflow stuck in Running, but only pod exited with OOMKilled (exit code 137) v3.5.8: workflow stuck in Running, but only pod exited with OOMKilled (exit code 137) Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
4 participants