-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.5.8: workflow stuck in Running
, but only pod exited with OOMKilled (exit code 137)
#13373
Comments
if the status of the node task is error, we need to set the task completed, like this func (ws *WorkflowStatus) TaskResultsInProgress() bool {
for taskName, value := range ws.TaskResultsCompletionStatus {
if _, ok := ws.Nodes[taskName]; ok {
if ws.Nodes[taskName].Phase == NodeError || ws.Nodes[taskName].Phase == NodeFailed {
return false
}
}
if !value {
return true
}
}
return false
} |
Same as #12993. |
Running
, but only pod exited with OOMKilled (exit code 137)
Yea same root cause as #12993, so let's consolidate the two Also, for back-link reference, this issue's title refers to #12103. #12103 (comment) refers to an OOM as well, but unclear that those two are the same issue, might just be related symptoms
Also a minimal, reproducible Workflow is nonetheless required for issues, even if very simple. Otherwise someone else has to write one from scratch in order to reproduce. When one is provided, it is a simple copy+paste, confirm, and debug. |
This issue seems to be a little different, the pod state is already argo-workflows/workflow/controller/operator.go Lines 1142 to 1147 in 709d0d0
|
@zhucan @agilgur5 I think this issue should be reopened. |
…gracefully. Fixes argoproj#13373 Signed-off-by: oninowang <oninowang@tencent.com>
Thanks @jswxstw ! With 3.5.5 we are seeing multiple scenarios in which a workflow with one or more steps that retry one or more times (e.g. due to OOMKilled) and eventually succeed shows as green in the UI (both the workflow itself and the steps that retried), but is perceived to be still This seems to be a serious regression in 3.5, which causes very serious problems for any downstream systems monitoring the status of Argo workflows. |
@jswxstw You should have permission to re-open yourself since you're a Member these days. Feel free to do so if you have a strong suspicion like this |
It’s strange that after becoming a member, I didn’t find that I had new operating permissions. |
@jswxstw I'm attaching some more info about our scenario (which is not something I can easily reproduce). The Argo UI shows the workflow as green, with one step that was OOMKilled on it first attempt, and then retried successfully: The same workflow appears with a blue circle in the UI workflows list: Expanding the workflow in the UI workflows list shows the following: Querying the workflow from the argo cli gives the following:
Similarly
Here are log from the workflow-controller that contain
I don't see any logs at all matching |
@yonirab Can you describe the pod
This log is in debug mode. |
I can describe the retry pod, though:
|
I can however get the logs for pod We can see from the end of these logs that both
|
Your pod is gone, #13454 will fix your problem.
taskResultsCompletionStatus:
wf-41e31955e6-3397268171: false @yonirab The PR I submitted will fix this issue and I think you can increase the resource limit for |
…gracefully. Fixes argoproj#13373 Signed-off-by: oninowang <oninowang@tencent.com>
…gracefully. Fixes argoproj#13373 Signed-off-by: oninowang <oninowang@tencent.com>
…inated not gracefully. Fixes argoproj#13373 Signed-off-by: oninowang <oninowang@tencent.com>
@jswxstw Bingo! Here's what I see at the end of the workflow spec: "taskResultsCompletionStatus": {
"wf-41e31955e6-3397268171": false,
"wf-41e31955e6-4001409550": true
} Looking forward to a release with your fix to hopefully see the end of this problem! @EladProject - FYI |
…inated not gracefully. Fixes argoproj#13373 Signed-off-by: oninowang <oninowang@tencent.com>
I made a mistake. #13454 may not fix this problem.
It mark the node as failed after a timeout and mark the workflowtaskresult as completed only when the pod is absent and the node has not been completed in argo-workflows/workflow/controller/operator.go Line 1241 in 983c6ca
Unfortunately, the node will be marked as The logic in these two sections is almost identical, with only two differences:
|
…fully. Fixes argoproj#13373 (argoproj#13491) Signed-off-by: oninowang <oninowang@tencent.com>
Running
, but only pod exited with OOMKilled (exit code 137)
Running
, but only pod exited with OOMKilled (exit code 137)
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened? What did you expect to happen?
If the pod was exited with "OOMKilled (exit code 137)", after the pods was cleaned by gc controller, we expect the status of wf to be "Eorror" not running.
Version(s)
v3.5.8
Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
set source limit to wf, and use many mem out of the limits, and the pod was killed by OOMKilled.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: