-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod #12993
v3.5.3: Workflow processing fails to complete due to incomplete WorkflowTaskResult from interrupted pod #12993
Comments
It seems unreliable to determine whether the task result has been successfully reported through the |
After reproducing it, there are indeed some problems and need to think about how to fix it. |
Yeah we are just concluding an RCA after we had a cohort of workflows "stuck" in Running state as the upgrade of the |
…. Fixes:argoproj#12993 Signed-off-by: shuangkun <tsk2013uestc@163.com>
…. Fixes:argoproj#12993 Signed-off-by: shuangkun <tsk2013uestc@163.com>
Is the primary case for this one in which the Controller itself issues a SIGKILL to the container because the container isn't responding fast enough to a SIGTERM? If so, it seems like we need to indicate on the Controller side that if we are doing a SIGKILL not to wait for the WorkflowTaskResult for that task, right? |
The root cause for the pod interruption in my case was primarily related to node deprovisioning, as the environment where this issue appeared consistently uses EC2 Spot Instances along with an aggressive deprovisioning strategy for underutilized nodes. I believe that node-pressure eviction could also cause this issue, along with any other external condition that would result in a non-graceful pod termination. |
Got it. That's interesting that node-deprovisioning and node-pressure eviction would result in SIGKILL rather than SIGTERM. |
There are projects like NTH which should get you a SIGTERM with some time to do some work. Perhaps they're not being used here. For non-graceful termination though, we still need a solution. I think we need to consider "Pod gone away" after a reasonable period (to allow for propogation of the WorkflowTaskResult) to be pod failure and mark the outputs as completed (with error) to allow the workflow to fail/retry. |
Good find. We encounter this sometimes in long running workflows.
|
Agreed, I said something very similar in #13066 (comment) and proposed a similar approach in #13344 (comment). The proposed PR #13051 would not necessarily resolve this as a finalizer does not necessarily prevent non-graceful termination (e.g. OOMs and the like are kubelet/node level and not control plane/k8s resource level). I don't have time to take this on, @Joibel would you be able to tackle this when you're back from vacation? Or if someone else wants to tackle it in the meantime, that would work too. See my proposed approach in #13344 (comment) for more details. |
I don't want to block someone else, but yes, otherwise I have it at the top of the list to do. |
I am seeing this issue in 3.5.8 with just a kubectl delete pod. Reproduction case is
|
@isubasinghe and I are working on a fix |
…3454) Signed-off-by: isubasinghe <isitha@pipekit.io>
Pre-requisites
:latest
image tag (i.e.quay.io/argoproj/workflow-controller:latest
) and can confirm the issue still exists on:latest
. If not, I have explained why, in detail, in my description below.What happened/what you expected to happen?
As part of #12402 (included from v3.5.3 onwards), workflow pod
wait
-container behavior was changed to create a placeholder (incomplete) WorkflowTaskResult before waiting for themain
-container to complete.argo-workflows/cmd/argoexec/commands/wait.go
Lines 38 to 42 in 0fdf745
The WorkflowTaskResult is finalized after output artifacts, logs etc. have been handled:
argo-workflows/cmd/argoexec/commands/wait.go
Line 34 in 0fdf745
If the
wait
-container is interrupted in a way that preventsFinalizeOutput
from being called (e.g. pod deletion without sufficient grace period), an incomplete WorkflowTaskResult remains with theworkflows.argoproj.io/report-outputs-completed
label set tofalse
. Retries of the same task will produce additional WorkflowTaskResults and will not mark the previous one complete. This leaves the workflow stuck inProcessing
state until the WorkflowTaskResult is manually edited to mark it complete.The reproduction example workflow simulates forced pod deletion using a pod that deletes itself, leaving behind an incomplete WorkflowTaskResult. The included workflow controller log snippet shows the resulting processing loop.
This issue may be one of the causes of #12103.
Version
v3.5.3
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: