-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod's phase is failed, but the workflow node status is succeeded. #3879
Comments
@juchaosong can you try it on the latest build v2.9.5? |
|
Can you provide the |
Argo controller logs
We use alibaba cloud eci pods which something like virtual-kubelet. The pod failed because there are not stock. I'm not sure whether alibaba cloud should change container state to terminated instead of waiting. Does kubernetes standardized when the pod failed, the container state must change to terminated? |
Yes, one of the containers should be in the terminated state for POD failure.
But did you see any of the node has removed on your cluster?
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/ |
The node which hosts failed pod is not really node. It's created by virtual-kubelet. If the pod information is wrong, it may be a problem with the implementation of virtual-kubelet. |
looks like virtual kubelet will not set terminate if the container is not running |
I think I encountered similar issue on GKE. The argo version I am using is 2.10.0. Last night I submitted a workflow which has 1024 pods in parallel for processing data and uploading to Google Cloud Storage (via gsutil command rather than artifacts output). This morning I noticed that the whole workflow is done, but yet there's one file missing. I looked into the log and realized the controller somehow think the node is done in very short time. I tried to find the log of the pod but can't find anything. I think the pod doesn't even get a chance to be up fully running, but soon it's dead. Yet somehow argo controller thinks the pod is succeeded. Here's the log for the controller:
Normally each of these node would take 20 minutes to finish, but this node from start to finish only took 1 minute, certainly not right. I cannot reproduce it, and not much other log I can find. These are what I can provide. |
Can you find the controller log like these? |
Available for testing in v2.11.0-rc1. |
Summary
What happened/what you expected to happen?
Pod's phase is failed, but the workflow node status is succeeded.
Diagnostics
What version of Argo Workflows are you running?
v2.7.0
The pod's yaml
The workflow node status yaml
You can see the pod's phase is failed, but the workflow node status is succeeded.
According to https://github.com/argoproj/argo/blob/v2.7.0/workflow/controller/operator.go#L1070-L1074, because the containerStatues's terminated is null, so the node will update to succeeded, https://github.com/argoproj/argo/blob/v2.7.0/workflow/controller/operator.go#L1165.
I'm not sure is argo's bug or the wrong pod container's status.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: