-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.4.11: workflow stuck Running
state, but only pod is Completed
#12103
Comments
So, to summarize, the Pod is "Completed" but the Step and the Workflow are both still showing as "Running", correct? I'm imagining that the Controller is failing to process it (especially as it has surpassed the |
correct. there is something in workflow controller logs below that caught my eye and makes me think its missing retry logic when receiving transient error from k8s control plane:
|
We've had this happen on 8 jobs yesterday.
The problem is exacerbated by the Deadline and it snowballs on following jobs as they get stuck in Pending |
@ZeidAqemia @zqhi71 do your controller logs have any errors like the ones in #12103 (comment) ? what % of workflows are affected? what version are u running? for me on v3.4.11 this affects less than 0.01% of workflows |
v3.5.2 here and it's 100% of workflows where one task OOMs |
Hi guys. We found just the default controller settings is not suitable for thounds of cronworkflow. When we adjust --cronworkflow-worker, qps and burst, the cronworkflow works fine. If someone have the same problem maybe adjust settings following this documents (https://argo-workflows.readthedocs.io/en/latest/scaling/) will help. |
Running
state, even though the only pod for it is Completed
This seems related to Hang on "Workflow processing has been postponed due to max parallelism limit" #11808 I'm seeing the same issue when using |
Hi @tooptoop4 |
it still reoccurs but rare: roughly 1 in 20000 workflow runs. did u see the log in #12103 (comment) ? |
Yes, but in my case, we are facing the same stuck in |
Hi @tooptoop4, we tried #12233 but it doesn't help in my case. |
any idea? @jswxstw @shuangkun |
After reading the background of the issue, and it seems that your situation is different from others.(not similar to #12993) I see a lot of logs like: |
We're also seeing this issue only in 3.5.x versions. I initially tried to upgrade and saw this issue on an earlier 3.5.x. It's been a month or so, so I tried again with 3.5.8, and I'm still seeing the issue. This is with any workflow I try to run - steps, dags, containers and both invoked from workflow templates or crons (although I doubt that matters). |
@sstaley-hioscar, could you verify what you see in the wait container logs from one of these runs confirms it is using workflowtaskresults, and that the controller itself has appropriate RBAC to read the workflowtaskresults as you've said you have custom RBAC. |
here are some logs from the wait container:
It looks like it's using the wrong version. I'll look into that. |
@Joibel It looks like that was the issue. I can delete some of those example templates to prevent cluttering up this thread, if you like. It looks like I'm running into a new error though. Now workflows is now attempting to use the service account of the namespace the workflow is running in to patch pods:
which wasn't what our rbac was set up for in the previous version. Is this expected new behavior or is there a configuration I need to set to make the controller use its own token for these API calls? |
@sstaley-hioscar This is the root cause, because wait container with version When upgrading from version 3.5.1 or below directly to version 3.5.5 or above, if there are running workflows in the cluster, these workflows will stuck in |
If you're upgrading from a version which does not record This is because of this choice from #12537, which means missing This blocks the controller from making any progress, and means upgrades over this with running workflows will always fail to complete inflight workflows. |
I found other issue, the task of the wf had been cleaned, but the wf is always running.
the "taskResultsCompletionStatus" of wf is:
logs of the argo controller is: |
I have analyzed this issus before like #12103 (comment). According to my analysis, this will only result in label argo-workflows/workflow/controller/taskresult.go Lines 66 to 73 in d7495b8
I can't think of a scenario where taskResultsCompletionStatus would be missing since WorkflowTaskResult is always created by wait container.
|
I'll let Alan check in more detail; sorry I didn't go through the whole thread too closely, just thought I'd answer an outstanding question I stumbled upon. |
Running
state, even though the only pod for it is Completed
Running
state, but only pod is Completed
We are having a similar issue where if upon a fanout only a few jobs fail, the workflow stays in running state, despite the fact that the exit handler has been called |
We have seen similar problems. In our case, in a workflow with steps where at least one step retried, we see sporadic occurrences of all steps appear as completed successfully, and the workflow shows as green in the UI, but We have seen this behaviour occasionally in sporadic workflows, ever since upgrading to 3.5.6. |
In 3.4 #13454 is supposed to mitigate this and some other scenarios where a Pod may have gone away in 3.5 but the WorkflowTaskResults are not marked I'm unsure this is the last fix for these issues. |
#13454 can not fix issues like #13373 #12103 (comment) mentioned. argo-workflows/workflow/controller/operator.go Line 2349 in 2192344
#12574 (comment) @shuangkun Can you take look at this? |
I don't understand why this issue is marked as only P3. Surely this should be treated as a Priority 1 issue? |
@jswxstw I discussed your comment above and the OOM scenario with @isubasinghe last night during the contributor meeting and he suspected that #13454 missed the scenario of a Pod that failed/errored but was not yet GC'd (so exists, not yet absent). Retry scenarios might also need some specific code (Pod errored but user has a |
@agilgur5 Pod errored but user has a |
#12993, which covered the general case and was mostly fixed by #13454, was a P1. |
I have submitted a new PR #13491 to fix this scenario. |
We are having a similar issue with one of our workflows. It is a very simply container but it is not finalizing even though it has completed. The output of the wait container shows this:
|
Update: I found out why our pods were hanging. We use metadata:
annotations:
linkerd.io/inject: disabled |
The Pod wouldn't be |
we had a similar issue and used the following kill command for argo workflows
With this pod annotation added to the workflow the Linkerd-proxy is killed after the workflow finishes |
Would you like to add that annotation to the examples in the "Sidecar Injection" page I mentioned above? |
Running
state, but only pod is Completed
Running
state, but only pod is Completed
@zhucan Can you create a new issue for it? |
@jswxstw sure |
To be clear for other readers that find themselves here, this issue pre-dates the v3.5.3-v3.5.10 WorkflowTaskResults issues (i.e. #12993 et al). If you're here with a similar v3.5.x issue, that is different, please update to v3.5.11+. If you still have a bug, please file a separate issue with a reproduction. I've edited the issue description to mention this since we're getting several unrelated comments. I may also lock this issue as "Off-topic" as such, but leave opened. I believe only Approver+ could make comments then, which would be suboptimal given other contributors here, so I'd prefer not to do that. |
Running
state, but only pod is Completed
Running
state, but only pod is Completed
edited by agilgur5: the symptoms of this issue are similar to issues in v3.5.3-v3.5.10, but are different. If you have similar symptoms in v3.5.x, please update to v3.5.11+. You likely had #12993 or similar v3.5.x issues.
If you still have similar symptoms after updating to v3.5.11+, please file a new bug report with a reproduction.
This bug was written against v3.4.11 and so has a different root cause than those issues, despite the similar symptoms.
Pre-requisites
:latest
What happened/what you expected to happen?
the workflow is running for more than 20 hours
even though i have activedeadlineseconds set at 12 hours
the workflow just has a single step which also shows as 'running' in the argo ui. but looking at the logs of it shows that it has complete the code that i expect for that step and also shows
time="2023-10-29T00:53:09 UTC" level=info msg="sub-process exited" argo=true error="<nil>"
at the end of the main log. the pod itself for that step is in Completed state.there are other workflows that have completed as expected during this time, and no other workflows running right now. note this exact workflow has successfully run 1000s of times in the past so i know my spec/permissions are correct.
Version
3.4.11
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: