-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zombie workflows (aka "stuck workflows") #4560
Comments
Testing this on
further checks (about every 10 minutes) show that node count is going up slowly, and pauses for several minutes with the same symptom [1] eventually the work stalls around item 1993,
[1]
|
Thank you. Can you try this?
Firstly, can you try |
I did have Seeing same behaviour on By the way, i do have a couple of "zombie" workflows sitting here since yesterday, in hopes that one of these tests brings them back to life (i don't care about them otherwise, but they can be useful in this way). So far they haven't budged. |
Here's another perhaps relevant point: a workflow that is both: 1) stuck, AND 2) has a pod that's running, but is "frozen" (for a reason unrelated to Argo, e.g. the Python process in the container locks up), I can get the workflow "unstuck" temporarily, to some degree, by forcing the pod to fail ( I confirmed this hack to work on 4 of my stuck workflows. All 4 of them had exactly one frozen pod (likely a coincidence). If a stuck workflow doesn't have any frozen pods, I have no way to manually kick it this way. To be sure, I am not suggesting that frozen containers are causing the workflows to get stuck, this is merely a "lucky" workaround. @alexec I'm currently on your I'm going to try |
on everything just lit up like a christmas tree! 😻 all "stuck" steps started up. (even from a 2-day-old workflow) @alexec not to jinx it, but i think we have a winner! |
Can I play this back:
|
Correct! |
We were able to replicate the findings here - can verify that the CPU usage is way down with hundreds of concurrent workflows running. |
@tomgoren can you confirm which engineering build you ran please? |
|
I've created a POC engineering build that uses S3 to offload and archive workflows to instead of MySQL or Postgres. My hypothesis is that offloading there maybe faster for users running large (5009+ node) workflows, or that archiving maybe more to many users. On top of this, it maybe cheaper for many users. I challenge you to prove me wrong. #4582 |
Offloading to S3 would be very interesting and welcome. Will try it when I can! Over the past weekend we ended up running workflows of enormous size, on the |
Excellent. Can you share a screenshot? |
We don't have those workflows in the cluster anymore, and I didn't take a screenshot! I'll try to run one of the larger workflows again over the holidays to take the screenshot |
I think this is now fixe.d |
A "zombie workflow" is one that starts but does not complete. Pods are scheduled and run to completion, but the workflow is not subsequently updated.
It is as if the workflow controller never sees the pod changes.
Impacted users:
All users have been running very large workflows.
Typically:
insignificant pod change
is not seen in the controller logs.Deadline exceeded
is seen in the logs. Increasing the CPU and memory on the Kubernetes master node may fix this.Things that don't appear to work (rejected hypothesis):
--burst
or--qps
, QPS settings.--workflow-workflow
or--pod-workers
settings. This only impacts concurrent processing.ALL_POD_CHANGES_SIGNIFICANT=true
. Hypothesis: we're missing significant pod changes.INFORMER_WRITE_BACK=false
.Questions:
Users should try the following:
argoproj/workflow-controller:v2.11.7
- this is faster than v2.11.6 and all previous versions. Suitable for production.argoproj/workflow-controller:latest
.argoproj/workflow-controller:mot
with envMAX_OPERATION_TIME=30s
. Make sure it logsdefaultRequeueTime=30s maxOperationTime=30s
. Hypothesis: we need more time to schedule pods.argoproj/workflow-controller:easyjson
. Hypothesis: JSON marshaling is very slow.If none of this works then we need to investigate deeper.
Related:
The text was updated successfully, but these errors were encountered: