-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zombie Workflows #4398
Comments
There's probably two bugs here:
|
Can you please add the logs from before the restart of the controller? |
Can you see if |
Logs with grepping for the workflow name BEFORE the restart when workflow is in Zombie mode:
|
No output from |
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Engineering build: |
Zombie keeps living:
|
You'll need to run a new workflow to see if |
Ups okay! That is totally automated! Will provide as soon as one newly submitted is finished... We have looong running workflows ;) |
This one still got a zombie: (grep for workflow)
|
But it seems like they are cleaned up from time to time! |
@TekTimmy I'm assuming that |
Correct, there are still zombie workflows marked as "Running". And Yes that 30m sync seems to work but even in that 30m those zombies are blocking our queue. Let me know how I can help, we could also have a call or something |
That is good news. I'm going to make another fix and send it to you shortly. |
Sidenote: maybe similar to #4048 |
Sidenote: Interested in how you want the progress to be shown in the CLI. Do you think this would be a good default? If you want to submit a PR, it would be as easy as adding this one line:
to workflow_types.go |
Can you please try |
Have set
gonna monitor the zombies now |
Thanks for the hint! I'm pretty interested to support this project but my |
workflow controller stopped logging shortly after the error message. |
Again with |
Error is gone, zombies are still summoned and no log entry at all about those zombie workflows in the workflow-controller logs =/ |
Thank you for helping with the testing. I'm really keen to nail down a fix for this. |
Can you tell me what labels the zombies have? Do the have a |
Thank YOU for developing! I'm happy that I can help. |
These are all the labels chromosome: "1"
sample-dataset-id: 1A-0-C1_v1
sample-id: 1A-0-C1
type: wgs
workflow: gvc-annotation
workflows.argoproj.io/creator: system-serviceaccount-v1-0-mocca-cron
workflows.argoproj.io/phase: Running ouh shit, does our own "workflow" label interfere ? |
We have just moved to wintertime (-1 hour) so its nearly midnight to me in EST time zone... have to get some sleep. Looking forward to test some more tomorrow. |
|
|
This doesn't make sense:
No archived workflow should ever be running. They should always be completed. |
I'm starting to think something else is interfering. Can you run the workflow, but follow it using |
After discussions with alexec and some tests to solve it zombie workflows still occur from time to time but get cleaned up in an acceptable time. I was not able to reproduce a workflow in phase |
I'm going to re-open. I think we should actually clean them up a bit quicker. |
Signed-off-by: Alex Collins <alex_collins@intuit.com>
Summary
In our cluster Workflows occur that seem to be forgotten / ignored by the workflow controller. So we find Workflows in status "Running" that have successfully finished all steps OR have a failed step. This is causing Workflows to be "Running" forever and block resources.
When killing the workflow-controller pod, which then automatically restarts, all workflows get cleaned up accordingly.
I expect the workflow controller to handle those workflows correctly. Could that be caused by not enough resources for the workflow controller? Do you have scaling suggestions for ~800 Running workflows?
I get the Zombie workflows:
When running this command:
Diagnostics
AWS EKS 1.17 with Argo:latest
Workflow Controller Options:
Cleaned YML output:
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: