-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow steps fail with a 'pod deleted' message. #4565
Comments
Can you please try running your controller with |
Can confirm that we are seeing zero occurrences of this with |
|
If I create an engineering build - can you please test it? |
Certainly! will it also include what we discussed re: #4560 ? |
Please can you try |
@alexec should I also remove the |
Yes. Remove |
Still seeing ~3% of pods being deleted on |
@ebr - no - that is an unsupported development flag. This is a blocking issue for v2.12 and does need to be fixed. |
I've just updated the build with a longer grace period and more logging. Could you please try it out? |
no longer seeing this on |
can you tell me if you see |
No |
That's really odd. You should absolutely see it. What is your log level? |
Log level |
there are definitely dozens of pods being started and completed, and none of the given strings appear in the controller logs.... I double checked that I'm still on |
Do you see |
No, I'm not seeing that. but I'm no longer seeing deleted pods either |
Ok, I did not |
Why am I asking again? Seeing |
Sounds good, I will try right now before signing off for the day. thanks for all your attention to this! |
I wasn't able to reproduce this at all (still on |
Update: we did see quite a lot of deleted pods through the night - hard to say how many exactly because we were running over 12,000 pods, but it's definitely in the thousands. this is still on |
Not fixed - but less of a problem? |
Ah - the |
Done. |
hey @alexec - thank you - I was offline for a couple of days, but will test it as soon as I can. It is unlikely to happen today.
This is correct. Definitely not as reproducible anymore. For what it's worth, my team has been running jobs on the |
Some more deleted pods today; i switched to
the worklfows seem to be running successuflly and without changes though, i'm debating whether to revert to |
It is Thanksgiving at the moment - I'll look in more detail on Monday. |
no worries and no rush - happy Thanksgiving! I actually did change to I've been at Am I the only user reporting this? Perhaps it would be helpful for someone else to also try reproducing the issue, just to be sure it's not a red herring caused by something related to our EKS cluster set up (I don't see how it could be, but stranger things have happened). |
@ebr I think I've heard of one or two users mention this (but not reporting this by creating a ticket), so I assume they are not seriously impacted. Let me know how you get on with |
I believe this has been likely fixed/mitigated on master. |
We can confirm the issue gone in our runs on a build off the commit 625e3ce in master. |
Hi @alexec, we are seeing this as well. What was the fix here? Or, what can we add to our workflow files to introduce the additional grace period? |
+1 to this. @alexec, I -- and I'm sure others -- would appreciate a general explanation of what the issue and corresponding fixes ended up being. That'd help teams like mine who run into this issue figure out if upgrading our version of Argo is the right remediation. I understand though that this issue was closed nearly two years ago, which might complicate things. :P |
Helpful, thank you. |
Summary
Maybe related to #3381?
Some of the workflow steps end up in
Error
state, withpod deleted
. I am not sure which of the following data points are relevant, but listing all observations:PodGC: strategy: OnPodSuccess
.withItems
loopDiagnostics
What Kubernetes provider are you using?
docker
What version of Argo Workflows are you running?
v2.12.0rc2
for all componentsMessage from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: