Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow steps fail with a 'pod deleted' message. #4565

Closed
ebr opened this issue Nov 19, 2020 · 40 comments
Closed

Workflow steps fail with a 'pod deleted' message. #4565

ebr opened this issue Nov 19, 2020 · 40 comments
Assignees
Labels
type/bug type/regression Regression from previous behavior (a specific type of bug)
Milestone

Comments

@ebr
Copy link

ebr commented Nov 19, 2020

Summary

Maybe related to #3381?

Some of the workflow steps end up in Error state, with pod deleted. I am not sure which of the following data points are relevant, but listing all observations:

  • the workflow uses PodGC: strategy: OnPodSuccess.
  • we are seeing this for ~5% of workflow steps.
  • affected steps are a part of a withItems loop
  • the workflow is not large - ~170 to 300 concurrent nodes
  • this is observed since deploying v2.12.0rc2 yesterday, including v2.12.0rc2 executor image. We were previously on v2.11.6 and briefly on v2.11.7, and have not seen this.
  • k8s events confirm the pods ran to completion
  • cluster scaling has been ruled out as the cause - this is observed on multiple k8s nodes, all of which are still running
  • we have not tried the same workflow without PodGC yet.

Diagnostics

What Kubernetes provider are you using?

docker

What version of Argo Workflows are you running?

v2.12.0rc2 for all components


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@ebr ebr added the type/bug label Nov 19, 2020
@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

I'm betting this is caused by #4024, maybe when it interacts with #4025.

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

Can you please try running your controller with INFORMER_WRITE_BACK=false?

@alexec alexec added the type/regression Regression from previous behavior (a specific type of bug) label Nov 19, 2020
@alexec alexec added this to the v2.12 milestone Nov 19, 2020
@alexec alexec self-assigned this Nov 19, 2020
@ebr
Copy link
Author

ebr commented Nov 19, 2020

Can confirm that we are seeing zero occurrences of this with INFORMER_WRITE_BACK=false, on a 10x larger workflow. We'll leave it like this for now. Are there any performance/reliability issues to look out for with this flag?

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

INFORMER_WRITE_BACK=false only points to a solution - it is not the solution itself and it comes with a performance cost

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

If I create an engineering build - can you please test it?

@ebr
Copy link
Author

ebr commented Nov 19, 2020

Certainly! will it also include what we discussed re: #4560 ?

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

Please can you try argoproj/workflow-controller:grace and check to see if allowing a short grace-period before marking node as error is logged?

@ebr
Copy link
Author

ebr commented Nov 19, 2020

@alexec should I also remove the INFORMER_WRITE_BACK envvar?

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

Yes. Remove INFORMER_WRITE_BACK.

@ebr
Copy link
Author

ebr commented Nov 19, 2020

Still seeing ~3% of pods being deleted on :grace, and no log messages with grace-period. I'd like to revert to 2.12.0rc2 and set INFORMER_WRITE_BACK=false for now - is that safe aside from the performance cost?

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

@ebr - no - that is an unsupported development flag.

This is a blocking issue for v2.12 and does need to be fixed.

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

I've just updated the build with a longer grace period and more logging. Could you please try it out?

@ebr
Copy link
Author

ebr commented Nov 19, 2020

no longer seeing this on :grace! but no log messages either.

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

can you tell me if you see recentlyStarted in your logs?

@ebr
Copy link
Author

ebr commented Nov 19, 2020

No recentlyStarted.... I'm going to try reproducing this again on 2.12rc2

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

That's really odd. You should absolutely see it. What is your log level? info? Is your imagePullPolicy set to Always?

@ebr
Copy link
Author

ebr commented Nov 19, 2020

Log level debug as we set in our Zoom troubleshooting and I haven't changed it; imagePullPolicy: Always.

@ebr
Copy link
Author

ebr commented Nov 19, 2020

there are definitely dozens of pods being started and completed, and none of the given strings appear in the controller logs.... I double checked that I'm still on :grace.

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

Do you see Pod for node deleted? You should not.

@ebr
Copy link
Author

ebr commented Nov 19, 2020

No, I'm not seeing that. but I'm no longer seeing deleted pods either

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

Ok, I did not docker push the very latest version. I have now pushed it if yo have time to test and check for recentlyStarted in your logs.

@alexec
Copy link
Contributor

alexec commented Nov 19, 2020

Why am I asking again? Seeing recentlyStarted: true will give me confidence that fix is correct. Not something else changed.

@ebr
Copy link
Author

ebr commented Nov 19, 2020

Sounds good, I will try right now before signing off for the day. thanks for all your attention to this!

@ebr
Copy link
Author

ebr commented Nov 20, 2020

I wasn't able to reproduce this at all (still on :grace). And see none of the above log messages...
Tomorrow I'm going to try reproducing this again on 2.12.0rc2
A silly thought perhaps, but could this have been caused by another workflow? The only thing that changed throughout the day (aside from my testing) was that we deleted some previously failed workflows.

@ebr
Copy link
Author

ebr commented Nov 20, 2020

Update: we did see quite a lot of deleted pods through the night - hard to say how many exactly because we were running over 12,000 pods, but it's definitely in the thousands. this is still on :grace. So I guess it's not as immediately reproducible, but definitely still happening.
I can't say whether there were any log messages with recentlyStarted: true, as the controller is pumping out such massive log volume that they've already been rotated out

@alexec
Copy link
Contributor

alexec commented Nov 23, 2020

Not fixed - but less of a problem?

@alexec
Copy link
Contributor

alexec commented Nov 23, 2020

Ah - the :grace version does not have any fixes from :mot (which is now merged to :latest). If I create a new engineering build of :grace that includes that - could you test the together?

@alexec
Copy link
Contributor

alexec commented Nov 23, 2020

Done. argoproj/workflow-controller:grace has both fixes.

@ebr
Copy link
Author

ebr commented Nov 24, 2020

hey @alexec - thank you - I was offline for a couple of days, but will test it as soon as I can. It is unlikely to happen today.

Not fixed - but less of a problem?

This is correct. Definitely not as reproducible anymore. For what it's worth, my team has been running jobs on the :mot build since Friday (4 days), and many very large workflows have completed perfectly, while some smaller ones experienced pod deletion. There doesn't seem to be any systematic pattern to it - it's very intermittent. I'll try the new :grace build when it's safe to do so; we're running some really critical workloads right now so I can't promise to do it soon, unfortunately.

@ebr
Copy link
Author

ebr commented Nov 25, 2020

Some more deleted pods today; i switched to :grace. Seeing this:

image

  • significant increase in RAM usage of the workflow controller
  • still some deleted pods, very few and very intermittent
  • still not seeing any log messages like recentlyStarted: true or grace, but also due to the frequent controller restarts and infrequent pod deletions, I can't be sure.

the worklfows seem to be running successuflly and without changes though, i'm debating whether to revert to :mot or just leave it alone. increased RAM usage is not an issue for us.

@alexec
Copy link
Contributor

alexec commented Nov 25, 2020

  • recentlyStarted was logged at debug level. I've increased this to info level and published a new :grace image.
  • I've added RECENTLY_STARTED_POD_DURATION=10s you could try setting this to higher values, e.g. RECENTLY_STARTED_POD_DURATION=30s to see if that improves the issue.
  • :mot is merged to master and the fix is on :latest.
  • I would not expect to see increased RAM usage. Are you able to test :latest so we can be sure that the issue is with grace not something else?

It is Thanksgiving at the moment - I'll look in more detail on Monday.

@ebr
Copy link
Author

ebr commented Nov 25, 2020

no worries and no rush - happy Thanksgiving! I actually did change to :latest shortly after posting the last reply, as I wasn't comfortable with the spurious restarts. And yes, on :latest the restarts stopped, and the RAM usage went back down and flattened. I noticed that on :grace the RAM usage increased proportionately to the number of workflow pods in the cluster - which is not the case on :latest.

I've been at debug loglevel the entire time, so should've seen the recentlyStarted messages, but perhaps the issue is too intermittent. We don't have log aggregation in place, and tailing the logs doesn't seem to be sufficient, there's just too much volume relative to the frequency of these occurrences. Anyway, I'm going to avoid changing builds at least today and tomorrow, so this may have to wait until Monday for me as well.

Am I the only user reporting this? Perhaps it would be helpful for someone else to also try reproducing the issue, just to be sure it's not a red herring caused by something related to our EKS cluster set up (I don't see how it could be, but stranger things have happened).

@alexec
Copy link
Contributor

alexec commented Nov 30, 2020

@ebr I think I've heard of one or two users mention this (but not reporting this by creating a ticket), so I assume they are not seriously impacted.

Let me know how you get on with :grace with RECENTLY_STARTED_POD_DURATION=30s.

@alexec
Copy link
Contributor

alexec commented Dec 16, 2020

I believe this has been likely fixed/mitigated on master.

@alexec alexec closed this as completed Dec 16, 2020
@vladlosev
Copy link
Contributor

We can confirm the issue gone in our runs on a build off the commit 625e3ce in master.

@dynamicmindset
Copy link
Contributor

dynamicmindset commented Feb 11, 2021

Hello, faced the same issue with Argo v2.12.8.
Garbage collection is set to "onPodCompletion".
Affected steps are a part of a withItems loop and it happens with both 100 or 50 parallelism (haven't tried less).

image
(36 out 2000 steps failed with "pod deleted")

If I change garbage collection to wait until the end of workflow it does not reproduce anymore.
Do you have any suggestions on how I can prevent this from happening?

@anshumanbh
Copy link

Hi @alexec, we are seeing this as well. What was the fix here? Or, what can we add to our workflow files to introduce the additional grace period?

@jinnovation
Copy link

jinnovation commented Mar 16, 2022

Hi @alexec, we are seeing this as well. What was the fix here? Or, what can we add to our workflow files to introduce the additional grace period?

+1 to this. @alexec, I -- and I'm sure others -- would appreciate a general explanation of what the issue and corresponding fixes ended up being. That'd help teams like mine who run into this issue figure out if upgrading our version of Argo is the right remediation.

I understand though that this issue was closed nearly two years ago, which might complicate things. :P

@alexec
Copy link
Contributor

alexec commented Mar 16, 2022

@jinnovation
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
6 participants