-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.9.0 workflow-controller OOM Killed #3400
Comments
Any chance this is related to a fix in v2.9.1? "fix: Running pods are garaged in PodGC onSuccess" |
@jjtroberts you say v2.9.0 in the description, but v2.9.1 in your message. Can you confirm if you have tested with v2.9.0 and v2.9.1 please? |
I saw the fix in v2.9.1 release only after posting this bug and am currently testing that version. The OOM killer occurred in v2.9.0. I'll update once I know more about v2.9.1 results. |
Thank you. |
At the end of the recursive step in v2.9.1, memory still exceeds node capacity: |
|
|
|
Fix in v2.9.2 |
Checklist:
What happened:
We upgraded from v2.4.3 to v2.9.0 and our heaviest process hung due to the workflow-controller consuming all available memory on the node, resulting in the oom killer reaping the pod +200 times. When I attempted to terminate the workflow, argo responded that it had successfully terminated the workflow, but the status would never change. I had to delete the workflow to remove it altogether. There were no running pods in the client namespace at the time.
What you expected to happen:
The workflow-controller should have remained as stable as it had in v2.4.3.
How to reproduce it (as minimally and precisely as possible):
Run the attached workflow with substitute images using argo-server v2.9.0 on
Kubernetes v1.17.2
Anything else we need to know?:
Environment:
Other debugging information (if applicable):
syslog
messages.txt
workflow-controller logs:
workflow_controller_logs_v2.9.0.txt
workflow template
workflow.txt
node memory stats
You can see where it ran out of memory last night around 10-10:30PM CST, and was terminated, then cron kicked it off again the next morning and it steadily grew until it consumed all available memory.
Message from the maintainers:
If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: