-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2.10/v2.11/latest(21st Sep): Too many warn & error messages in Argo Workflow Controller (msg="error in entry template execution" error="Deadline exceeded") #4048
Comments
|
Maybe fixed by #3905. Does the workflow have a DAG with 1000s of items? |
@sarabala1979 Yes we have Expected means should I ignore these warns & error messages ? @alexec No, the workflow do not have a DAG with 1000s of items. However it has DAG with about 8 items. |
When deadline is exceeded it maybe possible for workflow to get corrupted, so the reconciliation aborts. Reconciliation is allowed 10s and can take longer if, for example:
As you are running many workflows, it is probably flooding the Kubernetes API. Have you tried limiting work in progress this, e.g. using parallelism or semaphores. We are working on various improvements in v2.11 and v2.12 to support large scale. |
@alexec Yes I can see that Kubernetes API request are taking a long time during my large workload execution. I can see that my workflow are getting stuck in Running state while there is no Pod which is in Running state.
What kind of improvements you are trying to do to support large scale ? |
I would upgrade to v2.11.0 and try with that version. I'm not sure we will have fully fixed this problem until v2.12 mind you. |
@alexec After upgrading to v2.11.0 also I can see these warnings and errors
This is when I started 100 concurrent workflows with each workflow having about 6-7 Pods. |
This workflow is interesting. I got such 3 failed workflows in setup of 200 concurrent running workflows.
Attached all logs related to this workflow for further debugging. Highlights of logs
@alexec Can you tell me meaning for Looking at code https://github.com/argoproj/argo/blob/5be254425e3bb98850b31a2ae59f66953468d890/workflow/controller/steps.go#L279 |
What I understood from my observations is that- Some workflows are in forever Pending state while controller is not picking up those at all. Some are failing randomly as Argo workflow controller is not able to commit states for workflows even though DAG task or sequentially step succeeded. Workflow controller is taking more time to process workflows due to slowness in K8s APIs. Slowness is expected and it's fine but inconsistent state of workflows worries me a lot. @sarabala1979 Any help here ? |
Can anyone please provide some executable reproduction steps? I.e. can you provide a workflow that reproduces this behaviour. It is extremely time consuming (and often impractical) to fix bugs where this has not been provided. |
This is the line of code deeming failure: |
@alexec: Workflow
submit 1000 of them at once - load_argo.sh
load_argo.sh
Using an Azure kubernetes cluster. |
Do you have time for a Zoom? |
Hey @alexec - sure. Let me know of a time that works for you.. |
Hi @alexec - I am in IST time zone. I was also wondering how will we connect.. You most likely don't see my email address. I can't see yours at least. Is there someway I can message you? |
I'm in PST, so we do not have overlapping working hour with IST AFAIK - so no Zoom. |
We'll adjust the timing at our end. Please let us know your availability or email address to have Zoom call. I'll send you an invite there. It will be very helpful if we are able to nail down to the exact issue. |
@BlackRider97 which TZ are you in please? Could you do between 4:30pm and 6pm PST today? 9am and 10am tomorrow (Thursday)? |
@alexec @AbhishekMallick I am in IST (GMT+5:30) time zone. @alexec Please let me know how I can send you a Zoom call invite to you ? |
I'm busy at 9am, but free at 9:30am. Find me here: https://intuit.zoom.us/j/9471219298?pwd=L3dwWklkclZNUk1VYkJBOTE0SGREdz09 |
Looks like the |
This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further. |
Can you confirm if this was an issue resolves straight-away, or after 30m? |
It appears to have resolved right away. |
Can you please try |
So far so good using the |
Is it resolved straight away? |
Yes, it started to resolve very quickly after applying the changes. Curious what was the cause for the issue? |
Can I confirm that they were resolved using :latest or using :no-sig? |
Initially trying |
Ok. So we can shelve :no-sig. |
The issue description and comments sound exactly like what we're experiencing on 2.11.6. "stuck" workflow steps without running pods. @alexec we briefly chatted in Slack and you suggested upgrading to 2.11.7 - would |
@ebr try them both. |
Still an issue with 2.11.7 (i can share some screenshots privately or Zoom). I also raised
|
You should be able to revert, but I'd recommend you try in a test env. |
We are not lucky enough to have a test env :) In a middle of a massive production run right now. so we're doing it live. |
Does it come unstuck after 20m (at which time there is a resync)? |
Unfortunately, no. E.g. we have a couple of wfs right now that have been "stuck" this way with 0 pods (because they have been GC'd), but no new pods are being created for those wfs. Other wfs (like the chunked-up ones i described above) are happily creating pods though. |
Can we Zoom? I'd like to investigate online. |
That would be fantastic. Could we organize it for tomorrow, as it's getting late here (EST)? I'll make sure to keep the stuck workflows around (though it's reproducible, so that's not an issue). I can message you on Slack to organize, if that's ok. |
I'm in PST. |
We're seeing the exact issues in 2.11 as well, when submitting hundreds of workflows. If possible, I'd be delighted to join the call. |
Ping me on the Slack channel. |
Closing as we have #4560 now. |
Summary
Too many warning and error messages inside Argo workflow controllers
Argo workflow controller logs
$ kubectl logs --tail=20 workflow-controller-cb99d68cf-znssr
Workflows are getting stuck after some time and not completing in 12+ hours while normal execution is around 1 minute.
I am creating almost 1000 workflows with each workflow containing 4 Pods in short span of time. There are enough worker nodes to do processing so no issues as such from K8s cluster side.
Diagnostics
What version of Argo Workflows are you running?
Argo v2.10.1
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: