-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0 #2261
Comments
@tcolgate FYI |
@Arkanayan do you have an example of this problem? E.g. in the examples folder? |
Sorry @alexec, I am not able to reproduce this issue with the examples. |
If we can repro an issue, we cannot fix it. Please let me know if you're able to provide YAML that repos the issue. Do you believe the issues appeared in v2.5.0-rc8? I.e. it was not present in v2.5.0-rc7. If that is the case, we can reach out to the committers of that release and see what they think. |
I've just encountered this on v2.5.1, I'll try to provide a minimal example. In our case, it seems to be potentially related to: #1493 EDIT: Still investigating. It may be related to our use of tqdm (https://github.com/tqdm/tqdm), which we did not turn off. I believe it spits out quite a lot of logs because it writes to stderr, and there has been a recent change to the handling of stderr. EDIT 2: Disabling tqdm seems to fix it for us. I now believe the problem is when you write lots of things to stderr. |
I'm having this as well.
Running argo 2.6.0-rc2 then the last line just repeats... EDIT: Disabling the |
It might be worth manually running the docker logs command on the host, check if it returns and how much data it returns. I don't /think/ the NopCloser use could cause this. We could wrap stderr and stdout in a custom Closer that closes both, to be absolutely sure. |
We are seeing a similar issue as mentioned by @rmgpinto (we're running Some of our workflows dont terminate due to a sidecar stuck in We have just set Here's the
|
I could be worthwhile to share our docker version: |
Running |
Can someone test this?
…On Tue, 3 Mar 2020, 18:37 Alex Collins, ***@***.***> wrote:
Closed #2261 <#2261> via #2345
<#2345>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2261>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAASR42HZOC4TRZLCDW5TG3RFVE5PANCNFSM4KXTK4DQ>
.
|
I can test with my workflows. |
Can this bug fix be merged into v2.5 and v2.6 because it blocks us from using the latest stable versions? |
We probably. won't update v2.5, as it is now end of life. It should be back-ported to v2.6.1. |
Containers are still stuck in |
Can you do a `ps -ef` in the running executor container and send us the
result? Does it show docker logs still running? or any zombie processes?
…On Thu, 5 Mar 2020 at 08:45, Ricardo Gândara Pinto ***@***.***> wrote:
Containers are still stuck in 2.6.1.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2261>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAASR453OJ7FLVUDAHPOPFTRF5RCVANCNFSM4KXTK4DQ>
.
--
Tristan Colgate-McFarlane
----
"You can get all your daily vitamins from 52 pints of guiness, and a
glass of milk"
|
if I execute |
Right, I think I'm getting closer. MultiReader reads all inputs sequentially, so I think once we exceed the buffer amount on stderr, docker logs blocks to write, and doesn't exit, so we never finish reading stdin. (stupidly, I didn't check the multireader docs, it has been a while since I used it). |
This PR fixes argoproj#2261. Multireader concactinates, not combines, the two streams, this meant that we blocked on stdin if stderr was not completely buffered. In addition, this implementation ensures that we call Wait() on the command, and any error is return by Close(). Godoc for exec.Cmd.Std(out|err)Pipe, confirms that we do not need to close those pipes. This PR also ensures that we do not leak go routines in the even that cmd.Start() fails.
tested this with the following workflow:
I didn't check to see if that hangs on 2.6.1, but the old code hung on the equivalent with some local testing. |
I tested this workflow with 2.6.1. The issue still exists. $ kubectl exec -it issue2261 -c wait -- ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 15:44 ? 00:00:00 argoexec wait
root 21 1 0 15:45 ? 00:00:00 docker logs 3f25da9882f9ffbba5b4
root 33 0 0 15:47 pts/0 00:00:00 ps -ef |
@tigerwings yes, I hoped that fails with 2.6.1, but I've tested against the #2368, and it works with that patch applied. (this ticket needs reopening) |
We upgraded yesterday from 2.4.x to 2.6.0 and faced the same issue for some of our workflows. |
I can confirm that v2.6.2 fixed this issue for us, thanks! |
I can confirm as well. Thanks @tcolgate ! |
Well, I also did break it in the first place :) |
Only people that do nothing won't break anything :) |
Checklist:
What happened:
In multi-root workflows in a single pod, the wait container gets stuck. It is not possible to terminate the workflow. The only way to stop it is to delete it.
The logs of the stuck container
The processes running on the container are
If I kill the
docker logs
process with id22
, it exits successfully.Environment:
Argo version:
2.5-rc8
to2.5
Kubernetes version :
Maybe it was introduced in this commit #2136
The text was updated successfully, but these errors were encountered: