Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0 #2261

Arkanayan · 2020-02-19T07:17:26Z

Checklist:

I've included the version.
I've included reproduction steps.
I've included the workflow YAML.
I've included the logs.

What happened:
In multi-root workflows in a single pod, the wait container gets stuck. It is not possible to terminate the workflow. The only way to stop it is to delete it.

The logs of the stuck container

$ kubectl logs workflow-zdkm8-552438975 -n argo -c wait
time="2020-02-19T06:06:48Z" level=info msg="Creating a docker executor"
time="2020-02-19T06:06:48Z" level=info msg="Executor (version: v0.0.0+unknown, build_date: 1970-01-01T00:00:00Z) initialized (pod: <redacted> )}}}"
time="2020-02-19T06:06:48Z" level=info msg="Waiting on main container"
time="2020-02-19T06:06:49Z" level=info msg="main container started with container ID: 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting annotations monitor"
time="2020-02-19T06:06:49Z" level=info msg="Execution control set from API: {2020-02-19 10:06:47 +0000 UTC false}"
time="2020-02-19T06:06:49Z" level=info msg="docker wait 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting deadline monitor"
time="2020-02-19T06:06:53Z" level=info msg="Main container completed"
time="2020-02-19T06:06:53Z" level=info msg="Saving logs"
time="2020-02-19T06:06:53Z" level=info msg="[docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6]"
time="2020-02-19T06:06:53Z" level=info msg="Annotations monitor stopped"
time="2020-02-19T06:06:53Z" level=info msg="Deadline monitor stopped"

The processes running on the container are

root@pod-m8-552438975:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.4 145816 34196 ?        Ssl  06:06   0:00 argoexec wait
root        22  0.0  0.3  45672 28684 ?        Sl   06:06   0:00 docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6
root        41  0.0  0.0  18136  3188 pts/0    Ss   06:11   0:00 /bin/bash
root        54  0.0  0.0  36636  2844 pts/0    R+   06:12   0:00 ps aux

If I kill the docker logs process with id 22, it exits successfully.

root@pod-zdkm8-552438975:/# kill 22
root@pod-zdkm8-552438975:/# command terminated with exit code 137

Environment:

Argo version: 2.5-rc8 to 2.5
Kubernetes version :

clientVersion:
  buildDate: "2019-09-18T14:36:53Z"
  compiler: gc
  gitCommit: 2bd9643cee5b3b3a5ecbd3af49d09018f0773c77
  gitTreeState: clean
  gitVersion: v1.16.0
  goVersion: go1.12.9
  major: "1"
  minor: "16"
  platform: linux/amd64
serverVersion:
  buildDate: "2019-11-13T11:11:50Z"
  compiler: gc
  gitCommit: 7015f71e75f670eb9e7ebd4b5749639d42e20079
  gitTreeState: clean
  gitVersion: v1.15.6
  goVersion: go1.12.12
  major: "1"
  minor: "15"
  platform: linux/

Maybe it was introduced in this commit #2136

The text was updated successfully, but these errors were encountered:

alexec · 2020-02-19T17:23:49Z

@tcolgate FYI

alexec · 2020-02-19T17:25:27Z

@Arkanayan do you have an example of this problem? E.g. in the examples folder?

Arkanayan · 2020-02-24T07:05:54Z

Sorry @alexec, I am not able to reproduce this issue with the examples.

alexec · 2020-02-24T16:53:36Z

If we can repro an issue, we cannot fix it. Please let me know if you're able to provide YAML that repos the issue.

Do you believe the issues appeared in v2.5.0-rc8? I.e. it was not present in v2.5.0-rc7. If that is the case, we can reach out to the committers of that release and see what they think.

MrSaints · 2020-02-24T22:53:54Z

I've just encountered this on v2.5.1, I'll try to provide a minimal example.

In our case, it seems to be potentially related to: #1493

EDIT: Still investigating. It may be related to our use of tqdm (https://github.com/tqdm/tqdm), which we did not turn off. I believe it spits out quite a lot of logs because it writes to stderr, and there has been a recent change to the handling of stderr.

EDIT 2: Disabling tqdm seems to fix it for us. I now believe the problem is when you write lots of things to stderr.

rmgpinto · 2020-02-25T13:28:23Z

I'm having this as well.
wait container logs:

time="2020-02-25T01:02:00Z" level=info msg="Waiting on main container"
time="2020-02-25T01:02:05Z" level=info msg="main container started with container ID: e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394"
time="2020-02-25T01:02:05Z" level=info msg="Starting annotations monitor"
time="2020-02-25T01:02:06Z" level=info msg="docker wait e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394"
time="2020-02-25T01:02:06Z" level=info msg="Starting deadline monitor"
time="2020-02-25T01:07:00Z" level=info msg="Alloc=3200 TotalAlloc=9568 Sys=68674 NumGC=5 Goroutines=11"
time="2020-02-25T01:10:51Z" level=info msg="Main container completed"
time="2020-02-25T01:10:51Z" level=info msg="Saving logs"
time="2020-02-25T01:10:51Z" level=info msg="Annotations monitor stopped"
time="2020-02-25T01:10:51Z" level=info msg="[docker logs e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394]"
time="2020-02-25T01:10:52Z" level=info msg="Deadline monitor stopped"
time="2020-02-25T01:12:00Z" level=info msg="Alloc=3309 TotalAlloc=9685 Sys=70080 NumGC=7 Goroutines=6"
time="2020-02-25T01:17:00Z" level=info msg="Alloc=3217 TotalAlloc=9686 Sys=70080 NumGC=10 Goroutines=6"

Running argo 2.6.0-rc2

then the last line just repeats...

EDIT: Disabling the archiveLogs seems to "fix it" for some workflows.
Agree with @MrSaints , writing too much information seems to interfere with the wait container since the workflows that remained Running were the ones that wrote a lot of logs.

tcolgate · 2020-02-25T14:45:21Z

It might be worth manually running the docker logs command on the host, check if it returns and how much data it returns. I don't /think/ the NopCloser use could cause this. We could wrap stderr and stdout in a custom Closer that closes both, to be absolutely sure.

jpugliesi · 2020-02-28T18:29:17Z

We are seeing a similar issue as mentioned by @rmgpinto (we're running v2.5.0).

Some of our workflows dont terminate due to a sidecar stuck in Running state. In our case, the sidecar is tensorboard. We've been running these workflows for many months without issues, so this behavior is new.

We have just set archiveLogs: false at the workflow-controller and workflow-template level - I'll report back if this mitigates the issue.

Here's the wait container's logs, although they're not much more interesting the logs @rmgpinto shared:

time="2020-02-28T06:03:23Z" level=info msg="Main container completed"
time="2020-02-28T06:03:23Z" level=info msg="Saving logs"
time="2020-02-28T06:03:23Z" level=info msg="Annotations monitor stopped"
time="2020-02-28T06:03:23Z" level=info msg="[docker logs 6a2ef3074dc0a844737f8ba0da1bedb81d556a51acb9afc65d5a590d67649b03]"
time="2020-02-28T06:03:24Z" level=info msg="Deadline monitor stopped"
time="2020-02-28T06:07:59Z" level=info msg="Alloc=2398 TotalAlloc=5488 Sys=70592 NumGC=49 Goroutines=4"
time="2020-02-28T06:12:59Z" level=info msg="Alloc=2398 TotalAlloc=5492 Sys=70592 NumGC=51 Goroutines=4"
time="2020-02-28T06:17:59Z" level=info msg="Alloc=2398 TotalAlloc=5497 Sys=70592 NumGC=54 Goroutines=4"
...

rmgpinto · 2020-03-01T11:17:47Z

I could be worthwhile to share our docker version: 18.09.9-ce.

Arkanayan · 2020-03-02T10:29:43Z

It might be worth manually running the docker logs command on the host, check if it returns and how much data it returns. I don't /think/ the NopCloser use could cause this. We could wrap stderr and stdout in a custom Closer that closes both, to be absolutely sure.

Running docker logs in wait container is working as intended, instantly returning the logs and exiting.

tcolgate · 2020-03-03T19:00:21Z

Can someone test this?

…

On Tue, 3 Mar 2020, 18:37 Alex Collins, ***@***.***> wrote: Closed #2261 <#2261> via #2345 <#2345>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2261>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAASR42HZOC4TRZLCDW5TG3RFVE5PANCNFSM4KXTK4DQ> .

rmgpinto · 2020-03-03T21:46:56Z

I can test with my workflows.
I'm using the docker images, can you make a 2.6.1-rc?
Thanks

tigerwings · 2020-03-04T16:24:53Z

Can this bug fix be merged into v2.5 and v2.6 because it blocks us from using the latest stable versions?

alexec · 2020-03-04T17:24:19Z

We probably. won't update v2.5, as it is now end of life. It should be back-ported to v2.6.1.

rmgpinto · 2020-03-05T08:45:29Z

Containers are still stuck in 2.6.1.

tcolgate · 2020-03-05T08:56:28Z

Can you do a `ps -ef` in the running executor container and send us the result? Does it show docker logs still running? or any zombie processes?

…

On Thu, 5 Mar 2020 at 08:45, Ricardo Gândara Pinto ***@***.***> wrote: Containers are still stuck in 2.6.1. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2261>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAASR453OJ7FLVUDAHPOPFTRF5RCVANCNFSM4KXTK4DQ> .

-- Tristan Colgate-McFarlane ---- "You can get all your daily vitamins from 52 pints of guiness, and a glass of milk"

rmgpinto · 2020-03-05T09:58:20Z

$ kubectl -n argo exec -it test-workflow-nrf7h-741316927 -c wait -- ps -ef                          
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 09:46 ?        00:00:00 argoexec wait
root        23     1  0 09:55 ?        00:00:00 docker logs 6d57f49eaa0af8ebf36d
root        47     0  0 09:57 pts/0    00:00:00 ps -ef

if I execute kubectl -n argo exec -it test-workflow-nrf7h-741316927 -c wait -- docker logs 6d57f49eaa0af8ebf36d it shows the logs and exits.

tcolgate · 2020-03-05T11:59:21Z

Right, I think I'm getting closer. MultiReader reads all inputs sequentially, so I think once we exceed the buffer amount on stderr, docker logs blocks to write, and doesn't exit, so we never finish reading stdin. (stupidly, I didn't check the multireader docs, it has been a while since I used it).
Need to try and recreate but should have a fix real soon.

This PR fixes argoproj#2261. Multireader concactinates, not combines, the two streams, this meant that we blocked on stdin if stderr was not completely buffered. In addition, this implementation ensures that we call Wait() on the command, and any error is return by Close(). Godoc for exec.Cmd.Std(out|err)Pipe, confirms that we do not need to close those pipes. This PR also ensures that we do not leak go routines in the even that cmd.Start() fails.

tcolgate · 2020-03-05T15:05:08Z

tested this with the following workflow:

kind: Workflow
metadata:
  name:  wftestthing.3
  namespace: argo
spec:
  entrypoint: build
  templates:
  - container:
      args:
      - "for i in $(seq 100000); do date > /dev/stderr;  done "
      command:
      - sh
      - -c
      image: alpine:latest
    name: build

I didn't check to see if that hangs on 2.6.1, but the old code hung on the equivalent with some local testing.

tigerwings · 2020-03-05T15:50:09Z

tested this with the following workflow:
kind: Workflow
metadata:
  name:  wftestthing.3
  namespace: argo
spec:
  entrypoint: build
  templates:
  - container:
      args:
      - "for i in $(seq 100000); do date > /dev/stderr;  done "
      command:
      - sh
      - -c
      image: alpine:latest
    name: build
I didn't check to see if that hangs on 2.6.1, but the old code hung on the equivalent with some local testing.

I tested this workflow with 2.6.1. The issue still exists.

$ kubectl exec -it issue2261 -c wait -- ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 15:44 ?        00:00:00 argoexec wait
root        21     1  0 15:45 ?        00:00:00 docker logs 3f25da9882f9ffbba5b4
root        33     0  0 15:47 pts/0    00:00:00 ps -ef

tcolgate · 2020-03-05T15:52:58Z

@tigerwings yes, I hoped that fails with 2.6.1, but I've tested against the #2368, and it works with that patch applied. (this ticket needs reopening)

dguendisch · 2020-03-12T11:34:00Z

We upgraded yesterday from 2.4.x to 2.6.0 and faced the same issue for some of our workflows. wait container got stuck and process output showed docker logs running...
Looking forward to argo v.2.6.2.

dguendisch · 2020-03-13T10:28:49Z

I can confirm that v2.6.2 fixed this issue for us, thanks!

rmgpinto · 2020-03-13T10:31:40Z

I can confirm as well.

Thanks @tcolgate !

tcolgate · 2020-03-13T10:39:16Z

Well, I also did break it in the first place :)

dguendisch · 2020-03-13T10:48:54Z

Only people that do nothing won't break anything :)

Arkanayan added the type/bug label Feb 19, 2020

simster7 self-assigned this Feb 19, 2020

alexec added this to the v2.5 milestone Feb 19, 2020

alexec added the type/regression Regression from previous behavior (a specific type of bug) label Feb 19, 2020

alexec assigned alexec and unassigned simster7 Feb 19, 2020

tcolgate mentioned this issue Mar 3, 2020

fix(docker): remove NopCloser from the executor. See #2261 #2345

Merged

5 tasks

alexec linked a pull request Mar 3, 2020 that will close this issue

fix(docker): remove NopCloser from the executor. See #2261 #2345

Merged

5 tasks

alexec closed this as completed in #2345 Mar 3, 2020

tcolgate mentioned this issue Mar 5, 2020

fix(docker): fix streaming of combined stdout/stderr #2368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0 #2261

Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0 #2261

Arkanayan commented Feb 19, 2020

alexec commented Feb 19, 2020

alexec commented Feb 19, 2020

Arkanayan commented Feb 24, 2020

alexec commented Feb 24, 2020

MrSaints commented Feb 24, 2020 •

edited

Loading

rmgpinto commented Feb 25, 2020 •

edited

Loading

tcolgate commented Feb 25, 2020

jpugliesi commented Feb 28, 2020 •

edited

Loading

rmgpinto commented Mar 1, 2020

Arkanayan commented Mar 2, 2020 •

edited

Loading

tcolgate commented Mar 3, 2020 via email

rmgpinto commented Mar 3, 2020

tigerwings commented Mar 4, 2020

alexec commented Mar 4, 2020

rmgpinto commented Mar 5, 2020

tcolgate commented Mar 5, 2020 via email

rmgpinto commented Mar 5, 2020

tcolgate commented Mar 5, 2020

tcolgate commented Mar 5, 2020

tigerwings commented Mar 5, 2020

tcolgate commented Mar 5, 2020

dguendisch commented Mar 12, 2020

dguendisch commented Mar 13, 2020

rmgpinto commented Mar 13, 2020

tcolgate commented Mar 13, 2020

dguendisch commented Mar 13, 2020

Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0 #2261

Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0 #2261

Comments

Arkanayan commented Feb 19, 2020

alexec commented Feb 19, 2020

alexec commented Feb 19, 2020

Arkanayan commented Feb 24, 2020

alexec commented Feb 24, 2020

MrSaints commented Feb 24, 2020 • edited Loading

rmgpinto commented Feb 25, 2020 • edited Loading

tcolgate commented Feb 25, 2020

jpugliesi commented Feb 28, 2020 • edited Loading

rmgpinto commented Mar 1, 2020

Arkanayan commented Mar 2, 2020 • edited Loading

tcolgate commented Mar 3, 2020 via email

rmgpinto commented Mar 3, 2020

tigerwings commented Mar 4, 2020

alexec commented Mar 4, 2020

rmgpinto commented Mar 5, 2020

tcolgate commented Mar 5, 2020 via email

rmgpinto commented Mar 5, 2020

tcolgate commented Mar 5, 2020

tcolgate commented Mar 5, 2020

tigerwings commented Mar 5, 2020

tcolgate commented Mar 5, 2020

dguendisch commented Mar 12, 2020

dguendisch commented Mar 13, 2020

rmgpinto commented Mar 13, 2020

tcolgate commented Mar 13, 2020

dguendisch commented Mar 13, 2020

MrSaints commented Feb 24, 2020 •

edited

Loading

rmgpinto commented Feb 25, 2020 •

edited

Loading

jpugliesi commented Feb 28, 2020 •

edited

Loading

Arkanayan commented Mar 2, 2020 •

edited

Loading