Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0 #2261

Closed
2 of 4 tasks
Arkanayan opened this issue Feb 19, 2020 · 26 comments · Fixed by #2345 or #2368
Closed
2 of 4 tasks

Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0 #2261

Arkanayan opened this issue Feb 19, 2020 · 26 comments · Fixed by #2345 or #2368
Assignees
Labels
type/bug type/regression Regression from previous behavior (a specific type of bug)
Milestone

Comments

@Arkanayan
Copy link

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:
In multi-root workflows in a single pod, the wait container gets stuck. It is not possible to terminate the workflow. The only way to stop it is to delete it.

The logs of the stuck container

$ kubectl logs workflow-zdkm8-552438975 -n argo -c wait
time="2020-02-19T06:06:48Z" level=info msg="Creating a docker executor"
time="2020-02-19T06:06:48Z" level=info msg="Executor (version: v0.0.0+unknown, build_date: 1970-01-01T00:00:00Z) initialized (pod: <redacted> )}}}"
time="2020-02-19T06:06:48Z" level=info msg="Waiting on main container"
time="2020-02-19T06:06:49Z" level=info msg="main container started with container ID: 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting annotations monitor"
time="2020-02-19T06:06:49Z" level=info msg="Execution control set from API: {2020-02-19 10:06:47 +0000 UTC false}"
time="2020-02-19T06:06:49Z" level=info msg="docker wait 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting deadline monitor"
time="2020-02-19T06:06:53Z" level=info msg="Main container completed"
time="2020-02-19T06:06:53Z" level=info msg="Saving logs"
time="2020-02-19T06:06:53Z" level=info msg="[docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6]"
time="2020-02-19T06:06:53Z" level=info msg="Annotations monitor stopped"
time="2020-02-19T06:06:53Z" level=info msg="Deadline monitor stopped"

The processes running on the container are

root@pod-m8-552438975:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.4 145816 34196 ?        Ssl  06:06   0:00 argoexec wait
root        22  0.0  0.3  45672 28684 ?        Sl   06:06   0:00 docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6
root        41  0.0  0.0  18136  3188 pts/0    Ss   06:11   0:00 /bin/bash
root        54  0.0  0.0  36636  2844 pts/0    R+   06:12   0:00 ps aux

If I kill the docker logs process with id 22, it exits successfully.

root@pod-zdkm8-552438975:/# kill 22
root@pod-zdkm8-552438975:/# command terminated with exit code 137

Environment:

  • Argo version: 2.5-rc8 to 2.5

  • Kubernetes version :

clientVersion:
  buildDate: "2019-09-18T14:36:53Z"
  compiler: gc
  gitCommit: 2bd9643cee5b3b3a5ecbd3af49d09018f0773c77
  gitTreeState: clean
  gitVersion: v1.16.0
  goVersion: go1.12.9
  major: "1"
  minor: "16"
  platform: linux/amd64
serverVersion:
  buildDate: "2019-11-13T11:11:50Z"
  compiler: gc
  gitCommit: 7015f71e75f670eb9e7ebd4b5749639d42e20079
  gitTreeState: clean
  gitVersion: v1.15.6
  goVersion: go1.12.12
  major: "1"
  minor: "15"
  platform: linux/

Maybe it was introduced in this commit #2136

@simster7 simster7 self-assigned this Feb 19, 2020
@alexec
Copy link
Contributor

alexec commented Feb 19, 2020

@tcolgate FYI

@alexec alexec added this to the v2.5 milestone Feb 19, 2020
@alexec alexec added the type/regression Regression from previous behavior (a specific type of bug) label Feb 19, 2020
@alexec alexec assigned alexec and unassigned simster7 Feb 19, 2020
@alexec
Copy link
Contributor

alexec commented Feb 19, 2020

@Arkanayan do you have an example of this problem? E.g. in the examples folder?

@Arkanayan
Copy link
Author

Sorry @alexec, I am not able to reproduce this issue with the examples.

@alexec
Copy link
Contributor

alexec commented Feb 24, 2020

If we can repro an issue, we cannot fix it. Please let me know if you're able to provide YAML that repos the issue.

Do you believe the issues appeared in v2.5.0-rc8? I.e. it was not present in v2.5.0-rc7. If that is the case, we can reach out to the committers of that release and see what they think.

@MrSaints
Copy link

MrSaints commented Feb 24, 2020

I've just encountered this on v2.5.1, I'll try to provide a minimal example.

In our case, it seems to be potentially related to: #1493

EDIT: Still investigating. It may be related to our use of tqdm (https://github.com/tqdm/tqdm), which we did not turn off. I believe it spits out quite a lot of logs because it writes to stderr, and there has been a recent change to the handling of stderr.

EDIT 2: Disabling tqdm seems to fix it for us. I now believe the problem is when you write lots of things to stderr.

@rmgpinto
Copy link

rmgpinto commented Feb 25, 2020

I'm having this as well.
wait container logs:

time="2020-02-25T01:02:00Z" level=info msg="Waiting on main container"
time="2020-02-25T01:02:05Z" level=info msg="main container started with container ID: e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394"
time="2020-02-25T01:02:05Z" level=info msg="Starting annotations monitor"
time="2020-02-25T01:02:06Z" level=info msg="docker wait e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394"
time="2020-02-25T01:02:06Z" level=info msg="Starting deadline monitor"
time="2020-02-25T01:07:00Z" level=info msg="Alloc=3200 TotalAlloc=9568 Sys=68674 NumGC=5 Goroutines=11"
time="2020-02-25T01:10:51Z" level=info msg="Main container completed"
time="2020-02-25T01:10:51Z" level=info msg="Saving logs"
time="2020-02-25T01:10:51Z" level=info msg="Annotations monitor stopped"
time="2020-02-25T01:10:51Z" level=info msg="[docker logs e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394]"
time="2020-02-25T01:10:52Z" level=info msg="Deadline monitor stopped"
time="2020-02-25T01:12:00Z" level=info msg="Alloc=3309 TotalAlloc=9685 Sys=70080 NumGC=7 Goroutines=6"
time="2020-02-25T01:17:00Z" level=info msg="Alloc=3217 TotalAlloc=9686 Sys=70080 NumGC=10 Goroutines=6"

Running argo 2.6.0-rc2

then the last line just repeats...

EDIT: Disabling the archiveLogs seems to "fix it" for some workflows.
Agree with @MrSaints , writing too much information seems to interfere with the wait container since the workflows that remained Running were the ones that wrote a lot of logs.

@tcolgate
Copy link
Contributor

It might be worth manually running the docker logs command on the host, check if it returns and how much data it returns. I don't /think/ the NopCloser use could cause this. We could wrap stderr and stdout in a custom Closer that closes both, to be absolutely sure.

@jpugliesi
Copy link

jpugliesi commented Feb 28, 2020

We are seeing a similar issue as mentioned by @rmgpinto (we're running v2.5.0).

Some of our workflows dont terminate due to a sidecar stuck in Running state. In our case, the sidecar is tensorboard. We've been running these workflows for many months without issues, so this behavior is new.

We have just set archiveLogs: false at the workflow-controller and workflow-template level - I'll report back if this mitigates the issue.

Here's the wait container's logs, although they're not much more interesting the logs @rmgpinto shared:

time="2020-02-28T06:03:23Z" level=info msg="Main container completed"
time="2020-02-28T06:03:23Z" level=info msg="Saving logs"
time="2020-02-28T06:03:23Z" level=info msg="Annotations monitor stopped"
time="2020-02-28T06:03:23Z" level=info msg="[docker logs 6a2ef3074dc0a844737f8ba0da1bedb81d556a51acb9afc65d5a590d67649b03]"
time="2020-02-28T06:03:24Z" level=info msg="Deadline monitor stopped"
time="2020-02-28T06:07:59Z" level=info msg="Alloc=2398 TotalAlloc=5488 Sys=70592 NumGC=49 Goroutines=4"
time="2020-02-28T06:12:59Z" level=info msg="Alloc=2398 TotalAlloc=5492 Sys=70592 NumGC=51 Goroutines=4"
time="2020-02-28T06:17:59Z" level=info msg="Alloc=2398 TotalAlloc=5497 Sys=70592 NumGC=54 Goroutines=4"
...

@rmgpinto
Copy link

rmgpinto commented Mar 1, 2020

I could be worthwhile to share our docker version: 18.09.9-ce.

@Arkanayan
Copy link
Author

Arkanayan commented Mar 2, 2020

It might be worth manually running the docker logs command on the host, check if it returns and how much data it returns. I don't /think/ the NopCloser use could cause this. We could wrap stderr and stdout in a custom Closer that closes both, to be absolutely sure.

Running docker logs in wait container is working as intended, instantly returning the logs and exiting.

@tcolgate
Copy link
Contributor

tcolgate commented Mar 3, 2020 via email

@rmgpinto
Copy link

rmgpinto commented Mar 3, 2020

I can test with my workflows.
I'm using the docker images, can you make a 2.6.1-rc?
Thanks

@tigerwings
Copy link

Can this bug fix be merged into v2.5 and v2.6 because it blocks us from using the latest stable versions?

@alexec
Copy link
Contributor

alexec commented Mar 4, 2020

We probably. won't update v2.5, as it is now end of life. It should be back-ported to v2.6.1.

@rmgpinto
Copy link

rmgpinto commented Mar 5, 2020

Containers are still stuck in 2.6.1.

@tcolgate
Copy link
Contributor

tcolgate commented Mar 5, 2020 via email

@rmgpinto
Copy link

rmgpinto commented Mar 5, 2020

$ kubectl -n argo exec -it test-workflow-nrf7h-741316927 -c wait -- ps -ef                          
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 09:46 ?        00:00:00 argoexec wait
root        23     1  0 09:55 ?        00:00:00 docker logs 6d57f49eaa0af8ebf36d
root        47     0  0 09:57 pts/0    00:00:00 ps -ef

if I execute kubectl -n argo exec -it test-workflow-nrf7h-741316927 -c wait -- docker logs 6d57f49eaa0af8ebf36d it shows the logs and exits.

@tcolgate
Copy link
Contributor

tcolgate commented Mar 5, 2020

Right, I think I'm getting closer. MultiReader reads all inputs sequentially, so I think once we exceed the buffer amount on stderr, docker logs blocks to write, and doesn't exit, so we never finish reading stdin. (stupidly, I didn't check the multireader docs, it has been a while since I used it).
Need to try and recreate but should have a fix real soon.

tcolgate pushed a commit to tcolgate/argo that referenced this issue Mar 5, 2020
This PR fixes argoproj#2261. Multireader concactinates, not combines, the two
streams, this meant that we blocked on stdin if stderr was not
completely buffered.

In addition, this implementation ensures that we call Wait() on the
command, and any error is return by Close(). Godoc for
exec.Cmd.Std(out|err)Pipe, confirms that we do not need to close those
pipes.

This PR also ensures that we do not leak go routines in the even that
cmd.Start() fails.
@tcolgate
Copy link
Contributor

tcolgate commented Mar 5, 2020

tested this with the following workflow:

kind: Workflow
metadata:
  name:  wftestthing.3
  namespace: argo
spec:
  entrypoint: build
  templates:
  - container:
      args:
      - "for i in $(seq 100000); do date > /dev/stderr;  done "
      command:
      - sh
      - -c
      image: alpine:latest
    name: build

I didn't check to see if that hangs on 2.6.1, but the old code hung on the equivalent with some local testing.

@tigerwings
Copy link

tested this with the following workflow:

kind: Workflow
metadata:
  name:  wftestthing.3
  namespace: argo
spec:
  entrypoint: build
  templates:
  - container:
      args:
      - "for i in $(seq 100000); do date > /dev/stderr;  done "
      command:
      - sh
      - -c
      image: alpine:latest
    name: build

I didn't check to see if that hangs on 2.6.1, but the old code hung on the equivalent with some local testing.

I tested this workflow with 2.6.1. The issue still exists.

$ kubectl exec -it issue2261 -c wait -- ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 15:44 ?        00:00:00 argoexec wait
root        21     1  0 15:45 ?        00:00:00 docker logs 3f25da9882f9ffbba5b4
root        33     0  0 15:47 pts/0    00:00:00 ps -ef

@tcolgate
Copy link
Contributor

tcolgate commented Mar 5, 2020

@tigerwings yes, I hoped that fails with 2.6.1, but I've tested against the #2368, and it works with that patch applied. (this ticket needs reopening)

@dguendisch
Copy link

We upgraded yesterday from 2.4.x to 2.6.0 and faced the same issue for some of our workflows. wait container got stuck and process output showed docker logs running...
Looking forward to argo v.2.6.2.

@dguendisch
Copy link

I can confirm that v2.6.2 fixed this issue for us, thanks!

@rmgpinto
Copy link

I can confirm as well.

Thanks @tcolgate !

@tcolgate
Copy link
Contributor

Well, I also did break it in the first place :)

@dguendisch
Copy link

Only people that do nothing won't break anything :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
9 participants