Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Argo workflow wait container hanging when using artifact repository #1493

Closed
jbmcfarlin31 opened this issue Jul 23, 2019 · 3 comments
Closed

Comments

@jbmcfarlin31
Copy link

Is this a BUG REPORT or FEATURE REQUEST?:
Bug

What happened:
One of our workflow steps generates some lengthy output which we send to a minio artifact repository. The output generated is roughly 812kb but has the potential to be much more.

The main container that runs in our steps executes successfully and exits with a status code of 0. However the wait container, although appears to have ran successfully, hangs in a running state.

This is not the case every time... sometimes the wait container finishes quickly, sometimes it hangs, and sometimes it finishes after hanging for 15+ minutes.

What you expected to happen:
The wait container would finish quickly every time

How to reproduce it (as minimally and precisely as possible):
Have a workflow step that generates output and submits that output to an artifact repository.

Anything else we need to know?:

Environment:

  • Argo version: v2.3.0
$ argo version
  • Kubernetes version : v1.13.5
$ kubectl version -o yaml

Other debugging information (if applicable):

  • workflow-controller logs:
$ kubectl logs -n kube-system $(kubectl get pods -l app=workflow-controller -n kube-system -o name)

wait container logs:

time="2019-07-23T17:26:27Z" level=info msg="Waiting on main container"
time="2019-07-23T17:26:28Z" level=info msg="main container started with container ID: fbc32520ba040fbf375f555e099c825f3eb2e9c73d5070076b22d2cdf65cd7ef"
time="2019-07-23T17:26:28Z" level=info msg="Starting annotations monitor"
time="2019-07-23T17:26:28Z" level=info msg="Starting deadline monitor"
time="2019-07-23T17:26:28Z" level=info msg="docker wait fbc32520ba040fbf375f555e099c825f3eb2e9c73d5070076b22d2cdf65cd7ef"
time="2019-07-23T17:26:38Z" level=info msg="/argo/podmetadata/annotations updated"
time="2019-07-23T17:26:40Z" level=info msg="Main container completed"
time="2019-07-23T17:26:40Z" level=info msg="No sidecars"
time="2019-07-23T17:26:40Z" level=info msg="No output parameters"
time="2019-07-23T17:26:40Z" level=info msg="Annotations monitor stopped"
time="2019-07-23T17:26:40Z" level=info msg="Saving output artifacts"
time="2019-07-23T17:26:40Z" level=info msg="Staging artifact: scan-results"
time="2019-07-23T17:26:40Z" level=info msg="Copying /results/scan_results.json from container base image layer to /argo/outputs/artifacts/scan-results.tgz"
time="2019-07-23T17:26:40Z" level=info msg="Archiving fbc32520ba040fbf375f555e099c825f3eb2e9c73d5070076b22d2cdf65cd7ef:/results/scan_results.json to /argo/outputs/artifacts/scan-results.tgz"
time="2019-07-23T17:26:40Z" level=info msg="sh -c docker cp -a fbc32520ba040fbf375f555e099c825f3eb2e9c73d5070076b22d2cdf65cd7ef:/results/scan_results.json - | gzip > /argo/outputs/artifacts/scan-results.tgz"
time="2019-07-23T17:26:40Z" level=info msg="Archiving completed"
time="2019-07-23T17:26:40Z" level=info msg="S3 Save path: /argo/outputs/artifacts/scan-results.tgz, key: argo-wf-/convert-compose-clsjg/convert-compose-clsjg-170263088/scan-results.tgz"
time="2019-07-23T17:26:40Z" level=info msg="Creating minio client <minio_client> using static credentials"
time="2019-07-23T17:26:40Z" level=info msg="Saving from /argo/outputs/artifacts/scan-results.tgz to s3 (endpoint: minio_client, bucket: argo-workflow-bucket, key: argo-wf-/convert-compose-clsjg/convert-compose-clsjg-170263088/scan-results.tgz)"
time="2019-07-23T17:26:40Z" level=info msg="Deadline monitor stopped"
time="2019-07-23T17:31:27Z" level=info msg="Alloc=3541 TotalAlloc=11704 Sys=70078 NumGC=6 Goroutines=9"
time="2019-07-23T17:36:27Z" level=info msg="Alloc=3541 TotalAlloc=11706 Sys=70078 NumGC=8 Goroutines=9"
time="2019-07-23T17:41:27Z" level=info msg="Alloc=3642 TotalAlloc=11863 Sys=70078 NumGC=10 Goroutines=9"
time="2019-07-23T17:46:27Z" level=info msg="Alloc=3642 TotalAlloc=11866 Sys=70078 NumGC=12 Goroutines=9"
@jbmcfarlin31
Copy link
Author

Just to add, here is the output of the rest of the wait container logs. This wait container took roughly 55 minutes to complete:

time="2019-07-23T19:41:05Z" level=info msg="Alloc=3546 TotalAlloc=11740 Sys=70078 NumGC=9 Goroutines=9"
time="2019-07-23T19:46:05Z" level=info msg="Alloc=3647 TotalAlloc=11897 Sys=70078 NumGC=11 Goroutines=9"
time="2019-07-23T19:51:05Z" level=info msg="Alloc=3647 TotalAlloc=11900 Sys=70078 NumGC=13 Goroutines=9"
time="2019-07-23T19:56:05Z" level=info msg="Alloc=3647 TotalAlloc=12055 Sys=70078 NumGC=15 Goroutines=9"
time="2019-07-23T20:01:05Z" level=info msg="Alloc=3647 TotalAlloc=12057 Sys=70078 NumGC=17 Goroutines=9"
time="2019-07-23T20:06:05Z" level=info msg="Alloc=3648 TotalAlloc=12204 Sys=70078 NumGC=19 Goroutines=9"
time="2019-07-23T20:11:05Z" level=info msg="Alloc=3648 TotalAlloc=12206 Sys=70078 NumGC=21 Goroutines=9"
time="2019-07-23T20:16:05Z" level=info msg="Alloc=3648 TotalAlloc=12360 Sys=70078 NumGC=23 Goroutines=9"
time="2019-07-23T20:21:05Z" level=info msg="Alloc=3648 TotalAlloc=12362 Sys=70078 NumGC=25 Goroutines=9"
time="2019-07-23T20:21:32Z" level=info msg="Successfully saved file: /argo/outputs/artifacts/scan-results.tgz"
time="2019-07-23T20:21:32Z" level=info msg="Annotating pod with output"
time="2019-07-23T20:21:32Z" level=info msg="Alloc=4132 TotalAlloc=12847 Sys=70078 NumGC=25 Goroutines=9"

When argo compresses the artifact using gzip, this condenses down to roughly 30kb in size. From argo to minio the traffic is on the same network, so it's not really leaving the environment.

@madhuresh04
Copy link

Is it resolved in version 2.6.3 ?

@stale
Copy link

stale bot commented Jul 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 1, 2020
@stale stale bot closed this as completed Jul 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants