Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.3: Retrieving pod artifacts using Web UI from S3 is slow #8257

Closed
chriswilley opened this issue Mar 26, 2022 · 34 comments · Fixed by #8394
Closed

v3.3: Retrieving pod artifacts using Web UI from S3 is slow #8257

chriswilley opened this issue Mar 26, 2022 · 34 comments · Fixed by #8394
Assignees
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@chriswilley
Copy link

Summary

What happened/what you expected to happen?

After a pod completes, when we click on any input or output artifact (such as the main container log) in the Web UI it takes about a full minute to pull from S3. Previously on v3.1.12 it was very quick. Nothing about the network or EKS configurations have changed.

What version are you running?
v3.3.0


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec alexec added the area/artifacts S3/GCP/OSS/Git/HDFS etc label Mar 27, 2022
@alexec
Copy link
Contributor

alexec commented Mar 27, 2022

How slow is “slow”?

@chriswilley
Copy link
Author

I just timed pulling a 2KB log file and it took 2m 2s to appear.

@alexec alexec added the type/regression Regression from previous behavior (a specific type of bug) label Mar 27, 2022
@alexec
Copy link
Contributor

alexec commented Apr 4, 2022

Have you tested in v3.2?

@a3lk4i
Copy link

a3lk4i commented Apr 4, 2022

I've got the same issue and it's only in v3.3.0 and v3.3.1. v3.2.* had no issue.

@alexec
Copy link
Contributor

alexec commented Apr 4, 2022

Let me get this right+

  • v3.1 - fast
  • v3.2 - fast
  • v3.3 - slow

@a3lk4i
Copy link

a3lk4i commented Apr 4, 2022

yes, that's right.

@alexec
Copy link
Contributor

alexec commented Apr 4, 2022

Candidates for blame:

@alexec
Copy link
Contributor

alexec commented Apr 4, 2022

Does it always take 2m2s? Are you using browser?

@chriswilley
Copy link
Author

I'd have to do more timings, but anecdotally they all take about the same time.

@chriswilley
Copy link
Author

And yes to browser. (Google Chrome)

@a3lk4i
Copy link

a3lk4i commented Apr 4, 2022

Looks like it. These are the logs from the wait container where the artifacts (container logs in this case) are uploaded

time="2022-04-04T15:58:19.454Z" level=info msg="Starting deadline monitor"
time="2022-04-04T16:01:03.480Z" level=info msg="Main container completed"
time="2022-04-04T16:01:03.480Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2022-04-04T16:01:03.480Z" level=info msg="Saving logs"
time="2022-04-04T16:01:03.481Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: <redacted>/main.log"
time="2022-04-04T16:01:03.481Z" level=info msg="Creating minio client using IAM role"
time="2022-04-04T16:01:03.481Z" level=info msg="Saving file to s3" bucket=<redacted> endpoint=s3.amazonaws.com key=<redacted>/main.log path=/tmp/argo/outputs/logs/main.log
time="2022-04-04T16:03:07.717Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2022-04-04T16:03:07.717Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2022-04-04T16:03:07.717Z" level=info msg="No output parameters"
time="2022-04-04T16:03:07.717Z" level=info msg="No output artifacts"
time="2022-04-04T16:03:07.736Z" level=info msg="Create workflowtaskresults 201"
time="2022-04-04T16:03:07.737Z" level=info msg="Killing sidecars []"
time="2022-04-04T16:03:07.737Z" level=info msg="Alloc=5688 TotalAlloc=14812 Sys=23506 NumGC=7 Goroutines=10"

The main.log file is about 300KB here

@alexec
Copy link
Contributor

alexec commented Apr 4, 2022

If they take the exact same time, then that hints at some problem with Argo Server in one place.

@alexec
Copy link
Contributor

alexec commented Apr 4, 2022

Can you attach argo-server logs.

@a3lk4i
Copy link

a3lk4i commented Apr 4, 2022

During the pod run

time="2022-04-04T15:58:18.145Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:27.204Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:27.212Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T16:00:27Z" grpc.time_ms=10.506 span.kind=server system=grpc
time="2022-04-04T16:00:27.547Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:39.579Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WorkflowLogs grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T16:00:27Z" grpc.time_ms=12034.497 span.kind=server system=grpc
time="2022-04-04T16:02:15.428Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T15:58:18Z" grpc.time_ms=237286.6 span.kind=server system=grpc
time="2022-04-04T16:03:18.092Z" level=info msg="Alloc=12073 TotalAlloc=5976661 Sys=29265 NumGC=7960 Goroutines=93"
time="2022-04-04T16:08:18.092Z" level=info msg="Alloc=11273 TotalAlloc=5978609 Sys=29265 NumGC=7963 Goroutines=93"
time="2022-04-04T16:11:22.522Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T15:57:36Z" grpc.time_ms=825619.3 span.kind=server system=grpc
time="2022-04-04T15:58:18.145Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:27.204Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:27.212Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T16:00:27Z" grpc.time_ms=10.506 span.kind=server system=grpc
time="2022-04-04T16:00:27.547Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:00:39.579Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WorkflowLogs grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T16:00:27Z" grpc.time_ms=12034.497 span.kind=server system=grpc
time="2022-04-04T16:02:15.428Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T15:58:18Z" grpc.time_ms=237286.6 span.kind=server system=grpc
time="2022-04-04T16:03:18.092Z" level=info msg="Alloc=12073 TotalAlloc=5976661 Sys=29265 NumGC=7960 Goroutines=93"
time="2022-04-04T16:08:18.092Z" level=info msg="Alloc=11273 TotalAlloc=5978609 Sys=29265 NumGC=7963 Goroutines=93"
time="2022-04-04T16:11:22.522Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2022-04-04T15:57:36Z" grpc.time_ms=825619.3 span.kind=server system=grpc

When getting logs from UI

time="2022-04-04T16:59:10.573Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:59:10.582Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListArchivedWorkflowLabelKeys grpc.service=workflowarchive.ArchivedWorkflowService grpc.start_time="2022-04-04T16:59:10Z" grpc.time_ms=14.643 span.kind=server system=grpc
time="2022-04-04T16:59:10.589Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListArchivedWorkflows grpc.service=workflowarchive.ArchivedWorkflowService grpc.start_time="2022-04-04T16:59:10Z" grpc.time_ms=18.937 span.kind=server system=grpc
time="2022-04-04T16:59:12.847Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T16:59:12.854Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetArchivedWorkflow grpc.service=workflowarchive.ArchivedWorkflowService grpc.start_time="2022-04-04T16:59:12Z" grpc.time_ms=9.938 span.kind=server system=grpc
time="2022-04-04T17:00:07.981Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T17:00:07.986Z" level=info msg="Download artifact" artifactName=main-logs isInput=false nodeId=<redacted> uid=d8fea3cc-b687-4373-a20a-0a9259157e31
time="2022-04-04T17:00:07.986Z" level=info msg="S3 Load path: /tmp/artifact3631983470, key: <redacted>/main.log"
time="2022-04-04T17:00:07.986Z" level=info msg="Creating minio client using IAM role"
time="2022-04-04T17:00:07.987Z" level=info msg="Getting file from s3" bucket=<redacted> endpoint=s3.amazonaws.com key=<redacted>/main.log path=/tmp/artifact3631983470
time="2022-04-04T17:01:07.986Z" level=info msg="using the default service account for user" subject=CgQzMDYzEgllZy1naXRodWI
time="2022-04-04T17:01:07.990Z" level=info msg="Download artifact" artifactName=main-logs isInput=false nodeId=<redacted> uid=d8fea3cc-b687-4373-a20a-0a9259157e31
time="2022-04-04T17:01:07.990Z" level=info msg="S3 Load path: /tmp/artifact2233602876, key: <redacted>/main.log"
time="2022-04-04T17:01:07.990Z" level=info msg="Creating minio client using IAM role"
time="2022-04-04T17:01:07.990Z" level=info msg="Getting file from s3" bucket=<redacted> endpoint=s3.amazonaws.com key=<redacted>/main.log path=/tmp/artifact2233602876
2022/04/04 17:02:11 http: superfluous response.WriteHeader call from github.com/argoproj/argo-workflows/v3/server/artifacts.(*ArtifactServer).returnArtifact (artifact_server.go:225)
2022/04/04 17:03:10 http: superfluous response.WriteHeader call from github.com/argoproj/argo-workflows/v3/server/artifacts.(*ArtifactServer).returnArtifact (artifact_server.go:225)

Maybe due to the timeouts set on my ingress, I never see the logs

@alexec
Copy link
Contributor

alexec commented Apr 7, 2022

Do you have the same problem when connecting directly, rather than via ingress or LB?

@chriswilley
Copy link
Author

All my testing has been done when connecting through kubectl proxy direct to the cluster.

@alexec
Copy link
Contributor

alexec commented Apr 11, 2022

I think we need to do live debugging.

Could you please set-up 30m?

https://bit.ly/book-30m-with-argo-team

@chriswilley
Copy link
Author

Zoom is not accepting my password

@chriswilley
Copy link
Author

I believe I'm in the waiting room now

@alexec
Copy link
Contributor

alexec commented Apr 11, 2022

Hmm, I don't see your booking, maybe you got booked with someone else. Could you do 10:30am PT tomorrow?

@chriswilley
Copy link
Author

Yes

@alexec
Copy link
Contributor

alexec commented Apr 11, 2022

Cool. Are you able to book that slot tomorrow?

@chriswilley
Copy link
Author

It says there are no slots available

@alexec
Copy link
Contributor

alexec commented Apr 11, 2022

11am or 11:30am?

@chriswilley
Copy link
Author

do you see a booking for me at 1pm EST Wednesday?

@chriswilley
Copy link
Author

Are you still available now?

@alexec alexec changed the title Retrieving pod artifacts using Web UI from S3 is slow v3.3: Retrieving pod artifacts using Web UI from S3 is slow Apr 13, 2022
@alexec
Copy link
Contributor

alexec commented Apr 13, 2022

git log --oneline v3.2.9..v3.3.0 -- go.mod

f0fb0d5 chore(deps): bump github.com/minio/minio-go/v7 from 7.0.15 to 7.0.23 (#8003)
aea6c39 chore(deps): bump github.com/gorilla/websocket from 1.4.2 to 1.5.0 (#7991)
89d7cc3 chore(deps): bump github.com/tidwall/gjson from 1.13.0 to 1.14.0 (#7992)
6076524 chore: Upgrade to Kubernetes v0.23. Fixes #6750 (#7832)
980f2fe chore(deps): bump github.com/gavv/httpexpect/v2 from 2.2.0 to 2.3.1 (#7979)
5e45cd9 chore(deps): bump github.com/antonmedv/expr from 1.8.9 to 1.9.0 (#7967)
79fc4a9 (tag: v3.3.0-rc6, tag: v3.3.0-rc5) chore(deps): bump upper.io/db.v3 (#7939)
848effc chore(deps): bump github.com/aliyun/aliyun-oss-go-sdk (#7919)
044389b chore(deps): bump github.com/aliyun/aliyun-oss-go-sdk (#7901)
dc82f3f chore(deps): bump github.com/prometheus/client_golang (#7880)
69c5bc7 fix: Fix go-jose dep. Fixes #7814 (#7874)
1d7a177 chore(deps): bump github.com/soheilhy/cmux from 0.1.4 to 0.1.5 (#7848)
2797707 chore(deps): bump github.com/go-openapi/spec from 0.20.2 to 0.20.4 (#7817)
74d1bbe chore(deps): bump cloud.google.com/go/storage from 1.19.0 to 1.20.0 (#7747)
1277f05 chore(deps): bump gopkg.in/square/go-jose.v2 from 2.5.1 to 2.6.0 (#7740)
7e6f2c0 chore(deps): bump github.com/valyala/fasttemplate from 1.1.0 to 1.2.1 (#7727)
877d656 chore(deps): bump cloud.google.com/go/storage from 1.10.0 to 1.19.0 (#7714)
0f4c484 chore(deps): bump gopkg.in/go-playground/webhooks.v5 (#7704)
4664aea chore(deps): bump google.golang.org/grpc from v1.38.0 to v1.41.1 (#7658)
ecf2cec chore(deps): bump github.com/grpc-ecosystem/go-grpc-middleware (#7679)
5591534 chore(deps): bump github.com/prometheus/common from 0.26.0 to 0.32.1 (#7660)
d7bc123 chore: Upgrade tidwall/gjson to improve security (#7237)
a6ce659 feat!: Remove the hidden flag verify from argo submit (#7158)
f9e554d fix: Relative submodules in git artifacts. Fixes #7141 (#7162)
8d7ca73 feat: Upgrade to Golang 1.17 (#7029)
d815031 chore: Upgrade argo event 1.15 (#6932)
606011b build: simplify using Golang 1.16 (#6893)
c962bb1 fix: upgrade sprig to v3.2.2 (#6876)

@alexec
Copy link
Contributor

alexec commented Apr 13, 2022

rm

alexec added a commit that referenced this issue Apr 13, 2022
Signed-off-by: Alex Collins <alex_collins@intuit.com>
@alexec
Copy link
Contributor

alexec commented Apr 13, 2022

@chriswilley I've created an engineering build that adds additional informational logging. This should help diagnose where the problem is. It will take 1h for the image to be published. The tag will be :dev-log. Can you also test with :latest to make sure that this issue has not been accidentally fixed?

You only need to change the argo-server image tag. Leave other configration unchanged.

@alexec alexec self-assigned this Apr 13, 2022
@chriswilley
Copy link
Author

Will do. Thanks.

@tooptoop4
Copy link
Contributor

I think I found the cause minio-go 7.0.23 introduced a regression that is fixed in 7.0.24: minio/minio-go#1626

@alexec
Copy link
Contributor

alexec commented Apr 18, 2022

#8401

@alexec alexec mentioned this issue Apr 18, 2022
85 tasks
@alexec
Copy link
Contributor

alexec commented Apr 18, 2022

I've cherry-picked a0148c1 to fix #8257.

@alexec
Copy link
Contributor

alexec commented Apr 18, 2022

Will be fixed in v3.3.2

@alexec alexec closed this as completed Apr 18, 2022
alexec added a commit that referenced this issue Apr 18, 2022
#8394)

Signed-off-by: Alex Collins <alex_collins@intuit.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants