Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle idle timeouts more gracefully #5124

Closed
pires opened this issue Feb 16, 2021 · 5 comments
Closed

Handle idle timeouts more gracefully #5124

pires opened this issue Feb 16, 2021 · 5 comments
Labels

Comments

@pires
Copy link

pires commented Feb 16, 2021

Summary

Refs #4524 (closed issue without a supposed fix that never landed?) cc @dcherman

When you're using an ingress controller that has an idle timeout configured, it's possible that there are no events that occur within that period of time which results in the UI throwing an error since the workflow-events stream is closed. In the case of my cluster, I use ingress-nginx which has a default idle timeout of 60s.

Since it's expected that these streams are very long lived connections, we should consider one of the following:

Send a piece of data periodically if none has been sent. This is not optimal imo since we'd need to filter this out on the client, and it still may not solve the problem if the user configures an idle timeout shorter than the interval that we sent data.

Retry the connection on the front-end at least once. If the connection is successfully re-established, then it's a candidate to retry again if/when the error occurs. If the connection fails to be re-established, throw our existing error since that might indicate a loss of network connectivity or problems with the argo-server pod.

In both cases, we should also provide a nicer way to retry when this occurs rather than reloading the page.

Diagnostics

What Kubernetes provider are you using?

Private, Kubernetes 1.19

What version of Argo Workflows are you running?

v3.0.0-rc1 but also verified on 2.12.8.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@pires pires added the type/bug label Feb 16, 2021
@pires
Copy link
Author

pires commented Feb 16, 2021

This bug is not a blocker but truly annoying as things that are working and showing progress disappear from the screen for ~10s, show up and disappear again, rinse and repeat.

@alexec
Copy link
Contributor

alexec commented Feb 16, 2021

I think @dcherman is fixing this as we speak.

@pires
Copy link
Author

pires commented Feb 16, 2021

Cool! Is it still expected to land on v3.0.0?

@alexec
Copy link
Contributor

alexec commented Feb 16, 2021

I hope so.

dcherman added a commit to dcherman/argo that referenced this issue Feb 16, 2021
When operating behind load balancers / ingress controllers that have
an idle timeout configured, it's not uncommon to get disconnected and have
an error shown in the UI if you're looking at a relatively inactive workflow
or workflow list.

In the SSE spec, :\n is a sequence that you can send to the client which
should be ignored by the client, so we can use that to periodically send
something in the response without affecting the code in the UI at all.

Fixes argoproj#5006
Fixes argoproj#5124

Signed-off-by: Daniel Herman <dherman@factset.com>
@dcherman
Copy link
Member

This should be fixed in master, forgot to add a ref to this issue in the commit.

#5101

@alexec alexec closed this as completed Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants