Handle idle timeouts more gracefully #5124

pires · 2021-02-16T18:24:51Z

Summary

Refs #4524 (closed issue without a supposed fix that never landed?) cc @dcherman

When you're using an ingress controller that has an idle timeout configured, it's possible that there are no events that occur within that period of time which results in the UI throwing an error since the workflow-events stream is closed. In the case of my cluster, I use ingress-nginx which has a default idle timeout of 60s.

Since it's expected that these streams are very long lived connections, we should consider one of the following:

Send a piece of data periodically if none has been sent. This is not optimal imo since we'd need to filter this out on the client, and it still may not solve the problem if the user configures an idle timeout shorter than the interval that we sent data.

Retry the connection on the front-end at least once. If the connection is successfully re-established, then it's a candidate to retry again if/when the error occurs. If the connection fails to be re-established, throw our existing error since that might indicate a loss of network connectivity or problems with the argo-server pod.

In both cases, we should also provide a nicer way to retry when this occurs rather than reloading the page.

Diagnostics

What Kubernetes provider are you using?

Private, Kubernetes 1.19

What version of Argo Workflows are you running?

v3.0.0-rc1 but also verified on 2.12.8.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

pires · 2021-02-16T18:25:37Z

This bug is not a blocker but truly annoying as things that are working and showing progress disappear from the screen for ~10s, show up and disappear again, rinse and repeat.

alexec · 2021-02-16T18:26:59Z

I think @dcherman is fixing this as we speak.

pires · 2021-02-16T19:09:32Z

Cool! Is it still expected to land on v3.0.0?

alexec · 2021-02-16T19:46:41Z

I hope so.

When operating behind load balancers / ingress controllers that have an idle timeout configured, it's not uncommon to get disconnected and have an error shown in the UI if you're looking at a relatively inactive workflow or workflow list. In the SSE spec, :\n is a sequence that you can send to the client which should be ignored by the client, so we can use that to periodically send something in the response without affecting the code in the UI at all. Fixes argoproj#5006 Fixes argoproj#5124 Signed-off-by: Daniel Herman <dherman@factset.com>

dcherman · 2021-02-22T19:05:32Z

This should be fixed in master, forgot to add a ref to this issue in the commit.

#5101

pires added the type/bug label Feb 16, 2021

alexec closed this as completed Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle idle timeouts more gracefully #5124

Handle idle timeouts more gracefully #5124

pires commented Feb 16, 2021

pires commented Feb 16, 2021

alexec commented Feb 16, 2021

pires commented Feb 16, 2021

alexec commented Feb 16, 2021

dcherman commented Feb 22, 2021

Handle idle timeouts more gracefully #5124

Handle idle timeouts more gracefully #5124

Comments

pires commented Feb 16, 2021

Summary

Diagnostics

pires commented Feb 16, 2021

alexec commented Feb 16, 2021

pires commented Feb 16, 2021

alexec commented Feb 16, 2021

dcherman commented Feb 22, 2021