Recurring UI problems when watching the workflow running #5006

sylock · 2021-02-03T17:18:56Z

Summary

When watching for a just started workflow, I get every 10-20 sec a UI error. This makes the product currently not usable for production because you are interrupted every x seconds in your UI browsing.

Screenshot of the UI:

From the console:

In text:

GET https://argo-dev.apps.argo-wf.acc.cloud.smals.be/api/v1/workflow-events/openshift-odc?listOptions.fieldSelector=metadata.name=node-network-loss-latest-cfmn8&listOptions.resourceVersion=0 net::ERR_INCOMPLETE_CHUNKED_ENCODING 200 (OK)

Diagnostics

What Kubernetes provider are you using?
Openshift 4.5.5 with kubernetes 1.18.3

What version of Argo Workflows are you running?
2.12.7

Argo server is in server authentication method.

I have to add that since I use Openshift I use the PNS engine and the argo server is exposed with a route:
https://docs.openshift.com/container-platform/4.5/networking/routes/route-configuration.html

The error is not workflow related. I have the exact same behavior whatever workflow is in use.

Logs of argo server at the moment of the error:

time="2021-02-03T17:25:18.941Z" level=info msg="finished streaming call with code NotFound" error="rpc error: code = NotFound desc = workflows.argoproj.io \"node-network-loss-dev-rx9xp\" not found" grpc.code=NotFound grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-02-03T17:25:18Z" grpc.time_ms=7.023 span.kind=server system=grpc
time="2021-02-03T17:25:20.120Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-02-03T17:24:50Z" grpc.time_ms=30020.98 span.kind=server system=grpc

The failing request copy from the browser web console

fetch("https://argo-dev.apps.argo-wf.acc.<OBFUSCATED_DOMAIN>/api/v1/workflow-events/openshift-odc?listOptions.fieldSelector=metadata.name=node-network-loss-latest-cfmn8&listOptions.resourceVersion=0", {
  "headers": {
    "accept": "text/event-stream",
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "no-cache",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
    "sec-gpc": "1"
  },
  "referrer": "https://argo-dev.apps.argo-wf.acc.<OBFUSCATED_DOMAIN>/workflows/openshift-odc/node-network-loss-latest-cfmn8?tab=workflow",
  "referrerPolicy": "strict-origin-when-cross-origin",
  "body": null,
  "method": "GET",
  "mode": "cors",
  "credentials": "include"
});

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

alexec · 2021-02-03T17:45:31Z

No one else is complaining about this issue. Similar issues are a problem when running Nginx or other proxies in front of Argo Workflows.

You can try argoproj/argocli:latest. If it is still a problem with that version, then you most likely have some network issues and should speak to your network operator.

sylock · 2021-02-03T18:01:33Z

I am in latest.

I tried with disabled browsing proxy and I still get the same.

There is a reverse proxy in front of the argo-server but I guess you support this configuration since it is the usual architecture for any production grade setup. That reverse proxy works for absolutely everything (except that request). It you have any hint that could be usefull for troubleshooting it would be of great help (maybe set a debug mode in argo server? or anything else?)

austinpray-mixpanel · 2021-02-10T20:17:15Z

Seeing the same issue periodically. Haven't looked into it.

In the network inspector this results as a 500 error where the TTFB is 1.3 minutes+

Kube 1.15 on GKE running this https://github.com/argoproj/argo-workflows/blob/v2.12.8/manifests/install.yaml

When Argo is operating behind load balancers / ingress controllers that have an idle timeout configured, it's not uncommon to get disconnected and have an error shown in the UI if you're looking at a relatively inactive workflow or workflow list. In the SSE spec, `:\n` is a sequence that you can send to the client which should be ignored by the client, so we can use that to periodically send something in the response without affecting the code in the UI at all. Fixes argoproj#5006 Signed-off-by: Daniel Herman <dherman@factset.com>

When operating behind load balancers / ingress controllers that have an idle timeout configured, it's not uncommon to get disconnected and have an error shown in the UI if you're looking at a relatively inactive workflow or workflow list. In the SSE spec, `:\n` is a sequence that you can send to the client which should be ignored by the client, so we can use that to periodically send something in the response without affecting the code in the UI at all. argoproj/argo-workflows#5006

When operating behind load balancers / ingress controllers that have an idle timeout configured, it's not uncommon to get disconnected and have an error shown in the UI if you're looking at a relatively inactive workflow or workflow list. In the SSE spec, `:\n` is a sequence that you can send to the client which should be ignored by the client, so we can use that to periodically send something in the response without affecting the code in the UI at all. argoproj/argo-workflows#5006 Signed-off-by: Daniel Herman <dherman@factset.com>

When operating behind load balancers / ingress controllers that have an idle timeout configured, it's not uncommon to get disconnected and have an error shown in the UI if you're looking at a relatively inactive workflow or workflow list. In the SSE spec, :\n is a sequence that you can send to the client which should be ignored by the client, so we can use that to periodically send something in the response without affecting the code in the UI at all. Fixes argoproj#5006 Fixes argoproj#5124 Signed-off-by: Daniel Herman <dherman@factset.com>

middlestone · 2021-08-24T07:24:44Z

It is weird that I still have this problem using the version 3.1.4 which already has the patch merged. Is there anything I miss?

tsunamishaun · 2021-11-12T21:03:11Z

@middlestone found that my gloo gateway was blocking event source calls, removed timeout and problem went away.

sylock added the type/bug label Feb 3, 2021

sylock mentioned this issue Feb 3, 2021

Argo cli skip-tls-verify does not skip as expected #5008

Closed

dcherman mentioned this issue Feb 12, 2021

fix: send periodic keepalive packets on eventstream connections #5094

Closed

dcherman mentioned this issue Feb 13, 2021

fix: send periodic keepalive packets on eventstream connections #5101

Merged

1 task

dcherman mentioned this issue Feb 13, 2021

fix: send periodic keepalive packets on eventstream connections argoproj/pkg#35

Merged

alexec closed this as completed in #5101 Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recurring UI problems when watching the workflow running #5006

Recurring UI problems when watching the workflow running #5006

sylock commented Feb 3, 2021 •

edited

Loading

alexec commented Feb 3, 2021

sylock commented Feb 3, 2021 •

edited

Loading

austinpray-mixpanel commented Feb 10, 2021 •

edited

Loading

middlestone commented Aug 24, 2021

tsunamishaun commented Nov 12, 2021

Recurring UI problems when watching the workflow running #5006

Recurring UI problems when watching the workflow running #5006

Comments

sylock commented Feb 3, 2021 • edited Loading

Summary

Diagnostics

alexec commented Feb 3, 2021

sylock commented Feb 3, 2021 • edited Loading

austinpray-mixpanel commented Feb 10, 2021 • edited Loading

middlestone commented Aug 24, 2021

tsunamishaun commented Nov 12, 2021

sylock commented Feb 3, 2021 •

edited

Loading

sylock commented Feb 3, 2021 •

edited

Loading

austinpray-mixpanel commented Feb 10, 2021 •

edited

Loading