Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring UI problems when watching the workflow running #5006

Closed
sylock opened this issue Feb 3, 2021 · 5 comments · Fixed by #5101
Closed

Recurring UI problems when watching the workflow running #5006

sylock opened this issue Feb 3, 2021 · 5 comments · Fixed by #5101
Labels

Comments

@sylock
Copy link
Contributor

sylock commented Feb 3, 2021

Summary

When watching for a just started workflow, I get every 10-20 sec a UI error. This makes the product currently not usable for production because you are interrupted every x seconds in your UI browsing.

Screenshot of the UI:
2021-02-03_18h10_43

From the console:
2021-02-03_18h10_34
In text:

GET https://argo-dev.apps.argo-wf.acc.cloud.smals.be/api/v1/workflow-events/openshift-odc?listOptions.fieldSelector=metadata.name=node-network-loss-latest-cfmn8&listOptions.resourceVersion=0 net::ERR_INCOMPLETE_CHUNKED_ENCODING 200 (OK)

Diagnostics

What Kubernetes provider are you using?
Openshift 4.5.5 with kubernetes 1.18.3

What version of Argo Workflows are you running?
2.12.7

Argo server is in server authentication method.

I have to add that since I use Openshift I use the PNS engine and the argo server is exposed with a route:
https://docs.openshift.com/container-platform/4.5/networking/routes/route-configuration.html

The error is not workflow related. I have the exact same behavior whatever workflow is in use.

Logs of argo server at the moment of the error:

time="2021-02-03T17:25:18.941Z" level=info msg="finished streaming call with code NotFound" error="rpc error: code = NotFound desc = workflows.argoproj.io \"node-network-loss-dev-rx9xp\" not found" grpc.code=NotFound grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-02-03T17:25:18Z" grpc.time_ms=7.023 span.kind=server system=grpc
time="2021-02-03T17:25:20.120Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-02-03T17:24:50Z" grpc.time_ms=30020.98 span.kind=server system=grpc

The failing request copy from the browser web console

fetch("https://argo-dev.apps.argo-wf.acc.<OBFUSCATED_DOMAIN>/api/v1/workflow-events/openshift-odc?listOptions.fieldSelector=metadata.name=node-network-loss-latest-cfmn8&listOptions.resourceVersion=0", {
  "headers": {
    "accept": "text/event-stream",
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "no-cache",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
    "sec-gpc": "1"
  },
  "referrer": "https://argo-dev.apps.argo-wf.acc.<OBFUSCATED_DOMAIN>/workflows/openshift-odc/node-network-loss-latest-cfmn8?tab=workflow",
  "referrerPolicy": "strict-origin-when-cross-origin",
  "body": null,
  "method": "GET",
  "mode": "cors",
  "credentials": "include"
});

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@sylock sylock added the type/bug label Feb 3, 2021
@alexec
Copy link
Contributor

alexec commented Feb 3, 2021

No one else is complaining about this issue. Similar issues are a problem when running Nginx or other proxies in front of Argo Workflows.

You can try argoproj/argocli:latest. If it is still a problem with that version, then you most likely have some network issues and should speak to your network operator.

@sylock
Copy link
Contributor Author

sylock commented Feb 3, 2021

I am in latest.

I tried with disabled browsing proxy and I still get the same.

There is a reverse proxy in front of the argo-server but I guess you support this configuration since it is the usual architecture for any production grade setup. That reverse proxy works for absolutely everything (except that request). It you have any hint that could be usefull for troubleshooting it would be of great help (maybe set a debug mode in argo server? or anything else?)

@austinpray-mixpanel
Copy link

austinpray-mixpanel commented Feb 10, 2021

Seeing the same issue periodically. Haven't looked into it.

Screen Shot 2021-02-10 at 1 36 43 PM

In the network inspector this results as a 500 error where the TTFB is 1.3 minutes+

Kube 1.15 on GKE running this https://github.com/argoproj/argo-workflows/blob/v2.12.8/manifests/install.yaml

dcherman added a commit to dcherman/argo that referenced this issue Feb 12, 2021
When Argo is operating behind load balancers / ingress controllers that have
an idle timeout configured, it's not uncommon to get disconnected and have
an error shown in the UI if you're looking at a relatively inactive workflow
or workflow list.

In the SSE spec, `:\n` is a sequence that you can send to the client which
should be ignored by the client, so we can use that to periodically send
something in the response without affecting the code in the UI at all.

Fixes argoproj#5006

Signed-off-by: Daniel Herman <dherman@factset.com>
dcherman added a commit to dcherman/pkg that referenced this issue Feb 13, 2021
When operating behind load balancers / ingress controllers that have
an idle timeout configured, it's not uncommon to get disconnected and have
an error shown in the UI if you're looking at a relatively inactive workflow
or workflow list.

In the SSE spec, `:\n` is a sequence that you can send to the client which
should be ignored by the client, so we can use that to periodically send
something in the response without affecting the code in the UI at all.

argoproj/argo-workflows#5006
dcherman added a commit to dcherman/pkg that referenced this issue Feb 13, 2021
When operating behind load balancers / ingress controllers that have
an idle timeout configured, it's not uncommon to get disconnected and have
an error shown in the UI if you're looking at a relatively inactive workflow
or workflow list.

In the SSE spec, `:\n` is a sequence that you can send to the client which
should be ignored by the client, so we can use that to periodically send
something in the response without affecting the code in the UI at all.

argoproj/argo-workflows#5006
Signed-off-by: Daniel Herman <dherman@factset.com>
dcherman added a commit to dcherman/argo that referenced this issue Feb 16, 2021
When operating behind load balancers / ingress controllers that have
an idle timeout configured, it's not uncommon to get disconnected and have
an error shown in the UI if you're looking at a relatively inactive workflow
or workflow list.

In the SSE spec, :\n is a sequence that you can send to the client which
should be ignored by the client, so we can use that to periodically send
something in the response without affecting the code in the UI at all.

Fixes argoproj#5006
Fixes argoproj#5124

Signed-off-by: Daniel Herman <dherman@factset.com>
@middlestone
Copy link

It is weird that I still have this problem using the version 3.1.4 which already has the patch merged. Is there anything I miss?

@tsunamishaun
Copy link

@middlestone found that my gloo gateway was blocking event source calls, removed timeout and problem went away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants