Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed WorkflowTemplate crashes whole UI #3666

Closed
4 tasks done
SebastianGoeb opened this issue Aug 4, 2020 · 10 comments
Closed
4 tasks done

Malformed WorkflowTemplate crashes whole UI #3666

SebastianGoeb opened this issue Aug 4, 2020 · 10 comments
Assignees

Comments

@SebastianGoeb
Copy link

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:
Deployed a WorkflowTemplate with kubectl (really helm) that was malformed, but still accepted by K8s API server (fit CRD apparently). This broke the WorkflowTemplate UI list view (http://localhost:2746/workflow-templates), see screenshot below.

argo cli would have caught it:

# argo template create -n argo src/test/resources/workflow-template.yaml
020/08/04 17:32:17 Failed to parse workflow template: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal object into Go struct field Arguments.spec.templates.dag.tasks.arguments.parameters of type []v1alpha1.Parameter

but we will be deploying our workflow (-template) together with our application as a single helm chart because we want the declarative resource management, so argo cli isn't really an option.

What you expected to happen:
The WorkflowTemplate UI's list view should as least load and show me any WorkflowTemplates that are indeed syntactically correct. Put another way, I expect not to be prevented from working with my valid WorkflowTemplates, just because yesterday's newly deployed WorkflowTemplate is broken.

Particularly troubling is that the entire UI, including sidebar, disappears. Since it's only an ajax api request (http://localhost:2746/api/v1/workflow-templates/argo) that fails, I would expect to at least get a sensible error within the WorkflowTemplate UI about what has gone wrong. Ideally I would even be allowed to see the valid WorkflowTemplates and be informed about the broken one separately, maybe with a warning icon on the list item. While this could presumably be fixed by making the api and frontend smarter, another 500 will crop up in the future and break the UI somewhere else, and I would love not to be kicked out of the UI entirely whenever the server fails to provide some data.

On that note, the error page seems to have a double-redirect which breaks the browser back button, so I can't even get back to the main page that I was on previously. Should I post a separate issue about that?

How to reproduce it (as minimally and precisely as possible):

  1. start with blank docker for mac kubernetes environment (haven't tried other setups, but I suspect any kubernetes environment should work)
  2. setup argo
kubectl create ns argo
kubectl -n argo apply -f https://raw.githubusercontent.com/argoproj/argo/stable/manifests/quick-start-minimal.yaml
kubectl port-forward -n argo deploy/argo-server web
  1. deploy a WorkflowTemplate that Kubernetes accepts (fits CRD spec), but is technically malformed in one place (maybe CRD spec isn't strict enough?)
# workflow-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: some-template
spec:
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: print
            template: print
            arguments:
              parameters:
                # this should be
                # parameters:
                #   name: someParam
                #   value: "Hello, world!"
                someParam: "Hello, world!"
    - name: print
      script:
        image:  busybox
        command: [sh]
        script: |
          echo input was: {{inputs.parameters.someParam}}
kubectl -n argo apply -f workflow-template.yaml
  1. visit http://localhost:2746/workflow-templates
  2. watch the entire ui disappear behind an unfriendly API 500 error page:

Screenshot 2020-08-04 at 17 29 54

Anything else we need to know?:

Environment:
Docker for Mac 2.3.0.4 (46911) with Kubernetes enabled

  • Argo version:
argo: v2.9.4+20d2ace.dirty
  BuildDate: 2020-07-25T08:30:49Z
  GitCommit: 20d2ace3d5344db68ce1bc2a250bbb1ba9862613
  GitTreeState: dirty
  GitTag: v2.9.4
  GoVersion: go1.14.5
  Compiler: gc
  Platform: darwin/amd64
  • Kubernetes version :
clientVersion:
  buildDate: "2019-12-13T11:51:44Z"
  compiler: gc
  gitCommit: 70132b0f130acc0bed193d9ba59dd186f0e634cf
  gitTreeState: clean
  gitVersion: v1.17.0
  goVersion: go1.13.4
  major: "1"
  minor: "17"
  platform: darwin/amd64
serverVersion:
  buildDate: "2020-01-15T08:18:29Z"
  compiler: gc
  gitCommit: e7f962ba86f4ce7033828210ca3556393c377bcc
  gitTreeState: clean
  gitVersion: v1.16.6-beta.0
  goVersion: go1.13.5
  major: "1"
  minor: 16+
  platform: linux/amd64

Logs

  • server logs:

surprisingly, the server doesn't log the stacktrace seen in the UI.

time="2020-08-04T15:49:34Z" level=info authModes="[server client]" baseHRef=/ managedNamespace=argo namespace=argo secure=false
time="2020-08-04T15:49:34Z" level=warning msg="You are running in insecure mode. Learn how to enable transport layer security: https://github.com/argoproj/argo/blob/master/docs/tls.md"
time="2020-08-04T15:49:34Z" level=info msg="config map" name=workflow-controller-configmap
time="2020-08-04T15:49:34Z" level=info msg="SSO disabled"
time="2020-08-04T15:49:34Z" level=info msg="Starting Argo Server" version=v2.9.3
time="2020-08-04T15:49:34Z" level=info msg="Argo Server started successfully on http://localhost:2746"
  • workflow-controller logs:

but the workflow-controller does (partly at least)

...
E0804 15:50:42.669594       1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20191225075139-73fd2ddc9180/tools/cache/reflector.go:98: Failed to list *v1alpha1.WorkflowTemplate: v1alpha1.WorkflowTemplateList.Items: v1alpha1.WorkflowTemplates: v1alpha1.WorkflowTemplate.Spec: v1alpha1.WorkflowTemplateSpec.WorkflowSpec: Templates: []v1alpha1.Template: v1alpha1.Template.DAG: v1alpha1.DAGTemplate.Tasks: []v1alpha1.DAGTask: v1alpha1.DAGTask.Arguments: v1alpha1.Arguments.Parameters: []v1alpha1.Parameter: decode slice: expect [ or n, but found {, error found in #10 byte of ...|ameters":{"someParam|..., bigger context ...|es":[{"dag":{"tasks":[{"arguments":{"parameters":{"someParam":"Hello, world!"}},"name":"print","temp|...
...

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@simster7 simster7 self-assigned this Aug 4, 2020
@simster7 simster7 added the ui label Aug 4, 2020
@simster7
Copy link
Member

simster7 commented Aug 4, 2020

Looking into this

@simster7
Copy link
Member

simster7 commented Aug 4, 2020

Was able to reprodcue

@simster7
Copy link
Member

simster7 commented Aug 4, 2020

Seems like this error is propagated directly from the Kubernetes client when we make our List request here:

https://github.com/argoproj/argo/blob/dbb39368295cbc0ef886e78236338572c37607a1/server/workflowtemplate/workflow_template_server.go#L67

This might limit our ability to respond gracefully to this error, such as being able to display non-malformed Workflow templates or provide a more graceful explanation of the issue. Investigating more.

@alexec
Copy link
Contributor

alexec commented Aug 4, 2020

The UI fails to this error page. Think is a known issue, see #3454.

@alexec
Copy link
Contributor

alexec commented Aug 4, 2020

If the 500 error originates in the Kuberenetes API , then the 500 is probably not fixable. I. do think we should be more graceful in handling these errors. I.e. we shouldn't dump the user to and unusable UI.

@SebastianGoeb
Copy link
Author

Haven't used the k8s go-client myself, so I don't know if the CRD yaml is generated from go classes or something, but is it possible to make the CRD spec stricter, so it isn't possible to submit a malformed workflow template in the first place? It seems to me this particular 500 arises because what the K8s API server accepts and what the go client accepts, by way of its json parser, doesn't quite match. To be honest, I wouldn't expect to be allowed to insert someParam: "Hello, world!" under parameters in the first place, if the spec doesn't say that's a valid key.

@alexec
Copy link
Contributor

alexec commented Aug 4, 2020

@SebastianGoeb in theory - yes -in practise - no.

We provide CRDs with validation, but we don't use them by default.

Why?

  1. We need to be able to run multiple installations of Argo Workflows in a single cluster.
  2. CRD validation cannot pick up on many common errors (e.g. missing template names, invalid parameters).

Essentially, if you need a robust system, you must lint your manifests.

The argo template lint commands allow you to do this.

@SebastianGoeb
Copy link
Author

If it isn't fixable by reconfiguration, maybe it's something that should be bumped up to the go-client itself? Because I would expect it to deserialize anything that the server accepts, i.e. anything that is allowed by the CRD's OpenAPI spec. Or, conversely, if it refuses to deserialize a CR, then I would expect the server to refuse it too.

@SebastianGoeb
Copy link
Author

Ah, I see. That's a shame. But maybe that could be built into our CI workflow.

@alexec
Copy link
Contributor

alexec commented Sep 2, 2020

Available for testing in v2.11.0-rc1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants