Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WatchNames: return errors via WebSocket #141

Merged
merged 2 commits into from
Mar 1, 2024

Conversation

moio
Copy link
Contributor

@moio moio commented Dec 22, 2023

Align behavior with plain Watch, which also does the same. UI code actually expects the same behavior in both cases.

Fixes rancher/rancher#41809

Align behavior with plain Watch, which also does the same.

Fixes rancher/rancher#41809

Signed-off-by: Silvio Moioli <silvio@moioli.net>
Copy link
Contributor

@rmweir rmweir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@MbolotSuse MbolotSuse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment on the safety of returning the error event without doing any processing of the object that it contains.

I also have an overall question on the solution: my first glance would say that the ideal solution would be to immediately return a resource.error event when we are given an invalid resource version, rather than a resource.start, then a resource.error, then a resource.stop. I think that the idea here is just to align this with standard watch rather than re-evaluate the API behavior, but wanted to ask just to be sure - would you mind elaborating on why you choose the start -> error -> stop flow here?

pkg/stores/proxy/proxy_store.go Show resolved Hide resolved
@moio
Copy link
Contributor Author

moio commented Jan 9, 2024

I also have an overall question on the solution: my first glance would say that the ideal solution would be to immediately return a resource.error event when we are given an invalid resource version, rather than a resource.start, then a resource.error, then a resource.stop. I think that the idea here is just to align this with standard watch rather than re-evaluate the API behavior, but wanted to ask just to be sure - would you mind elaborating on why you choose the start -> error -> stop flow here?

As you mentioned, here I am proposing to align behavior with the Watch function. UI code does not differentiate the two cases, as from the UI perspective a watch call is a watch call - only Steve makes a distinction between Watch and WatchNames internally depending on permissions.

I think you are right in pointing out this particular flow could be optimized to be less chatty, but that would need to be done for Watch and WatchNames, aligning client UI code in the process. To me, that's a valid optimization/refactoring but it is a separate effort/PR from just fixing the bug. It should be done later, separately, and subject to an evaluation of importance - as this is could be a pretty rare case.

@moio
Copy link
Contributor Author

moio commented Jan 10, 2024

@MbolotSuse following our discussions can you approve and merge this?

Relatedly: user reported success with the patch in the meantime.

@MbolotSuse
Copy link
Contributor

@moio I've resolved the outstanding thread on the functionality (and I think that we can merge this PR without further changes there), however, I'm not sure that the tests are working. Would you mind giving that a look?

Signed-off-by: Silvio Moioli <silvio@moioli.net>
@moio moio force-pushed the watchnames_propagate_errors branch from 2365164 to d138622 Compare January 23, 2024 10:26
@moio moio requested a review from MbolotSuse January 23, 2024 10:27
@moio
Copy link
Contributor Author

moio commented Jan 23, 2024

@MbolotSuse please review once again and, if all looks OK,merge this directly. Thanks in advance

@moio
Copy link
Contributor Author

moio commented Feb 5, 2024

Partially related, about your question which was left pending:

listAndWatch currently has a case that looks like it functions much the same as the case that you fixed does (i.e. debug the error and don't send on the channel). Would you mind seeing if you need to fix that as well?

I took a deeper look. To recap, listAndWatch is the function producing events, including errors, consumed by WatchNames, the the subject of this PR. WatchNames, in turn, forwards events to the UI and this PR is about forwarding errors to the UI too, so that the UI can act on them. Watch does the same.

Therefore when listAndWatch receives an error (from the Kubernetes watch operation) it has two options:

  • log and ignore or
  • send it to the UI, by forwarding it WatchNames/Watch

At this point listAndWatch is forwarding all errors that can be type-asserted to *v1.Status - which is recommended but not guaranteed by the API:

https://github.com/kubernetes/apimachinery/blob/f14778da5523847e4c07346e3161a4b4f6c9186e/pkg/watch/watch.go#L67-L69

I think that approach makes sense - if Steve is at least able to extract a string or a code from the error then there is a chance the UI can do something about it, therefore forward (eg. the "resource too old" situation that kicked off this PR).

For an error whose type is completely unknown, there is no way to tell if it makes sense to forward to the UI or not and even if it does, there is no way to extract a meaningful string or code that the UI could check and react on a priori. It will need intervention from a Steve programmer anyway to add code for a specific type when such a case is found, and the best way to serve that future programmer is to have a reliable log.

What could be argued, IMO, is to raise the log level from debug to info or even error to increase the visibility of those cases, if they are important at all. ATM I do not have reasons to change that, but others might.

Please help me double-check the reasoning above. If it makes sense, I will not open a follow-up PR.

@MbolotSuse MbolotSuse removed the request for review from KevinJoiner February 5, 2024 15:59
moio added a commit to rancher/rancher that referenced this pull request Mar 1, 2024
includes rancher/steve#141

Signed-off-by: Silvio Moioli <silvio@moioli.net>
@MbolotSuse MbolotSuse merged commit 7913f27 into rancher:master Mar 1, 2024
1 check passed
@moio moio deleted the watchnames_propagate_errors branch March 4, 2024 13:00
moio added a commit to moio/rancher that referenced this pull request Mar 5, 2024
Signed-off-by: Silvio Moioli <silvio@moioli.net>
moio added a commit to rancher/rancher that referenced this pull request Mar 7, 2024
Signed-off-by: Silvio Moioli <silvio@moioli.net>
moio added a commit that referenced this pull request Jun 3, 2024
Signed-off-by: Silvio Moioli <silvio@moioli.net>
moio added a commit that referenced this pull request Jun 3, 2024
Signed-off-by: Silvio Moioli <silvio@moioli.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SURE-7122] Excessive WebSocket activity when watching resources with permission by name
3 participants