-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make watches debuggable #49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, interesting that there aren't even logged errors for many of these things already.
pkg/stores/proxy/proxy_store.go
Outdated
@@ -327,6 +342,12 @@ func (s *Store) listAndWatch(apiOp *types.APIRequest, client dynamic.ResourceInt | |||
eg.Go(func() error { | |||
for event := range watcher.ResultChan() { | |||
if event.Type == watch.Error { | |||
if status, ok := event.Object.(*metav1.Status); ok { | |||
logrus.Debugf("event watch error: %s", status.Message) | |||
returnErr(errors.Wrapf(err, "event watch error: %+v", event), result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can reproduce this issue and tested with your PR, found that steve(your PR version)
can not return the error message: resource version is too old
for UI WebSocket, that because err
is nil
causing errors.Wrapf
to return nil
, The message expected by the UI in the event
object cannot be returned because errors.Wrapf(err)
is nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we need to pay attention that rancher/apiserver will send resource.stop
event notifications, combined with the current logic of rancher/dashboard resource.error & rancher/dashboard resource.stop, possibly sending resource.error
event will have other effects.
This pr will cause the backend to send error and stop events
So the above issue may not be resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
eeeb17b
to
97f0eac
Compare
I've updated this to address the nil error issue that @Jason-ZW pointed out. I need more time to see how emitting both resource.stop and resource.error will behave with the UI. It may be that we should only emit one or the other, or change the UI to only respond to one. |
@Jason-ZW I was able to test out these changes again and found that with the error message properly emitted the dashboard successfully calls resyncWatch like you pointed out, and this actually seems to fix the issue on its own because the resource list is called again and the new watch uses the correct resource version from that fresh list. So it's possible we might not need a dashboard change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ly5156 Cloud you please confirm this information? |
LGTM |
1 similar comment
LGTM |
Add debug logs and send websocket messages when the watch is closed unexpectedly. In addition to being helpful for debugging, the dashboard specifically looks for a `resource.error` event containing the string "too old" in order to trigger the watch to be resynced with a refreshed revision number. Without this error returned, the dashboard will only see `resource.stop` events and never change its behavior, continuing to try to restart the watch with an incorrect resource version.
By default, a watch times out after 30 minutes. For debugging purposes, it's convenient if this can be decreased. Add an environment variable CATTLE_WATCH_TIMEOUT_SECONDS to enable setting the timeout in seconds.
97f0eac
to
11fe86a
Compare
Updated one of the commit messages and PR description since we are now using this to fix the issue instead of just for debugging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Return websocket error and add logging for watches
Add debug logs and send websocket messages when the watch is closed
unexpectedly.
In addition to being helpful for debugging, the dashboard specifically
looks for a
resource.error
event containing the string "too old" inorder to trigger the watch to be resynced with a refreshed revision
number. Without this error returned, the dashboard will only see
resource.stop
events and never change its behavior, continuing to tryto restart the watch with an incorrect resource version.
Make watch timeout configurable
By default, a watch times out after 30 minutes. For debugging purposes,
it's convenient if this can be decreased. Add an environment variable
CATTLE_WATCH_TIMEOUT_SECONDS to enable setting the timeout in seconds.
rancher/rancher#37627