Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make watches debuggable #49

Merged
merged 2 commits into from
Jun 14, 2022
Merged

Conversation

cmurphy
Copy link
Contributor

@cmurphy cmurphy commented May 20, 2022

Return websocket error and add logging for watches

Add debug logs and send websocket messages when the watch is closed
unexpectedly.

In addition to being helpful for debugging, the dashboard specifically
looks for a resource.error event containing the string "too old" in
order to trigger the watch to be resynced with a refreshed revision
number. Without this error returned, the dashboard will only see
resource.stop events and never change its behavior, continuing to try
to restart the watch with an incorrect resource version.

Make watch timeout configurable

By default, a watch times out after 30 minutes. For debugging purposes,
it's convenient if this can be decreased. Add an environment variable
CATTLE_WATCH_TIMEOUT_SECONDS to enable setting the timeout in seconds.

rancher/rancher#37627

cbron
cbron previously approved these changes May 20, 2022
Copy link
Contributor

@cbron cbron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, interesting that there aren't even logged errors for many of these things already.

@@ -327,6 +342,12 @@ func (s *Store) listAndWatch(apiOp *types.APIRequest, client dynamic.ResourceInt
eg.Go(func() error {
for event := range watcher.ResultChan() {
if event.Type == watch.Error {
if status, ok := event.Object.(*metav1.Status); ok {
logrus.Debugf("event watch error: %s", status.Message)
returnErr(errors.Wrapf(err, "event watch error: %+v", event), result)
Copy link

@Jason-ZW Jason-ZW May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can reproduce this issue and tested with your PR, found that steve(your PR version) can not return the error message: resource version is too old for UI WebSocket, that because err is nil causing errors.Wrapf to return nil, The message expected by the UI in the event object cannot be returned because errors.Wrapf(err) is nil.

Copy link

@Jason-ZW Jason-ZW May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, we need to pay attention that rancher/apiserver will send resource.stop event notifications, combined with the current logic of rancher/dashboard resource.error & rancher/dashboard resource.stop, possibly sending resource.error event will have other effects.

@ly5156
Copy link

ly5156 commented May 24, 2022

This pr will cause the backend to send error and stop events

  1. As the UI will re-watch the resource (possibly with the wrong resourceVersion) 5 seconds after the stop event is fired: https://github.com/rancher/dashboard/blob/master/shell/plugins/steve/subscribe.js#L460

  2. It will also determine if there is a too old string in the message in the error event, and if so it will resyncWatch the resource again: https://github.com/rancher/dashboard/blob/master/shell/plugins/steve/subscribe.js#L437

So the above issue may not be resolved

rmweir
rmweir previously approved these changes May 25, 2022
Copy link
Contributor

@rmweir rmweir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmurphy cmurphy dismissed stale reviews from rmweir and cbron via 97f0eac June 2, 2022 00:16
@cmurphy cmurphy force-pushed the watch-logging-head branch from eeeb17b to 97f0eac Compare June 2, 2022 00:16
@cmurphy
Copy link
Contributor Author

cmurphy commented Jun 2, 2022

I've updated this to address the nil error issue that @Jason-ZW pointed out. I need more time to see how emitting both resource.stop and resource.error will behave with the UI. It may be that we should only emit one or the other, or change the UI to only respond to one.

@cmurphy
Copy link
Contributor Author

cmurphy commented Jun 2, 2022

@Jason-ZW I was able to test out these changes again and found that with the error message properly emitted the dashboard successfully calls resyncWatch like you pointed out, and this actually seems to fix the issue on its own because the resource list is called again and the new watch uses the correct resource version from that fresh list. So it's possible we might not need a dashboard change.

@cmurphy cmurphy requested review from cbron and rmweir June 2, 2022 22:08
rmweir
rmweir previously approved these changes Jun 2, 2022
Copy link
Contributor

@rmweir rmweir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jason-ZW
Copy link

Jason-ZW commented Jun 5, 2022

@Jason-ZW I was able to test out these changes again and found that with the error message properly emitted the dashboard successfully calls resyncWatch like you pointed out, and this actually seems to fix the issue on its own because the resource list is called again and the new watch uses the correct resource version from that fresh list. So it's possible we might not need a dashboard change.

@ly5156 Cloud you please confirm this information?

@ly5156
Copy link

ly5156 commented Jun 6, 2022

LGTM

1 similar comment
@Jason-ZW
Copy link

Jason-ZW commented Jun 6, 2022

LGTM

cbron
cbron previously approved these changes Jun 6, 2022
cmurphy added 2 commits June 13, 2022 15:54
Add debug logs and send websocket messages when the watch is closed
unexpectedly.

In addition to being helpful for debugging, the dashboard specifically
looks for a `resource.error` event containing the string "too old" in
order to trigger the watch to be resynced with a refreshed revision
number.  Without this error returned, the dashboard will only see
`resource.stop` events and never change its behavior, continuing to try
to restart the watch with an incorrect resource version.
By default, a watch times out after 30 minutes. For debugging purposes,
it's convenient if this can be decreased. Add an environment variable
CATTLE_WATCH_TIMEOUT_SECONDS to enable setting the timeout in seconds.
@cmurphy cmurphy dismissed stale reviews from cbron and rmweir via 11fe86a June 13, 2022 23:02
@cmurphy cmurphy force-pushed the watch-logging-head branch from 97f0eac to 11fe86a Compare June 13, 2022 23:02
@cmurphy
Copy link
Contributor Author

cmurphy commented Jun 13, 2022

Updated one of the commit messages and PR description since we are now using this to fix the issue instead of just for debugging.

@cmurphy cmurphy requested review from cbron and rmweir June 13, 2022 23:03
Copy link
Contributor

@rmweir rmweir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmurphy cmurphy merged commit 22b03d3 into rancher:master Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants