Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactors sockets and moves reconnect logic and error handling into it. #7459

Closed
wants to merge 1 commit into from
Closed

Conversation

ghost
Copy link

@ghost ghost commented Nov 15, 2022

Summary

Fixes #5997
Socket subscriptions did not maintain adequate information about their current state to self-heal from an error state. Most notably, there was no information as to which resourceVersion or the age of the resourceVersion being stored in vuex that would allow for a quick recovery if a subscription (watch) was stopped. Subscriptions (watches) were tracked in a store key "started" which didn't adequately describe what was being stored, and a subscription's (watch's) error state was being kept in the key "inError" which also doesn't adequately describe that the key is for use only for subscriptions (watches).

A resourceVersion remains valid to use to resubscribe only if it is less than 5 minutes old, these resourceVersions are not necessarily sequential so they can't be guessed. Current logic was to use the latest resourceVersion from a single resource in a collection rather than the collection resourceVersion, causing issues creating the new subscription (watch).

Occurred changes and/or fixed issues

Websocket subscriptions for resources are referred to as "watches" in the Kubernetes API Concepts documentation so they are referred to as such in code and in this description from here on out.

Moves logic for maintaining resource watches into the socket object itself and extends it so that the socket maintains those subscriptions by itself to consolidate logic and location of socket relevant values.

Technical notes summary

A lot of code removed from shell/plugins/steve/subscribe.js and replaced in shell/utils/socket.js which should hopefully make sockets more modular (this lays a decent amount of groundwork for advanced webworkers as well). Changes outside of these files should be relatively minimal and there shouldn't be much in the way of user impact aside from more stable websocket watches.

Areas or cases that should be tested

  • High socket traffic
  • Watches the receive frequent "resource.stop" messages should resubscribe by themselves
  • Switching between clusters should work properly
  • resource.error messages should be handled gracefully, if an error uses the word "too old" then that watch should stop (if not already) and rewatch.

Areas which could experience regressions

This is a major change to sockets and watches so those should be thoroughly tested

@github-actions github-actions bot assigned ghost Nov 15, 2022
@ghost ghost marked this pull request as draft November 15, 2022 16:32
@ghost ghost requested review from richard-cox, nwmac and mantis-toboggan-md November 16, 2022 13:48
@github-actions github-actions bot added this to the v2.7.1 milestone Nov 16, 2022
@ghost ghost marked this pull request as ready for review November 16, 2022 13:50
@catherineluse catherineluse self-requested a review November 22, 2022 01:51
@catherineluse
Copy link
Contributor

@Sean-McQ I'd like to review this PR, but I have found it difficult to reproduce the original issue using Colleen's steps. Do you have any tips to make it easier to repro the original issue?

@ghost ghost requested a review from rak-phillip November 22, 2022 17:31
Copy link
Member

@richard-cox richard-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't done a deep dive in to this yet, but making shell/utils/socket.js steve/norman specific breaks the other places it's used (epinio, harvester, streaming workload logs, shell to deployment). Easiest place to see is Workload list --> Deployment action menu --> Execute Shell

@ghost ghost closed this Nov 28, 2022
@ghost
Copy link
Author

ghost commented Nov 28, 2022

For the most part, Steve/Norman are pretty fundamental to the way we use websockets so I still think it makes sense for some of their requirements to be baked into them but I do see where/how that breaks shells atm. Gonna work on a fix for that and put in an update here.

@ghost ghost reopened this Nov 28, 2022
@ghost ghost requested a review from richard-cox November 30, 2022 03:15
@ghost
Copy link
Author

ghost commented Nov 30, 2022

As requested, new changes to the socket have been moved to a new class derived from the original socket class (ResourceWatcher) and the ResourceWatcher is in use by the subscribe Vuex methods.

Copy link
Member

@richard-cox richard-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't totally finished reviewing, but hit a functionally blocking bug (see comment in resourceWatcher send(data) { and will pick up the rest tomorrow

const INSECURE = 'ws://';
const SECURE = 'wss://';

const STATE_DISCONNECTED = 'disconnected';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 8 - 18 are core socket stuff already defined in the root socket class

export const EVENT_CONNECT_ERROR = 'connect_error';
export const EVENT_DISCONNECT_ERROR = 'disconnect_error';

export const NO_WATCH = 'NO_WATCH';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These and WATCHSTATUSES don't need to be exported

constructor(url, autoReconnect = true, frameTimeout = null, protocol = null, maxTries = null) {
super(url, autoReconnect, frameTimeout, protocol, maxTries);

this.maintenanceInterval = setInterval(() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maintenanceInterval is never stopped. I need to investigate this more, but we should probably wire this in to the connect/disconnect


setUrl(url) {
this.baseUrl = self.location.origin + url.replace('subscribe', '');
if ( !url.match(/wss?:\/\//) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could swap this section below out for just super.setUrl(url)

}
}

isConnecting() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed

}
}

this.dispatchEvent(new CustomEvent(EVENT_MESSAGE, { detail: event }));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line and lines 246, 247 and 248 are super._onmessage. Could that be used instead and placed either first or last? That would keep all common code in the root class and extended functionality in parent classe's method together

debug(state, on) {
state.debugSocket = on !== false;
},

resetSubscriptions(state) {
clear(state.started);
const watcheKeys = Object.keys(state.socket.watches);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this throws an exception when going from the Virtualization Management product to a cluster's explorer product

// refactoring this to use new specific methods for subscribing and unsubscribing to resource collections which will make recovering stopped subscriptions a little bit easier.
send(data) {
// we're going to use these values from the data we're sending to determine if we're subscribing or unsubscribing
if (data.resourceType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was never true for me (data is always a string), so this.watches is never initialised

@richard-cox
Copy link
Member

Closing in favour of #7719 / #7671. There'll be another PR up that covers moving rancher socket subscription management into a worker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resource watch re-subscription uses wrong resource version and floods the k8s API server
2 participants