Moves sockets into the advanced worker #7760

ghost · 2022-12-19T15:31:31Z

Summary

Fixes #7894

Occurred changes and/or fixed issues

Technical notes summary

Areas or cases that should be tested

Areas which could experience regressions

Screenshot/Video

richard-cox

shell/plugins/steve/subscribe.js
- some docs/comments at the top is needed to explain that subscribe handles resource socket things directly with sockets, or via worker (basic and advanced)
- 'ws.resource.change'(ctx, msg) contains an if block that has been removed in master - Allow for partial count updates via websocket #7647 (which removed special handling of COUNTS given changes in the backend). This PR needs aligning with that one
- 'ws.resource.change'(ctx, msg) & 'ws.resource.remove'(ctx, msg) have additional code to handle resource aliases which is skipped if the advanced worker is used
- It should be a little clearer which fns apply to which mode (socket, basic worker, advanced worker). I think this pretty much means the following t
  - Affect following types - rehydrateSubscribe, reconnectWatches, opened, closed, error, send, sendImmediate, ws.resource.stop, ws.resource.start, ws.resource.error
  - Could do this by doc, or throw exception if mode is advanced worker?
- To confirm, we don't need to do anything special for resyncWatch?
- How is ws.ping handled in the advanced worker? (update - doesn't look like this is passed through from advanced worker)
I'm still not sure I understand why resourceWatcher needs to update state on an interval instead of directly when needed. it introduces a lot of complexity (and delay) and the benefit isn't clear.
I've only given this a brief test but hit two issue.
- issue 1
  - In browser 1 bring up the Deployments list
  - In browser 2 go to same list, edit a deployment and scale it up
  - In browser 1 the deployment stats should change, however remain with the previous value
- issue 2
  - switching between clusters should close and open the socket
  - (probably due to unsubscribe not being wired in)

My TODO's

All advanced worker testing
Test normal socket behaviour plus basic worker

shell/utils/resourceWatcher.js

shell/plugins/dashboard-store/mutations.js

shell/utils/resourceWatcher.js

- Fixes for switching cluster - includes using common getPerformanceSetting - avoid new code to unsub before socket disconnect - handle `watch` `stop` requests - lots of TODO's (questions, work, checks, test, etc) - use common

- isAdvancedWorker should only be true for cluster store - advancedWorker to be wired in

- sockets use an incremented local var for id - when we nuke the socket file within the worker this resets, so they all ahve id of 1 - work around this by applying the unix time

- seen in dex cluster explorer dashboard - count cards would be removed when partial counts response received

- getters canWatch, watchStarted now are worked around (they look at state in the UI thread) - we now don't call resource.stop or restart.start in subscription - tidied up `forgetType` - moved clearFromQueue from steve mutations into subscription mutations (better location) - added and removed some TODOs - fixed watch (stop handler should be higher up, include force watch handling)

- This change mutates input in a function, which is bad... - but ensures the reference isn't broken, which is needed to maintain similar functionality as before

- Seen when creating or viewing clusters

- the probably would have been a problem if the worker wasn't nuked - however as the codes there lets make it safe Also added `trace` feature in advanced worker, will probably bring out to other places as well

- Ensure that we handle the case where the advanced worker was created but the resource watcher wasn't - ... but fix case where this was happening (aka ensure that a blank cluster context is ignored)

- This will help test normal flow (when advanced worker is disabled) - Note - setting is now in a bag. This may help us better support further settings (enable client side pagination, etc) ``` advancedWorker: { enabled: false }, ```

…oard and events not re-subbed - Ensure we block default handling of resource.start (keep state in resource watcher)

…ew file - this avoids bringing class files into the worker

- Remove `syncWatch` (do the watch/unwatch straight away) - Test/Fix re-sub on reconnect - Test/Fix growls on disconnect

- including clean of workerQueue on resource.stop (this is SUPER defensive)

- ensure podsByNamespace is updated on batchChange TODO - the final update to the pod is ignored - removing a namespace cleans the cache correctly - disabling advanced worker still works

- ensure podsByNamespace is updated on batchChange Tested / Fixed - the final update to the pod is ignored - removing a namespace cleans the cache correctly - disabling advanced worker still works

- https://developer.mozilla.org/en-US/docs/Web/API/Request/credentials

…n_worker

richard-cox · 2023-01-12T11:49:21Z

Summary of Dev Testing

To Do

Disconnect/Reconnect

~~Disconnect socket / socket down --> nav to a new resouce (aka watch is queued). Socket comes up. Confirm resource.start message successfully received for new resource~~
- Couldn't test this, when navigating to a new page the initial request will fail meaning the watch won't start. The watch part can't move up before the request due to a missing revision

Complete

Stress Testing

System with lots of constantly churning resources

Flood Testing

more targeted and testable flooding of socket events (from local bash script)

Resources update

Browser 1 deployment detail page, pods list visible. Browser 2 --> delete a pod. Browser 1 pods list should show change new pod, remove old pod including any change of pod states
- Same as above, but redeploy deployment from deployment list

Batch updates work as expected

Cluster dashboard, events list visible, total events is at 500 and no more
- Note - if there's not enough events to hit the events limit, edit configureType(EVENT, { limit: 500 });)

Settings/Old Way

Disable the feature, ensure similar all other tests (or subset) work fine

Specific types of events

Cluster dashboard --> any other cluster page. Certain types should be forgotten aka unwatch called resulting in resource.stop socket message
- Returning to the cluster dashboard should re-watch same types aka resource.start socket message
Cluster dashboard, events list visible. Click on an event and confirm events resource.stop is received and events for specific event id is resource.started. Return to cluster dashboard and ensure reverse happens
before sending a socket, bork the resourceType such that a resource.error failed to find schema message is sent
Ensure resource.create, resource.change, resource.remove events are processed correctly in the schema related to a new CRD (note - change cronspec.type from string to integer to manufacture resource.change... check it this results in an actual schema change and if not the ui thread does not receive the update
Ensure different type of watches work fine (resource.start, resource.stop, resource.change events all work as expected)
- ~~selector. Browser 1. Create a service with a pod and nav to service with pods tab visible. Browser 2, same page, delete the pod. Browser 1 pods should update correctly~~
- namespace - Cannot test due to Sockets subscribe to all resources instead of required namespace filter #7919
- ~~id. covered by any test here that starts (refresh) on a detail page~~

Disconnect/Reconnect

Browser 1 cluster dashboard. Browser 2 on local cluster, redeploy the cattle-cluster-agent. cluster socket should close, socket re-connect messages in console should appear in reducing amount. when agent comes back up cluster socket should open and stay open. Note - there should be messages to re-sub to anything that was sub'd... however all will fail. This is the same as on master
- Same as above, but kill the pod within the cattle-cluster-agent deployment

Regression

Resource watch re-subscription uses wrong resource version and floods the k8s API server #5997 - Ensure that when steve resource stops watching we successfully re-watch with appropriate revision
- Note - the resource.stop message is sent and we don't then use a stale revision so won't get the resource.error too old message. To test that part bork the code to re-fetch the revision if it's older than five minutes

Unit tests

write unit tests for batchChanges function. This will be good going forward, but also test lots of weird edge cases

- batchChanges fixes - fix index is 0 issues (!/!!index) - only `set` if we have to - ensure we set the correct index after pushing to list - ensure map is updated after reducing list size with limit - podsByNamespace fixes - ensure when ew replace... we don't use the same referenced object - general service resource fixes - ensure service's pods list stays up to date with store

- resourceCache - store the hash instead of the whole object. This means longer load time be reduces memory footprint - resourceWatcher - don't re-sub on socket reconnect if watcher is in error - don't sub if watcher is in error - don't unwatch for 'failed to find schema' and 'too old' errors - this clears the error, we won't to keep it to ensure we don't watch - Remove #5997 comments, follow on work #7917

Much more scope for some crazy content

- disable logging by default - initWorker comes in too late to affect initial trace, so just rely on the `debug` to toggle at runtime

richard-cox

Given the below is done this is good to squash merge

Remaining tests from Moves sockets into the advanced worker #7760 (comment)
- One is blocked by Sockets subscribe to all resources instead of required namespace filter #7919
@Sean-McQ Can you give the code a brief self review before merging, just to catch any bad logging, settings, etc?

Work for next release in #7917

This reverts commit ca1b810.

catz · 2024-03-21T11:08:20Z

Hello @richard-cox,

I've been checking the code in the advanced worker and noticed that there seems to be a missing piece for the COUNT entity (

dashboard/shell/plugins/steve/worker/web-worker.advanced.js

Line 238 in 479074f

state.batchChanges[type][id] = change;

).

I suppose this works fine for standard resources, but not for counts. We need to merge previously saved counts; otherwise, only the last counts will be included in the batch changes to send.

(Perhaps something similar to this could work: #7647)

ghost requested a review from richard-cox December 19, 2022 15:31

github-actions bot assigned ghost Dec 19, 2022

ghost marked this pull request as draft December 19, 2022 15:32

ghost marked this pull request as ready for review December 20, 2022 18:15

richard-cox requested changes Dec 21, 2022

View reviewed changes

ghost requested a review from richard-cox January 5, 2023 10:32

Sean and others added 22 commits January 5, 2023 10:21

Moves sockets into the advanced worker

c356453

worker can die peacefully now, making switching between cluster work.

dc53df3

Make waitFor generic, wire in to waitForTestFn

03f6d39

General Changes

4cf1aad

- Fixes for switching cluster - includes using common getPerformanceSetting - avoid new code to unsub before socket disconnect - handle `watch` `stop` requests - lots of TODO's (questions, work, checks, test, etc) - use common

Switch socket fixes

07cc083

- isAdvancedWorker should only be true for cluster store - advancedWorker to be wired in

Fix socket id for cluster workers

197834b

- sockets use an incremented local var for id - when we nuke the socket file within the worker this resets, so they all ahve id of 1 - work around this by applying the unix time

Fix handling of new partical counts response

3a13a0e

- seen in dex cluster explorer dashboard - count cards would be removed when partial counts response received

pushes the csrf value into worker and adds it to fetch request headers.

0fdb22a

refactors batchChanges to address ref concerns and be more performant

126d649

Maintain schema reference whilst updating

6b05898

- This change mutates input in a function, which is bad... - but ensures the reference isn't broken, which is needed to maintain similar functionality as before

Fix waitForTestFn

2baa946

- Seen when creating or viewing clusters

On unwatch ensure any pending watch requests are removed from the queue

87c3737

- the probably would have been a problem if the worker wasn't nuked - however as the codes there lets make it safe Also added `trace` feature in advanced worker, will probably bring out to other places as well

Fix navigation from cluster manager world to any cluster

e91df45

- Ensure that we handle the case where the advanced worker was created but the resource watcher wasn't - ... but fix case where this was happening (aka ensure that a blank cluster context is ignored)

Tidy some TODOs

316e09e

Add perf settings page

8a6f0d8

- This will help test normal flow (when advanced worker is disabled) - Note - setting is now in a bag. This may help us better support further settings (enable client side pagination, etc) ``` advancedWorker: { enabled: false }, ```

FIX - Nav from cluster dashboard --> specific event --> cluster dashb…

105023b

…oard and events not re-subbed - Ensure we block default handling of resource.start (keep state in resource watcher)

Tidying up some TODOs

0c6de44

Adds in a cache and uses it to validate SCHEMA messages before batching.

92f7c33

Forgot to actually save CSRF to the resourceWatcher when instantiated.

d6193fd

an empty resource in a batchChange to signal remove

b6cddef

Move addSchemaIndexFields to and created removeSchemaIndexFields in n…

f1cb07d

…ew file - this avoids bringing class files into the worker

richard-cox added this to the v2.7.next1 milestone Jan 10, 2023

Fix disconnect/reconnect

4f312e2

- Remove `syncWatch` (do the watch/unwatch straight away) - Test/Fix re-sub on reconnect - Test/Fix growls on disconnect

richard-cox and others added 9 commits January 10, 2023 19:44

Tidying up some TODO's

8a284f1

- including clean of workerQueue on resource.stop (this is SUPER defensive)

batchChanges will now handle aliases

3cd4ae1

Fix pods list - WIP

b06c7dd

- ensure podsByNamespace is updated on batchChange TODO - the final update to the pod is ignored - removing a namespace cleans the cache correctly - disabling advanced worker still works

Fix pods list - fixes

709b422

- ensure podsByNamespace is updated on batchChange Tested / Fixed - the final update to the pod is ignored - removing a namespace cleans the cache correctly - disabling advanced worker still works

Tidying TODOs

f0fdc79

Remove default same-origin header

dea67b9

- https://developer.mozilla.org/en-US/docs/Web/API/Request/credentials

Fixed TODO description

6a99b33

Merge remote-tracking branch 'upstream/master' into feature/sockets_i…

ac0a1a5

…n_worker

Refactor subscribe, make it clear which vuex feature relates to what

560751d

richard-cox and others added 5 commits January 12, 2023 20:26

toggle debug, remap alias types, cleaned up comments and console

2d5687d

Unit tests for batchChanges

283ccc8

Much more scope for some crazy content

Logging tweaks

901fc8f

- disable logging by default - initWorker comes in too late to affect initial trace, so just rely on the `debug` to toggle at runtime

richard-cox mentioned this pull request Jan 13, 2023

Re-fetch resources on some resource watcher failures #7917

Closed

richard-cox approved these changes Jan 13, 2023

View reviewed changes

ghost merged commit ca1b810 into rancher:master Jan 13, 2023

n313893254 added a commit to harvester/dashboard that referenced this pull request Jan 28, 2023

Revert "Moves sockets into the advanced worker (rancher#7760)"

d546100

This reverts commit ca1b810.

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moves sockets into the advanced worker #7760

Moves sockets into the advanced worker #7760

ghost commented Dec 19, 2022 •

edited by zube bot

Loading

richard-cox left a comment

richard-cox commented Jan 12, 2023 •

edited

Loading

richard-cox left a comment

catz commented Mar 21, 2024

Moves sockets into the advanced worker #7760

Moves sockets into the advanced worker #7760

Conversation

ghost commented Dec 19, 2022 • edited by zube bot Loading

Summary

Occurred changes and/or fixed issues

Technical notes summary

Areas or cases that should be tested

Areas which could experience regressions

Screenshot/Video

richard-cox left a comment

Choose a reason for hiding this comment

richard-cox commented Jan 12, 2023 • edited Loading

Summary of Dev Testing

To Do

Disconnect/Reconnect

Complete

Stress Testing

Flood Testing

Resources update

Batch updates work as expected

Settings/Old Way

Specific types of events

Disconnect/Reconnect

Regression

Unit tests

richard-cox left a comment

Choose a reason for hiding this comment

catz commented Mar 21, 2024

ghost commented Dec 19, 2022 •

edited by zube bot

Loading

richard-cox commented Jan 12, 2023 •

edited

Loading