Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moves sockets into the advanced worker #7760

Merged
37 commits merged into from Jan 13, 2023
Merged

Moves sockets into the advanced worker #7760

37 commits merged into from Jan 13, 2023

Conversation

ghost
Copy link

@ghost ghost commented Dec 19, 2022

Summary

Fixes #7894

Occurred changes and/or fixed issues

Technical notes summary

Areas or cases that should be tested

Areas which could experience regressions

Screenshot/Video

@ghost ghost requested a review from richard-cox December 19, 2022 15:31
@github-actions github-actions bot assigned ghost Dec 19, 2022
@ghost ghost marked this pull request as draft December 19, 2022 15:32
@ghost ghost marked this pull request as ready for review December 20, 2022 18:15
Copy link
Member

@richard-cox richard-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • shell/plugins/steve/subscribe.js
    • some docs/comments at the top is needed to explain that subscribe handles resource socket things directly with sockets, or via worker (basic and advanced)
    • 'ws.resource.change'(ctx, msg) contains an if block that has been removed in master - Allow for partial count updates via websocket #7647 (which removed special handling of COUNTS given changes in the backend). This PR needs aligning with that one
    • 'ws.resource.change'(ctx, msg) & 'ws.resource.remove'(ctx, msg) have additional code to handle resource aliases which is skipped if the advanced worker is used
    • It should be a little clearer which fns apply to which mode (socket, basic worker, advanced worker). I think this pretty much means the following t
      • Affect following types - rehydrateSubscribe, reconnectWatches, opened, closed, error, send, sendImmediate, ws.resource.stop, ws.resource.start, ws.resource.error
      • Could do this by doc, or throw exception if mode is advanced worker?
    • To confirm, we don't need to do anything special for resyncWatch?
    • How is ws.ping handled in the advanced worker? (update - doesn't look like this is passed through from advanced worker)
  • I'm still not sure I understand why resourceWatcher needs to update state on an interval instead of directly when needed. it introduces a lot of complexity (and delay) and the benefit isn't clear.
  • I've only given this a brief test but hit two issue.
    • issue 1
      • In browser 1 bring up the Deployments list
      • In browser 2 go to same list, edit a deployment and scale it up
      • In browser 1 the deployment stats should change, however remain with the previous value
    • issue 2
      • switching between clusters should close and open the socket
      • (probably due to unsubscribe not being wired in)

My TODO's

  • All advanced worker testing
  • Test normal socket behaviour plus basic worker

shell/utils/resourceWatcher.js Outdated Show resolved Hide resolved
shell/plugins/dashboard-store/mutations.js Outdated Show resolved Hide resolved
shell/plugins/dashboard-store/mutations.js Outdated Show resolved Hide resolved
shell/plugins/dashboard-store/mutations.js Outdated Show resolved Hide resolved
shell/plugins/dashboard-store/mutations.js Outdated Show resolved Hide resolved
shell/utils/resourceWatcher.js Outdated Show resolved Hide resolved
shell/utils/resourceWatcher.js Outdated Show resolved Hide resolved
shell/utils/resourceWatcher.js Outdated Show resolved Hide resolved
shell/utils/resourceWatcher.js Outdated Show resolved Hide resolved
shell/utils/resourceWatcher.js Outdated Show resolved Hide resolved
@ghost ghost requested a review from richard-cox January 5, 2023 10:32
Sean and others added 22 commits January 5, 2023 10:21
- Fixes for switching cluster
  - includes using common getPerformanceSetting
  - avoid new code to unsub before socket disconnect
- handle `watch` `stop` requests
- lots of TODO's (questions, work, checks, test, etc)
- use common
- isAdvancedWorker should only be true for cluster store
- advancedWorker to be wired in
- sockets use an incremented local var for id
- when we nuke the socket file within the worker this resets, so they all ahve id of 1
- work around this by applying the unix time
- seen in dex cluster explorer dashboard
- count cards would be removed when partial counts response received
- getters canWatch, watchStarted now are worked around (they look at state in the UI thread)
  - we now don't call resource.stop or restart.start in subscription
- tidied up `forgetType`
- moved clearFromQueue from steve mutations into subscription mutations (better location)
- added and removed some TODOs
- fixed watch (stop handler should be higher up, include force watch handling)
- This change mutates input in a function, which is bad...
- but ensures the reference isn't broken, which is needed to maintain similar functionality as before
- Seen when creating or viewing clusters
- the probably would have been a problem if the worker wasn't nuked
- however as the codes there lets make it safe

Also added `trace` feature in advanced worker, will probably bring out to other places as well
- Ensure that we handle the case where the advanced worker was created but the resource watcher wasn't
- ... but fix case where this was happening (aka ensure that a blank cluster context is ignored)
- This will help test normal flow (when advanced worker is disabled)
- Note - setting is now in a bag. This may help us better support further settings (enable client side pagination, etc)
  ```
  advancedWorker: { enabled: false },
  ```
…oard and events not re-subbed

- Ensure we block default handling of resource.start (keep state in resource watcher)
…ew file

- this avoids bringing class files into the worker
@richard-cox richard-cox added this to the v2.7.next1 milestone Jan 10, 2023
- Remove `syncWatch` (do the watch/unwatch straight away)
- Test/Fix re-sub on reconnect
- Test/Fix growls on disconnect
richard-cox and others added 9 commits January 10, 2023 19:44
- including clean of workerQueue on resource.stop (this is SUPER defensive)
- ensure podsByNamespace is updated on batchChange

TODO
- the final update to the pod is ignored
- removing a namespace cleans the cache correctly
- disabling advanced worker still works
- ensure podsByNamespace is updated on batchChange

Tested / Fixed
- the final update to the pod is ignored
- removing a namespace cleans the cache correctly
- disabling advanced worker still works
@richard-cox
Copy link
Member

richard-cox commented Jan 12, 2023

Summary of Dev Testing

To Do

Disconnect/Reconnect

  • Disconnect socket / socket down --> nav to a new resouce (aka watch is queued). Socket comes up. Confirm resource.start message successfully received for new resource
    • Couldn't test this, when navigating to a new page the initial request will fail meaning the watch won't start. The watch part can't move up before the request due to a missing revision

Complete

Stress Testing

  • System with lots of constantly churning resources

Flood Testing

  • more targeted and testable flooding of socket events (from local bash script)

Resources update

  • Browser 1 deployment detail page, pods list visible. Browser 2 --> delete a pod. Browser 1 pods list should show change new pod, remove old pod including any change of pod states
    • Same as above, but redeploy deployment from deployment list

Batch updates work as expected

  • Cluster dashboard, events list visible, total events is at 500 and no more
    • Note - if there's not enough events to hit the events limit, edit configureType(EVENT, { limit: 500 });)

Settings/Old Way

  • Disable the feature, ensure similar all other tests (or subset) work fine

Specific types of events

  • Cluster dashboard --> any other cluster page. Certain types should be forgotten aka unwatch called resulting in resource.stop socket message
    • Returning to the cluster dashboard should re-watch same types aka resource.start socket message
  • Cluster dashboard, events list visible. Click on an event and confirm events resource.stop is received and events for specific event id is resource.started. Return to cluster dashboard and ensure reverse happens
  • before sending a socket, bork the resourceType such that a resource.error failed to find schema message is sent
  • Ensure resource.create, resource.change, resource.remove events are processed correctly in the schema related to a new CRD (note - change cronspec.type from string to integer to manufacture resource.change... check it this results in an actual schema change and if not the ui thread does not receive the update
  • Ensure different type of watches work fine (resource.start, resource.stop, resource.change events all work as expected)

Disconnect/Reconnect

  • Browser 1 cluster dashboard. Browser 2 on local cluster, redeploy the cattle-cluster-agent. cluster socket should close, socket re-connect messages in console should appear in reducing amount. when agent comes back up cluster socket should open and stay open. Note - there should be messages to re-sub to anything that was sub'd... however all will fail. This is the same as on master
    • Same as above, but kill the pod within the cattle-cluster-agent deployment

Regression

Unit tests

  • write unit tests for batchChanges function. This will be good going forward, but also test lots of weird edge cases

richard-cox and others added 5 commits January 12, 2023 20:26
- batchChanges fixes
  - fix index is 0 issues (!/!!index)
  - only `set` if we have to
  - ensure we set the correct index after pushing to list
  - ensure map is updated after reducing list size with limit
- podsByNamespace fixes
  - ensure when ew replace... we don't use the same referenced object
- general service resource fixes
  - ensure service's pods list stays up to date with store
- resourceCache - store the hash instead of the whole object. This means longer load time be reduces memory footprint
- resourceWatcher
  - don't re-sub on socket reconnect if watcher is in error
  - don't sub if watcher is in error
  - don't unwatch for 'failed to find schema' and 'too old' errors
    - this clears the error, we won't to keep it to ensure we don't watch
- Remove #5997 comments, follow on work #7917
Much more scope for some crazy content
- disable logging by default
- initWorker comes in too late to affect initial trace, so just rely on the `debug` to toggle at runtime
Copy link
Member

@richard-cox richard-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the below is done this is good to squash merge

Work for next release in #7917

@ghost ghost merged commit ca1b810 into rancher:master Jan 13, 2023
n313893254 added a commit to harvester/dashboard that referenced this pull request Jan 28, 2023
@catz
Copy link

catz commented Mar 21, 2024

Hello @richard-cox,

I've been checking the code in the advanced worker and noticed that there seems to be a missing piece for the COUNT entity (

state.batchChanges[type][id] = change;
).

I suppose this works fine for standard resources, but not for counts. We need to merge previously saved counts; otherwise, only the last counts will be included in the batch changes to send.

(Perhaps something similar to this could work: #7647)

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance: Move cluster based socket work into web worker (step 1)
2 participants