Fix stale management cluster resources #8224

richard-cox · 2023-02-21T18:08:42Z

Summary

Occurred changes and/or fixed issues

There's a number of 'stale' screen content when creating an RKE2 DO cluster, adding/removing machine pools (deployments) and scaling up/down a pool (deployment). Stale content covers

Cluster overall State
Freshly created pool does not show new machine
Machine from a removed pool remained
Pool's Machine summary bar graph not updating
Pool scaled down still shows removed machine
Created cluster shows empty pools without deployments

The only consistent way to reproduce any of these, is to scale down the last remaining machine in a second pool (deleted machine stays in second pool) and from the same cluster state edit the cluster to have one in the second pool and nav quickly to the cluster detail page (new machine does not show in pool two). For the others, it's just about getting lucky/unlucky when creating/editing/removing clusters and pools, and scaling pools.

From what I can tell there's three causes

We don't receive the relevant resource.create, resource.change, resource.remove message
- Probably missed due to the lag between resource.stop and re-subscribing
We do receive the relevant resource.create, resource.change, resource.remove message
- Mystery vue reactivity issue
There are errors in shell/models/provisioning.cattle.io.cluster.js get unavailableMachines
- Sometimes the machine we iterate over does not contain a status or status.condition
- Must be 'crap in' somewhere

Fixes

Missing socket messages - e73c55c
- This is all about reducing the time between the backend telling us to resource.stop a resource sub and us trying to start it again.
- resource.stop caused by a number of things, more detail in commit message
- new fix means we don't wait 5 seconds and re-sub straight away with whatever revision we have in the store
  - if the revision isn't too old (stop probably came from change of permision) it should mean we get all required resource change messages promptly
  - if the revision is too old (stop probably came from 30 min socket death) it means we'll re-fetch entire state and re-watch from that revision
- these changes were possible given the backend now gives us the correct too old message
UI doesn't update following socket messages - 6ce6632
- I think this is machine/node list issue where we were adding/removing a fake entry to ensure a pool group was shown... this didn't change the footprint of the list (real one removed --> fake one added, fake one removed --> real one added) so a cached, stale set of rows was shown
shell/models/provisioning.cattle.io.cluster.js get unavailableMachines - 90ccf3f
- Make these null safe

Technical notes summary

Workings out - rancher/rancher#40558 (comment)

Areas or cases that should be tested

RKE1/RKE2 Cluster Create
- Single node, multiple nodes/groups
RKE1/RKE2 Pool Scaling
- Up / Down in a pool, including to zero left
RKE1/RKE2 Pool Create/Delete

Areas which could experience regressions

Anywhere that cluster information in cluster management or the home page is shown
- The initial data should be fine, but anything that changes afterwards
This includes a partial revert of the fix for Resource watch re-subscription uses wrong resource version and floods the k8s API server #5997.
- We now fall back on the potential dodgy revision
  - If it's good there's no problem. If it's not we will re-fetch the entire lot (which we were doing pre-2.7.0(
  - This fix is better ensuring info is kept up to date on screen, rather than potentially making additional socket start that will fail

- fix issue where .. - state 1 - X machines + Y fake machines = total - state 2 - X+1 machines + Y-1 fake machines = same total - same total meant sortable table `arrangedRows` value wasn't updating - fix is to ensure the sort generation changes so `arrangedRows` doesn't return the cached rows - this is the same method used for the project/namespace list

…missed resource changes - changes cover create, change and remove - resource.stop events happen - we unusb - after socket errors (that rancher sends, like revision `too old`) - after resource type permissions change - there would be a gap between resource.stop (fetch latest revision, wait 5 seconds) and resource.start - this could lead to missed resource changes and stale info on screen Linking a couple of pertinent changes - forceWatch partially implemented - rancher@14862b2#diff-42632b5ed3c30e60abade8a67748b16d45e0778091713dd71a46d4bbe9211d2c - too old originally removed https://github.com/rancher/dashboard/pull/3743/files - this was implemented before the backend fixed their spam Note - resource.stop can be forced with CATTLE_WATCH_TIMEOUT_SECONDS=300 (on v1 will resource.stop every 5 mins) Note - Too old can be forced by editing resource.stop with // const revision = type === '' ? undefined : 1; // dispatch('watch', { ...obj, revision });

…age errors

richard-cox added 3 commits February 20, 2023 17:09

Fix issue where editing number of machine pools can cause dev white p…

90ccf3f

…age errors

richard-cox self-assigned this Feb 21, 2023

richard-cox added this to the v2.7.2 milestone Feb 21, 2023

github-actions bot added this to the v2.7.2 milestone Feb 21, 2023

richard-cox marked this pull request as ready for review February 21, 2023 18:09

This was referenced Feb 21, 2023

fix revision in resource watching #8121

Closed

[BUG] Creating/deleting management resources cause resources to be unsubscribed rancher/rancher#40558

Closed

richard-cox requested review from nwmac and torchiaf February 21, 2023 18:24

nwmac approved these changes Feb 23, 2023

View reviewed changes

richard-cox merged commit 95fdf80 into rancher:master Feb 23, 2023

richard-cox deleted the stale-management-resources branch February 23, 2023 11:58

This was referenced Feb 24, 2023

Avoid spurious console.error #8255

Merged

Persist http response collection revision #8257

Merged

Re-evaluate socket based fixes in master in advanced worker world #8296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stale management cluster resources #8224

Fix stale management cluster resources #8224

richard-cox commented Feb 21, 2023 •

edited by zube bot

Loading

Fix stale management cluster resources #8224

Fix stale management cluster resources #8224

Conversation

richard-cox commented Feb 21, 2023 • edited by zube bot Loading

Summary

Occurred changes and/or fixed issues

Technical notes summary

Areas or cases that should be tested

Areas which could experience regressions

richard-cox commented Feb 21, 2023 •

edited by zube bot

Loading