Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stale management cluster resources #8224

Merged
merged 3 commits into from
Feb 23, 2023

Conversation

richard-cox
Copy link
Member

@richard-cox richard-cox commented Feb 21, 2023

Summary

Fixes #7819
Fixes #7815

Occurred changes and/or fixed issues

There's a number of 'stale' screen content when creating an RKE2 DO cluster, adding/removing machine pools (deployments) and scaling up/down a pool (deployment). Stale content covers

  • Cluster overall State
  • Freshly created pool does not show new machine
  • Machine from a removed pool remained
  • Pool's Machine summary bar graph not updating
  • Pool scaled down still shows removed machine
  • Created cluster shows empty pools without deployments

The only consistent way to reproduce any of these, is to scale down the last remaining machine in a second pool (deleted machine stays in second pool) and from the same cluster state edit the cluster to have one in the second pool and nav quickly to the cluster detail page (new machine does not show in pool two). For the others, it's just about getting lucky/unlucky when creating/editing/removing clusters and pools, and scaling pools.

From what I can tell there's three causes

  1. We don't receive the relevant resource.create, resource.change, resource.remove message
    • Probably missed due to the lag between resource.stop and re-subscribing
  2. We do receive the relevant resource.create, resource.change, resource.remove message
    • Mystery vue reactivity issue
  3. There are errors in shell/models/provisioning.cattle.io.cluster.js get unavailableMachines
    • Sometimes the machine we iterate over does not contain a status or status.condition
    • Must be 'crap in' somewhere

Fixes

  1. Missing socket messages - e73c55c
    • This is all about reducing the time between the backend telling us to resource.stop a resource sub and us trying to start it again.
    • resource.stop caused by a number of things, more detail in commit message
    • new fix means we don't wait 5 seconds and re-sub straight away with whatever revision we have in the store
      • if the revision isn't too old (stop probably came from change of permision) it should mean we get all required resource change messages promptly
      • if the revision is too old (stop probably came from 30 min socket death) it means we'll re-fetch entire state and re-watch from that revision
    • these changes were possible given the backend now gives us the correct too old message
  2. UI doesn't update following socket messages - 6ce6632
    • I think this is machine/node list issue where we were adding/removing a fake entry to ensure a pool group was shown... this didn't change the footprint of the list (real one removed --> fake one added, fake one removed --> real one added) so a cached, stale set of rows was shown
  3. shell/models/provisioning.cattle.io.cluster.js get unavailableMachines - 90ccf3f
    • Make these null safe

Technical notes summary

Workings out - rancher/rancher#40558 (comment)

Areas or cases that should be tested

  • RKE1/RKE2 Cluster Create
    • Single node, multiple nodes/groups
  • RKE1/RKE2 Pool Scaling
    • Up / Down in a pool, including to zero left
  • RKE1/RKE2 Pool Create/Delete

Areas which could experience regressions

  • Anywhere that cluster information in cluster management or the home page is shown
    • The initial data should be fine, but anything that changes afterwards
  • This includes a partial revert of the fix for Resource watch re-subscription uses wrong resource version and floods the k8s API server #5997.
    • We now fall back on the potential dodgy revision
      • If it's good there's no problem. If it's not we will re-fetch the entire lot (which we were doing pre-2.7.0(
      • This fix is better ensuring info is kept up to date on screen, rather than potentially making additional socket start that will fail

- fix issue where ..
  - state 1 - X machines + Y fake machines = total
  - state 2 - X+1 machines + Y-1 fake machines = same total
- same total meant sortable table `arrangedRows` value wasn't updating
- fix is to ensure the sort generation changes so `arrangedRows` doesn't return the cached rows
- this is the same method used for the project/namespace list
…missed resource changes

- changes cover create, change and remove
- resource.stop events happen
  - we unusb
  - after socket errors (that rancher sends, like revision `too old`)
  - after resource type permissions change
- there would be a gap between resource.stop (fetch latest revision, wait 5 seconds) and resource.start
- this could lead to missed resource changes and stale info on screen

Linking a couple of pertinent changes

- forceWatch partially implemented - rancher@14862b2#diff-42632b5ed3c30e60abade8a67748b16d45e0778091713dd71a46d4bbe9211d2c
- too old originally removed https://github.com/rancher/dashboard/pull/3743/files
  - this was implemented before the backend fixed their spam

Note - resource.stop can be forced with CATTLE_WATCH_TIMEOUT_SECONDS=300 (on v1 will resource.stop every 5 mins)
Note - Too old can be forced by editing resource.stop with
      // const revision = type === '' ? undefined : 1;
      // dispatch('watch', { ...obj, revision });
@richard-cox richard-cox self-assigned this Feb 21, 2023
@richard-cox richard-cox added this to the v2.7.2 milestone Feb 21, 2023
@github-actions github-actions bot added this to the v2.7.2 milestone Feb 21, 2023
@richard-cox richard-cox marked this pull request as ready for review February 21, 2023 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants