Resource watch re-subscription uses wrong resource version and floods the k8s API server #5997

cmurphy · 2022-05-20T00:29:36Z

Setup

Rancher version: 0e2fdc8
Browser type & version:

Describe the bug

The Vue dashboard uses websockets to subscribe to kubernetes events in the downstream cluster. The watch has a default timeout of 30 minutes. If the dashboard is left open for more than that, the watch will expire and the dashboard will attempt to resubscribe. However, the watch request requires a resource version to start the watch from, and the dashboard is providing an invalid version, causing the subscription to fail. The page must be refreshed in order to properly re-start the watch. In extreme cases with a large scale, this causes the kubernetes API server to become overloaded.

To Reproduce

Create a downstream cluster. Turn on debug logging for the agent by setting CATTLE_DEBUG to true in the environment variable configuration for the cluster. Turn on verbose logging for the kubernetes API server by editing the YAML, and under rancher_kubernetes_engine_config->services->kube-api add a section

...
kube-api:
  extra_args:
    v: '3'
...

When the cluster is running, navigate to "Deployments" under "Workloads". Create at least one workload, like a Deployment or Statefulset, or deploy an App and come back to the workloads tab. Open the Network tab.

EDIT reproducing this depends on having the List resource version be extremely skewed from the resource version of the item that is picked as a base. To reproduce this consistently, create AND delete several (i.e. tens to hundreds) of Deployments. THEN navigate to the Workloads tab and refresh. If the watch is started while Deployments are being created and deleted, the resource version used may come from one of the websocket updates, and the problem will not be visible, but if there is no activity happening, it will be picked from the last list response.

Result

In the Network tab, you can see the initial requests for the list of objects relevant to Workloads, for example:

GET https://<server>/k8s/clusters/<cluster>/v1/apps.statefulsets

➡️

{
    "type": "collection",
    "links": {
        "self": "..."
    },
    "createTypes": {
        "apps.statefulset": "..."
    },
    "actions": {},
    "resourceType": "apps.statefulset",
    "revision": "400478",
    "data": [
        {
            "id": "cattle-prometheus/prometheus-cluster-monitoring",
            "type": "apps.statefulset",
            "links": {
               ...
            },
            "apiVersion": "apps/v1",
            "kind": "StatefulSet",
            "metadata": {
                "name": "prometheus-cluster-monitoring",
                "namespace": "cattle-prometheus",
                 ...
                "resourceVersion": "17159",
            ....
        }
        {
            .... another statefulset ....
        }

In one of the websocket requests (for me it's always wss://<server>/k8s/clusters/<cluster>/v1/subscribe?sockId=3), you can see an outgoing websocket message to start a watch for that resource type and version:

🔼 {"resourceType":"apps.statefulset","resourceVersion":"400478"}

After 30 minutes, the watch is closed (which is expected, because of the default timeout):

🔽 {name: "resource.stop", resourceType: "apps.statefulset", data: {type: "subscribe", links: {}}}

We then see a new outgoing watch request, but now for a completely different resource version:

🔼 {"resourceType":"apps.statefulset","resourceVersion":"17159"}

The watch immediately fails and is closed:

🔽 {name: "resource.stop", resourceType: "apps.statefulset", data: {type: "subscribe", links: {}}}

and the cycle continues.

In the cluster agent debug logs, you can see the watch being stopped and started on repeat:

time="2022-05-19T23:16:34Z" level=debug msg="opening watcher for batch.job"
time="2022-05-19T23:16:34Z" level=debug msg="opening watcher for apps.statefulset"
time="2022-05-19T23:16:34Z" level=debug msg="opening watcher for namespace"
time="2022-05-19T23:16:34Z" level=debug msg="closing watcher for namespace"
time="2022-05-19T23:16:34Z" level=debug msg="closing watcher for batch.job"
time="2022-05-19T23:16:34Z" level=debug msg="closing watcher for apps.statefulset"
time="2022-05-19T23:16:36Z" level=debug msg="Wrote ping"
time="2022-05-19T23:16:36Z" level=debug msg="Wrote ping"
time="2022-05-19T23:16:37Z" level=debug msg="Wrote ping"
time="2022-05-19T23:16:37Z" level=debug msg="Wrote ping"
time="2022-05-19T23:16:40Z" level=debug msg="opening watcher for apps.statefulset"
time="2022-05-19T23:16:40Z" level=debug msg="opening watcher for batch.job"
time="2022-05-19T23:16:40Z" level=debug msg="closing watcher for batch.job"
time="2022-05-19T23:16:40Z" level=debug msg="closing watcher for apps.statefulset"
time="2022-05-19T23:16:40Z" level=debug msg="opening watcher for namespace"

On the kubernetes API server, you will see it being hammered by repeat GET requests with the URL including watch=true.

It turns out this resourceVersion is from one individual item in the list of resources, see JSON above. This is not the revision number returned for the collection by steve. Steve fails to initialize the watch because the resource version is "too old".

Note that kubernetes uses the field resourceVersion both for the version of the List and the versions of the individual resources, where steve changes the List's resourceVersion to revision.

When re-subscribing after a timeout, the revision number for the whole collection needs to be used, not the resource version number of any one item in the collection.

Expected Result

After a timeout, a new watch is started successfully and is ongoing until the next timeout.

Screenshots

Additional context

This may not be 100% reproducible because sometimes the item resource version is close enough to the collection revision that the watch still works, but this is an accident. If you deploy additional workloads and apps without making updates to any of them, eventually the collection revision number will get significantly bigger while the original item version number stays the same, and you should see the issue occur.
To help with debugging, I've proposed a change to steve which will

return a resource.error event through the websocket if the connection is closed irregularly
logs the error in the cluster agent
allows you to decrease the timeout from 30 minutes, by setting CATTLE_WATCH_TIMEOUT_SECONDS to, e.g. 300 (5 minutes) in the cluster agent environment.

The agent image with this change is published as ~~cmurpheus/rancher-agent:v2.6-head-0e2fdc-steve-debug.~~ EDIT: new image with latest changes at cmurpheus/rancher-agent:v2.6-head-f4e1de7-steve-debug

The text was updated successfully, but these errors were encountered:

gaktive · 2022-05-20T15:31:30Z

Connects to internal reference SURE-3208 amongst others potentially (SURE-4484)

gaktive · 2022-05-20T20:46:39Z

Whoever takes this on, @cmurphy is happy to help.

gaktive · 2022-05-24T16:33:21Z

As noted in the linked PR (#6024), we have a fix for this via Harvester but this needs to be reviewed first.

gaktive · 2022-05-24T16:36:58Z

@rak-phillip to look this over and sync with @cmurphy to see if the existing PR helps out here or if other work is required. Then we'd need to test.

cbron · 2022-05-25T17:36:47Z

Can we also get an estimate on when this would have started happening. Has it existed since the beginning of dashboard, or was it introduced at some point in time ?

rak-phillip · 2022-06-13T17:02:02Z

I spoke with @cmurphy offline and I think we should merge rancher/steve#49 because it looks like it fixes the issue that we're experiencing today. We can still make some changes to Dashboard to improve on the behavior described above, but the change to Steve should resolve this issue:

Update our caching mechanism to to distinguish between list and item resources. I'm not seeing a way to improve on this with what's immediately available to us in dashboard. We make use of the type (e.g. catalog.cattle.io.app) and attempt to look up the latestresourceVersion by making use of Math.max()..
I think we would be able to make use of resyncWatch if Steve was able to relay to Dashboard that the revision is too old in the resource.stop event. Dashboard currently doesn't have any information available to discern if the cache is invalidated unless we receive a message that contains too old in the resource.error event.. I think we'd be able to appropriately make use of the local cache and resyncWatch if we could make use of something like 410 Gone on the resource.stop event. https://kubernetes.io/docs/reference/using-api/api-concepts/#efficient-detection-of-changes

@gaktive

git-ival · 2022-06-13T22:46:11Z

@cmurphy Was able to reproduce after creating/deleting around 30 deployments and waiting roughly 45 minutes.

Observed websockets with sockId1 and sockId3 starting and stopping various watches.

websocket1_websocket3_startstop.mp4

Observed cattle-agent debug logs confirming that watchers were being opened and closed repeatedly.

Observed very high rate of WATCH requests on jobs resources, matching the observed watcher starts/stops.

catherineluse · 2022-06-20T16:46:02Z

I tried to repro this recently and couldn't - I'm assuming that's because Colleen's PR was merged. I forced an incorrect resource version to be used - a random number - and confirmed that the socket has self-corrected and is still able to watch the resource. Or at least I THINK that is what this screenshot is showing:

@rak-phillip That's a great doc that you linked. In response to your comment:

We make use of the type (e.g. catalog.cattle.io.app) and attempt to look up the latestresourceVersion by making use of Math.max()

This part of the linked doc makes me think the dashboard shouldn't be comparing resource versions with Math.max() at all:

You must not assume resource versions are numeric or collatable. API clients may only compare two resource versions for equality (this means that you must not compare resource versions for greater-than or less-than relationships).

I think there should probably be a new tech debt issue opened about making some minimum enhancements. I agree that we should accept the 410 gone response. And I think we should completely remove the logic (the Math.max) for assigning a version number if there is none. If we don't have a resourceVersion number, instead of trying to find one based on comparisons, we just shouldn't provide one at all.

Side note on tools

I was also curious about this part of those upstream docs that mention tooling for watching K8s resources:

For subscribing to collections, Kubernetes client libraries typically offer some form of standard tool for this list-then-watch logic. (In the Go client library, this is called a Reflector and is located in the k8s.io/client-go/tools/cache package.)

I wonder why we have an in-house implementation of findAll, or if there is a better tool that can be used with Vue. I found this example of watching a K8s collection with a Javascript client and it looks simpler than what we have. https://github.com/kubernetes-client/javascript/blob/master/examples/cache-example.js#L13 If we used that client it could help us differentiate between lists and individual resources.

Side note on performance

Before reading the linked docs I thought a big reason the UI takes a long time to load was that etcd doesn't support pagination. But in fact, the K8s docs says the K8s API server does support pagination to make UIs faster and reduce server load:

On large clusters, retrieving the collection of some resource types may result in very large responses that can impact the server and client. For instance, a cluster may have tens of thousands of Pods, each of which is equivalent to roughly 2 KiB of encoded JSON. Retrieving all pods across all namespaces may result in a very large response (10-20MB) and consume a large amount of server resources.

Provided that you don't explicitly disable the APIListChunking feature gate, the Kubernetes API server supports the ability to break a single large collection request into many smaller chunks while preserving the consistency of the total request. Each chunk can be returned sequentially which reduces both the total size of the request and allows user-oriented clients to display results incrementally to improve responsiveness.

Since the K8s API supports pagination, there should probably be another performance issue about looking into pagination for the list views.

cmurphy · 2022-06-22T00:05:43Z

@catherineluse

I tried to repro this recently and couldn't - I'm assuming that's because Colleen's PR was merged. I forced an incorrect resource version to be used - a random number - and confirmed that the socket has self-corrected and is still able to watch the resource. Or at least I THINK that is what this screenshot is showing:

The flood of requests was corrected, but the dashboard is still doing incorrect behavior that can still be reproduced following the instructions above. This is what it looks like:

The watch starts with the correct resource version.
After a timeout, the watch stops.
The watch is restarted with a seemingly out-of-nowhere revision number
With the steve fix in place, Rancher now returns an error.
Dashboard responds to the error by restarting the watch again, now with a sensible version number.

Even though this now isn't directly impacting the user, I maintain that step 3 is not the right behavior, that the dashboard should use a valid number from the start instead of having to re-correct after feedback from the backend.

richard-cox · 2022-12-13T11:19:00Z

Manually reopening as github bot has failed to do so

gaktive · 2022-12-21T16:13:50Z

2.6.x backport: #7779

gaktive · 2022-12-21T20:25:14Z

Actual backport ticket: #7707 (7779 closed as duplicate)

* Moves sockets into the advanced worker * worker can die peacefully now, making switching between cluster work. * Make waitFor generic, wire in to waitForTestFn * General Changes - Fixes for switching cluster - includes using common getPerformanceSetting - avoid new code to unsub before socket disconnect - handle `watch` `stop` requests - lots of TODO's (questions, work, checks, test, etc) - use common * Switch socket fixes - isAdvancedWorker should only be true for cluster store - advancedWorker to be wired in * Fix socket id for cluster workers - sockets use an incremented local var for id - when we nuke the socket file within the worker this resets, so they all ahve id of 1 - work around this by applying the unix time * Fix handling of new partical counts response - seen in dex cluster explorer dashboard - count cards would be removed when partial counts response received * Make resourceWatcher the sole location for watch state - getters canWatch, watchStarted now are worked around (they look at state in the UI thread) - we now don't call resource.stop or restart.start in subscription - tidied up `forgetType` - moved clearFromQueue from steve mutations into subscription mutations (better location) - added and removed some TODOs - fixed watch (stop handler should be higher up, include force watch handling) * pushes the csrf value into worker and adds it to fetch request headers. * refactors batchChanges to address ref concerns and be more performant * Maintain schema reference whilst updating - This change mutates input in a function, which is bad... - but ensures the reference isn't broken, which is needed to maintain similar functionality as before * Fix waitForTestFn - Seen when creating or viewing clusters * On unwatch ensure any pending watch requests are removed from the queue - the probably would have been a problem if the worker wasn't nuked - however as the codes there lets make it safe Also added `trace` feature in advanced worker, will probably bring out to other places as well * Fix navigation from cluster manager world to any cluster - Ensure that we handle the case where the advanced worker was created but the resource watcher wasn't - ... but fix case where this was happening (aka ensure that a blank cluster context is ignored) * Tidy some TODOs * Add perf settings page - This will help test normal flow (when advanced worker is disabled) - Note - setting is now in a bag. This may help us better support further settings (enable client side pagination, etc) ``` advancedWorker: { enabled: false }, ``` * FIX - Nav from cluster dashboard --> specific event --> cluster dashboard and events not re-subbed - Ensure we block default handling of resource.start (keep state in resource watcher) * Tidying up some TODOs * Adds in a cache and uses it to validate SCHEMA messages before batching. * Forgot to actually save CSRF to the resourceWatcher when instantiated. * an empty resource in a batchChange to signal remove * Move addSchemaIndexFields to and created removeSchemaIndexFields in new file - this avoids bringing class files into the worker * Fix disconnect/reconnect - Remove `syncWatch` (do the watch/unwatch straight away) - Test/Fix re-sub on reconnect - Test/Fix growls on disconnect * Tidying up some TODO's - including clean of workerQueue on resource.stop (this is SUPER defensive) * batchChanges will now handle aliases * Fix pods list - WIP - ensure podsByNamespace is updated on batchChange TODO - the final update to the pod is ignored - removing a namespace cleans the cache correctly - disabling advanced worker still works * Fix pods list - fixes - ensure podsByNamespace is updated on batchChange Tested / Fixed - the final update to the pod is ignored - removing a namespace cleans the cache correctly - disabling advanced worker still works * Tidying TODOs * Remove default same-origin header - https://developer.mozilla.org/en-US/docs/Web/API/Request/credentials * Fixed TODO description * Refactor subscribe, make it clear which vuex feature relates to what * Lots of Fixes - batchChanges fixes - fix index is 0 issues (!/!!index) - only `set` if we have to - ensure we set the correct index after pushing to list - ensure map is updated after reducing list size with limit - podsByNamespace fixes - ensure when ew replace... we don't use the same referenced object - general service resource fixes - ensure service's pods list stays up to date with store * Multiple improvements/fixes - resourceCache - store the hash instead of the whole object. This means longer load time be reduces memory footprint - resourceWatcher - don't re-sub on socket reconnect if watcher is in error - don't sub if watcher is in error - don't unwatch for 'failed to find schema' and 'too old' errors - this clears the error, we won't to keep it to ensure we don't watch - Remove #5997 comments, follow on work #7917 * toggle debug, remap alias types, cleaned up comments and console * Unit tests for batchChanges Much more scope for some crazy content * Logging tweaks - disable logging by default - initWorker comes in too late to affect initial trace, so just rely on the `debug` to toggle at runtime Co-authored-by: Richard Cox <richard.cox@suse.com>

git-ival · 2023-02-07T08:46:56Z

I have retested this issue and from my understanding of this issue and the merged PRs, this issue should be closed. I am not seeing repetitions of the same websocket IDs and it seems that the stop -> start watch behavior now uses sensible version #s.

Content of websocket messages before and after a stop:

Websocket IDs after multiple timeouts:

skaven81 · 2023-04-18T14:45:36Z

I'm running Rancher 2.6.11 and I'm not convinced that the backported fix in #7719 has actually resolved the problem. We are still seeing surges of cattle-cluster-agent watch start/stop events:

These surges are correlated with high CPU and memory utilization on kube-apiserver as well as high latency in the Rancher UI and kubectl commands.

skaven81 · 2023-04-19T16:43:43Z

More data collected -- the correlation looks too close to ignore. Here's the userAgent=agent/v0.0.0 API calls. The different colored lines are different Rancher users (so the purple line is a browser session that has gone haywire with the watch resubscriptions):

And here's the same timespan but looking at the resource watch start/stop events in the cattle-cluster-agent debug log:

This is all in Rancher 2.6.11

skaven81 · 2023-04-19T16:46:28Z

@cmurphy should I open a new issue about this, or can/should we reopen this one? Or should we abandon 2.6.x and try to upgrade to 2.7.2+?

cmurphy · 2023-05-12T22:26:11Z

@skaven81 apologies for the delayed response. It would be a good idea to upgrade to 2.7.3 but I don't know if it would help with this issue. The main fix to the watch subscription spam issue was actually in rancher/steve#49 which went into 2.6.7, so this should have already been addressed.

In your graphs, it looks like there are predictable surges which resolve themselves in a regular pattern. In this original issue, the re-subscription attempts were increasing at a constant rate after the initial timeout, so I would expect the graphs to continue on a straight path until the pod crashes and the line would fall off a cliff to zero. I think the surges you are seeing are probably normal and the performance issues you are seeing on those intervals could be alleviated by adding more memory and CPU. If you disagree, I would recommend opening a new issue because the symptoms you've described don't fully match the original issue.

skaven81 · 2023-05-22T22:47:26Z

In your graphs, it looks like there are predictable surges which resolve themselves in a regular pattern.

That's not "predictable surges" shown in the graph. The entire width of that zigzag line is a single event. Not a series of short ones. The behavior we see is that a Rancher UI browser session will hit some sort of corner case condition (probably something similar to what was fixed in rancher/steve#49 with watch subscriptions timing out), and at that point, we see a huge spike of kube-apiserver calls from the cattle-cluster-agent using the agent/v0.0.0 userAgent string. The kube-apiserver calls have the "acting as" metadata associated so we can see that the calls are being executed on behalf of the user whose browser has gone haywire. Interestingly, there is no additonal traffic coming from the web browser -- the traffic is entirely (or mostly) between cattle-cluster-agent and kube-apiserver in the downstream cluster. Not between the browser and rancher, and not between rancher and the cattle-cluster-agent.

The zig zags you see in the charts are zoomed into one of these events. It shows how the cattle-cluster-agent begins logging a series of watch stop/start events, while simultaneously hammering the kube-apiserver in the same pattern.

In this original issue, the re-subscription attempts were increasing at a constant rate after the initial timeout, so I would expect the graphs to continue on a straight path until the pod crashes and the line would fall off a cliff to zero

Yeah that's definitely not what we see. It has always been this sudden increase of kube-apiserver traffic for a single user (associated with one browser session) which then holds steady until the user closes their browser or refreshes the tab, at which point I guess the watches get re-established and kube-apiserver traffic for that user drops back to a normal level.

I'll open a new case for this.

cmurphy · 2023-05-23T16:15:48Z

Thanks!

skaven81 · 2023-05-25T23:01:23Z

@cmurphy I have created rancher/rancher#41663 to track the other related issue.

cbron added this to the v2.6.6 milestone May 20, 2022

gaktive added kind/bug internal labels May 20, 2022

gaktive added JIRA priority/0 [zube]: Backlog labels May 20, 2022

samjustus added the team/ui label May 20, 2022

ly5156 mentioned this issue May 24, 2022

Resync watch resource if receive stop event #6024

Closed

gaktive assigned rak-phillip May 24, 2022

gaktive added [zube]: Design Review and removed [zube]: Backlog labels May 24, 2022

Jason-ZW mentioned this issue May 25, 2022

Make watches debuggable rancher/steve#49

Merged

bmdepesa assigned git-ival and floatingman May 26, 2022

rak-phillip added [zube]: Working and removed [zube]: Next Up labels Jun 7, 2022

rak-phillip added [zube]: Next Up and removed [zube]: Working labels Jun 13, 2022

cmurphy mentioned this issue Jun 21, 2022

Cluster Explorer caused denial of service in downstream cluster rancher/rancher#37627

Closed

richard-cox reopened this Dec 13, 2022

zube bot added [zube]: To Triage and removed [zube]: Done labels Dec 13, 2022

richard-cox added [zube]: To Test and removed [zube]: To Triage labels Dec 13, 2022

gaktive mentioned this issue Dec 14, 2022

[2.6.x backport]: Resource watch re-subscription causes API load / leader election #7707

Closed

richard-cox mentioned this issue Dec 15, 2022

[2.6] Improve handling of steve socket watch timeout #7719

Merged

Sahota1225 added the team/area2 Hostbusters label Dec 15, 2022

slickwarren removed the team/area2 Hostbusters label Dec 15, 2022

gaktive mentioned this issue Dec 20, 2022

[Backport 2.6.x] Resource re-watch #7779

Closed

This was referenced Jan 11, 2023

[BUG] Regression: Format of resource.error socket message is incorrect rancher/rancher#40131

Closed

Moves sockets into the advanced worker #7760

Merged

git-ival closed this as completed Feb 7, 2023

zube bot added [zube]: Done and removed [zube]: To Test labels Feb 7, 2023

richard-cox mentioned this issue Feb 21, 2023

Fix stale management cluster resources #8224

Merged

zube bot removed the [zube]: Done label May 8, 2023

skaven81 mentioned this issue May 25, 2023

[BUG] Resource watch re-subscription causes sustained heavy kube-apiserver load rancher/rancher#41663

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource watch re-subscription uses wrong resource version and floods the k8s API server #5997

Resource watch re-subscription uses wrong resource version and floods the k8s API server #5997

cmurphy commented May 20, 2022 •

edited

Loading

gaktive commented May 20, 2022

gaktive commented May 20, 2022

gaktive commented May 24, 2022

gaktive commented May 24, 2022

cbron commented May 25, 2022

rak-phillip commented Jun 13, 2022

git-ival commented Jun 13, 2022 •

edited

Loading

catherineluse commented Jun 20, 2022 •

edited

Loading

cmurphy commented Jun 22, 2022

richard-cox commented Dec 13, 2022

gaktive commented Dec 21, 2022

gaktive commented Dec 21, 2022

git-ival commented Feb 7, 2023

skaven81 commented Apr 18, 2023

skaven81 commented Apr 19, 2023 •

edited

Loading

skaven81 commented Apr 19, 2023

cmurphy commented May 12, 2023

skaven81 commented May 22, 2023

cmurphy commented May 23, 2023

skaven81 commented May 25, 2023

Resource watch re-subscription uses wrong resource version and floods the k8s API server #5997

Resource watch re-subscription uses wrong resource version and floods the k8s API server #5997

Comments

cmurphy commented May 20, 2022 • edited Loading

gaktive commented May 20, 2022

gaktive commented May 20, 2022

gaktive commented May 24, 2022

gaktive commented May 24, 2022

cbron commented May 25, 2022

rak-phillip commented Jun 13, 2022

git-ival commented Jun 13, 2022 • edited Loading

catherineluse commented Jun 20, 2022 • edited Loading

Side note on tools

Side note on performance

cmurphy commented Jun 22, 2022

richard-cox commented Dec 13, 2022

gaktive commented Dec 21, 2022

gaktive commented Dec 21, 2022

git-ival commented Feb 7, 2023

skaven81 commented Apr 18, 2023

skaven81 commented Apr 19, 2023 • edited Loading

skaven81 commented Apr 19, 2023

cmurphy commented May 12, 2023

skaven81 commented May 22, 2023

cmurphy commented May 23, 2023

skaven81 commented May 25, 2023

cmurphy commented May 20, 2022 •

edited

Loading

git-ival commented Jun 13, 2022 •

edited

Loading

catherineluse commented Jun 20, 2022 •

edited

Loading

skaven81 commented Apr 19, 2023 •

edited

Loading